Kernel 4.4.0 CPU stalling

I’m getting some CPU stalling on this new release (16.02, 20160227-66). I have a process that uses about 1-1.5 CPU cores 24/7. It was running fine for weeks on the previous release (15.11, 20151204-36), but after the update it only made it about 90 minutes before the system started freaking out.

Every ~30 seconds I’m getting this in /var/log/messages:
Mar 1 02:01:56 linaro-developer kernel: [24374.351267] Task dump for CPU 3:
Mar 1 02:01:56 linaro-developer kernel: [24374.356820] process R running task 0 9040 2316 0x00000003
Mar 1 02:01:56 linaro-developer kernel: [24374.367068] [<ffffffc000086ca4>] __switch_to+0x94/0xb4
Mar 1 02:01:56 linaro-developer kernel: [24374.369410] [<ffffffc030c37c00>] 0xffffffc030c37c00

I can’t kill the process (called “process” in the output above) even with kill -9, it’s using 100% CPU and not doing anything, and even attempts to shut down or reboot are met with:

reboot

Failed to start reboot.target: Connection timed out

Anybody else seeing this? It’s not a one-time thing for me, I can’t get it to run for longer than ~60-90 minutes before the entire system locks up and I have to power cycle, every time.

I’ve had to drop back to the 20151204-36 release, which is working fine.

Yes, I am seeing something similar. I had thought it was the fact that I had used ssh to logon to the board, and scp to move large (1GB+) files to and from the board over WiFi, but maybe I am seeing the same thing you are seeing. I see all 4 cores loaded at 25%, and usually it takes a few hours before I see the issue. Once the issue appears, even echoing key strokes can take many seconds.

Full disclosure: I am an employee of Qualcomm Canada, any opinions expressed in this or any other post may not reflect the opinions of my employer.

That seems pretty similar to what I’m seeing. It did it three times in a row for me before I finally broke down and re-flashed the previous release.

Of those three times, the first one would still allow ssh sessions, but they were VERY slow to respond, it would take 20+ seconds to get the password prompt, and tasks I would perform were very slow. There was a process stuck at 100% CPU, but it wasn’t doing anything, I couldn’t kill it, and every ~30 seconds I was getting a message in the system log like I posted earlier. My system is running on a bench power supply, and normally the current draw is very dynamic, bouncing erratically between 200mA and 400mA at 10V as it’s running and executing different tasks (as you’d expect). When this happened it was stuck at ~250mA solid, absolutely no fluctuation.

The other two times it happened I was locked out completely. I didn’t have an active ssh session so I couldn’t see what was going on, and any attempts to ssh into it just dropped. As before, the current draw was ~250mA solid, as if it was hung and not actually operational.

All three times it took around 60-90 minutes before this happened. A full power cycle gets it going again, until 1-2 hours later when it happens again. To reiterate, 20151204-36 does not have this problem.

This is the closest match I could find on Google:
https://archlinuxarm.org/forum/viewtopic.php?f=64&t=9362

It’s a different kernel version, but the symptoms are the same. The cause in that case was a regression in the 4.1.9-1 kernel.

When I try to tar and untar the kernel source files I seem to have a similar problem with 4.4.0:

[ 1114.572173] —[ end trace e7a152329f7602dc ]—
[ 1114.573739] Call trace:
[ 1114.578340] [<ffffffc000106404>] lock_release+0x264/0x330
[ 1114.580516] [<ffffffc00090be88>] _raw_spin_unlock_irqrestore+0x24/0x74
[ 1114.586071] [<ffffffc0004d7194>] regmap_unlock_spinlock+0x10/0x18
[ 1114.592495] [<ffffffc0004d97b4>] regmap_update_bits+0x54/0x64
[ 1114.598658] [<ffffffc00068b7fc>] __mux_div_set_src_div+0x94/0x110
[ 1114.604387] [<ffffffc00068bce8>] __mux_div_set_rate_and_parent.isra.0+0x17c/0x1a8
[ 1114.610463] [<ffffffc00068bd34>] mux_div_set_rate+0xc/0x14
[ 1114.617927] [<ffffffc00068240c>] clk_change_rate+0x58/0x1b4
[ 1114.623309] [<ffffffc000682a18>] clk_core_set_rate_nolock+0x68/0x124
[ 1114.628777] [<ffffffc000682afc>] clk_set_rate+0x28/0x50
[ 1114.635374] [<ffffffc000637e58>] set_target+0x1e4/0x66c
[ 1114.640323] [<ffffffc00063118c>] __cpufreq_driver_target+0x244/0x3ac
[ 1114.645534] [<ffffffc0006353a8>] dbs_freq_increase+0x48/0x70
[ 1114.652128] [<ffffffc000635438>] od_check_cpu+0x68/0xd8
[ 1114.657768] [<ffffffc000637168>] dbs_check_cpu+0x198/0x1e8
[ 1114.662718] [<ffffffc000635938>] od_dbs_timer+0x5c/0x100
[ 1114.668274] [<ffffffc000637350>] dbs_timer+0xac/0xe4
[ 1114.673744] [<ffffffc0000d2afc>] process_one_work+0x1a0/0x3cc
[ 1114.678691] [<ffffffc0000d2e60>] worker_thread+0x138/0x438
[ 1114.684332] [<ffffffc0000d96a8>] kthread+0xdc/0xf0
[ 1114.689715] [<ffffffc000085c50>] ret_from_fork+0x10/0x40

hi,

some update on this issue… we spent a significant amount of time debugging it and i think we figured it out… It seems that this issue has been in all our release so far, but because it’s very “racy” and it looks like something that changed in 4.4 makes this bug much more reproducible than it used to be. There have a few similar reports with 4.4 branch, such as this one for example: https://bugs.96boards.org/show_bug.cgi?id=305.

The two commits that we pushed on the 4.4 branch today are supposed to fix this bug, see

https://git.linaro.org/landing-teams/working/qualcomm/kernel.git/shortlog/6bfb715ed71e6948d08b0205f721488bd764e65c

I am trying to get a build with these patches but we have some issues on Jenkins and it’s taking forever… so I am sharing the patches here in case someone can try them out… I am very interested in feedback.

thanks

a snapshot build with these changes is now available here:

http://builds.96boards.org/snapshots/dragonboard410c/linaro/debian/73/

if you are running 16.02 release, you just need to download the boot image and use it (no need to reflash the entire rootfs)

I re-compiled from source and with the fix the issue is no longer present.
Thank you very much!

@ljking
Did this patch resolve your scp issue? I have been seeing something similar (with scp + large files + downward performance) - wondering if it was related or not… my gut feeling is that it’s either wifi driver or cpufreq related, though.