HiKey Unexpected Shutdown (Overheating?)

johnson1066 · February 20, 2017, 9:27am

While using the HiKey board, it seems to shut down during periods of intense computation. The most likely culprit seems to be overheating, but I cannot find a way to measure temperature; the method listed on the Wiki of using lm-sensors fails, because the program reports that there are no sensors on the board. Is there any way to check/confirm this?

ldts-atsuka · February 20, 2017, 9:27am

Hi @johnson1066,

I also have experienced the same.
Are you using the original image installed on the HiKey when you bought it?
The thermal management have improved from the 15.11 images which is installed on the original HiKey.

Do you mind trying new image?
(please use fastboot method, the SD card method do not work at the moment)

http://www.96boards.org/documentation/ConsumerEdition/HiKey/Downloads/Debian.md/

http://www.96boards.org/documentation/ConsumerEdition/HiKey/Installation/README.md/

ldts · February 20, 2017, 9:27am

@johnson1066, could you share what kernel are you using?

$ uname -a

johnson1066 · February 20, 2017, 9:27am

Uname -a output:

linaro@linaro-alip:~$ uname -a
Linux linaro-alip 3.18.0-linaro-hikey #1 SMP PREEMPT Thu Feb 25 07:17:29 UTC 2016 aarch64 GNU/Linux

From what I can tell, that is the newest image. Is that correct? Additionally, I attempted the other method listed on the wiki, on this page:

http://wiki.lemaker.org/How_to_read_the_CPU%26PMU_temperature

And read the value at /sys/class/thermal/thermal_zone1/temp

Unfortunately, i get values ~35000, which are hard to decipher without units.

ldts-atsuka · February 20, 2017, 9:27am

Hi @johnson1066,

I blindly point to the old image instead of new image.

This is the instruction of the new images.
https://github.com/Linaro/documentation/blob/master/Reference-Platform/RPOfficial/ConsumerEdition/HiKey/README.md

ldts · February 20, 2017, 9:27am

The temperature is being exported through sysfs in Celsius (so 35,000 means 35 C).

You are using an old kernel; it is probably not set with well adjusted temperature trip points therefore not giving the zones the time needed to cool down.

You could either modified those values in the device tree if that is indeed the problem or simply upgrade your system

johnson1066 · February 20, 2017, 9:27am

Thanks for the info, guys. I’ll update and see how that works.

johnson1066 · February 20, 2017, 9:27am

With the new kernel, the board seems to reboot instead of just shutting off. It reboots off during the compilation of OpenBLAS, which I was able to do with the old Kernel. An interesting note is that cpufreq-info shows no/unknown cpufreq driver, and temperatures get as high as 78C right before reboot, which is pretty worrying.

ldts · February 20, 2017, 9:27am

can you share the kernel version please?

$uname -a

johnson1066 · February 20, 2017, 9:27am

Output of uname -a:

Linux linaro-alip 4.4.0-135-arm64 #1 SMP Debian 4.4.11.linaro.135-1linaro.135-1linarojessie1 (2016-06-28) aarch64 GNU/Linux

ldts · February 20, 2017, 9:27am

yes I can reproduce on my end - seems there are no cooling devices for the trip points.

$ apt-get install stress
$ stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M

puts the SoC at 100C at which point the system reboots. I’ll come back to you on this.

ldts · February 20, 2017, 9:27am

The support team at Linaro came up with the analysis and temporary solution for this issue on the RPB 16.06 release. Please read below:

The problem is with the missing ‘hisi-acpu-cpufreq’ module and the need to set the “power_allocator” as a thermal governor.

As temporary solution just add the following two lines to /etc/rc.local so it looks like follows:

test -f /etc/ssh/ssh_host_dsa_key || dpkg-reconfigure openssh-server

modprobe hisi-acpu-cpufreq
echo &quot;power_allocator&quot; &gt; /sys/class/thermal/thermal_zone0/policy

exit0

in my tests I can see that the temperature does not rise above 75C.

johnson1066 · February 20, 2017, 9:27am

That did the trick. Thank you. Does this trigger some manner of CPU throttling?

ldts · February 20, 2017, 9:27am

Yes, that is right.
You could open a couple of terminals (over ssh or the desktop) and see the throttling:

terminal 1 $ watch -n1  &quot;cpufreq-info | grep \&quot;CPU frequency\&quot; &quot;
terminal 2 $ watch -n1 cat /sys/class/thermal/thermal_zone0/temp
terminal 3 $ stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M

ivid · November 17, 2017, 7:29pm

@ldts @leo-yan

I am facing a similar overheat issue on Hikey 620 but with 4.9 kernel. And I could not insert hisi-acpu-cpufreq module since it has been removed in kernel source. I tried to add the patch but kernel panics

Is there any other cpufreq module available for CPU throttling? Please let me know your views

Thanks,
Ivid