GPU Crash During Specific Temperature Range

kimbo · July 9, 2021, 9:39pm

Hi. We are seeing a GPU crash during temperature changes and with multiple custom boards with the snapdragon 820c chip. When the snapdragon reads between 9-12 Celsius we will get the below crash. We use /sys/class/thermal/thermal_zone0/temp to read this temperature. Can anyone offer some guidance?

Thanks,
Kim

[ 1520.504254] msm 900000.mdss: gpu fault ring 1 fence 2e1f7 status C00401C3 rb 01e1/01e1 ib1 000000000234F000/012d ib2 000000000D5A1000/0000

[ 1520.504430] msm 900000.mdss: A530: hangcheck recover!

[ 1520.516875] msm 900000.mdss: A530: offending task: V:flush_queue0 (./VTNext_ARM64_Release. out -FULL)

[ 1520.520857] revision: 530 (5.3.0.2)

[ 1520.531692] rb 0: fence: 0/0

[ 1520.533228] rptr: 39

[ 1520.536277] rb wptr: 39

[ 1520.539096] rb 1: fence: 188917/188919

[ 1520.541584] rptr: 464

[ 1520.546172] rb wptr: 481

[ 1520.548559] rb 2: fence: 0/0

[ 1520.550785] rptr: 0

[ 1520.553641] rb wptr: 0

[ 1520.556065] rb 3: fence: 0/0

[ 1520.558505] rptr: 0

[ 1520.561639] rb wptr: 0

[ 1520.564028] CP_SCRATCH_REG0: 0

[ 1520.566481] CP_SCRATCH_REG1: 0

[ 1520.569614] CP_SCRATCH_REG2: 188917

[ 1520.572620] CP_SCRATCH_REG3: 0

[ 1520.576053] CP_SCRATCH_REG4: 0

[ 1520.576116] qcom-camss a34000.camss: Active buffer mismatch!

[ 1520.579137] CP_SCRATCH_REG5: 3069549

[ 1520.588003] CP_SCRATCH_REG6: 3069565

[ 1520.591591] CP_SCRATCH_REG7: 3069550

kimbo · July 14, 2021, 1:20pm

We were able to recreate this crash with the MESA openGL application called gpxgears. We soaked the hardware in a temperature chamber at -20C, then turned on the unit and started gpxgears application. We slowing increased the chamber temperature and monitored the temperature of the hardware with /sys/class/thermal/thermal_zone0/temp. At around 7.3 C, we see the gpu crash, see below. We also observed that the framerate of the gpxgears went from 60fps to 30fps. We continued to increase the chamber temperature until the temp read was 9C and then everything went back to normal. Any suggestions would be appreciated.

root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
6000
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
6400
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
6400
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
6400
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
7000
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
7000
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
7000
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
7000
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
7300
root@sla-alip:~# cat /sys/class/thermal/thermal_zone0/temp
7300
root@sla-alip:~# [ 458.933132] msm 900000.mdss: gpu fault ring 1 fence 1fbb8 status C00401C3 rb 0218/0218 ib1 00000000021DF000/0000 ib2 0000000003A7F000/0000
[ 458.933193] msm 900000.mdss: A530: hangcheck recover!
[ 458.944500] msm 900000.mdss: A530: offending task: X:flush_queue0 (/usr/lib/xorg/Xorg -nolisten tcp -auth /var/run/sddm/{6001025e-8eaf-46b8-b4fa-5d6671f66083} -background none -noreset -displayfd 17 -seat seat0 vt7)
[ 458.949620] revision: 530 (5.3.0.2)
[ 458.968520] rb 0: fence: 0/0
[ 458.971985] rptr: 39
[ 458.975108] rb wptr: 39
[ 458.977889] rb 1: fence: 129974/129976
[ 458.980378] rptr: 480
[ 458.984312] rb wptr: 536
[ 458.986913] rb 2: fence: 0/0
[ 458.989518] rptr: 0
[ 458.992442] rb wptr: 0
[ 458.994899] rb 3: fence: 0/0
[ 458.997330] rptr: 0
[ 459.000429] rb wptr: 0
[ 459.002888] CP_SCRATCH_REG0: 0
[ 459.005319] CP_SCRATCH_REG1: 0
[ 459.008419] CP_SCRATCH_REG2: 129974
[ 459.011485] CP_SCRATCH_REG3: 0
[ 459.014866] CP_SCRATCH_REG4: 0
[ 459.017990] CP_SCRATCH_REG5: 3324011
[ 459.021031] CP_SCRATCH_REG6: 3324032
[ 459.024740] CP_SCRATCH_REG7: 782638
[ 459.122710] msm 900000.mdss: gpu fault ring 1 fence 1fbd7 status C00401C3 rb 0490/04a0 ib1 0000000002245000/0000 ib2 0000000003847000/0000
[ 459.122769] msm 900000.mdss: A530: hangcheck recover!
[ 459.134075] msm 900000.mdss: A530: offending task: X:flush_queue0 (/usr/lib/xorg/Xorg -nolisten tcp -auth /var/run/sddm/{6001025e-8eaf-46b8-b4fa-5d6671f66083} -background none -noreset -displayfd 17 -seat seat0 vt7)
[ 459.139202] revision: 530 (5.3.0.2)
[ 459.158100] rb 0: fence: 0/0

doitright · July 15, 2021, 2:06pm

That’s really interesting, especially since the temperature range where you are experiencing the fault is typically a very stable temperature for electronics. Can this be reproduced with a different unit?

Best expert I know of for this hardware is @robclark – he should at least be able to give you a good place to start looking.

kimbo · July 15, 2021, 2:46pm

Hi. Yes, we have seen the issue on multiple SOMs. A couple of customers actually reported the issue to us and we also thought the temperature range was quite odd. We have contacted the manufacturer of the SOMs with the Snapdragon chip on them and they are also able to reproduce the issue at their site. When I find out more from them, I will post here. I would be interested to hear from this @robclark if he has any thoughts. Thanks, Kim

robclark · July 15, 2021, 4:01pm

I guess either (a) some sort of power issue, or (b) possibly a GPMU firmware issue (I assume GPMU is monitoring some thermal sensor(s) within the GPU… but the GPMU fw is kind of a black box to me)

ljking · July 15, 2021, 6:08pm

If I had to guess, I would suspect a PLL settings issue. if the PLL looses lock then nothing is guaranteed.

But I also like Rob’s suggestion. We know that the chip has a carefully controlled balance between die temperature, operating frequency and supply voltage. Maybe turning the supply voltage to the GPU up a little but in this temperature range might help.

kimbo · July 15, 2021, 6:25pm

Hi. We heard back from the manufacturer of the boards and they can only reproduce with Linux and not Android. Are either of @robclark or @ljking suggestions device tree changes that I can try or are they more hardware changes? I will pass along your suggestions to the hardware people here.
We do have a Dragonboard 820C that we can also test to see if it also has this issue.
Thanks,
Kim

robclark · July 15, 2021, 6:41pm

One thing that might work to rule out a GPMU issue (but this configuration has not been tested, so it also might just completely not work) is to disable SP/TP power collapse by commenting out the call to a5xx_pc_init() in drivers/gpu/drm/msm/adreno/a5xx_power.c

Another thing that might be worth a try is using a530v3_gpmu.fw2 from the android build.

kimbo · July 15, 2021, 7:01pm

Hi @robclark,
These sound like good things to try! Where does the a530v3_gpmu.fw2 reside?

Thanks,
Kim

robclark · July 15, 2021, 7:45pm

it would be in /lib/firmware/qcom on linux… not sure on android, maybe somewhere under /vendor I assume?

kimbo · July 15, 2021, 9:39pm

Great! I will let you know when I get a chance to try these things and how they work out

doitright · July 18, 2021, 1:00pm

What kind of Android are they running on it? AOSP/Mesa or CAF/qcom? The latter would have different drivers entirely, whereas the former would be the same drivers as GNU/Linux.

kimbo · July 19, 2021, 5:06pm

Hi. Good question! I asked the board manufacturer and they said they tested with Android 7 Nougat CAF LA.UM.5.5.r1-01800-8x96.0. Then they are using different drivers than linux, which makes sense why they didn’t see the issue on Android OS.

kimbo · July 20, 2021, 10:41pm

Hi @robclark,
We still got the same crash when I commented out the call to a5xx_pc_init() in a5xx_power.c. Does this mean it could still be a GPMU issue?

Thanks,
Kim

robclark · July 21, 2021, 1:40am

I think that makes it less likely to be GPMU, but GPMU is still a bit of a black box… it wouldn’t hurt to try the same GPMU fw that is used on the android build.

One way or another this seems like something power related, so either related to GPMU or something outside of the gpu driver.

robclark · July 21, 2021, 2:30am

also, someone may well have already tested this, but in the off chance no one has, it would probably be a good idea to compare the exact same board with caf/android vs linux, just to rule out a physical issue

doitright · July 21, 2021, 7:54pm

Have you tested on the DB820C? That should at least tell you if its something tied to the board implementation.

And what are you using for a kernel? I.e. version and, if possible, link to source?

kimbo · July 21, 2021, 7:59pm

That is our next step I am going to load linux version 4.14 onto the dragonboard820C and then run the temperature test, most likely, by the end of the week. I will let you know the results.
thanks,
Kim

kimbo · July 22, 2021, 9:56pm

We were able to reproduce the issue on the Dragonboard820, running the gpxgears application. We used the boot and rootfs from here: Linaro Snapshots

The unit was powered on at room temperature and then the temperature in the chamber was lowered. When the dragonboard read 8.3C, the application became glitchy and then the board crashed.

kimbo · July 22, 2021, 9:58pm

I confirmed with the manufacturer of the custom boards that they tested Android (CAF) and Linux with the same hardware.