GPU Crash During Specific Temperature Range

doitright · July 22, 2021, 10:13pm

That’s an extremely convincing test. Was the temperature reading true?

Also, I notice that you picked a 4.14 snapshot. Are you able to reproduce with 5.4, 5.10, and 5.13?

kimbo · July 22, 2021, 10:34pm

I loaded the Dragonboard 820C with 4.14 because this is the version we are using on our custom board and we wanted a side by side comparison (or close). No, we haven’t tried any newer versions on either the DB820 or the custom board.

We were reading the temperature from here: /sys/class/thermal/thermal_zone0/temp
Thanks,
Kim

doitright · July 22, 2021, 10:52pm

try other kernels to see if the problem is fixed anywhere.
I asked you if the temperature was true, not where you read it from. For example, is it actually 25 degrees when you read 8? Or is it actually 8?

kimbo · July 23, 2021, 4:57pm

Hi @doitright,

We will try the latest kernel on the DB820C soon.
Discussing this with the hardware folks here, we are reading the internal temperature of the chip and it will be the most accurate temperature versus putting a sensor on the chip. The chamber is around -5C when the crash occurs. The chip is at a higher temperature than this. Can you please advise?

Thanks,
Kim

kimbo · July 23, 2021, 7:37pm

We are not having a lot of luck with getting other versions to boot on the DB820C. We tried 5.13 (build 820) and 5.10 (build 800). Neither booted successfully. We will try 5.7 next. If anyone can point me to which of these builds we should try, that would be great: Linaro Snapshots

kimbo · July 23, 2021, 10:25pm

We got linux version 5.7 to run on the DragonBoard820c. It does not crash!

doitright · July 27, 2021, 12:41pm

Good news then. You’re better off in every way with the newer kernel. Do you think you could get the boot logs for the newer kernels? It would be nice to see why they aren’t booting. 5.10 especially, since its an LTS.

Looks like kernels on builds as;
5.7: 572 through 743 inclusive.
5.10: 744 through 817 inclusive.

If you could bisect that and see what exactly was the first build that didn’t boot, it would help get things straightened out.

kimbo · July 28, 2021, 9:32pm

Hi @doitright,

Yes, I think I can get some boot logs of the newer kernels that were failing. I can probably get the system going next week (I don’t have it with me).

Is it possible to determine what area of 5.7 linux kernel code caused the DB820 not to crash the GPU? Any particular files? For us, it will take a lot of testing to switch to the 5.7 linux version. We are wondering if there is an area we could compare files between the two versions and come up with a patch for our customers now and switch to 5.7 later?

Thanks,
Kim

doitright · July 28, 2021, 9:53pm

Not unless you can get a log that shows a clear crash.

Bisecting is actually a pretty fast process. Do you know how to do that?

Imagine that you have known working build number 100 and known broken build number 200. You then try build 150. If 150 works, you try 175 (half way between 150 and 200). If 175 is broken, you try 162.

Basically, you go up and down half way based on working or non-working until you find the first broken build, which will be pretty fast because each try cuts the number of possibilities in half, so it takes… about 7 iterations per a range of 100.

Once you find the first broken build, then you find the kernel commits of the last working and first broken and bisect the kernel (you can use git to bisect the kernel!). Maybe it will be just 1 commit, but probably very few anyway since these builds are made nightly.

danielt · July 29, 2021, 10:18am

Nothing easy!

You could try merging across changes from the v5.7 DT into the v4.14 DT and vice versa (can you fix v4.14 by changing DT to be more like v5.7? can you break v5.7 by changing DT to be more like v4.14?). The idea is to see if the useful change is DT related or code related. For DTs that are included in kernel sources bisecting both at once is possible but you might get lucky and discover that the beneficial change comes from DT. If so then things get much easier!

PS A friend of mine observed that the temperatures where it fails corresponds to the temperatures range where some modern silicon processes are “fastest”… so if you are comparing the two source bases and come across anything that could reduce voltage or increase minimum clock speed then they should really attract you attention.

doitright · July 29, 2021, 4:40pm

Oh maybe…

https://git.linaro.org/landing-teams/working/qualcomm/kernel.git/commit/arch/arm64/boot/dts/qcom/apq8096-db820c.dtsi?h=release/qcomlt-5.7&id=7a2a2231ef22cb158ea05e60ba6a6d329327a963

https://git.linaro.org/landing-teams/working/qualcomm/kernel.git/commit/arch/arm64/boot/dts/qcom/apq8096-db820c.dtsi?h=release/qcomlt-5.7&id=2800aaa3b8bfb3bbc341ed515bf9e50a0f58fe22

kimbo · August 2, 2021, 8:43pm

@danielt and @doitright, just want to let you know that we are investigating both of your suggestions. I will let you know when we get something working in 4.14

Thanks,
Kim

kimbo · August 19, 2021, 6:30pm

Hi @danielt and @doitright,

We did some further investigation into this issue. The first thing we did was to do a binary search of the linaro builds to determine when the gpu crash did not occur. We could not make an accurate determination and could only conclude that the latest 4.14 images have the issue and the earliest (loadable) 5.7 image (574) did not have the crash.

The 5.4 images were not that reliable. The glxgears ran slower and was jumpy. The 5.4 buster versions did not allow us to even run glxgears. The newer 5.4 sid versions were loadable but, like I said, glxgears ran slower and was jerky.

We then tried building version 5.7 ourselves and started modifying the device tree sources to get the gpu to crash on the DB820. We were able to get the gpu crash, in the specific temperature range, if we only allowed one gpu frequency (no other changes in the device tree), such as 624MHz:

opp-624000000 {
opp-hz = /bits/ 64 <624000000>;
opp-supported-hw = <0x01>;
};

We are confused why this would cause the gpu crash. Do you have any theories or suggestions?

Thanks,
Kim

robclark · August 26, 2021, 7:08pm

If there is an issue with the power rail to the gpu, that could definitely cause crashes… and I could believe that higher freq’s would be more problematic. (IIRC 624MHz is the highest freq for msm8996)

But this is somewhat outside of my area of expertise (ie. it is a power issue, not a gpu issue)

kimbo · August 26, 2021, 11:20pm

I should mention that we also tried setting the frequency to be only 510MHz and we got the same failure. Even the lowest frequency (133MHz) had the failure, when it was the only frequency in the device tree.

Thanks,
Kim

kimbo · January 21, 2022, 2:46pm

We still have not been able to switch kernel versions because other the other kernel versions we have tried have other issues that we cannot resolve. These are the known issues with the kernel versions we have tried:

4.14
• GPU crash issue in specific temperature range
5.7
• Encoder occasionally Crashes
• capture isn’t as reliable and drops frames occasionally. With HDMI display enabled (and GPU drawing), capture will drop more frames.
5.13
• capture isn’t as reliable and drops frames occasionally. With HDMI display enabled (and GPU drawing), capture will drop more frames.

We are currently planning on implementing a workaround for the issue in 4.14 that will avoid using the GPU during the temperature range that we see the crash. If anyone has any ideas/suggestions to fix issues in 5.7 or 5.13, we are open to trying your suggestions.

Thanks,
Kim