Frequency switching stability issues on Q820

Hi,

I am running benchmark utility (cpufreq-bench) in Linux Debian (Kernel version 4.14.96) https://git.linaro.org/landing-teams/working/qualcomm/kernel.git/log/?h=release/qcomlt-4.14 and rootfs http://snapshots.linaro.org/96boards/dragonboard820c/linaro/debian/361/ on Q820 based board.

But when i run benchmark utility, Board get crashed sometime. sometime cpufreq-bench completed successfully, but crash is observed 1 in a 5 time test run. Attaching crash logs here https://pastebin.com/Eu9r79gQ

I am running cpufreq-bench on all (4) cpu using below commands:

cpufreq-bench -l 25000 -s 25000 -n 200 -y 25000 -x 25000 -r 5 -c 0 &
cpufreq-bench -l 25000 -s 25000 -n 200 -y 25000 -x 25000 -r 5 -c 1 &
cpufreq-bench -l 25000 -s 25000 -n 200 -y 25000 -x 25000 -r 5 -c 2 &
cpufreq-bench -l 25000 -s 25000 -n 200 -y 25000 -x 25000 -r 5 -c 3 &

Can someone suggest how to debug this issue?

Well we think it could be related to doing frequency scaling without doing voltage scaling (which could be fixed in msm8996 CPR driver). You nmay want to test this patch: https://lore.kernel.org/patchwork/patch/937453/ but it has been reverted because causing some kernel panics AFAIK…

I have tried patch https://lore.kernel.org/patchwork/patch/937453/ you suggested, After this patch board is stuck at boot time and after that it crashed. so no successful boot.

Attaching the log here https://pastebin.com/j3y3S69Z. I am trying to debug what cause an issue.

I have also tried last 3 patch from show related link, but board is not booting in that case also. Attaching the log here: https://pastebin.com/M3eYyEPd

Hi @Loic, @anon91830841

In above patch when i use saw3 then board is not booted and getting stuck, so i used apcs as saw reg and then board successfully booted. But when i run my use case of cpufreq-bench on all 4 cpus then board getting crashed.

-qcom,saw-reg = <&saw3>;
+qcom,saw-reg = <&apcs>;

So what is the issue here? And is there any branch in which voltage scaling with frequency scaling is implemented for Q820?

Thanks,
Hiren

Yes it’s related to voltage scaling not implemented… I’m afraid there is currently no plan to implement it and no company sponsoring this work.

Hi @Loic
What would be involved in getting this working at least to some point where these SD820 boards would be able to operate stable? With some moderate guidance, I wouldn’t mind getting my hands dirty on this, since without it, these two DB820C’s I’ve recently obtained aren’t particularly useful. I’ve got the random boot failures, which I’m under the understanding is caused by the same problem.

@Hiren, @doitright,

Yes to summary, we think this issue is due to doing CPU frequency scaling without doing voltage scaling. For debug purpose, I’ve tried in the past to play with opp freqs table in msm8996 device tree which seems to impact the reproducibility of boot hang (I did no perform cpufreq-bench stress tests), and limiting the big/LITTLE cores to ~556 MHz prevented the issue to reproduce.

The idea would be to add support for CPU voltage adjustment on frequency change, or maybe in a first time, statically set the voltage to the highest value… The regulator responsible for this is ‘S11’, it can be controlled via the existing qcom_spmi-regulator driver (pm8994-regulators compatible node in dts).

The patch I mentioned above [1] was tentative to setup the auto voltage scaling, but it has been reverted at some point because of a kernel panic issue (don’t know exactly the root cause).

The first thing I would recommend is inspiring from the linked patch to only add the pmic@1 in the devicetree, and set/force regulator-min-microvolt to the max value (1140000) which should force S11 to 1114mV at boot (add some debug to confirm this), then try to reproduce the issue (either boot or stress test hang).

[1] https://lore.kernel.org/patchwork/patch/937453/

Thanks @Loic, I will give that a shot and respond back on how it works out.

Hmm

[     6.027090] s11: Bringing 1175000uV into 1140000-1140000uV
...
[     8.257100] cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz

And dead.

It seems to be starting at a higher voltage, 1175000 uV.

It also seems that when it locks up, it is almost always just after setting the regulators. I wonder if it just needs more power? What would be the safe maximum?

Correct, not sure however if it’s a sane default.

APQ8096 VDD APC (cpus power) can get up to 1.23V (seems to be the max for turbo mode). PM8996 itself (the PMIC) can generate 0.375V to 1.275V on S11/S9/S10 (connected to AP8096 VDD APC).

Can you test the following branch (or the top patch): loic.poulain/linux.git - [no description]

It SEEMS to fix the boot hang issue on my side, would be interested to get feedback from @Hiren as well .

I’ll give a shot to the top patch when I’m home this evening. I’m running AOSP though, so I won’t be able to run the entire kernel.

No difference that I can tell. It’s still happening at the same point and about the same frequency. I would expect that having the voltage maxed out like that would show some difference if it was a voltage problem.

I wonder if there is something else wrong with the aosp kernel?

Yeah, there is definitely something else wrong with the aosp kernel. While I wasn’t able to boot android, your kernel is apparently not exhibiting that problem. Well I guess at least I have something to go on now :slight_smile:

Do you get a crash in your case or just hang and reboot? Is the issue similar to https://bugs.96boards.org/show_bug.cgi?id=804 ?

It’s a hang and reboot. No obvious crash, at least nothing that prints on the serial.

I have applied changes and run cpufreq-bench test in Linux for approx 4-5 hour and didn’t observe any crash. Will do more testing and running cpufreq-bench for long time and let you know if any issue or crash observed.

Thanks,
Hiren

The aosp kernel source appears to be quite severely out of date.

Adding @amit in case of any feedback

Hi,

With this change cpufreq-bench issue is resolved and didn’t observed crash while running it.

But boot hang issue is still there. Board sometime stuck at boot time and going into ram dump mode.

Thanks,
Hiren

Hi @Loic

It seems that policy->cur for CPU0 is set to 614400 KHz and for CPU2 its 19200 KHz which is outside of frequency table present with cpufreq core and thus CPU might go into unstable mode and boot freeze issue occurred .

I am trying to set policy->cur frequency to one which is available in frequency table. As i have seen some code it seems that this frequency is coming from perfcl_smux, pwrcl_smux or perfcl_pmux, pwrcl_pmux clock source.

So any pointer would be helpful regarding how i can change this frequency or is it appropriate to change this frequency or not?

Thanks,
Hiren

Good!

These freqs are set by the bootloader, but cpufreq is supposed to adjust the frequencies with the provided table.