Boot failure / crash during the boot?

Hi,
We have been seeing the Dragonboard sometimes doesn’t come up to the login after power on.
I put the log file here. It includes kernel log of 2 failure cases, and 1 normal boot.

Dragonboard 820c with:

  • Dragonboard-820c-bootloader-ufs-linux-39
  • Boot-linaro-buster-dragonboard-820c-228.img
  • linaro-buster-alip-dragonboard-820c-228.img
  • USB keyboard + HDMI out + debug serial.

When it happens, the serial console is locked up.
It happens relatively often (it took me 17 power cycles to get 2 failures). Has anybody else seen this? How can we debug this?

Thanks,
tamo2

Is it something new (not reproduced with older build)?

I don’t think this is anything new.
I originally had older release on the board (I don’t remember which though), and had similar issues. To post to the forum, I upgraded to the latest, making sure that the issue still happens in the latest.

Looks like fixing this is issue is difficult, so we are looking for a work around.

The kernel has a watchdog and indeed it seems to be in effect. By looking at the failure cases, after the boot sequence halts, it goes back to some boot sequence, then stops:


[ 4.120588] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 614400 KHz
[ 4.128401] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 652800 KHz
[ 4.136225] cpufreq: cpufreq_online: CPU2: Running at unlisted freq: 19200 KHz
[ 4.143911] cpufreq: cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz

Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.XF.1.0-00301
S - IMAGE_VARIANT_STRING=M8996LAB
S - OEM_IMAGE_VERSION_STRING=crm-ubuntu68
S - Boot Interface: UFS
… omitted …
B - 2169526 - usb: PLL lock success , 0x2
B - 2187704 - usb: SUPER , 0x900e

Instead of stopping, would it be possible to restart the kernel?

I do not observe this boot issue. @anon91830841 is it something familiar to your team (sporadic boot issues on db820c with reboot in recovery mode)?

@tamo2, just in case, do you reproduce the issue without USB(keyboard)/HDMI connected?

Not easy here since the board seems to reboot in ‘recovery’ mode and does not even reach the LK booatloader (you could have been tweaked to unconditionally boot).

This is certainly because of the power-on-reason:
B - 411597 - PON REASON:PM0:0x20060 PM1:0x20020

Hi Tamotsu @tamo2

Have you checked your power supply? This looks to me like the power supply is dropping out at about the instant the kernel speeds up the main CPU which causes an increased load (a spike in current demand) on the power supply.

The power supply we were shipping for the 820c was a 3A/12V supply which is much larger than needed and never showed this problem. If you are using a lower voltage supply (6.5V will work), or if you have a long or thin wire between your power supply and the 820c board it is possible that the input voltage will momentarily droop below 6V, which would cause a reboot.

Of course this is just a guess on my part, but it is certainly something you can quickly and easily take a look at to confirm this is not causing your problem.

-Lawrence-
no longer a Qualcomm employee
looking for a new opportunity

Loic, ljking
Thanks for the response.
I just re-tested without any peripherals attached, and the problem occurred on the first boot. I added the log and the test HW setup to the doc.

The power supply I use is the one came with the board, but it is 2A/12V. I will try the same test with an external power supply later.
Thanks,
tamo2

I tried the same test with an external power supply, 12V and the current limit was set to 4A.
Initially, it looked the failure rate got dropped, but I don’t know. I still see the failure pretty regularly.

I don’t know if it matters but my board’s serial # is:
Job#: 4941523
-H100 Rev: D
COO: US
N10N3DL1R

We only have 1 DB820c, but have several tens of SOMs from a third party, and having the same issue. We also noticed that those SOMs behave quite differently under higher temperature – some works better (running higher clock) than others under the same environment using the exact same software and clock governor settings.
So, I am speculating if the problem is related to the clock change and power supply, then some system may happen to work better(?)

p.s. In this test, Ethernet was connected to DB820c.

Hi @tamo2

The 12V/2A supply (24 Watts) should be lots to run the 820c (assuming that you don’t have a very power hungry mezzanine board attached). and your test with a 12V/4A supply didn’t change anything. I am assuming that the wires from the supply to the 820c are not too long (~1m) and of sufficient gauge to deliver the necessary current to the board.

I think you can eliminate my power supply idea from your list of potential causes of the issue.

I read through the boot log, it appears that the Qualcomm proprietary loaders (messages beginning with “[SBD] -”) get through with no issue, then LK (messages with integer time stamps) is started with no issue, then finally the Linux kernel (massages with real time stamps) get started. The Linux kernel seems to be doing fine bringing things up and gets all 4 cores running then tries to bring up the SDCard controller.

That’s when something goes wrong. There is no SDCard in the slot in the setup pictures so I don’t think that is the issue, maybe it just happens to be at the time of the crash which was also right after the 4 cores were powered up. The watch dog then trips, and the system goes back to the Qualcomm proprietary boot loader, which realises that the problem is a watchdog timeout and it proceeds to generate a XBLRamDump Image. Finally the Qualcomm proprietary boot loader stops and is waiting for a debugger to look into the problem.

At this point I agree it is a kernel SW issue, but beyond my debugging skills.

-Lawrence-
No longer a Qualcomm employee
looking for a new opportunity.

Perhaps initcall_debug (Initcall Debug - eLinux.org ) is worth enabling. Assuming the kernel got far enough to start working its way through initcalls then it can be useful to know exactly which one we stopped in!
Not every initcall prints something so, whilst initcall_debug sometimes just confirms our suspicions from the last message printed before death, it is still an important part of narrowing things down.

I reproduced this issue today, so I’m going to investigate.

1 Like

A bug already exist for this issue: https://bugs.96boards.org/show_bug.cgi?id=738

My coworker found a patch which is not currently included in linaro yet : https://patchwork.kernel.org/patch/10377933/
Do you think this may fix the issue?
We are going to try it next week.

I can’t say but feel free to give it a try.

There is a tentative patch associated to the bug ticket, it’s worth to test as well: https://bugs.96boards.org/attachment.cgi?id=309

We have tried both patches, but neither fixed the issue.
With https://bugs.96boards.org/attachment.cgi?id=309 , it failed after 17 power cycles.

With https://patchwork.kernel.org/patch/10377933, filae from 39 and 21 power cycles.

Our test currently stops at the first failure, so we don’t measure the failure rate, but we have seen sometime it goes over 50 cycles without a failure. The issue is really sporadic.

The test was done on a third party SOM, but I think it wold be similar on DB820c. Any help would be great.
We started thinking of having a hardware watchdog as a backup.

Just to confirm, is the bug signature still:

[ 4.120588] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 614400 KHz
[ 4.128401] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 652800 KHz
[ 4.136225] cpufreq: cpufreq_online: CPU2: Running at unlisted freq: 19200 KHz
[ 4.143911] cpufreq: cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz

Yes. We have seen other crashes but cpufreq was still the majority of the crashes.

Nothing new, but I would like to share one of our boot test results. This was using a SOM (not DB820c) with a heatsink + fan.

>   cnt#     NumFailed  failure%
> [    0] (numFailed=0, 0.00%) Mon Nov 05 16:30:41 2018
> [   73] (numFailed=1, 1.37%) Mon Nov 05 17:21:10 2018
> [   93] (numFailed=2, 2.15%) Mon Nov 05 17:35:29 2018
> [  132] (numFailed=3, 2.27%) Mon Nov 05 18:02:44 2018
> [  203] (numFailed=4, 1.97%) Mon Nov 05 18:51:42 2018
> [  204] (numFailed=5, 2.45%) Mon Nov 05 18:53:06 2018
> [  391] (numFailed=6, 1.53%) Mon Nov 05 21:00:52 2018
> [  650] (numFailed=7, 1.08%) Mon Nov 05 23:57:32 2018
> [ 1556] (numFailed=8, 0.51%) Tue Nov 06 10:13:41 2018
> [ 1562] (numFailed=9, 0.58%) Tue Nov 06 10:18:29 2018
> [ 1585] (numFailed=9, 0.57%) Tue Nov 06 10:34:07 2018 

900 cycles were passed without a single error (650 ~ 1556), indicating how sporadic it is. The 900 cycles were during the night. It could be because the temperature in the office was very steady. During the day, the air conditioner goes on and off quite often in our office.

I have tried changing the default cpu governor to powersave or conservative in the kernel config.
After this change the boot failure frequency reduced compared to the one with ondemand and it crashes rarely (after 200 boots) at some different location other than at cpufreq drivers.
Is this because of CPR or voltage scaling missing in the current kernel ?