How to analyze the boot up failures

kldixon · November 2, 2018, 3:08pm

The lines following ‘Format: Log Type’ are from the ROM stage loader in the SoC followed by messages from the Primary Boot Loader which the ROM code loads from xbl.elf in the xbl partition and appear because the kernel failed in some way and, I am presuming, a watchdog times out, sets some flag somewhere and reboots. The PBL sees that the kernel failed and loads the XBLRamDump Image from the xbl.elf.
I am guessing there is some Qualcomm application that will communicate with this to allow the core image to be downloaded from the board by USB for analysis.
Anyway, that is what I infer is happening. You can use readelf on xbl.elf and see that three of the LOAD sections add up to 342782 Bytes. BTW the following LOAD section is UEFI.

Loic · November 2, 2018, 3:31pm

detect that the system has failed to boot and do something about it, such as force a restart…

cf @kldixon response.
The board actually resets, but in a specific mode. This behavior is not easy to change since it’s part of boot code we don’t have source code.

make sure the system never has a boot issue

This is probably the way to go.

In the other thread @danielt suggested to add initcall debugging, did you try?
Basically, you need to add initcall_debug=1 and ignore_loglevel=1 to the kernel boot cmdline.
This can help to identify if the problem happens in a particular init code.

You can achieve this with dbootimg [1]:

# install dbootimg
git clone https://github.com/96boards/dt-update.git
cd dt-update
make
sudo make install

# Update kernel cmdline
dbootimg boot-db820c.img -u cmdline "root=/dev/sda1 rw rootwait initcall_debug=1 ignore_loglevel=1"

[1] https://github.com/96boards/dt-update

kimbo · November 2, 2018, 4:43pm

Hi Loic,
I did not try @danielt suggestion initially. So I tried it just now. I copied a portion of a failed boot and a non-failed boot. It looks like cpufreq_dt_platdev_init is taking much longer (27ms). I have reproduced this failure twice now and this call takes a lot longer on failure and successful boots are around 4ms.
Thanks,
Kim

On a failure I see these messages:
[ 13.580704] calling cpufreq_gov_dbs_init+0x0/0x1c @ 1
[ 13.588098] initcall cpufreq_gov_dbs_init+0x0/0x1c returned 0 after 1 usecs
[ 13.593118] calling dt_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 13.600188] initcall dt_cpufreq_platdrv_init+0x0/0x20 returned 0 after 190 usecs
[ 13.605537] calling cpufreq_dt_platdev_init+0x0/0xdc @ 1
[ 13.614868] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 614400 KHz
[ 13.618450] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 652800 KHz
[ 13.626536] cpufreq: cpufreq_online: CPU2: Running at unlisted freq: 19200 KHz
[ 13.634147] cpufreq: cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz
[ 13.641333] initcall cpufreq_dt_platdev_init+0x0/0xdc returned 0 after 27653 usecs

Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.XF.1.0-00301
S - IMAGE_VARIANT_STRING=M8996LAB
S - OEM_IMAGE_VERSION_STRING=crm-ubuntu68
S - Boot Interface: UFS
S - Secure Boot: Off

A no failure shows:
[ 11.427835] calling uniphier_wdt_driver_init+0x0/0x20 @ 1
[ 11.428228] initcall uniphier_wdt_driver_init+0x0/0x20 returned 0 after 369 usecs
[ 11.428237] calling cpufreq_gov_powersave_init+0x0/0x1c @ 1
[ 11.428249] initcall cpufreq_gov_powersave_init+0x0/0x1c returned 0 after 3 usecs
[ 11.428256] calling cpufreq_gov_userspace_init+0x0/0x1c @ 1
[ 11.428264] initcall cpufreq_gov_userspace_init+0x0/0x1c returned 0 after 0 usecs
[ 11.428271] calling cpufreq_gov_dbs_init+0x0/0x1c @ 1
[ 11.428280] initcall cpufreq_gov_dbs_init+0x0/0x1c returned 0 after 1 usecs
[ 11.428286] calling dt_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 11.428570] initcall dt_cpufreq_platdrv_init+0x0/0x20 returned 0 after 262 usecs
[ 11.428577] calling cpufreq_dt_platdev_init+0x0/0xdc @ 1
[ 11.430727] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 614400 KHz
[ 11.430854] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 652800 KHz
[ 11.432134] cpufreq: cpufreq_online: CPU2: Running at unlisted freq: 19200 KHz
[ 11.432190] cpufreq: cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz
[ 11.432533] initcall cpufreq_dt_platdev_init+0x0/0xdc returned 0 after 3849 usecs
[ 11.432537] calling brcm_avs_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 11.432795] initcall brcm_avs_cpufreq_platdrv_init+0x0/0x20 returned 0 after 247 usecs
[ 11.432798] calling scpi_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 11.432870] initcall scpi_cpufreq_platdrv_init+0x0/0x20 returned 0 after 66 usecs
[ 11.432873] calling tegra_cpufreq_init+0x0/0xb0 @ 1
[ 11.432877] initcall tegra_cpufreq_init+0x0/0xb0 returned -2 after 0 usecs
[ 11.432879] calling tegra_cpufreq_init+0x0/0x90 @ 1
[ 11.432886] initcall tegra_cpufreq_init+0x0/0x90 returned -19 after 3 usecs
[ 11.432888] calling arm_idle_init+0x0/0x200 @ 1
[ 11.432898] initcall arm_idle_init+0x0/0x200 returned -19 after 6 usecs
[ 11.432901] calling mmc_pwrseq_simple_driver_init+0x0/0x20 @ 1

kimbo · November 2, 2018, 4:52pm

It may not always be the case. This was another failure:
[ 13.602839] calling cpufreq_gov_dbs_init+0x0/0x1c @ 1
[ 13.610234] initcall cpufreq_gov_dbs_init+0x0/0x1c returned 0 after 1 usecs
[ 13.615254] calling dt_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 13.622322] initcall dt_cpufreq_platdrv_init+0x0/0x20 returned 0 after 189 usecs
[ 13.627674] calling cpufreq_dt_platdev_init+0x0/0xdc @ 1
[ 13.636970] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 614400 KHz
[ 13.640585] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 652800 KHz
[ 13.648669] cpufreq: cpufreq_online: CPU2: Running at unlisted freq: 19200 KHz
[ 13.656280] cpufreq: cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz

Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic

kldixon · November 2, 2018, 5:53pm

There was a presentation at Connect on Kernel Power Management

The dynamic power management code is under heavy development and I suspect is likely going to be pretty flaky for some time. You may have to revert to an earlier kernel.
I have an issue with HDMI which only works with 4.11.

althea · November 5, 2018, 7:40am

The board actually resets, but in a specific mode. This behavior is not easy to change since it’s part of boot code we don’t have source code.

Hi @Loic, does this mean that we can’t stop the Qualcomm bootloaders deliberately stopping in bootloader under certain circumstances?

That makes it tricky to make production units where we need to maintain constant operation by incorporating watchdogs etc.

kldixon · November 5, 2018, 10:33am

Unless you get the source of linux-board-support-package-r01700.1.zip, directly from Qualcomm under a NDA, I suspect you are stuck.

In correction of my previous comment; the ROM stage loader is the Primary Boot Loader and what I called the PBL is in fact the first segment of the Secondary Boot Loader, SBL1, loaded from the xbl partition.

Loic · November 5, 2018, 12:46pm

This seems the default behavior for non-production devices (AFAIK, production config can be fused differently). I’m looking for a way to disable this without having to fuse a PBL config or modify the SBL code, not sure it’s possible though.

In worst case you can add external watchdog.

althea · November 5, 2018, 1:18pm

Thanks, would be great to hear what you find.

Failing that… are there instructions anywhere on setting these fuses ourselves (assuming not already set)?

danielt · November 5, 2018, 1:52pm

We should probably also make clear that these answers are for DB820C only. If you are interested in an answer for other boards you need to speak to the board vendor (or to chip supplier if you are the board vendor).

suspendme · December 3, 2018, 11:27am

Hi Kim,
Since you want the system to reboot, would this workaround shared below help?
Add this below the line in rootfs in /etc/sysctl.conf
kernel.panic = 3 (or 2 or any number of seconds that you want the system to reboot after)

kimbo · December 3, 2018, 4:35pm

Hi suspendme,

I did try adding panic=3 in the cmdline argument when I run mkbootimg, but it did not seem to do anything for this boot issue.

Thanks,
Kim

Loic · December 4, 2018, 10:27am

I created a bug [1] to track this remaining boot issue.

[1] https://bugs.96boards.org/show_bug.cgi?id=804

vinaysimha · December 12, 2018, 6:07am

how to reproduce this issue?

regards,
vinaysimha

danielt · December 12, 2018, 12:13pm

As best as I understand things, adopt a v4.14 kernel and reboot several times (rate of nonoccurence may be fairly low).

tamo2 · December 13, 2018, 10:55pm

Sometimes takes a few hundred reboots: Boot failure / crash during the boot? - #18 by tamo2

suspendme · December 14, 2018, 10:18am

Hi @Loic, Why would you attribute the solution to “fuse a PBL config or modify the SBL code” here? Could you please share your thoughts?

danielt · December 14, 2018, 11:18am

Note that Loic’s comment was about how to stop the system entering a debug mode after it had failed (it was not about stopping it from failing). In that case of debug mode no other code has been executed yet!

Loic · December 14, 2018, 2:15pm

Correct! my point was not about root cause resolution but how to prevent the bootloader to switch to ramdump mode after a crash.

I’ve tried to enable QCOM krait watchdog on Linux, and this also causes a reboot in ramdump mode when triggered. So I do not see any easy way to cleanly reboot after a crash/watchdog without modifying the second stage bootloader.

pwhewell · May 14, 2019, 7:21pm

After some research around on the internet it looks like when you enter this Ramdump mode you should be able to dump the memory or some kind of stack for analysis.

Are there are instructions or tools to be able to do this with the Dragonboard by any chance?

I have gotten into this mode a couple of times but without being able to see how I got there I am not able to determine if there is a solution to my problem.