How to analyze the boot up failures

Hi. I am seeing some boot failures and am wondering how to interpret the information that prints. Sometimes what this post shows occurs.

Today I observed different boot failure. So how can I decode the boot failures, such as the one below?

Thanks,
kimbo

[ 3.567517] dwc3 6a00000.dwc3: failed to initialize core
[ 3.569361] ehci_hcd: USB 2.0 ‘Enhanced’ Host Controller (EHCI) Driver
[ 3.569460] ehci-pci: EHCI PCI platform driver
[ 3.569515] ehci-platform: EHCI generic platform driver
[ 3.569784] ehci-orion: EHCI orion driver

Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.XF.1.0-00301
S - IMAGE_VARIANT_STRING=M8996LAB
S - OEM_IMAGE_VERSION_STRING=crm-ubuntu68
S - Boot Interface: UFS
S - Secure Boot: Off
S - Boot Config @ 0x00076044 = 0x000001c9
S - JTAG ID @ 0x000760f4 = 0x4003e0e1
S - OEM ID @ 0x000760f8 = 0x00000000
S - Serial Number @ 0x00074138 = 0x2d9fe61b
S - OEM Config Row 0 @ 0x00074188 = 0x0000000000000000
S - OEM Config Row 1 @ 0x00074190 = 0x0000000000000000
S - Feature Config Row 0 @ 0x000741a0 = 0x0050000010000100
S - Feature Config Row 1 @ 0x000741a8 = 0x00fff00001ffffff
S - Core 0 Frequency, 1228 MHz
B - 0 - PBL, Start
B - 10413 - bootable_media_detect_entry, Start
B - 46030 - bootable_media_detect_success, Start
B - 46034 - elf_loader_entry, Start
B - 47083 - auth_hash_seg_entry, Start
B - 47184 - auth_hash_seg_exit, Start
B - 79150 - elf_segs_hash_verify_entry, Start
B - 81650 - PBL, End
B - 83661 - SBL1, Start
B - 179614 - usb: hs_phy_nondrive_start
B - 179950 - usb: PLL lock success - 0x3
B - 182939 - usb: hs_phy_nondrive_finish
B - 187056 - boot_flash_init, Start
D - 0 - boot_flash_init, Delta
B - 194163 - sbl1_ddr_set_default_params, Start
D - 0 - sbl1_ddr_set_default_params, Delta
B - 202154 - boot_config_data_table_init, Start
D - 179248 - boot_config_data_table_init, Delta - (60 Bytes)
B - 385916 - CDT Version:3,Platform ID:24,Major ID:1,Minor ID:0,Subtype:0
B - 390613 - Image Load, Start
D - 22235 - PMIC Image Loaded, Delta - (37272 Bytes)
B - 412848 - pm_device_init, Start
B - 418612 - PON REASON:PM0:0x20020 PM1:0x200a0
B - 454999 - PM_SET_VAL:Skip
D - 40016 - pm_device_init, Delta
B - 456920 - pm_driver_init, Start
D - 2928 - pm_driver_init, Delta
B - 463508 - pm_sbl_chg_init, Start
D - 91 - pm_sbl_chg_init, Delta
B - 470279 - vsense_init, Start
D - 0 - vsense_init, Delta
B - 477904 - Pre_DDR_clock_init, Start
D - 366 - Pre_DDR_clock_init, Delta
B - 483913 - ddr_initialize_device, Start
B - 487603 - 8996 v3.x detected, Max frequency = 1.8 GHz
B - 495167 - ddr_initialize_device, Delta
B - 497180 - Basic DDR tests done
B - 2060092 - clock_init, Start
D - 244 - clock_init, Delta
B - 2062105 - Image Load, Start
D - 6130 - QSEE Dev Config Image Loaded, Delta - (46008 Bytes)
B - 2068540 - Image Load, Start
D - 5704 - APDP Image Loaded, Delta - (0 Bytes)
B - 2098522 - usb: fedl, vbus_low
B - 2098583 - Image Load, Start
D - 10766 - XBLRamDump Image Loaded, Delta - (342782 Bytes)
B - 2163914 - usb: init start
B - 2165744 - usb: PLL lock success , 0x2

Are your running 4.14 qcomlt release branch? Did you try the mentioned patch [1]?

[1] https://bugs.96boards.org/attachment.cgi?id=309

I presume you are referring to the first line of the log you report. I have an old log with precisely those lines which went on to boot successfully.

Hi Loic,
Yes, I have tried the patch (and am on 4.14 qcomlt release branch) and it does reduce the number of boot failures (I think, but could be imaging), but still I see a boot issue on occasion. When it doesn’t boot I want to know how to analyze the lines starting at “Format: Log Type”. Can I know where or what component in the boot is having an issue?
Thanks,
kimbo

These just look like they are boot up messages (so from the point of view of debugging I think it just means the board went into reset). Naturally you can double check this (my DB820C isn’t plugged in at the moment for me to check) by rebooting the board manually :wink:

Hi Daniel,

On a normal boot I see the “Format:” messages first, then the kernel starts booting and eventually I get the hdmi display.

On a failed boot I see the “Format:” messages first, then the kernel starts boot, but will lock up at some point and then print these “Format:” messages again. At that point, yes, I have to reboot the system and then it usually comes up per the normal boot and I get to the hdmi display successfully.

The problem is that our customers will notice that the system isn’t coming up properly and “cycling power” to the board isn’t very nice for them. So our options are either 1) make sure the system never has a boot issue or 2) detect that the system has failed to boot and do something about it, such as force a restart…without the customer knowing anything about it.

Are either of these two options possible for us?

Thanks,
Kim

The lines following ‘Format: Log Type’ are from the ROM stage loader in the SoC followed by messages from the Primary Boot Loader which the ROM code loads from xbl.elf in the xbl partition and appear because the kernel failed in some way and, I am presuming, a watchdog times out, sets some flag somewhere and reboots. The PBL sees that the kernel failed and loads the XBLRamDump Image from the xbl.elf.
I am guessing there is some Qualcomm application that will communicate with this to allow the core image to be downloaded from the board by USB for analysis.
Anyway, that is what I infer is happening. You can use readelf on xbl.elf and see that three of the LOAD sections add up to 342782 Bytes. BTW the following LOAD section is UEFI.

  1. detect that the system has failed to boot and do something about it, such as force a restart…

cf @kldixon response.
The board actually resets, but in a specific mode. This behavior is not easy to change since it’s part of boot code we don’t have source code.

  1. make sure the system never has a boot issue

This is probably the way to go.

In the other thread @danielt suggested to add initcall debugging, did you try?
Basically, you need to add initcall_debug=1 and ignore_loglevel=1 to the kernel boot cmdline.
This can help to identify if the problem happens in a particular init code.

You can achieve this with dbootimg [1]:

# install dbootimg
git clone https://github.com/96boards/dt-update.git
cd dt-update
make
sudo make install

# Update kernel cmdline
dbootimg boot-db820c.img -u cmdline "root=/dev/sda1 rw rootwait initcall_debug=1 ignore_loglevel=1"

[1] https://github.com/96boards/dt-update

Hi Loic,
I did not try @danielt suggestion initially. So I tried it just now. I copied a portion of a failed boot and a non-failed boot. It looks like cpufreq_dt_platdev_init is taking much longer (27ms). I have reproduced this failure twice now and this call takes a lot longer on failure and successful boots are around 4ms.
Thanks,
Kim

On a failure I see these messages:
[ 13.580704] calling cpufreq_gov_dbs_init+0x0/0x1c @ 1
[ 13.588098] initcall cpufreq_gov_dbs_init+0x0/0x1c returned 0 after 1 usecs
[ 13.593118] calling dt_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 13.600188] initcall dt_cpufreq_platdrv_init+0x0/0x20 returned 0 after 190 usecs
[ 13.605537] calling cpufreq_dt_platdev_init+0x0/0xdc @ 1
[ 13.614868] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 614400 KHz
[ 13.618450] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 652800 KHz
[ 13.626536] cpufreq: cpufreq_online: CPU2: Running at unlisted freq: 19200 KHz
[ 13.634147] cpufreq: cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz
[ 13.641333] initcall cpufreq_dt_platdev_init+0x0/0xdc returned 0 after 27653 usecs

Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.XF.1.0-00301
S - IMAGE_VARIANT_STRING=M8996LAB
S - OEM_IMAGE_VERSION_STRING=crm-ubuntu68
S - Boot Interface: UFS
S - Secure Boot: Off

A no failure shows:
[ 11.427835] calling uniphier_wdt_driver_init+0x0/0x20 @ 1
[ 11.428228] initcall uniphier_wdt_driver_init+0x0/0x20 returned 0 after 369 usecs
[ 11.428237] calling cpufreq_gov_powersave_init+0x0/0x1c @ 1
[ 11.428249] initcall cpufreq_gov_powersave_init+0x0/0x1c returned 0 after 3 usecs
[ 11.428256] calling cpufreq_gov_userspace_init+0x0/0x1c @ 1
[ 11.428264] initcall cpufreq_gov_userspace_init+0x0/0x1c returned 0 after 0 usecs
[ 11.428271] calling cpufreq_gov_dbs_init+0x0/0x1c @ 1
[ 11.428280] initcall cpufreq_gov_dbs_init+0x0/0x1c returned 0 after 1 usecs
[ 11.428286] calling dt_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 11.428570] initcall dt_cpufreq_platdrv_init+0x0/0x20 returned 0 after 262 usecs
[ 11.428577] calling cpufreq_dt_platdev_init+0x0/0xdc @ 1
[ 11.430727] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 614400 KHz
[ 11.430854] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 652800 KHz
[ 11.432134] cpufreq: cpufreq_online: CPU2: Running at unlisted freq: 19200 KHz
[ 11.432190] cpufreq: cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz
[ 11.432533] initcall cpufreq_dt_platdev_init+0x0/0xdc returned 0 after 3849 usecs
[ 11.432537] calling brcm_avs_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 11.432795] initcall brcm_avs_cpufreq_platdrv_init+0x0/0x20 returned 0 after 247 usecs
[ 11.432798] calling scpi_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 11.432870] initcall scpi_cpufreq_platdrv_init+0x0/0x20 returned 0 after 66 usecs
[ 11.432873] calling tegra_cpufreq_init+0x0/0xb0 @ 1
[ 11.432877] initcall tegra_cpufreq_init+0x0/0xb0 returned -2 after 0 usecs
[ 11.432879] calling tegra_cpufreq_init+0x0/0x90 @ 1
[ 11.432886] initcall tegra_cpufreq_init+0x0/0x90 returned -19 after 3 usecs
[ 11.432888] calling arm_idle_init+0x0/0x200 @ 1
[ 11.432898] initcall arm_idle_init+0x0/0x200 returned -19 after 6 usecs
[ 11.432901] calling mmc_pwrseq_simple_driver_init+0x0/0x20 @ 1

It may not always be the case. This was another failure:
[ 13.602839] calling cpufreq_gov_dbs_init+0x0/0x1c @ 1
[ 13.610234] initcall cpufreq_gov_dbs_init+0x0/0x1c returned 0 after 1 usecs
[ 13.615254] calling dt_cpufreq_platdrv_init+0x0/0x20 @ 1
[ 13.622322] initcall dt_cpufreq_platdrv_init+0x0/0x20 returned 0 after 189 usecs
[ 13.627674] calling cpufreq_dt_platdev_init+0x0/0xdc @ 1
[ 13.636970] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 614400 KHz
[ 13.640585] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 652800 KHz
[ 13.648669] cpufreq: cpufreq_online: CPU2: Running at unlisted freq: 19200 KHz
[ 13.656280] cpufreq: cpufreq_online: CPU2: Unlisted initial frequency changed to: 307200 KHz

Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic

There was a presentation at Connect on Kernel Power Management

The dynamic power management code is under heavy development and I suspect is likely going to be pretty flaky for some time. You may have to revert to an earlier kernel.
I have an issue with HDMI which only works with 4.11.

The board actually resets, but in a specific mode. This behavior is not easy to change since it’s part of boot code we don’t have source code.

Hi @Loic, does this mean that we can’t stop the Qualcomm bootloaders deliberately stopping in bootloader under certain circumstances?

That makes it tricky to make production units where we need to maintain constant operation by incorporating watchdogs etc.

Unless you get the source of linux-board-support-package-r01700.1.zip, directly from Qualcomm under a NDA, I suspect you are stuck.

In correction of my previous comment; the ROM stage loader is the Primary Boot Loader and what I called the PBL is in fact the first segment of the Secondary Boot Loader, SBL1, loaded from the xbl partition.

This seems the default behavior for non-production devices (AFAIK, production config can be fused differently). I’m looking for a way to disable this without having to fuse a PBL config or modify the SBL code, not sure it’s possible though.

In worst case you can add external watchdog.

Thanks, would be great to hear what you find.

Failing that… are there instructions anywhere on setting these fuses ourselves (assuming not already set)?

We should probably also make clear that these answers are for DB820C only. If you are interested in an answer for other boards you need to speak to the board vendor (or to chip supplier if you are the board vendor).

Hi Kim,
Since you want the system to reboot, would this workaround shared below help?
Add this below the line in rootfs in /etc/sysctl.conf
kernel.panic = 3 (or 2 or any number of seconds that you want the system to reboot after)

Hi suspendme,

I did try adding panic=3 in the cmdline argument when I run mkbootimg, but it did not seem to do anything for this boot issue.

Thanks,
Kim

I created a bug [1] to track this remaining boot issue.

[1] https://bugs.96boards.org/show_bug.cgi?id=804

how to reproduce this issue?

regards,
vinaysimha