I have been investigating why a kexec booted kernel will always hang in the early boot phase on the DragonBoard 820c. This hang has persisted since my first attempt to use kexec with numerous kernels and many different command lines so I thought it was time to try to nail it. Unfortunately, I have failed.
I use the same Image.gz and initrd for the first and second kernels.
The OS installed on the board is Debian snapshot 355:
boot-linaro-buster-dragonboard-820c-355.img.gz
linaro-buster-alip-dragonboard-820c-355.img.gz
The initrd used comes from this Debian snapshot with /lib/firmware added and /lib/modules replaced by that built with the kernel. The kernel comes from:
$ git clone https://git.linaro.org/landing-teams/working/qualcomm/kernel.git
$ git checkout -t origin/integration-linux-qcomlt
This is the kernel build information:
Linux version 5.2.0-02469-g44b4395655ff-dirty (dixon@computer2) (gcc version 8.3.0 (GNU Toolchain for the A-profile Architecture 8.3-2019.03 (arm-rel-8.36))) #2 SMP PREEMPT Mon Jul 29 10:38:30 BST 2019
The boot image for the first kernel is built and booted with:
$ ../skales/dtbTool -o dt.img -s 4096 ./arch/arm64/boot/dts/qcom/
$ ../skales/mkbootimg --output boot.img --cmdline "root=/dev/disk/by-partlabel/rootfs rw rootwait console=tty0 console=ttyMSM0,115200n8" --base 0x80000000 --pagesize 4096 --kernel ./arch/arm64/boot/Image.gz --dt dt.img --ramdisk ../snapshot-355/contents/ramdisk.new.packed
# fastboot boot boot.img
The first kernel seems to boot fine. (Although, recently, a problem with the ahci driver has appeared, but that is for a separate investigation.)
The second kernel is loaded and run with:
root@linaro-alip:~# kexec -v
kexec-tools 2.0.18
root@linaro-alip:~# kexec -d -l /boot/Image.gz --initrd=/boot/ramdisk.new.packed --command-line='maxcpus=1 root=/dev/disk/by-partlabel/rootfs rw rootwait console=tty0 console=ttyMSM0,115200n8 earlycon earlyprintk initcall_debug debug reset_devices ignore_loglevel dynamic_debug.verbose=1 dyndbg="file regmap-mmio.c +p; file regmap.c +p; file qcom-apcs-ipc-mailbox.c +p; file mailbox.c +p; file qcom_glink_native.c +p; file qcom_glink_rpm.c +p; file dd.c +p; file bus.c +p; file driver.c +p; file platform.c +p"'
root@linaro-alip:~# systemctl kexec
and so uses the device tree from the first kernel.
This is the tail of the boot log from the second kernel. The first line is normally the last line seen before the hang when the debugging kernel parameters are not given:
[ 9.056670] EDAC MC: Ver: 3.0.0
[ 9.060870] bus: 'edac': registered
[ 9.063754] bus: 'edac': add device mc
[ 9.067694] EDAC DEBUG: edac_mc_sysfs_init: device mc created
[ 9.075382] initcall edac_init+0x0/0x90 returned 0 after 15625 usecs
[ 9.076790] calling mmc_init+0x0/0x48 @ 1
[ 9.083584] bus: 'mmc': registered
[ 9.087952] bus: 'sdio': registered
[ 9.090481] initcall mmc_init+0x0/0x48 returned 0 after 3906 usecs
[ 9.094162] calling leds_init+0x0/0x68 @ 1
[ 9.100447] initcall leds_init+0x0/0x68 returned 0 after 0 usecs
[ 9.104532] calling qcom_scm_init+0x0/0x20 @ 1
[ 9.110589] __platform_driver_register
[ 9.114833] driver_register
[ 9.118503] bus: 'platform': add driver qcom_scm
[ 9.121359] bus_add_driver
[ 9.126156] driver_attach
[ 9.128570] bus_for_each_dev
[ 9.131377] __driver_attach
[ 9.134216] device_driver_attach
[ 9.136830] driver_probe_device
[ 9.140207] bus: 'platform': driver_probe_device: matched device firmware:scm with driver qcom_scm
[ 9.143175] really_probe_debug
[ 9.152135] bus: 'platform': really_probe: probing driver qcom_scm with device firmware:scm
[ 9.155275] really_probe
[ 9.163692] driver: 'qcom_scm': driver_bound: bound to device 'firmware:scm'
[ 9.166441] bus: 'platform': really_probe: bound device firmware:scm to driver qcom_scm
[ 9.173516] probe of firmware:scm returned 1 after 12000 usecs
[ 9.181580] initcall qcom_scm_init+0x0/0x20 returned 0 after 39062 usecs
[ 9.187207] calling scmi_bus_init+0x0/0x44 @ 1
[ 9.194063] bus: 'scmi_protocol': registered
[ 9.198273] initcall scmi_bus_init+0x0/0x44 returned 0 after 3906 usecs
[ 9.202613] calling scmi_clock_init+0x0/0x20 @ 1
[ 9.208988] initcall scmi_clock_init+0x0/0x20 returned 20 after 0 usecs
[ 9.213809] calling scmi_perf_init+0x0/0x20 @ 1
[ 9.220197] initcall scmi_perf_init+0x0/0x20 returned 19 after 0 usecs
[ 9.225060] calling scmi_power_init+0x0/0x20 @ 1
[ 9.231380] initcall scmi_power_init+0x0/0x20 returned 17 after 0 usecs
[ 9.236110] calling scmi_sensors_init+0x0/0x20 @ 1
[ 9.242520] initcall scmi_sensors_init+0x0/0x20 returned 21 after 0 usecs
[ 9.247397] calling glink_rpm_init+0x0/0x20 @ 1
[ 9.254321] __platform_driver_register
[ 9.259003] driver_register
[ 9.262513] bus: 'platform': add driver qcom_glink_rpm
[ 9.265280] bus_add_driver
[ 9.270444] driver_attach
[ 9.273101] bus_for_each_dev
[ 9.275873] __driver_attach
[ 9.278747] device_driver_attach
[ 9.281301] driver_probe_device
[ 9.284739] bus: 'platform': driver_probe_device: matched device rpm-glink with driver qcom_glink_rpm
[ 9.287707] really_probe_debug
[ 9.297014] bus: 'platform': really_probe: probing driver qcom_glink_rpm with device rpm-glink
[ 9.300073] really_probe
[ 9.308676] glink_rpm_probe ffff000012ca0804 ffff000012ca0c00
[ 9.311457] qcom_glink_native_probe
[ 9.316905] qcom_glink_send_version
[ 9.320177] qcom_glink_tx
[ 9.323628] mbox_send_message
[ 9.326406] msg_submit
[ 9.329349] qcom_apcs_ipc_send_data
[ 9.331630] regmap_write
[ 9.334994] _regmap_write
[ 9.337769] _regmap_bus_reg_write
[ 9.340294] regmap_mmio_write
[ 9.343589] regmap_mmio_write32le ffff00001000d000 10 1
At this point the watchdog times out, the primary boot loader starts and the normal boot sequence ensues.
Much of the output in the log is from pr_debugs inserted by myself to trace the calls. I can pastebin the full boot log from the first and second kernels if anyone is willing to read it.
This is the source for the last C function before the hang, from ./drivers/base/regmap/regmap_mmio.c.
static void regmap_mmio_write32le(struct regmap_mmio_context *ctx,
unsigned int reg,
unsigned int val)
{
pr_debug("%s %llx %x %x\n", __func__, (u64)ctx->regs, reg, val);
writel(val, ctx->regs + reg);
}
Disassembling ./drivers/base/regmap/regmap_mmio.o with Radare2 + Cutter gives the following for this function:
/ (fcn) sym.regmap_mmio_write32le 64
| sym.regmap_mmio_write32le (int32_t arg3, int32_t arg2, int32_t arg1, int32_t arg_20h);
| ; arg int32_t arg_20h @ sp+0x20
| ; arg int32_t arg3 @ x2
| ; arg int32_t arg2 @ x1
| ; arg int32_t arg1 @ x0
| 0x080002d8 stp x29, x30, [sp, -0x30]!
| 0x080002dc mov x29, sp
| 0x080002e0 stp x19, x20, [sp, 0x10]
| 0x080002e4 mov w19, w1 ; arg2
| 0x080002e8 mov w20, w2 ; arg3
| 0x080002ec str x21, [sp + arg_20h]
| 0x080002f0 mov x21, x0 ; arg1
| 0x080002f4 nop
| 0x080002f8 dsb st
| 0x080002fc ldr x0, [x21] ; [0x8000000:4]=0x464c457f ; loc.imp.__devm_regmap_init
| 0x08000300 add x19, x0, w19, uxtw
| 0x08000304 str w20, [x19]
| 0x08000308 ldp x19, x20, [sp, 0x10]
| 0x0800030c ldr x21, [sp + arg_20h] ; [0x20:4]=-1 ; 32
| 0x08000310 ldp x29, x30, [sp], 0x30
\ 0x08000314 ret
This is the call stack.
Determined in the most labour intensive manner imaginable, because I do not know any better.
hangs in ./arch/arm64/include/asm/io.h:writel
called by ./drivers/base/regmap/regmap-mmio.c:regmap_mmio_write32le
called through ctx->reg_write(ctx, reg, val)
called by ./drivers/base/regmap/regmap-mmio.c:regmap_mmio_write
called through map->bus->reg_write(map->bus_context, reg, val)
called by ./drivers/base/regmap/regmap.c:_regmap_bus_reg_write
called through map->reg_write(context, reg, val)
called by ./drivers/base/regmap/regmap.c:_regmap_write
called by ./drivers/base/regmap/regmap.c:regmap_write
called by ./drivers/mailbox/qcom-apcs-ipc-mailbox.c:qcom_apcs_ipc_send_data
called through chan->mbox->ops->send_data(chan, chan->msg_data[idx])
called by ./drivers/mailbox/mailbox.c:msg_submit
called by ./drivers/mailbox/mailbox.c:mbox_send_message
called by ./drivers/rpmsg/qcom_glink_native.c:qcom_glink_tx
called by ./drivers/rpmsg/qcom_glink_native.c:qcom_glink_send_version
called by ./drivers/rpmsg/qcom_glink_native.c:qcom_glink_native_probe
called by ./drivers/rpmsg/qcom_glink_rpm.c:glink_rpm_probe
called through ret = drv->probe(dev)
called by ./drivers/base/dd.c:really_probe
called by ./drivers/base/dd.c:really_probe_debug
called by ./drivers/base/dd.c:driver_probe_device
called by ./drivers/base/dd.c:device_driver_attach
called by ./drivers/base/dd.c:__driver_attach
called through fn(dev, data)
called by ./drivers/base/bus.c:bus_for_each_dev
called by ./drivers/base/dd.c:driver_attach
called by ./drivers/base/bus.c:bus_add_driver
called by ./drivers/base/driver.c:driver_register
called by ./drivers/base/platform.c:__platform_driver_register
called as ./include/linux/platform_device.h:platform_driver_register
called by ./drivers/rpmsg/qcom_glink_rpm.c:glink_rpm_init
called through initcall mechanism
There are two calls of writel immediately before the call to qcom_glink_native_probe in ./drivers/rpmsg/qcom_glink_rpm.c:glink_rpm_probe. The pr_debug, inserted by me, is included to show where the values in the log come from.
writel(0, tx_pipe->head);
writel(0, rx_pipe->tail);
pr_debug("%s %llx %llx\n", __func__, (u64)tx_pipe->head, (u64)rx_pipe->tail);
glink = qcom_glink_native_probe(&pdev->dev,
0,
&rx_pipe->native,
&tx_pipe->native,
true);
These calls compile to:
| || ||: 0x08000448 dsb st
| || ||: 0x0800044c ldr x0, [x20, 0x30] ; [0x30:4]=-1 ; '0' ; 48
| || ||: 0x08000450 str wzr, [x0]
| || ||: 0x08000454 dsb st
| || ||: 0x08000458 ldr x4, [x19, 0x28] ; [0x28:4]=-1 ; '(' ; 40
| || ||: 0x0800045c str wzr, [x4]
Prior to that, the only calls to regmap_mmio_write32le occur as follows
[ 6.983167] driver_probe_device
[ 6.988049] bus: 'platform': driver_probe_device: matched device 9820000.mailbox with driver qcom_apcs_ipc
[ 6.990926] really_probe_debug
[ 7.000688] bus: 'platform': really_probe: probing driver qcom_apcs_ipc with device 9820000.mailbox
[ 7.003749] really_probe
[ 7.012957] driver: 'qcom_apcs_ipc': driver_bound: bound to device '9820000.mailbox'
[ 7.015611] bus: 'platform': really_probe: bound device 9820000.mailbox to driver qcom_apcs_ipc
[ 7.023595] probe of 9820000.mailbox returned 1 after 12000 usecs
[ 7.031655] OF: create child: /soc/clock-controller@300000
[ 7.038113] bus: 'platform': add device 300000.clock-controller
[ 7.043887] driver_probe_device
[ 7.049483] bus: 'platform': driver_probe_device: matched device 300000.clock-controller with driver gcc-msm8996
[ 7.052388] really_probe_debug
[ 7.062832] bus: 'platform': really_probe: probing driver gcc-msm8996 with device 300000.clock-controller
[ 7.065730] really_probe
[ 7.075721] _regmap_write
[ 7.077826] _regmap_bus_reg_write
[ 7.080345] regmap_mmio_write
[ 7.083642] regmap_mmio_write32le ffff000012300000 7d024 80282001
[ 7.086785] _regmap_write
[ 7.092664] _regmap_bus_reg_write
[ 7.095274] regmap_mmio_write
[ 7.098570] regmap_mmio_write32le ffff000012300000 7d024 80000000
[ 7.101683] _regmap_write
[ 7.107595] _regmap_bus_reg_write
[ 7.110206] regmap_mmio_write
[ 7.113501] regmap_mmio_write32le ffff000012300000 7d034 80282000
[ 7.116609] _regmap_write
[ 7.122525] _regmap_bus_reg_write
[ 7.125135] regmap_mmio_write
[ 7.128432] regmap_mmio_write32le ffff000012300000 7d038 80282001
[ 7.131519] _regmap_write
[ 7.137456] _regmap_bus_reg_write
[ 7.140067] regmap_mmio_write
[ 7.143361] regmap_mmio_write32le ffff000012300000 7d038 80000000
[ 7.152943] driver: 'gcc-msm8996': driver_bound: bound to device '300000.clock-controller'
[ 7.153333] bus: 'platform': really_probe: bound device 300000.clock-controller to driver gcc-msm8996
[ 7.160904] probe of 300000.clock-controller returned 1 after 36000 usecs
At this point, I am stuck. I have insufficient understanding of the arm architecture.
I am guessing that the data synchronisation barrier is hanging because kexec has omitted some necessary action, such as modifying a system register, when preparing for the second kernel.
Any suggestions on what I could check next would be very welcome. Or where I should be posting to to find someone who could help.