Getting kernel crash in Linaro-18.01

Hello Team,

We are working on a product based on apq8016 and working on Linaro-18.01.

Below kernel crash observed while we run the kernel.

[ 5.349792] Synchronous External Abort: synchronous external abort (0x96000010) at 0xffff0000097ddfe0
[ 5.349862] Internal error: : 96000010 [#1] PREEMPT SMP
[ 5.358005] Modules linked in: i2c_qcom_cci
[ 5.363054] CPU: 1 PID: 62 Comm: kworker/1:1 Not tainted 4.14.0-qcomlt-arm64 #292
[ 5.367213] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[ 5.374872] Workqueue: events amba_deferred_retry_func
[ 5.381623] task: ffff80003d66d700 task.stack: ffff000009808000
[ 5.386580] PC is at amba_device_try_add+0xe4/0x238
[ 5.392392] LR is at amba_device_try_add+0xd0/0x238
[ 5.397247] pc : [] lr : [] pstate: 40000145
[ 5.402106] sp : ffff00000980bd50
[ 5.409740] x29: ffff00000980bd50 x28: 0000000000000000
[ 5.412954] x27: ffff80003d6d4fb8 x26: ffff000008e0bee0
[ 5.418337] x25: ffff0000080def18 x24: ffff0000097dd000
[ 5.423631] x23: ffff80003d0ec6f8 x22: 0000000000001000
[ 5.428925] x21: 0000000000000000 x20: 0000000000000000
[ 5.434221] x19: ffff80003d0ec400 x18: 0000000000000000
[ 5.439516] x17: 0000000000000000 x16: 0000000000000000
[ 5.444810] x15: 0000000000000000 x14: ffff80003d42ab80
[ 5.450107] x13: 0000800036b68000 x12: 000000013df321af
[ 5.455402] x11: 0000000000000000 x10: 0000000000000a80
[ 5.460698] x9 : ffff00000980ba50 x8 : ffff80003d66e1e0
[ 5.465992] x7 : ffff80003fe44d00 x6 : 00000000000e9cc1
[ 5.471286] x5 : 0000000000000002 x4 : 0000000000000000
[ 5.476581] x3 : 0000000000000000 x2 : 0000000000000000
[ 5.481877] x1 : ffff0000097ddfe0 x0 : 0000000000000000
[ 5.487179] Process kworker/1:1 (pid: 62, stack limit = 0xffff000009808000)
[ 5.492468] Call trace:
[ 5.499154] Exception stack(0xffff00000980bc10 to 0xffff00000980bd50)
[ 5.501593] bc00: 0000000000000000 ffff0000097ddfe0
[ 5.508196] bc20: 0000000000000000 0000000000000000 0000000000000000 0000000000000002
[ 5.516011] bc40: 00000000000e9cc1 ffff80003fe44d00 ffff80003d66e1e0 ffff00000980ba50
[ 5.523822] bc60: 0000000000000a80 0000000000000000 000000013df321af 0000800036b68000
[ 5.531633] bc80: ffff80003d42ab80 0000000000000000 0000000000000000 0000000000000000
[ 5.539446] bca0: 0000000000000000 ffff80003d0ec400 0000000000000000 0000000000000000
[ 5.547260] bcc0: 0000000000001000 ffff80003d0ec6f8 ffff0000097dd000 ffff0000080def18
[ 5.555073] bce0: ffff000008e0bee0 ffff80003d6d4fb8 0000000000000000 ffff00000980bd50
[ 5.562885] bd00: ffff0000084d3180 ffff00000980bd50 ffff0000084d3194 0000000040000145
[ 5.570697] bd20: ffff00000980bd50 ffff0000084d3180 ffffffffffffffff 00000000fffffffe
[ 5.578504] bd40: ffff00000980bd50 ffff0000084d3194
[ 5.586321] [] amba_device_try_add+0xe4/0x238
[ 5.591009] [] amba_deferred_retry_func+0x48/0xc8
[ 5.596913] [] process_one_work+0x1c8/0x328
[ 5.603159] [] worker_thread+0x44/0x450
[ 5.609066] [] kthread+0x128/0x130
[ 5.614446] [] ret_from_fork+0x10/0x18
[ 5.619481] Code: 35000820 d10082c1 52800002 8b010301 (b9400020)
[ 5.624769] —[ end trace 605510ee26ddc69c ]—
[ 8.936611] EXT4-fs (mmcblk0p19): recovery complete
[ 8.937358] EXT4-fs (mmcblk0p19): mounted filesystem with ordered data mode. Opts: (null)
done.

Can anyone please help here?

Regards,
Parth Y Shah

Can you add custom_board tag.

You neither indicate how to reproduce nor the context of this issue. Moreover I would suggest to jump to the latest release.

Added here.

After flashing the sign images on our board we are seeing these crash logs while kernel bootup. Can you please help here?

There’s not much to go on here so I’m afraid I can only really state the obvious.

It looks like accessing an amba device is causing bus errors. You should convert amba_device_try_add+0xe4 into a line number so you can instrument just before that line so you can find out what device the kernel is trying to add.

Additionally, if the unsigned case is working, you should be comparing the dmesg traces between the working and not-working cases (perhaps with initcall_debug enabled). You are likely to find the differ from each other way before the fragment of boot log you have permitted us to see.

Hi Daniel,

After further investigation, below are our findings:
In the file from Linaro-18.01, drivers/amba/bus.c, the crash is coming from function amba_device_try_add(). We have put some debug prints and compared logs with earlier unsigned binaries and latest signed binaries.

========= Fail case log(signed binary case)============================

[ 5.371402] amba_device_try_add — line 404 start 820000 tmp ffff0000097e5000 size 1000

[ 5.373542] Synchronous External Abort: synchronous external abort (0x96000010) at 0xffff0000097e5fe0
[ 5.381601] Internal error: : 96000010 [#1] PREEMPT SMP
[ 5.390678] Modules linked in: i2c_qcom_cci
[ 5.395720] CPU: 1 PID: 1607 Comm: kworker/1:2 Not tainted 4.14.0-qcomlt-arm64 #41
[ 5.399883] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[ 5.407534] Workqueue: events amba_deferred_retry_func
[ 5.414378] task: ffff80003d2fd700 task.stack: ffff00000c830000
[ 5.419332] PC is at amba_device_try_add+0x1e0/0x378
[ 5.425146] LR is at amba_device_try_add+0x1d4/0x378
[ 5.430352] pc : [] lr : [] pstate: 40000145
[ 5.435299] sp : ffff00000c833d10
[ 5.442672] x29: ffff00000c833d10 x28: 0000000000000000
[ 5.445887] x27: 0000000000001000 x26: ffff80003d0f0ef8
[ 5.451269] x25: 0000000000000000 x24: ffff0000097e5000
[ 5.456563] x23: 0000000000001000 x22: ffff000008e66360
[ 5.461859] x21: ffff000008b615a0 x20: 0000000000000000
[ 5.467155] x19: ffff80003d0f0c00 x18: 0000000000000000
[ 5.472449] x17: 0000000000000000 x16: 0000000000000000
[ 5.477745] x15: 0000000000000010 x14: 3035653739303030
[ 5.483039] x13: 3030666666662070 x12: 6d74203030303032
[ 5.488335] x11: 3820747261747320 x10: 0000000000000004
[ 5.493629] x9 : 000000000000005d x8 : 000000000000000d
[ 5.498926] x7 : ffff00000c833abc x6 : 00000000000001e3
[ 5.504220] x5 : 0000000000000000 x4 : 0000000000000000
[ 5.509516] x3 : ffffffffffffffff x2 : 0000000000000000
[ 5.514810] x1 : ffff80003d2fd700 x0 : ffff0000097e5fe0
[ 5.520108] Process kworker/1:2 (pid: 1607, stack limit = 0xffff00000c830000)
[ 5.525403] Call trace:
[ 5.532432] Exception stack(0xffff00000c833bd0 to 0xffff00000c833d10)
[ 5.534696] 3bc0: ffff0000097e5fe0 ffff80003d2fd700
[ 5.541296] 3be0: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000
[ 5.549109] 3c00: 00000000000001e3 ffff00000c833abc 000000000000000d 000000000000005d
[ 5.556921] 3c20: 0000000000000004 3820747261747320 6d74203030303032 3030666666662070
[ 5.564734] 3c40: 3035653739303030 0000000000000010 0000000000000000 0000000000000000
[ 5.572547] 3c60: 0000000000000000 ffff80003d0f0c00 0000000000000000 ffff000008b615a0
[ 5.580358] 3c80: ffff000008e66360 0000000000001000 ffff0000097e5000 0000000000000000
[ 5.588172] 3ca0: ffff80003d0f0ef8 0000000000001000 0000000000000000 ffff00000c833d10
[ 5.595985] 3cc0: ffff0000084d3530 ffff00000c833d10 ffff0000084d353c 0000000040000145
[ 5.603796] 3ce0: 0000000000000194 0000000000820000 ffffffffffffffff 0000000000001000
[ 5.611606] 3d00: ffff00000c833d10 ffff0000084d353c
[ 5.619419] [] amba_device_try_add+0x1e0/0x378
[ 5.624107] [] amba_deferred_retry_func+0xd4/0x134
[ 5.630359] [] process_one_work+0x1c8/0x328
[ 5.636692] [] worker_thread+0x44/0x450
[ 5.642336] [] kthread+0x128/0x130
[ 5.647717] [] ret_from_fork+0x10/0x18
[ 5.652754] Code: 97f12b69 d10082e0 52800002 8b000300 (b9400001)
[ 5.658045] —[ end trace 997e35eaf2bc9e88 ]—

==========Pass case log(unsigned case)========================
[ 5.368101] amba_device_try_add — line 404 start 820000 tmp ffff0000097e5000 size 1000
No crash

Function amba_device_try_add()
364         /*
365          * Dynamically calculate the size of the resource
366          * and use this for iomap
367          */
368         printk("%s --- line %d\n", __func__, __LINE__);
369         size = resource_size(&dev->res);
370         tmp = ioremap(dev->res.start, size);
371         if (!tmp) {
372                 ret = -ENOMEM;
373                 goto err_release;
374         }
.........
404         printk("%s --- line %d start %llx tmp %p size %x\n", __func__, __LINE__,dev->res.start,  tmp, size);
405                 /*
406                  * Read pid and cid based on size of resource
407                  * they are located at end of region
408                  */
409                 for (pid = 0, i = 0; i < 4; i++)
410                         pid |= (readl(tmp + size - 0x20 + 4 * i) & 255) <<
411                                 (i * 8);

Here, as per our observation, the issue is coming from line number 410 when it is creating the pid. But as you can see in pass case and fail case logs, the dev->res.start, tmp and size are same for both thecases. This res is as below from the dtsi file arch/arm64/boot/dts/qcom/msm8916.dtsi,

tpiu@820000 {
        compatible = "arm,coresight-tpiu", "arm,primecell";
        reg = <0x820000 0x1000>;

        clocks = <&rpmcc RPM_QDSS_CLK>, <&rpmcc RPM_QDSS_A_CLK>;
        clock-names = "apb_pclk", "atclk";

        port {
                tpiu_in: endpoint {
                        slave-mode;
                        remote-endpoint = <&replicator_out1>;
                };
        };
};

So please suggest what can we do further here?

Regards,
Parth Y Shah

Now you know what hardware is impacted then this is probably a question for whoever supplied you with alternative firmware. In particular you should ask if this piece of hardware still enabled for non-secure access.

If your new firmware does disable access to this hardware then you would need to remove it from the device tree.

Hello Daniel,

Thanks for all your inputs. There are total 14 nodes which uses apb_pclk and 1st among these is tpiu which we have tried to disable and the result is crash from the next node. So we have disabled all the 14 nodes and now we are not seeing any crash here. But we have few questions here:

  1. What is the use case for all these nodes?
  2. Will disabling this affect any other peripherals performance/device’s performance?

In the pass case (unsigned images), when all the 14 nodes were up properly, we can see them like below,
oot@linaro-alip:~# ls /sys/bus/amba/devices/
820000.tpiu 825000.etf 850000.debug 856000.debug 85e000.etm
821000.funnel 826000.etr 852000.debug 85c000.etm 85f000.etm
824000.replicator 841000.funnel 854000.debug 85d000.etm

Please help us with our queries.

Regards,
Parth Y Shah

These are coresight components [1], so you loose the hardware assisted tracing features, which should not cause any performance of issues for other functions.

https://www.kernel.org/doc/Documentation/trace/coresight.txt

Thanks @Loic for your quick reply.