Hi all,
First of all, I apologize for the long post but I couldn’t find a simple and easy way to explain the problem without providing at least this information.
I am facing often kernel crashes when using/connecting two PCIe devices on the Kirin970 platform.
Hardware setup and Image:
My development scenario requires ethernet connection (using onboard PCIe Realtek 8168 SoC) and an additional 802.11ac radio module (attached to the mini-PCIe slot), which should work simultaneously.
The radio modules that I’ve tried were QCA6174 (M.2 radio module which requires an additional M.2 to mini-PCIe adapter; I’ve tried to fit it to the M.2 slot but it does not fit properly and even if forced, it is not detected as the connection contacts seem to be misaligned) and, more recently, a QCA9994 (mini-PCIe radio module, directly connected to the slot). Both Qualcomm radios rely on the Ath10k driver but to enable VHT160 or VHT80+80 modes, QCA9994 requires some patching either on the ATH10k and on the respective FW.
The image running on the Hikey970 is based on Debian, which was custom compiled based on the instructions provided in this post. To better debug this problem, my current kernel config is very similar to the default provided hikey970_defconfig
config file (I am just enabling some debug options and some ath10k related options and disabling android_paranoid option).
Making the Ath10k work (MSI vs INT interrupt mode):
The first issue that I saw was that the Ath10k was constantly failing to detect the HTT core and/or load the FW, which makes it fail the initial device probe process. I discovered that loading the Ath10k kernel module forcing the legacy interrupt mode, allows the radios to enumerate successfully and they work. However, I know from other previous designs that, at least the QCA6174 can work fine in MSI interrupt mode…which started to suggest me that there was some problem with the Kirin970/designware PCIe driver layer. I did try to debug why it was failing but I couldn’t get a solution (it seemed to hang up forever waiting for an interrupt/value that never happened). So, what I did was to provide boot argument pci=nomsi
to the kernel and force all PCIe devices to work in INT (aka legacy) mode.
Behavior when using just one of the specified devices:
If just “enabling one PCIe device at a time”, they seem to work perfectly fine (tested for at least 48 consecutive hours each one).
The problem is that “enabling one PCIe device at a time” means to either:
- just have the Realtek 8168 connected (onboard PCIe connection); no radio module fitted in either the M.2 or mini-PCIe slots;
- fit one radio module to the mini-PCIe slot but remove one of the devices (either the Realtek or the QCA radio module) from the PCI enumeration list during boot;
With the two devices enumerated at the same time on the system, the chances for a crash increase drastically, especially if under heavy network traffic load.
Crash dump analysis:
Although the crashes can have different output information, they seem to happen always due to the same reason: an attempt to read/access information on a PCIe IO memory area that is not accessible at that precise moment.
The following crash dump example illustrates the typical crash info output I see:
Exceptions seen:
[ 36.601101] Unhandled fault: synchronous external abort (0x96000210) at 0xffff00000e000018
[ 36.601104] Unhandled fault: synchronous external abort (0x96000210) at 0xffff00001139703e
Looking at the PC register per CPU when the exception occurred:
CPU0:
[ 36.601188] PC is at rtl8168_interrupt+0x40/0x408 [r8168]
which corresponds to drivers/net/ethernet/realtek/r8168_n.c:28165
, function: rtl8168_interrupt
, instruction:
status = RTL_R16(IntrStatus);
and more specifically, to following assembly code:
0000000000066348 <rtl8168_interrupt>:
(…)
__raw_readw():
/home/vagrant/projects/hikey970_linux_k4_9/linux/./arch/arm64/include/asm/io.h:80
asm volatile(ALTERNATIVE(“ldrh %w0, [%1]”,
66388: 79400022 ldrh w2, [x1]
where X1
register contains:
x1 : ffff00001139703e
CPU1:
[ 38.696705] PC is at kirin_pcie_rd_own_conf+0x6c/0x140
which corresponds to drivers/pci/host/pcie-kirin.c:62
, function: kirin_pcie_rd_own_conf
, instruction:
*val = readl(pp->dbi_base + (where & ~0x3));
which corresponds to following assembly code:
00000000000001b0 <kirin_pcie_readl_rc>:
(…)
__raw_readl():
/home/vagrant/projects/hikey970_linux_k4_9/linux/./arch/arm64/include/asm/io.h:91
2e4: b9400000 ldr w0, [x0]
where X0
register contains:
x0 : ffff00000e000018
Both X1
and X0
registers contain valid addresses that correspond to the allocated PCIe IO memory areas (“R8168->ioaddr” and “kirin->dbi_base” respectively).
I have ensured that on crash, these registers contain valid addresses because either they were accessible previously (typically a thousand multiple times) before a crash or, when I am lucky enough (which happens few times), the system can run for hours without crashes performing those accesses.
PCIe power management is disabled (through kernel boot argument pcie_aspm=off
) and I have ensured this is not related with an hardware fault, either because
- the radio modules work fine on other platforms
- because different Hikey970 exhibit the same crash behavior.
It seems that there is a race condition that allows two or more entities to access the PCIe bus at the same time (through the IO memory resource).
Looking at the typical functions call back trace, I am pretty unsure where to put the mutual exclusion logic to prevent this situation. (Please note that the first functions listed were the last ones running before the crash.)
Example 1:
CPU_A:
[ 81.120100] [<ffff000000d34388>] rtl8168_interrupt+0x40/0x408 [r8168]
[ 81.120169] [<ffff00000811b7ac>] __handle_irq_event_percpu+0x5c/0x148
[ 81.120169] [<ffff00000811b8cc>] handle_irq_event_percpu+0x34/0x88
[ 81.120175] [<ffff00000811b968>] handle_irq_event+0x48/0x78
[ 81.120197] [<ffff00000811f868>] handle_fasteoi_irq+0xd8/0x1
[ 81.120199] [<ffff00000811a88c>] generic_handle_irq+0x24/0x38
[ 81.120200] [<ffff00000811af14>] __handle_domain_irq+0x84/0xf0
[ 81.120221] [<ffff00000808173c>] gic_handle_irq+0x54/0xa8
[ 81.120329] [<ffff0000080831e4>] el1_irq+0xe4/0x188
[ 81.120348] [<ffff0000080c9850>] irq_exit+0x90/0xd0
[ 81.120348] [<ffff00000811af18>] __handle_domain_irq
[ 81.120359] [<ffff00000808173c>] gic_handle_irq+0x54/0xa8
CPU_B:
[ 84.351246] [<ffff000008436820>] kirin_pcie_readl_rc+0x20/0x40
[ 84.357087] [<ffff000008435508>] dw_pcie_readl_rc+0x18/0x38
[ 84.362667] [<ffff000008435734>] dw_pcie_prog_outbound_atu+0x20c/0x390
[ 84.369202] [<ffff000008435c78>] dw_pcie_rd_conf+0x150/0x1a0
[ 84.374873] [<ffff000008411278>] pci_bus_read_config_word+0x80/0xe0
[ 84.381200] [<ffff000000d3df5c>] rtl8168_esd_timer+0xf4/0x560 [r8168]
[ 84.387654] [<ffff00000812ed2c>] call_timer_fn.isra.5+0x24/0x80
[ 84.393583] [<ffff00000812ee34>] expire_timers+0xac/0xd0
[ 84.398904] [<ffff00000812ef54>] run_timer_softirq+0xfc/0x1a0
[ 84.404658] [<ffff000008081a18>] __do_softirq+0x128/0x230
[ 84.410065] [<ffff0000080c9850>] irq_exit+0x90/0xd0
[ 84.414954] [<ffff00000811af18>] __handle_domain_irq+0x88/0xf0
[ 84.420793] [<ffff00000808173c>] gic_handle_irq+0x54/0xa8
Example 2:
CPU_A:
[ 34.079436] [<ffff0000084369c8>] kirin_pcie_readl_rc+0x20/0x40
[ 34.079446] [<ffff000008435508>] dw_pcie_readl_rc+0x18/0x38
[ 34.079456] [<ffff000008435734>] dw_pcie_prog_outbound_atu+0
[ 34.079466] [<ffff000008435c78>] dw_pcie_rd_conf+0x150/0x1a0
[ 34.079475] [<ffff0000084111b0>] pci_bus_read_config_byte+0x7
[ 34.079491] [<ffff000000d3e19c>] rtl8168_esd_timer+0x334/0x560
[ 34.079562] [<ffff00000812ed2c>] call_timer_fn.isra.5+0x24/0x80
[ 34.079582] [<ffff00000812ee34>] expire_timers+0xac/0xd0
[ 34.079593] [<ffff00000812ef54>] run_timer_softirq+0xfc/0
[ 34.079607] [<ffff000008081a18>] __do_softirq+0x128/0x230
[ 34.079619] [<ffff0000080c9850>] irq_exit+0x90/0xd0
[ 34.079621] [<ffff0000080906c8>] handle_IPI+0x138/0x
[ 34.079641] [<ffff000008081788>] gic_handle_irq+0xa0/0xa
[ 34.079748] [<ffff0000080831e4>] el1_irq+0xe4/0x188
CPU_B:
[ 37.076376] [<ffff000000d34388>] rtl8168_interrupt+0x40/0x408 [r8168]
[ 37.082827] [<ffff00000811b7ac>] __handle_irq_event_percpu+0x5c/0x148
[ 37.089276] [<ffff00000811b8cc>] handle_irq_event_percpu+0x34/0x88
[ 37.095463] [<ffff00000811b968>] handle_irq_event+0x48/0x78
[ 37.101048] [<ffff00000811f868>] handle_fasteoi_irq+0xd8/0x1c8
[ 37.106894] [<ffff00000811a88c>] generic_handle_irq+0x24/0x38
[ 37.112650] [<ffff00000811af14>] __handle_domain_irq+0x84/0xf0
[ 37.118494] [<ffff00000808173c>] gic_handle_irq+0x54/0xa8
Looking at these function call back trace, it is possible to infer that:
- the R8168 interrupt handler has no call dependency on the kirin or designware-host pcie driver layers;
- the only common path that I can see is the
gic_handle_irq
but I am not sure if this would be the best place to put the mutual exclusion logic;
At first glance, it seems I should prevent the two interrupt handlers from running at same time but this is not the only condition that makes it crash.
If R8168 is loaded and device enumerated, just running a simple lspci
command while TX/RX data through the interface makes it crash (even if in MSI interrupt mode and the only PCIe device present). In this case, one of the calls is not in interrupt context but it also tries to access the same resource.
lspci
inconsistent output:
More recently, while debugging just the crashes when running the lspci
command, I noticed that the hex dump output (of the config space) for the Root PCIe device is not always the same (even if running in the same Linux session and consecutively lspci
calls…when the system does not crash).
lspci
run 1:
0000:00:00.0 PCI bridge: Huawei Technologies Co., Ltd. Device 3670 (rev 01)
Kernel driver in use: pcieport
lspci: Unable to load libkmod resources: error -12
00: e5 19 70 36 07 01 10 00 01 00 04 06 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 01 ff 00 f0 00 00 20
20: 00 01 30 01 40 01 40 01 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
(…)
0000:05:00.0 Network controller: Qualcomm Atheros Device 0046 (rev 01)
Subsystem: Qualcomm Atheros Device cafe
Kernel driver in use: ath10k_pci
00: 8c 16 46 00 06 00 10 00 01 00 80 02 00 00 00 00
10: 04 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 8c 16 fe ca
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
lspci
run 2:
0000:00:00.0 PCI bridge: Huawei Technologies Co., Ltd. Device 3670 (rev 01)
Kernel driver in use: pcieport
lspci: Unable to load libkmod resources: error -12
00: e5 19 70 36 ff ff ff ff 01 00 04 06 00 00 01 00
10: 00 00 00 00 ff ff ff ff 00 01 ff 00 ff ff ff ff
20: 00 01 30 01 40 01 40 01 00 00 00 00 ff ff ff ff
30: 00 00 00 00 ff ff ff ff 00 00 00 00 ff 01 00 00
(…)
0000:05:00.0 Network controller: Qualcomm Atheros Device 0046
Subsystem: Qualcomm Atheros Device cafe
Kernel driver in use: ath10k_pci
00: 8c 16 46 00 06 00 10 00 00 00 80 02 00 00 00 00
10: 04 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 8c 16 fe ca
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
Comparing these output, it is possible to see that, for some reason, on lspci
run 2, some config space reads failed (ff
). In this precise example, command
register should be the same and even status register read should not return ff
; I have seen it fail for Device ID
and Vendor ID
as well. This seems to be frequent on the Root Bridge but less on the other devices. On the other devices, other than the fields that should naturally change (e.g. status register), the previous read fail ff
was not seen…but as can be seen on run 2, sometimes the revision is wrongly read on the QCA9994 PCIe device…
I do not see any errors on the bootlog related with the PCIe training so, although unsure, I believe this is only related with wrong mediation of the PCIe bus (concurrent multiple access allowed).
Some of the attempts already tried without success:
- compile the new driver (v8.046) provided directly by Realtek but it didn’t make any difference.
- at some point and after consulting the ARM A53/A73 erratas, I started suspecting if this could be also related with one of the bugs that the
ALTERNATIVES
framework tries to solve (e.g. in certain situations, between the instruction that sets the address on a X register and aldr
instruction that uses that X register as source location, there should be at least two other instructions that don’t modify the X register to avoid invalid data on the X register)…
Looking at the assembly code, it is not clear to me that these conditions are being satisfied, which was one of the reasons why I wanted to debug using the JTAG (see post). - compile kernel 4.19.1 (which includes updated pcie patches and different structure, as well as other patches in the basic IO kernel subsystem - including ALTERNATIVES framework)…but I couldn’t get LDO33 and five other regulators to initialize properly (system hangs waiting for a specific status); without these regulators, I can only see the root bridge and none of the PCIe ports or devices;
- force IRQ affinity to just one core; this allows it to run much more reliably (as expected) but it can still fail because “data reading/writing processes” (e.g.
Iperf
orlspci
) can still force a direct (and concurrent) access using any other core (and of course, limiting everything to run in one core is not an optimal solution);
It is possible that this race condition is “visible” because I am forcing the legacy interrupt mode instead of the MSI, which could potentially avoid the race condition…however, as referred previously, I couldn’t get it to work in MSI interrupt mode…and in the case of the R8168, just running alspci
while TX/RX data, also makes it crash.
Questions:
- has anyone used the Hikey970 with two PCIe devices without facing/spotting this issue?
- has anyone seen this issue and solved it?
- did anyone got a QCA radio (based on ath10k driver) working with MSI IRQ support?
Currently, I am trying to implement some mutual exclusion logic shared between the pcie-kirin and r8168 drivers based on spinlocks (not sure if it will be needed on pcie-designware as well). So far, the main problem is to avoid deadlock situations.
Not sure if this is the best approach to solve the race condition but it seems the most quickest and the one that “apparently” needs less modification on r8168 and pcie-kirin drivers (or others).
Any help, guidance or suggestions would be very much appreciated.
Kind regards