Bad RAM, or something else?


#1

Hello,

I just got my board this week and have been attempting to get it to boot properly through all the different methods given in the docs. At this point I am certain I’ve followed the setup properly, with both the recent v179 and v184 releases.

It seems to be that, once flashed, the kernel eventually ends up panicking but at seemingly random points during the boot process. Here’s a more common case, although sometimes the panic happens before the mount:

Scanning for Btrfs filesystems
done.
Begin: Will now check root file system ... fsck from util-linux 2.31.1
fsck: error 2 (No such file or directory) while executing fsck.ext4 for /dev/sda7
fsck exited with status code 8
done.
Warning: File system check failed but did not detect errors
[   11.906340] EXT4-fs (sda7): mounted filesystem with ordered data mode. Opts: (null)
done.
Begin: Running /scripts/local-bottom ... done.
Begin: Running /scripts/init-bottom ... done.
[   12.097387] Insufficient stack space to handle exception!
[   12.097399] ESR: 0x96000044 -- DABT (current EL)
[   12.101766] FAR: 0xfffec000080000d0
[   12.106453] Task stack:     [0xffff0000090d0000..0xffff0000090d4000]
[   12.109670] IRQ stack:      [0xffff000008000000..0xffff000008004000]
[   12.116267] Overflow stack: [0xffff8000de98f2e0..0xffff8000de9902e0]
[   12.122609] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-qcomlt-arm64 #1
[   12.128942] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[   12.135973] task: ffff0000090e3080 task.stack: ffff0000090d0000
[   12.141968] PC is at el1_sync+0x0/0xb0
[   12.147777] LR is at swiotlb_unmap_sg_attrs+0x58/0x88
[   12.151594] pc : [<ffff000008082e40>] lr : [<ffff0000083faf78>] pstate: 200003c5
[   12.156721] sp : fffec000080000d0
[   12.164178] x29: fffec00008003cd0 x28: ffff0000090e3080 
[   12.167393] x27: 0000000000000000 x26: ffff000008ec3d88 
[   12.172775] x25: 0000000000000000 x24: 0000000000000000 
[   12.178070] x23: ffff8000d9742410 x22: 0000000000000010 
[   12.183365] x21: 0000000000000002 x20: 0000000000000006 
[   12.188660] x19: ffff8000d7e8a8a0 x18: 0000000000040e0f 
[   12.193955] x17: 0000ffff91ee2a98 x16: 0000000000000005 
[   12.199250] x15: 7ffffffffffff7ff x14: 0000000000000000 
[   12.204545] x13: 0000000000000000 x12: 0000000000000001 
[   12.209841] x11: 0000000000000000 x10: 0000000000000040 
[   12.215136] x9 : ffff0000090f47c8 x8 : 0000000000030e75 
[   12.220431] x7 : ffff8000d9c00270 x6 : 00000000dafff000 
[   12.225725] x5 : 00000000defff000 x4 : 0000000000000000 
[   12.231022] x3 : 0000000000000002 x2 : 0000000000001000 
[   12.236317] x1 : 000000015772a000 x0 : ffff8000d9742410 
[   12.241614] Kernel panic - not syncing: kernel stack overflow
[   12.246910] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-qcomlt-arm64 #1
[   12.252553] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[   12.259671] Call trace:
[   12.265656] [<ffff000008089020>] dump_backtrace+0x0/0x3a0
[   12.268005] [<ffff0000080893d4>] show_stack+0x14/0x20
[   12.273560] [<ffff000008aae5d8>] dump_stack+0x9c/0xbc
[   12.278596] [<ffff0000080c9c68>] panic+0x11c/0x28c
[   12.283626] [<ffff0000080c98a4>] nmi_panic+0x6c/0x70
[   12.288316] [<ffff000008089c9c>] handle_bad_stack+0x11c/0x128
[   12.293437] Exception stack(0xffff8000de9901a0 to 0xffff8000de9902e0)
[   12.299082] 01a0: ffff8000d9742410 000000015772a000 0000000000001000 0000000000000002
[   12.305508] 01c0: 0000000000000000 00000000defff000 00000000dafff000 ffff8000d9c00270
[   12.313320] 01e0: 0000000000030e75 ffff0000090f47c8 0000000000000040 0000000000000000
[   12.321134] 0200: 0000000000000001 0000000000000000 0000000000000000 7ffffffffffff7ff
[   12.328945] 0220: 0000000000000005 0000ffff91ee2a98 0000000000040e0f ffff8000d7e8a8a0
[   12.336757] 0240: 0000000000000006 0000000000000002 0000000000000010 ffff8000d9742410
[   12.344570] 0260: 0000000000000000 0000000000000000 ffff000008ec3d88 0000000000000000
[   12.352383] 0280: ffff0000090e3080 fffec00008003cd0 ffff0000083faf78 fffec000080000d0
[   12.360194] 02a0: ffff000008082e40 00000000200003c5 0000000000000000 0000000000000000
[   12.368007] 02c0: 0001000000000000 0000000000000000 fffec00008003cd0 ffff000008082e40
[   12.375820] [<ffff00000808285c>] __bad_stack+0x88/0x8c
[   12.383628] [<ffff000008082e40>] el1_sync+0x0/0xb0
[   12.388669] SMP: stopping secondary CPUs
[   12.393447] Kernel Offset: disabled
[   12.397515] CPU features: 0x002008
[   12.400724] Memory Limit: none
[   12.404206] ---[ end Kernel panic - not syncing: kernel stack overflow

Other errors that pop up include unhandled level 0/3/11 translation faults, usually in a random process, NULL dereference in the kernel, bad PC value on panic trace (!). Here’s a trace from a successful boot followed by a panic.

These are all things that seem to me like issues caused by RAM errors. Which sucks because the 820c is not up on Arrow right now (404 since at least this morning) and I can’t order a replacement.

I guess I’m looking for validation on whether it is indeed bad RAM. If not, maybe there is something else going on? Thanks for your input.


#2

Hey,

It is not a hardware issue, but software. We pushed a fix/workaround in the
afternoon, Jenkins should pick it soon and make a new build…


#3

Thanks for the quick fix! v185 resolved the issue and the board now boots normally.