AOSP display freeze: *ERROR* too many planes! cnt=7, start stage=2

@jstultz : Since I suspect that this may be tied to drm_hwcomposer, you would probably be the right one to ask about this.

I’m getting a situation where the display freezes up. Sometimes it recovers, and sometimes not. The system seems to continue functioning, but nothing is updated on the screen. When this happens, I get a bunch of messages like this;

[   76.351693] msm 900000.mdss: [drm:mdp5_crtc_atomic_check [msm]] *ERROR* too many planes! cnt=7, start stage=2
[   76.367321] msm 900000.mdss: [drm:mdp5_crtc_atomic_check [msm]] *ERROR* too many planes! cnt=7, start stage=2
[   76.384007] msm 900000.mdss: [drm:mdp5_crtc_atomic_check [msm]] *ERROR* too many planes! cnt=7, start stage=2
[   76.400598] msm 900000.mdss: [drm:mdp5_crtc_atomic_check [msm]] *ERROR* too many planes! cnt=7, start stage=2

When I get just 1 or 2 of those error lines, it usually recovers. If I get a lot more, say 50, it doesn’t recover.

Usually happens when switching things around in the UI, like changing programs, opening notification shade, popup windows, etc., and I think it happens more frequently when there is a mouse attached (cursor display) as compared to a touchscreen. Presumably the cursor takes up a plane?

Have you seen anything like this before? Any idea where I should start looking?

Some more info… this just happened in the logcat;

06-03 16:09:08.201   555   578 I ActivityManager: Done dumping
06-03 16:30:46.564     0     0 I sysrq   : Show Blocked State
06-03 16:30:46.564     0     0 I         : task                        PC stack   pid father
06-03 16:30:46.567     0     0 I kworker/u8: 4    D    0   195      2 0x00000028
06-03 16:30:46.572     0     0 I Workqueue: msm retire_worker [msm]
06-03 16:30:46.578     0     0 W Call trace:  
06-03 16:30:46.582     0     0 W         : __switch_to+0xe4/0x148
06-03 16:30:46.584     0     0 W         : __schedule+0x344/0x750
06-03 16:30:46.588     0     0 W         : schedule+0x40/0xe0
06-03 16:30:46.591     0     0 W         : schedule_preempt_disabled+0x20/0x38
06-03 16:30:46.594     0     0 W         : __mutex_lock.isra.0+0x17c/0x5b8
06-03 16:30:46.599     0     0 W         : __mutex_lock_slowpath+0x10/0x18
06-03 16:30:46.604     0     0 W         : mutex_lock+0x34/0x50
06-03 16:30:46.608     0     0 W         : retire_worker+0x60/0x80 [msm]
06-03 16:30:46.611     0     0 W         : process_one_work+0x198/0x320
06-03 16:30:46.615     0     0 W         : worker_thread+0x48/0x420
06-03 16:30:46.619     0     0 W         : kthread+0xf0/0x120
06-03 16:30:46.623     0     0 W         : ret_from_fork+0x10/0x18
06-03 16:30:46.626     0     0 I HwBinder: 387_1  D    0   412      1 0x04000001
06-03 16:30:46.630     0     0 W Call trace:  
06-03 16:30:46.635     0     0 W         : __switch_to+0xe4/0x148
06-03 16:30:46.637     0     0 W         : __schedule+0x344/0x750
06-03 16:30:46.641     0     0 W         : schedule+0x40/0xe0
06-03 16:30:46.644     0     0 W         : schedule_preempt_disabled+0x20/0x38
06-03 16:30:46.647     0     0 W         : __mutex_lock.isra.0+0x17c/0x5b8
06-03 16:30:46.652     0     0 W         : __mutex_lock_slowpath+0x10/0x18
06-03 16:30:46.656     0     0 W         : mutex_lock+0x34/0x50
06-03 16:30:46.661     0     0 W         : msm_gem_new_impl.isra.0+0x104/0x150 [msm]
06-03 16:30:46.664     0     0 W         : _msm_gem_new+0xc4/0x1a8 [msm]
06-03 16:30:46.669     0     0 W         : msm_gem_new_handle+0x2c/0x80 [msm]
06-03 16:30:46.673     0     0 W         : msm_ioctl_gem_new+0x2c/0x50 [msm]
06-03 16:30:46.677     0     0 W         : drm_ioctl_kernel+0xb8/0x108
06-03 16:30:46.682     0     0 W         : drm_ioctl+0x28c/0x4f0
06-03 16:30:46.686     0     0 W         : do_vfs_ioctl+0x98c/0xb60
06-03 16:30:46.689     0     0 W         : ksys_ioctl+0x78/0xa8
06-03 16:30:46.693     0     0 W         : __arm64_sys_ioctl+0x1c/0x28
06-03 16:30:46.696     0     0 W         : el0_svc_common.constprop.0+0x68/0x160
06-03 16:30:46.700     0     0 W         : el0_svc_handler+0x20/0x80
06-03 16:30:46.705     0     0 W         : el0_svc+0x8/0xc
06-03 16:30:46.709     0     0 I sysrq   : Show backtrace of all active CPUs
06-03 16:30:46.712     0     0 I sysrq   : CPU1:
06-03 16:09:13.358   555  1597 I DropBoxManagerService: add tag=system_server_watchdog isTagEnabled=true flags=0x2
06-03 16:09:13.381   555   578 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on display thread (android.display), Blocked in handler on animation thread (android.anim)
06-03 16:09:13.382   555   578 W Watchdog: android.fg annotated stack trace:
06-03 16:09:13.382   555   578 W Watchdog:     at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:17666)
06-03 16:09:13.382   555   578 W Watchdog:     - waiting to lock <0x0cd2e606> (a com.android.server.am.ActivityManagerService)
06-03 16:09:13.382   555   578 W Watchdog:     at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:239)
06-03 16:09:13.382   555   578 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:934)
06-03 16:09:13.383   555   578 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:99)
06-03 16:09:13.383   555   578 W Watchdog:     at android.os.Looper.loop(Looper.java:216)
06-03 16:09:13.383   555   578 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
06-03 16:09:13.383   555   578 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
06-03 16:09:13.386   555   578 W Watchdog: android.display annotated stack trace:
06-03 16:09:13.386   555   578 W Watchdog:     at com.android.server.wm.WindowManagerService$H.handleMessage(WindowManagerService.java:4901)
06-03 16:09:13.386   555   578 W Watchdog:     - waiting to lock <0x0952298c> (a com.android.server.wm.WindowManagerGlobalLock)
06-03 16:09:13.386   555   578 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:106)
06-03 16:09:13.386   555   578 W Watchdog:     at android.os.Looper.loop(Looper.java:216)
06-03 16:09:13.386   555   578 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
06-03 16:09:13.386   555   578 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
06-03 16:09:13.388   555   578 W Watchdog: android.anim annotated stack trace:
06-03 16:09:13.388   555   578 W Watchdog:     at android.view.Surface.nativeLockCanvas(Native Method)
06-03 16:09:13.389   555   578 W Watchdog:     at android.view.Surface.lockCanvas(Surface.java:373)
06-03 16:09:13.390   555   578 W Watchdog:     - locked <0x0eec88d5> (a java.lang.Object)
06-03 16:09:13.390   555   578 W Watchdog:     at android.view.ViewRootImpl.drawSoftware(ViewRootImpl.java:3834)
06-03 16:09:13.390   555   578 W Watchdog:     at android.view.ViewRootImpl.draw(ViewRootImpl.java:3795)
06-03 16:09:13.390   555   578 W Watchdog:     at android.view.ViewRootImpl.performDraw(ViewRootImpl.java:3495)
06-03 16:09:13.391   555   578 W Watchdog:     at android.view.ViewRootImpl.performTraversals(ViewRootImpl.java:2779)
06-03 16:09:13.391   555   578 W Watchdog:     at android.view.ViewRootImpl.doTraversal(ViewRootImpl.java:1745)
06-03 16:09:13.391   555   578 W Watchdog:     at android.view.ViewRootImpl$TraversalRunnable.run(ViewRootImpl.java:7768)
06-03 16:09:13.391   555   578 W Watchdog:     at android.view.Choreographer$CallbackRecord.run(Choreographer.java:970)
06-03 16:09:13.391   555   578 W Watchdog:     at android.view.Choreographer.doCallbacks(Choreographer.java:794)
06-03 16:09:13.391   555   578 W Watchdog:     at android.view.Choreographer.doFrame(Choreographer.java:729)
06-03 16:09:13.391   555   578 W Watchdog:     at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:955)
06-03 16:09:13.391   555   578 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:934)
06-03 16:09:13.391   555   578 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:99)
06-03 16:09:13.392   555   578 W Watchdog:     at android.os.Looper.loop(Looper.java:216)
06-03 16:09:13.392   555   578 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
06-03 16:09:13.392   555   578 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
06-03 16:09:13.392   555   578 W Watchdog: *** GOODBYE!
06-03 16:30:46.716     0     0 W Call trace:  
06-03 16:30:46.719     0     0 W         : dump_backtrace+0x0/0x168
06-03 16:30:46.721     0     0 W         : show_stack+0x14/0x20
06-03 16:30:46.725     0     0 W         : showacpu+0x68/0x78
06-03 16:30:46.728     0     0 W         : flush_smp_call_function_queue+0xa0/0x150
06-03 16:30:46.731     0     0 W         : generic_smp_call_function_single_interrupt+0x10/0x18
06-03 16:30:46.736     0     0 W         : handle_IPI+0x10c/0x190
06-03 16:30:46.742     0     0 W         : gic_handle_irq+0x144/0x148
06-03 16:30:46.746     0     0 W         : el1_irq+0xf0/0x1c0
06-03 16:30:46.750     0     0 W         : _raw_spin_unlock_irq+0x10/0x48
06-03 16:30:46.753     0     0 W         : __schedule+0x348/0x750
06-03 16:30:46.757     0     0 W         : schedule+0x40/0xe0
06-03 16:30:46.760     0     0 W         : futex_wait_queue_me+0xbc/0x138
06-03 16:30:46.763     0     0 W         : futex_wait+0xe0/0x208
06-03 16:30:46.768     0     0 W         : do_futex+0x18c/0xd48
06-03 16:30:46.771     0     0 W         : __arm64_sys_futex+0x110/0x150
06-03 16:30:46.775     0     0 W         : el0_svc_common.constprop.0+0x68/0x160
06-03 16:30:46.779     0     0 W         : el0_svc_handler+0x20/0x80
06-03 16:30:46.783     0     0 W         : el0_svc+0x8/0xc
06-03 16:09:13.685   283   283 I lowmemorykiller: lmkd data connection dropped
06-03 16:09:13.685   283   283 I lowmemorykiller: closing lmkd data connection
06-03 16:09:13.700   367   367 I chatty  : uid=0(root) expire 3 lines
06-03 16:30:47.058     0     0 I binder  : release 555:570 transaction 28221 out, still active
06-03 16:30:47.058     0     0 I binder  : release 555:570 transaction 28215 in, still active
06-03 16:30:47.064     0     0 I binder  : undelivered TRANSACTION_COMPLETE
06-03 16:30:47.070     0     0 I binder  : release 555:576 transaction 27447 out, still active
06-03 16:30:47.075     0     0 I binder  : undelivered TRANSACTION_COMPLETE
06-03 16:30:47.081     0     0 I binder  : release 555:909 transaction 29822 in, still active
06-03 16:30:47.086     0     0 I binder  : send failed reply for transaction 29822 to 765:1093
06-03 16:09:13.738   765  1093 E AndroidRuntime: FATAL EXCEPTION: AsyncTask #1
06-03 16:09:13.738   765  1093 E AndroidRuntime: Process: com.android.systemui, PID: 765
06-03 16:09:13.738   765  1093 E AndroidRuntime: DeadSystemException: The system died; earlier logs will point to the root cause
06-03 16:30:47.093     0     0 I binder  : undelivered death notification, b400e95a5a9132a0

It looks like something jamming up with the mutex at these two points…?

@robclark any ideas about this call trace? AOSP display freeze: *ERROR* too many planes! cnt=7, start stage=2

Probably that “too many planes” should only be printed when debug prints are enabled. It isn’t really a kernel error. It is the kernel telling you that userspace asked for a configuration with too many hw overlays, and the atomic-check ioctl properly returns an error. Userspace should fall back to a config with fewer overlays and using the gpu to composite the rest.

There are various conditions, some transient, where fewer planes may be available:

  1. For 4k (or iirc, anything wider than 2k) multiple hwpipes (planes) are needed per layer
  2. Sometimes transiently a hwpipe may not be available until after the next vblank (ie. waiting for it to be released from it’s previous use)

Looks to me like a userspace problem.

I kind of understand the too many planes not being a kernel error. The call trace though, does come from the kernel. It to me looks like synchronization hell, and I wonder if it is an error that might be leading to the too many planes message?

not sure what is up with that (and the message board makes it annoying to even copy/paste the contents of it into something I can actually read)… not sure if it is related to the too-many-planes (or even what kernel version you are looking at)… you could try enabling lockdep

Kernel version is 5.4.20 with just a few (77) commits added on top of it:

The first several commits were needed just to get the board to boot, and most of the rest are related to a custom sound card. Only this commit should have any impact on the gpu: https://gitlab.com/aosp-automotive/dragonboard-kernel-src/-/commit/ed17921de9a10efca2631dd80abe44631dac8ef9

I’ve copied the call trace to a gitlab snipped, if that makes it any easier?

I’ll give lockdep a try as soon as I figure out how to work it. Thanks!

A bit more on this;

I think the kernel crash was a once in a lifetime event. I haven’t even been able to reproduce it. The number of planes error is definitely tied to the problem of the display locking up though.

I’ve found 2 things that make an impact on the occurrence of the display lockup;

What I believe happens in drm_hwcomposer under normal circumstances, is when the kernel says its hit the maximum number of planes, it switches additional planes over to client compositing. If that happens, it will probably lock up. The two changes I found causes it to switch to client compositing sooner, and it locks up more easily. So there is something bad happening with client compositing on this hardware.

@jstultz : I think you are an expert on drm_hwcomposer, since you’re listed as an owner: https://android.googlesource.com/platform/external/drm_hwcomposer/+/refs/heads/master/OWNERS

And I wonder if it could be related to this bug? https://bugs.96boards.org/show_bug.cgi?id=759