You are doing better than me! I have installed Snapshot 678 and get not a flicker from the monitor even though the files /sys/class/drm/card0-HDMI-A-1/* show the expected content and there is nothing unexpected in /var/log/Xorg.0.log. I have not had any output from the monitor since my thread Blank monitor after software update but, recently, I have been making some investigations with kernel 5.8.0-rc2-00481-gc2d7a8db1c3a from working/qualcomm/kernel.git - Qualcomm Landing Team kernel.
I have suspected the power management code since the thread above and have now got some evidence. I am fairly certain that the problem you are seeing is gpu_gx being powered off. When you get a console attached you will probably see “gpu_gx status stuck at ‘off’”. That message comes from a patch from Bjorn Andersson, commit f02fba3aa8feeee0a9f9c82c6db2ae9dda7825cd.
There seems to have been some thrashing about with the GDSC code recently. The issue appears to arise from the fact that you cannot associate more that one power domain with a single device. The two commits a25c0a854bd2015c9a402e9567e5a89eb90986d9 and 90361c81f7bae1e152916e1e77a5a89cbc10e7d4 have precisely the same commit messages, including date and time in 2016, but subtly different patches. The second includes the declaration ‘static struct gdsc gpu_gdsc;’ which allows the gdsc structs for gpu and gpu_gx to be in each others scope enabling the commit 90a3691e0bd907daae23bb22850d4f4f4bfefa50 to create a power domain dependency graph consisting of a two element cycle of gpu and gpu_gx. When I saw that I thought that is going to lead to grief.
If you do:
$ git log -p --follow ./drivers/clk/qcom/mmcc-msm8996.c
the commit 90361c81f7bae1e152916e1e77a5a89cbc10e7d4 is the most recent, immediately preceded by 90a3691e0bd907daae23bb22850d4f4f4bfefa50.
As far as I can see, drivers/base/power/domain.c:genpd_power_on traverses the graph by recursion, i.e. it presumes it is a tree and although there could be a depth limit it is not enough and you end up with this:
[ 247.008584] INFO: task kworker/0:1:12 blocked for more than 120 seconds.
[ 247.012334] Not tainted 5.8.0-rc2-00481-gc2d7a8db1c3a-dirty #46
[ 247.019015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 247.025365] kworker/0:1 D 0 12 2 0x00000028
[ 247.033167] Workqueue: events deferred_probe_work_func
[ 247.038538] Call trace:
[ 247.043644] __switch_to+0xf0/0x158
[ 247.045995] __schedule+0x33c/0x800
[ 247.049467] schedule+0x74/0x100
[ 247.052936] schedule_preempt_disabled+0x20/0x38
[ 247.056419] __mutex_lock+0x334/0x8f0
[ 247.061014] mutex_lock_nested+0x30/0x58
[ 247.064577] genpd_lock_nested_mtx+0x10/0x18
[ 247.068571] genpd_power_on.part.0+0x6c/0x188
[ 247.072824] genpd_power_on.part.0+0x80/0x188
[ 247.077078] __genpd_dev_pm_attach+0xf8/0x240
[ 247.081418] genpd_dev_pm_attach+0x58/0x68
[ 247.085759] dev_pm_domain_attach+0x1c/0x30
[ 247.089750] platform_drv_probe+0x38/0xa0
[ 247.093828] really_probe+0xd8/0x438
[ 247.097993] driver_probe_device+0x64/0x158
[ 247.101642] __device_attach_driver+0x88/0x108
[ 247.105552] bus_for_each_drv+0x74/0xc0
[ 247.110059] __device_attach+0xdc/0x160
[ 247.113793] device_initial_probe+0x10/0x18
[ 247.117613] bus_probe_device+0x98/0xa0
[ 247.121778] deferred_probe_work_func+0x88/0xd8
[ 247.125605] process_one_work+0x288/0x6e0
[ 247.130112] worker_thread+0x1f0/0x418
[ 247.134279] kthread+0x140/0x160
[ 247.137920] ret_from_fork+0x10/0x18
[ 247.141327] INFO: lockdep is turned off.
So, one of the dependencies has to go, to break the cycle. I have had most success with gpu_gx being the parent of gpu, i.e. keeping “.parent = &gpu_gx_gdsc.pd” and setting “power-domains = <&mmcc GPU_GDSC>;” in the device tree nodes gpu@b00000 and iommu@b40000, the adreno smmu. I seem to get communication with user space through drm ioctls from different pids.
The other way round I get:
[ 35.074362] gpu_gx: gdsc_check_status 0 00004024 00282001
[ 35.075114] [drm:mdp5_crtc_err_irq] crtc-0: error: 40000000
[ 35.075124] [drm:mdp5_irq_error_handler] *ERROR* errors: 40000000
Note, drivers/gpu/drm/msm/disp/mdp5/mdp5.xml.h contains #define MDP5_IRQ_INTF3_UNDER_RUN 0x40000000, but I do not understand the significance of that.
The first line is from a pr_debug I inserted in drivers/clk/qcom/gdsc.c:gdsc_check_status:
pr_debug("%s: %s %d %08x %08x\n", sc->pd.name, __func__, ret, reg, val);
Bit 31 is the power on bit so gpu_gx is off at this point and that is the last call to gdsc_check_status.
However, there are other things going on. I still have nothing on the monitor and there is probably a race involved too as I get different behaviour on each boot.