DragonBoard 820c doesn't come up with build 683 alip

ljking · June 26, 2020, 11:41am

I am trying to get my old 820c board back up and running with the latest kernel. I reloaded the partition table, the boot loaders, the boot and rootfs with the latest from the snapshots directories for the 820c. I didn’t re-provision the UFS.

The board mostly comes up, but then fails to bring up the desktop. Sometimes when I boot, the desktop is visible on the HDMI monitor for a few seconds, then the desktop goes black. The HDMI monitor is correctly synced, the mouse pointer is visible and responsive to mouse movement, but the desktop is black.

Unfortunately I have not yet setup a serial monitor on the console so I don’t have any logs (yet).

Has anyone else successfully brought up the ALIP build from here: Linaro Snapshots ? I did try a few slightly older builds (680, 670 and 660), but the results were the same.

doitright · June 26, 2020, 11:56am

Hmm, there could be some hairy edges. I haven’t run a debian build (I use Android) since some early debugging I was doing at around builds 400-415, which were still on kernel 4.14 (which often don’t boot successfully due to a bug that was patched upstream for 5.x). I recall that the early 5.4 builds didn’t work well since they hadn’t brought in several patches required for db820c, but it was my understanding that some time in the early 5.7’s they had it working. The switch to 5.7 (rc) was at build 572.

kldixon · June 27, 2020, 3:03pm

You are doing better than me! I have installed Snapshot 678 and get not a flicker from the monitor even though the files /sys/class/drm/card0-HDMI-A-1/* show the expected content and there is nothing unexpected in /var/log/Xorg.0.log. I have not had any output from the monitor since my thread Blank monitor after software update but, recently, I have been making some investigations with kernel 5.8.0-rc2-00481-gc2d7a8db1c3a from working/qualcomm/kernel.git - Qualcomm Landing Team kernel.

I have suspected the power management code since the thread above and have now got some evidence. I am fairly certain that the problem you are seeing is gpu_gx being powered off. When you get a console attached you will probably see “gpu_gx status stuck at ‘off’”. That message comes from a patch from Bjorn Andersson, commit f02fba3aa8feeee0a9f9c82c6db2ae9dda7825cd.

There seems to have been some thrashing about with the GDSC code recently. The issue appears to arise from the fact that you cannot associate more that one power domain with a single device. The two commits a25c0a854bd2015c9a402e9567e5a89eb90986d9 and 90361c81f7bae1e152916e1e77a5a89cbc10e7d4 have precisely the same commit messages, including date and time in 2016, but subtly different patches. The second includes the declaration ‘static struct gdsc gpu_gdsc;’ which allows the gdsc structs for gpu and gpu_gx to be in each others scope enabling the commit 90a3691e0bd907daae23bb22850d4f4f4bfefa50 to create a power domain dependency graph consisting of a two element cycle of gpu and gpu_gx. When I saw that I thought that is going to lead to grief.
If you do:

$ git log -p --follow ./drivers/clk/qcom/mmcc-msm8996.c

the commit 90361c81f7bae1e152916e1e77a5a89cbc10e7d4 is the most recent, immediately preceded by 90a3691e0bd907daae23bb22850d4f4f4bfefa50.
As far as I can see, drivers/base/power/domain.c:genpd_power_on traverses the graph by recursion, i.e. it presumes it is a tree and although there could be a depth limit it is not enough and you end up with this:

[  247.008584] INFO: task kworker/0:1:12 blocked for more than 120 seconds.
[  247.012334]       Not tainted 5.8.0-rc2-00481-gc2d7a8db1c3a-dirty #46
[  247.019015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.025365] kworker/0:1     D    0    12      2 0x00000028
[  247.033167] Workqueue: events deferred_probe_work_func
[  247.038538] Call trace:
[  247.043644]  __switch_to+0xf0/0x158
[  247.045995]  __schedule+0x33c/0x800
[  247.049467]  schedule+0x74/0x100
[  247.052936]  schedule_preempt_disabled+0x20/0x38
[  247.056419]  __mutex_lock+0x334/0x8f0
[  247.061014]  mutex_lock_nested+0x30/0x58
[  247.064577]  genpd_lock_nested_mtx+0x10/0x18
[  247.068571]  genpd_power_on.part.0+0x6c/0x188
[  247.072824]  genpd_power_on.part.0+0x80/0x188
[  247.077078]  __genpd_dev_pm_attach+0xf8/0x240
[  247.081418]  genpd_dev_pm_attach+0x58/0x68
[  247.085759]  dev_pm_domain_attach+0x1c/0x30
[  247.089750]  platform_drv_probe+0x38/0xa0
[  247.093828]  really_probe+0xd8/0x438
[  247.097993]  driver_probe_device+0x64/0x158
[  247.101642]  __device_attach_driver+0x88/0x108
[  247.105552]  bus_for_each_drv+0x74/0xc0
[  247.110059]  __device_attach+0xdc/0x160
[  247.113793]  device_initial_probe+0x10/0x18
[  247.117613]  bus_probe_device+0x98/0xa0
[  247.121778]  deferred_probe_work_func+0x88/0xd8
[  247.125605]  process_one_work+0x288/0x6e0
[  247.130112]  worker_thread+0x1f0/0x418
[  247.134279]  kthread+0x140/0x160
[  247.137920]  ret_from_fork+0x10/0x18
[  247.141327] INFO: lockdep is turned off.

So, one of the dependencies has to go, to break the cycle. I have had most success with gpu_gx being the parent of gpu, i.e. keeping “.parent = &gpu_gx_gdsc.pd” and setting “power-domains = <&mmcc GPU_GDSC>;” in the device tree nodes gpu@b00000 and iommu@b40000, the adreno smmu. I seem to get communication with user space through drm ioctls from different pids.
The other way round I get:

[   35.074362] gpu_gx: gdsc_check_status 0 00004024 00282001
[   35.075114] [drm:mdp5_crtc_err_irq] crtc-0: error: 40000000
[   35.075124] [drm:mdp5_irq_error_handler] *ERROR* errors: 40000000

Note, drivers/gpu/drm/msm/disp/mdp5/mdp5.xml.h contains #define MDP5_IRQ_INTF3_UNDER_RUN 0x40000000, but I do not understand the significance of that.
The first line is from a pr_debug I inserted in drivers/clk/qcom/gdsc.c:gdsc_check_status:

	pr_debug("%s: %s %d %08x %08x\n", sc->pd.name, __func__, ret, reg, val);

Bit 31 is the power on bit so gpu_gx is off at this point and that is the last call to gdsc_check_status.
However, there are other things going on. I still have nothing on the monitor and there is probably a race involved too as I get different behaviour on each boot.

doitright · June 27, 2020, 9:36pm

Just so we don’t detail this thread too much…
Since yours hasn’t lit up since kernel 4.11 a couple of years ago, there is a very big chance that you are experiencing a hardware failure.

Also, the gpu_gx stuck off error happens even with a working display, nothing to worry about by itself.

This lights the display up reliably and makes for a really good starting point:

… It also gives the gpu_gx off error 100% of boots.

kldixon · June 28, 2020, 10:10am

I found my old build of kernel 4.11.12-30705-g0e82eeffbc29, booted it with ‘fastboot boot’, and voila, I have boot messages followed by desktop on my monitor.
I have one of the earlier boards, with a UFS chip. I reported the markings here: USB ports stopped working - #14 by kldixon.
It is entirely possible that early revision, or even defective, silicon was placed on these boards, silicon that was always only ever going to work with kernel 4.11. If so, I doubt anyone is going to admit to it.
I will try to wade through the commits you have included but I am not very familiar with GitLab.

doitright · June 28, 2020, 11:49am

That’s interesting. I guess a compatibility problem could have cropped in with regards to older boards. You should probably start a thread for your board though, as the differences you are trying to deal with are certainly a different set of problems.

kldixon · June 28, 2020, 1:09pm

I think the OP also has an old board.

doitright · June 28, 2020, 1:40pm

Yet his lights up. I seriously don’t see anything to link the issues you are experiencing with this thread, I’d advise that you cease to derail it. Start your own thread for your own issue.

ljking · June 28, 2020, 2:45pm

Yes I do have a old board, it was one of the first batch of 500 built. It has UFS, not eMMC for storage. But it isn’t one of the ‘P1’ boards, it is a P2 revision. I also have a very large heatsink bolted to the bottom of the board and thermally connected to the area near the CPU and PMIC.

Last night I managed to connect a serial console. from the serial console everything looks fine, so I connected to wifi with nmcli, and then ran sudo apt-get multiple times (update, upgrade, autoremove and full-upgrade). The upgrade brought in over 200mB of changes, and all applied successfully.

Now the board boots, the gui is present and seems to be stable, although I haven’t tried anything difficult yet. performance looks OK. I monitored the core frequencies with

watch -n1 “sudo cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq”

Under heavy load from stress-ng the big cores stay at 1GHz, and the little cores at 0.5GHz which is expected.

doitright · June 28, 2020, 3:07pm

I’m actually working with someone who has one of the newest boards delivered from arrow yesterday, and his still has a UFS, just a different brand.

And I have a couple from the previous run, also UFS.

Pretty sure that they’ve never equipped a db820c with eMMC.

ljking · June 30, 2020, 9:42pm

You could be right about all of the boards having UFS memory. The board layout was dual footprinted for eMMC or UFS. We went with UFS because the secondary boot loaders were only tested with UFS. eMMC would have saved a lot of money bringing the board cost down. But marketing wanted to show the best possible performance. Memory prices change wildly day to day so for all I know UFS is cheaper this week.

samit · October 29, 2020, 9:16am

It requires a patch in DT file msm8996.dtsi in qcomlt-5.7 branch.