DB410c freeze and requires cold reset

NiB · April 23, 2020, 7:01pm

Dear All,

I am using the Dragon board 410c with Linaro release 19.01.
Unfortunately i cannot provide a specific scenario of how to reproduce the phenomenon
because it happens only once in several hours and i am currently not really sure what’s causing the problem.
I can tell that i am utilizing the WiFi chip and switching between AP and device modes
very often.
The bottom line is that once it occur i have to make a cold reset (make a complete removal of power).

According to ‘syslog’ it is very hard to distinguish the module which actually causes the kernel to be freeze (l’Caught , dumped core as pid 11628.’).
However it can be seen that there are issues that relates to memory allocation (most of them refers to wcn36xx).
For example: trying to free nonexistent vm area (‘Trying to vfree() nonexistent vm area (ffff000013273000)’), trying to allocate contiguous memory space (‘cma: cma_alloc: alloc failed’, wcn36xx: ERROR Failed to allocate BD mempool, wcn36xx: ERROR Failed to alloc DXE mempool: -12)
eventually SEGV signal is caught and we are freezing.
The SEGV signal may be generated as a result of illegal address access operation.

Can some explain why how can i upload a txt file (i.e. syslog) only images?
It looks like there is no option for that.

Please advise,
Nissim

Loic · April 24, 2020, 7:07am

Could you try with the latest release? https://releases.linaro.org/96boards/dragonboard410c/linaro/debian/latest/

You can use e.g. pastebin.com to copy the log and attach the link here.

NiB · May 5, 2020, 7:08pm

Sorry for the late response but I run the Dragon Board in different use case in order to provide more information. Moving to latest version is currently not an option at the moment.
We are using the Desktop image of 19.01 release as follows:

an HDMI display is connected to the DB
our process consumes a lot of physical RAM memory (only 40MB are left for the entire system).
we are using the WiFi

The DB works fine until it freezes.

The kernel warns about problems with contiguous memory allocation:
May 5 06:31:20 linaro-alip kernel: [37941.593143] alloc_contig_range: 1782 callbacks suppressed
May 5 06:31:20 linaro-alip kernel: [37941.593152] alloc_contig_range: [bee00, bee04) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.598097] alloc_contig_range: [bee04, bee08) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.603393] alloc_contig_range: [bee08, bee0c) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.608502] alloc_contig_range: [bee0c, bee10) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.613889] alloc_contig_range: [bee10, bee14) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.619025] alloc_contig_range: [bee14, bee18) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.624555] alloc_contig_range: [bee18, bee1c) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.629705] alloc_contig_range: [bee1c, bee20) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.635393] alloc_contig_range: [bee20, bee24) PFNs busy
May 5 06:31:20 linaro-alip kernel: [37941.640421] alloc_contig_range: [bee24, bee28) PFNs busy
The kernel tries to free nonexistent virtually contiguous memory:
May 5 06:31:35 linaro-alip kernel: [37956.562948] cma: cma_alloc: alloc failed, req-size: 5 pages, ret: -16
May 5 06:31:35 linaro-alip kernel: [37956.563009] Trying to vfree() nonexistent vm area (ffff00000fed3000)
…
…
…
…
May 5 06:31:36 linaro-alip kernel: [37957.616722] wcn36xx: ERROR Failed to allocate BD mempool
May 5 06:31:36 linaro-alip kernel: [37957.621145] wcn36xx: ERROR Failed to alloc DXE mempool: -12
We didn’t find memory leak in our process.

A detailed log (syslog) is attached at the link below:
https://drive.google.com/open?id=1glhUzJQ5juwXxIN4P0JlfJYIRoi9Akc3

In addition, if we run the same uses case without HDMI display the system will not freeze but the errors will still occur.

My questions:

does this phenomenon known ?
does it valid to run the DB with only 40 MB physical RAM left?
Is there a minimum physical RAM memory for 19.01 Desktop Image that below it such issues may occur?

Thanks,
Nissim

Loic · May 6, 2020, 7:39am

If you’re low in ram, I would suggest to try zram which is a compressed ram swap: Create Swap Space using ZRAM or RAM Compression - 96Boards

For the freeze issue, I assume that at some point there are not enough continuous pages for the desktop/graphics to run (though the system should still alive).

NiB · May 6, 2020, 9:32am

Hi,

As i mentioned previously if we run the system without HDMI display the system will not freeze but the errors will still occur.

How much RAM does the desktop/graphics requires?

I found two more interesting issues in this use case:

From attached syslog: at 05/05/20 6:36:48 (BUG Bad page map in …)
I believe that this implies that there is a problem with the physical RAM (very hard to tell what exactly)
syslog
I run a script that monitor our process via top command every 15 minutes.
As can be seen from the attached log file (mem_usage.log), the total physical free RAM memory
is stable (~40MB). However our process consumed more 99856KB from the moment it started.
How this could be if the total physical free RAM memory didn’t change respectively?
monitor process log file

Thanks,
Nissim

Loic · May 13, 2020, 8:23am

That’s depend on usage, I see that kwin requires ~100mb at startup. but do you even need graphics? desktop environment?

800mb is quite a huge memory usage for a process, is there space for memory saving?

Swap usage has increased from your log.

doitright · May 13, 2020, 2:51pm

Because as running processes consume more memory, you see an increase in swap usage in order to ensure that the free memory remains at a safe level. This will continue until there is no more swap space available, at which point it will either start killing processes, or crash, or both.

I’d suggest that more important than monitoring the “whole system” memory usage, you probably should focus on your process, whatever that is. I have the feeling that you have memory leaks that need to be addressed.

If your process is stable in terms of memory utilization, then start monitoring other processes until you track down what is causing the increased memory usage.