Audio Issues with Dragonboard410c (Yocto) on Newer Kernels. Dead period interrupt

Hello,

While working with the dragonboard410c I ran into an interesting issue with the audio pipeline using alsaloop.

I moved forward to linux kernel 5.1 and based it off the mainline 5.1 branch. The same issue happens with the linaro branch (release/db845c/qcomlt-5.1).

This issue occurs when running alsaloop. The issue is that at any given moment alsaloop will hang up and start skipping and it never seems to recover. I use this command ( alsaloop -C hw:0,1 -P hw:0,0 -U -S 0 -l 2400 -v -v -v -v -v -v -v -v -v -v -v -v -v -v -v -v ) and after an amount of time the console printing will stop and give me a message along the lines of “poll took 65431256421us”

I made a post about this in the past (Kernel 5.1: Audio hanging on Alsaloop Poll) but more work has been done to figure out what the cause of the issue is. The efforts have not been successful but a lot was learned. Its important to note that this patch is added (Real Time Audio and DragonBoard 410c). That patch allows for the buffer sizes to be adjusted which is a nice feature but it does make the issue more prevalent. It is important to know that this issues does happen with or without the patch. Another strange thing that seems cause the hung state to occur faster is to run alsaloop through Putty connected to the serial terminal of the DragonBoard with the verbosity level at 16.

Through the testing we used devmem2 to check register values and see whats going on with the hardware.

When alsaloop hangs it is in a state where it is waiting for the period interrupt to happen but it never does. The interesting thing is that the DMACURR register continues to change as though nothing is wrong. Below is a capture of the lpass control registers using devmem2 while alsaloop is running and once it becomes hung. All the registers appear to have values indicating it should be running but no interrupt is firing. We have also traced into the irq functions in lpass-platform.c and the gic irq chip and have confirmed that they are not executing for the lpass-irq-lpaif while it is in this state.

Running:
// Capture registers
Read at address 0x07713004 (0xffff81568004) : 0xBEC80000 // LPAIF_DMABASE_REG
Read at address 0x07713008 (0xffff95c3c008) : 0x00000FFF // LPAIF_DMABUFF_REG
Read at address 0x07713010 (0xffff976fe010) : 0x000001FF // LPAIF_DMAPER_REG
Read at address 0x07713000 (0xffffb980b000) : 0x0000083F // LPAIF_DMACTL_REG
Read at address 0x0770E00C (0xffff9af4d00c) : 0x00000000 // LPAIF_IRQCLEAR_REG
Read at address 0x0770E000 (0xffff971a6000) : 0x00038007 // LPAIF_IRQEN_REG
Read at address 0x0771300C (0xffff9450800c) : 0xBEC811A0 // LPAIF_DMACURR_REG
Read at address 0x0770B000 (0xffffb314d000) : 0x00000110 // LPAIF_I2SCTL_REG

// Playback registers
Read at address 0x07710404 (0xffffbda36404) : 0xBEC7C000 // LPAIF_DMABASE_REG
Read at address 0x07710408 (0xffff8eb96408) : 0x00000FFF // LPAIF_DMABUFF_REG
Read at address 0x07710410 (0xffffa5a33410) : 0x000001FF // LPAIF_DMAPER_REG
Read at address 0x07710400 (0xffffb2866400) : 0x0000081F // LPAIF_DMACTL_REG
Read at address 0x0770E00C (0xffffa2dce00c) : 0x00000000 // LPAIF_IRQCLEAR_REG
Read at address 0x0770E000 (0xffff95300000) : 0x00038007 // LPAIF_IRQEN_REG
Read at address 0x0771040C (0xffffbdc6540c) : 0xBEC7F130 // LPAIF_DMACURR_REG
Read at address 0x07709000 (0xffffbf164000) : 0x00004400 // LPAIF_I2SCTL_REG

Hung :
// Capture registers
Read at address 0x07713004 (0xffffb1afe004) : 0xBEC80000 // LPAIF_DMABASE_REG
Read at address 0x07713008 (0xffffa1798008) : 0x00000FFF // LPAIF_DMABUFF_REG
Read at address 0x07713010 (0xffff9283b010) : 0x000001FF // LPAIF_DMAPER_REG
Read at address 0x07713000 (0xffff9b0a7000) : 0x0000083F // LPAIF_DMACTL_REG
Read at address 0x0770E00C (0xffffb8e2e00c) : 0x00000000 // LPAIF_IRQCLEAR_REG
Read at address 0x0770E000 (0xffff8a3bb000) : 0x00038007 // LPAIF_IRQEN_REG
Read at address 0x0771300C (0xffff9a58700c) : 0xBEC83CC0 // LPAIF_DMACURR_REG
Read at address 0x0770B000 (0xffffaa90d000) : 0x00000110 // LPAIF_I2SCTL_REG

// Playback registers
Read at address 0x07710404 (0xffffb403b404) : 0xBEC7C000 // LPAIF_DMABASE_REG
Read at address 0x07710408 (0xffffa6e91408) : 0x00000FFF // LPAIF_DMABUFF_REG
Read at address 0x07710410 (0xffff8a6ae410) : 0x000001FF // LPAIF_DMAPER_REG
Read at address 0x07710400 (0xffff83d48400) : 0x0000081F // LPAIF_DMACTL_REG
Read at address 0x0770E00C (0xffffb09a600c) : 0x00000000 // LPAIF_IRQCLEAR_REG
Read at address 0x0770E000 (0xffff83f41000) : 0x00038007 // LPAIF_IRQEN_REG
Read at address 0x0771040C (0xffff979da40c) : 0xBEC7D950 // LPAIF_DMACURR_REG
Read at address 0x07709000 (0xffffbf164000) : 0x00004400 // LPAIF_I2SCTL_REG

Since alsa is in a wait state and doesn’t seem to recognize that there is an issue; we tried jump starting the hardware again by executing the following sequence for each substream and suddenly alsa comes back to life and it starts happily running for some time until it hangs the same again. We can repeat this same jump start each time it hangs but eventually it comes back to life and quickly errors out with a broken pipe.

lpass_platform_pcmops_trigger with command SNDRV_PCM_TRIGGER_SUSPEND
lpass_platform_pcmops_trigger with command SNDRV_PCM_TRIGGER_RESUME
lpass_cpu_daiops_trigger with command SNDRV_PCM_TRIGGER_SUSPEND
lpass_cpu_daiops_trigger with command SNDRV_PCM_TRIGGER_RESUME

We have also repeated our testing with various kernel versions from 5.3 back to 4.14 and the issue occurs the same in all of them.

We are currently testing and looking all the way back to 4.4.9.

Any input or ideas are welcomed!

We spent some time working with the alsaloop code and the increase_playback_avail_min seems to make the issue worse. We removed the function and the issue almost never happens. It ran 3 days no issues other than one when it started (Used the “jump start” functions to restart it). below is the commit that we reverted to make it work better.

@srini

Hey, Any chance you may have any feedback on this? I know its not the driver specifically but any ideas on what might create this or any input on what I can work on to make this issue better? thank you in advance

Hi Jessie,
Does the GIC interrupt count increase (in /proc/interrupts) during this lockup state?
Wanted to understand if we actually receive hw interrupt but for some reason the code in lpass_platform.c is ignoring it.

–srini

1 Like

@srini

The GIC interrupt count stops increasing during this locked state. Can I provide you anymore information?

(Just for a little bit more information, I jumped up to kernel 5.5 and configured it to run on the board. I am having the same issue)

@srini

Any updates? Or anything else I can provide?

@Jessie
Not seeing h/w interrupt is a real issue, Either we messed up with register to disable those interrupt sources (may be some debug on register writes to irq port register LPAIF_IRQ_PORT_HOST should help). Or may be we are hitting some race condition here.

1 Like

@srini

We did some debug on those registers outlined above using devmem2. The 2 writes on LPAIF_IRQ_PORT_HOST are as follows:

Running:
// Capture registers
Read at address 0x0770E00C (0xffff9af4d00c) : 0x00000000 // LPAIF_IRQCLEAR_REG
Read at address 0x0770E000 (0xffff971a6000) : 0x00038007 // LPAIF_IRQEN_REG

// Playback registers
Read at address 0x0770E00C (0xffffa2dce00c) : 0x00000000 // LPAIF_IRQCLEAR_REG
Read at address 0x0770E000 (0xffff95300000) : 0x00038007 // LPAIF_IRQEN_REG

Hung :
// Capture registers
Read at address 0x0770E00C (0xffffb8e2e00c) : 0x00000000 // LPAIF_IRQCLEAR_REG
Read at address 0x0770E000 (0xffff8a3bb000) : 0x00038007 // LPAIF_IRQEN_REG

// Playback registers
Read at address 0x0770E00C (0xffffb09a600c) : 0x00000000 // LPAIF_IRQCLEAR_REG
Read at address 0x0770E000 (0xffff83f41000) : 0x00038007 // LPAIF_IRQEN_REG

The values are the same here (What other debug info would you want/need?).

I think its possible that its the race condition. The dma still appears to be working (based off looking at the dma_curr value) and we are hearing the last buffer get played over and over and over. when we removed the function (increase_playback_avail_min) from the source code of alsaloop it almost entirely eliminated the issue. increase_playback_avail_min calls snd_pcm_sw_params with some updated params. is it possible that the XRUN that is ripping that down and started it is getting some kind of bad data? I looked at the source for snd_pcm_sw_params and I am going to add some logging but you know more about this kind of stuff then I do. The bit of code I am looking at is:

	snd_pcm_stream_lock_irq(substream);
	runtime->tstamp_mode = params->tstamp_mode;
	if (params->proto >= SNDRV_PROTOCOL_VERSION(2, 0, 12))
		runtime->tstamp_type = params->tstamp_type;
	runtime->period_step = params->period_step;
	runtime->control->avail_min = params->avail_min;
	runtime->start_threshold = params->start_threshold;
	runtime->stop_threshold = params->stop_threshold;
	runtime->silence_threshold = params->silence_threshold;
	runtime->silence_size = params->silence_size;
        params->boundary = runtime->boundary;
	if (snd_pcm_running(substream)) {
		if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK &&
		    runtime->silence_size > 0)
			snd_pcm_playback_silence(substream, ULONG_MAX);
		err = snd_pcm_update_state(substream, runtime);
	}
	snd_pcm_stream_unlock_irq(substream);

I will update you if I find anything but feedback would be nice.
Thank you for the help

@srini

As a little bit of an update I am a little lost on where to go next. It seems to stem from some kind of race condition (honestly I have no facts to prove this though) but tracing it down has been quite the task.

To the best of my efforts I think the avail_min function in alsaloop is causing the hardware to go in this state because it is trying to adjust these values while playing and a lock is getting stuck but again I have no proof.

If you have any input or some other ideas for me to explore I would be happy to explore.

so whats the next step in your book?

Is it something you reproduce with a latest release (based on kernel 5.4)? Linaro Linux 20.02 Dragonboard410c releases

1 Like

Sorry for the delayed response but I just got the time to test this.

The issue is reproduce-able with the new release. I tested it with and without that alsaloop code change and it has the same behavior.

At the moment I have been running 2 boards with the alsaloop code change for about 5 days and the issue hasn’t shown up. I’m not sure what that increase_playback_avail_min is really doing on the alsa side but it seems to create the issue…

Thank you!

@Loic

It hit me that my answer is not correct.

I have to build it because I have a change in the lpass platform to allow the buffersize to be adjustable (Real Time Audio and DragonBoard 410c).

Using the release build you will probably never run into this issue because that increase_playback_avail_min function in alsaloop wont do anything due to the lpass-platform having the hardcoded value by default.

I use zeus and the branch (so kernel 5.4) in the file just making that code change because I am looking for low latency audio. I even went as far as to pull 5.5 and make the config file changes to build in on the board with the same outcome.

Thank you again!