Wcn36xx AP Dropout

sabjorn · April 13, 2018, 1:19am

Having trouble with AP mode on dargonboard410c running Linux 4.9.

AP mode will work for a while but then after a few hours the SSID is still observable (broadcasting) but the AP point will not accept any connections. The only way to connect to the AP is resetting the dragonboard.

I have be unable to track down where this issue is coming from. Using HostAPD for managing AP mode, and setting it to be high verbosity, I cannot find any problems in the logs.

Any other ideas where this problem might be coming from?

THANKS!

anon91830841 · April 13, 2018, 6:53am

hi,

there have been a very large number of bug fixes in the last few months in the wcn36xx driver (the wlan driver used on db410c). Most (if not all) have now been merged into mainline, and we have backported them into our 4.14 release branch, however we are not supporting this branch anymore.

If you cannot upgrade to 4.14, then you might be able to cherry-pick all the bug fixes hopefully without too much pain. Looking at ath-next branch, i can see the following patches to backport. Not all of them are strictly needed, but I would recommend to take them all:

5151a673da43 (ath-next/ath-next) wcn36xx: allocate skbs with GFP_KERNEL during init
6062546d9b7f wcn36xx: Remove useless skb spinlock
1391cce7daf6 wcn36xx: Add missing fall through comment in smd.c
2edfcf2b303c wcn36xx: don't delete invalid bss indices
271f1e65ff38 wcn36xx: don't keep reference to skb if transmission failed
7cae35199bee wcn36xx: check for DMA mapping errors in wcn36xx_dxe_tx_frame()
f276ba06e8b2 wcn36xx: dequeue all pending indicator messages
e5f9908155c9 wcn36xx: Fix firmware crash due to corrupted buffer address
ee35eecb0822 wcn36xx: turn off probe response offloading
2ef00c53049b wireless: Use octal not symbolic permissions
6767b302e1c9 wcn36xx: Check DXE IRQ reason
e5d04670904f wcn36xx: calculate DXE default channel values
6ced7958168f wcn36xx: calculate DXE control registers values
6b8a127bf66d wcn36xx: reduce verbosity of drivers messages
9bfd05e35ac3 wcn36xx: Fix warning due to duplicate scan_completed notification
d0bb950b9f5f wcn36xx: release DMA memory in case of error
4c8cf8df2f42 wcn36xx: fix incorrect assignment to msg_body.min_ch_time
0856655a2547 wcn36xx: Fix dynamic power saving
6d1f37323f5b wcn36xx: Reduce spinlock in indication handler
2f3bef4b247e wcn36xx: Add hardware scan offload support

Loic · April 13, 2018, 7:06am

I think this is same issue has the one in this topic: Access Point wifi connection that hangs up, we have to open an issue.

On my side problem happens on rekeying. Could you try to disable rekeying in your conf file (wpa_group_rekey=0) and let me know.

Bug: https://bugs.96boards.org/show_bug.cgi?id=727

danielt · April 13, 2018, 8:49am

Having trouble with AP mode on dargonboard410c running Linux 4.9.

Anything running that kernel must be at least six months old (maybe
more) and there has been some significant improvements to the DB410C
WiFi over that time. Do you still see problems in the latest 18.01
release (or, given that 18.04 is due “real soon now”, if 18.01 doesn’t
work perhaps with the most recent snapshot).

Daniel.

danielt · April 13, 2018, 8:52am

Sorry. I ended up merely I repeated, with less detail, what others already said (didn’t spot the later contributions when processing my mail this morning… sorry for noise).

sabjorn · April 13, 2018, 6:48pm

I should mention, running the oe-rpb image lovingly built with bitbake.

Was running a pretty old build of 4.9:

in: meta-qcom/recipes-kernel/linux/linux-linaro-qcomlt_4.9.bb

SRCREV ?= "8165c999c87f1fe205e6fad779ded1f3e9bc382f"

Did a build with:

SRCREV = "3e797f5b154e7b382aa1801b022d0d431ca00b86"

And still same problem.

@anon91830841, i’ll try cherry-picking these patches and get back to you.

@Loic, i’ll look into this keyring issue as well.

Thanks everyone! Fingers crossed!

sabjorn · January 23, 2019, 10:59pm

@anon91830841

Started experimenting with 4.14 (as you’ve noticed with other threads we’ve been communicating through). The hope is the latest driver for WCN36xx will solve our AP Mode dropout problem.

To re-iterate the problem (from above):

AP mode will work for a while but then after a few hours the SSID is still observable (broadcasting) but the AP point will not accept any connections. The only way to connect to the AP is resetting the dragonboard.

However, we’ve run into the same behaviour (although, possibly from a different source) in the 4.14 kernel.

I’m hoping you can give me some suggestions on further debugging this problem.

I’ve tried setting hostapd to log everything. But, there doesn’t appear to be anything in the logs which would show a problem.

So, perhaps there is still a kernel problem. Is there anyway to increase the log level for the Wcn36xx driver?

I was thinking this may come from power management. As in, after some amount of time, some system is turned off and fails to return when trying to connect to the Access Point. Do you know of any way to observe the power management behaviours of the system (or better yet, to set power management to never sleep)?

Thanks in advance!

danielt · January 24, 2019, 12:38pm

Bug #727 is still open so you should probably try disabling WPA rekeying as @Loic suggests above.

sabjorn · January 25, 2019, 2:21am

@danielt | @Loic
The bug we’re experiencing does not appear to be matching Bug #727.

We’re using an open network, not WPA, so my assumption was that setting:

wpa_group_rekey=0

will have no effect.

I tested this assumption today:

First, I setup hostapd with the configuration in the Bug #727 report. I was able to recreate the reported issue by pinging the device continuously both with wpa_group_rekey=60 and wpa_group_rekey=0.

With wpa_group_rekey=60, after a couple of minutes packets stopped succeeding (Request timeout for icmp_seq). Trying to start ping again and still failures. Following this, I disconnected from the network and when attempting to reconnect, the attempt failed. I could not reconnect to the AP without restarting the device.

I restarted the test with wpa_group_rekey=0 and had the same issue come up (packets stopped succeeding). However, this time when I disconnected and reconnected to the AP there was no trouble connecting.
So, I was able to recreate the bug you have referenced.

However, moving back to our original device configuration–non-password AP–I attempted the same test. However, with both wpa_group_rekey=60 and wpa_group_rekey=0, even after 20 minutes of letting ping run, there was no problems. Hence my assumption that wpa_group_rekey only has an effect when AP is configured to WPA. Is this a valid assumption?

Thus far, for the past 10 months, we have been unable to recreate our specific bug on demand. For some unknown reason, intermittently, we are unable to access the AP (non-password protected) without restarting the device.

Can you think of any other reasons why this might be happening? Again, is there any chance this is being caused by a power management setting? Or possibly a power fluctuation? Does the PMIC on the board have different power modes that it can automatically adjust to?

Furthermore, can you point me towards any information regarding setting the wcn36xx into a higher verbosity debug mode?

–
I have 4 devices I am going to setup in WPA mode for the weekend to see if perhaps there is just a problem with being an open AP. I’ll let you know if I experience any of the same issues as we’ve been having.

Loic · January 25, 2019, 8:06am

Well, the effort has been mainly done in fixing client mode, though some fixes address both AP and client modes.

Correct.

There is no fine grained power mgmt in AP mode, the WiFi controller is just on and acting as an AP.

$ dmesg -n 8
$ echo 0x550 > /sys/module/wcn36xx/parameters/debug_mask

This will show 802.11 mac operation and RX/TX events. So you can ensure that packet are correctly received during the connection tentative. I also suggest running wpa_supplicant in verbose/debug mode.

Ideally, an air capture (via e.g. wireshark) would be useful.

sabjorn · January 25, 2019, 7:22pm

Thanks for getting back to me!
This is very helpful. I’ll get back to you on my the findings.

sabjorn · February 7, 2019, 12:53am

Bad news. Bug #727 does not appear to be the same issue we’re facing here. I still have found no way to force this issue to result so recreating it requires leaving some boards running for a few days until the problem appears.

I’m setting up a test with 4 boards and enabling driver debug. I’ll post results once I capture the failure.

Next step will be to try to setup Wireshark as well.

Thanks for the info up to this point!

Loic · February 8, 2019, 2:00pm

If your wireless card (e.g. wlan0) is capable of promiscuous mode, you can sniff on e.g. channel 1 with:

rfkill unblock wifi
iwconfig wlan0 mode monitor
sudo ifconfig wlan0 up
sudo iwconfig wlan0 channel 1
sudo wireshark -i wlan0 -k

sabjorn · February 14, 2019, 9:47pm

Have not yet setup Wireshark.

However, I setup 4 devices with the WiFi driver in verbose mode:

One pair running debian-4.14.0 (official Linaro release)
One pair running rpb-console-image-4.9.39 (Yocto built from Linaro repo)

Fortunately, for each pair, I had a failure and a success.

Logs

Pair 1

4.14 wifi continues
4.14 wifi fails

Pair 2

4.9.39 wifi continues
4.9.39 wifi fails

Notes

hostapd settings

All systems were configured with the same hostapd config file (with different SSIDs). They were configured without any encryption/passwords. This is generally how our devices are in the field.

Problem Description

After leaving systems to run in AP mode, eventually the system is no longer accessible via this AP.

The AP continues to broadcast but in attempting to connect, the connection fails.

We have had a lot of trouble recreating this bug. It can happen after leaving the system for 10 minutes and sometimes it can take several days for the problem to surface.

Logs Incomplete

You will notice with the 4.9.39 dmesg logs that they don’t start from the very top. I believe this is because we have a log rotator running on this system. I am hoping it didn’t cut off to much info.

Indicator

In each log you will see:

usb 1-1.3: new high-speed USB device

This marks when I plugged in a USB ethernet adaptor. After inserting the adaptor, for each system I connected to their AP in order to capture the connection successes and failures.

This should hopefully make it very easy to see whats going on.

Additional Test

Seeing that this may be a driver problem, I ran:

modprobe -r wcn36xx
modprobe wcn36xx
systemctl restart hostapd@wlan0

On one of the failed boxes. This allowed me to access the AP without restarting the system (which is what we currently have to do in the field to regain AP access).

However, fairly quickly after, I was locked out again.

Please take a look and let me know if there is anything that can be done.

This is currently my primary task right now so of there is anything I can do to help facilitate finding a solution, please let me know.

Basically, we’ve got systems out in the field that keep having this failure and our customer are not happy

Loic · February 15, 2019, 8:46am

I do not see any obvious error in the driver’s log, just a ‘delete station indication’ in case of failure log, which indicates that firmware reports (maybe wrongly) the client has left.

Do you have any board side hostapd log in that case? (from serial/UART).

I notice WiFi run the old firmware:

[   12.856411] wcn36xx: firmware WLAN version 'WCN v2.0 RadioPhy vRhea_GF_1.12 with 19.2MHz XO' and CRM version 'CNSS-PR-2-0-1-2-c1-00021'
[   12.856445] wcn36xx: firmware API 1.5.1.2, 41 stations, 2 bssids

The first thing to do is trying to reproduce with last release (19.01), you should see:

[    6.844556] wcn36xx: firmware WLAN version 'WCN v2.0 RadioPhy vRhea_GF_1.12 with 19.2MHz XO' and CRM version 'CNSS-PR-2-0-1-2-c1-74-130449-3'

On my side I use the following hostapd config:

interface=wlan0
driver=nl80211
ssid=testap
hw_mode=g
ieee80211n=1
ht_capab=[DSSS_CCK-40][SHORT-GI-20]
channel=1

sabjorn · February 15, 2019, 5:00pm

I’ll be able to setup the latest release early next week and get back to you on the findings.

I’ll also setup the previous tests with hostapd in verbose mode and capture those logs as well.

In the meantime, can you keep your device running in AP mode for a while and see if you can catch this same failure? In some cases, it’s taken >1 week for this error to occur.

sabjorn · March 13, 2019, 8:13pm

here is a complete log output from a failing system.

hostapd config:

ssid=Aurora-a06400
interface=wlan0
driver=nl80211
hw_mode=g
channel=6
macaddr_acl=0
auth_algs=1

The system is a 4.14 kernel. I’ve tried the latest Linaro Debian (19.01) release but there are a bunch of problems coming up for us with that release. After about a week of non-stop operation, the system freezes so I was unable to catch it in a frozen WiFi state.

I’ve also tried using the latest firmware but the driver complains and WiFi fails to start.

HostAPD and the wcn36xx driver were in full verbose mode.

Couple of things to note:

ignore the date stamps. Being in AP mode there is no NTP.
line 11955 is the last TX from the wcn36xx. My assumption is this is when AP mode starts to fail since it is ONLY RX from this point on.
line 12312, you can see a USB device attach. That is the USB ethernet adapter allowing me to get the logs off the device. After plugging in the adapter I attempted one more connection to the AP in order to get a sample of what a ‘failure’ looks like.

My current theory is based on a note on the driver page:

Not working yet:

Recovery(In case of chip hanging, wcn36xx must restart it)

Basically, something is happening with the chip where it gets locked up. But the driver has no way of knowing this is happening.

This might explain why:

modprobe -r wcn36xx
modprobe wcn36xx
systemctl restart hostapd@wlan0

allows the AP to recover.

Please let me know if you see anything of note.

Loic · March 15, 2019, 7:03pm

I suggest you to add the following in your hostapd config:

ht_capab=[DSSS_CCK-40][SHORT-GI-20]

Weird, to workaround this you can use the watchdog to reboot the system in case of freeze.

So, which release are you running? 18.01 with up to date kernel?

For replacing the firmware you have to copy all wcnss file (/lib/firmware/wcnss.*) from 19.01 to your target in same dir. It’s important to check issue is reproducible on most recent firmware.

I primarily think about a firmware issue, but it could be a race condition in the driver tx path as well.

sabjorn · March 21, 2019, 6:28pm

So, which release are you running? 18.01 with up to date kernel?

We are running a modified version of an older (about 1 year?) RPB image built with Bitbake. Kernel is 4.9.39). We’ve attampted to move to newer versions but Bitbake has made that process very annoying and so, asside from this WiFi issue, the stability of our current build is sufficient (and not worth the effort of pushing forward).

Added:

ht_capab=[DSSS_CCK-40][SHORT-GI-20]

and updated the firmware as suggested (copying /lib/firmware/wcnss.* from the latest release).

Still hitting the same connection problem.

here’s the log.

Please let me know if you see anything of note. As I said before, I setup 19.01 but something caused it to freeze up after a few days (likely some interaction with our software; but since it was a quick experiment I was not interested in trying to debug that problem).

Also, thanks for all the time you’ve spent on this. Finding a fix for this problem is ideal for us as a company but on a personal level I just want to help make sure people in the future don’t run into this same issue.

Loic · March 21, 2019, 6:44pm

oouch, anyway, I suggest you to upgrade at least the wifi driver which contains a lot of fixes, (though it maybe does not fix your specific issue):

tree: wcn36xx « ath « wireless « net « drivers - working/qualcomm/kernel.git - Qualcomm Landing Team kernel
log: working/qualcomm/kernel.git - Qualcomm Landing Team kernel

If it’s a firmware issue (in TX path) it’s going to be hard to fix without qcom support.