Wcn36xx AP Dropout


#7

I should mention, running the oe-rpb image lovingly built with bitbake.

Was running a pretty old build of 4.9:

in: meta-qcom/recipes-kernel/linux/linux-linaro-qcomlt_4.9.bb

SRCREV ?= "8165c999c87f1fe205e6fad779ded1f3e9bc382f"

Did a build with:

SRCREV = "3e797f5b154e7b382aa1801b022d0d431ca00b86"

And still same problem.

@ndec, i’ll try cherry-picking these patches and get back to you.

@Loic, i’ll look into this keyring issue as well.

Thanks everyone! Fingers crossed!


#8

@ndec

Started experimenting with 4.14 (as you’ve noticed with other threads we’ve been communicating through). The hope is the latest driver for WCN36xx will solve our AP Mode dropout problem.

To re-iterate the problem (from above):

AP mode will work for a while but then after a few hours the SSID is still observable (broadcasting) but the AP point will not accept any connections. The only way to connect to the AP is resetting the dragonboard.

However, we’ve run into the same behaviour (although, possibly from a different source) in the 4.14 kernel.

I’m hoping you can give me some suggestions on further debugging this problem.

I’ve tried setting hostapd to log everything. But, there doesn’t appear to be anything in the logs which would show a problem.

So, perhaps there is still a kernel problem. Is there anyway to increase the log level for the Wcn36xx driver?

I was thinking this may come from power management. As in, after some amount of time, some system is turned off and fails to return when trying to connect to the Access Point. Do you know of any way to observe the power management behaviours of the system (or better yet, to set power management to never sleep)?

Thanks in advance!


#9

Bug #727 is still open so you should probably try disabling WPA rekeying as @Loic suggests above.


#10

@danielt | @Loic
The bug we’re experiencing does not appear to be matching Bug #727.

We’re using an open network, not WPA, so my assumption was that setting:

wpa_group_rekey=0

will have no effect.

I tested this assumption today:

First, I setup hostapd with the configuration in the Bug #727 report. I was able to recreate the reported issue by pinging the device continuously both with wpa_group_rekey=60 and wpa_group_rekey=0.

With wpa_group_rekey=60, after a couple of minutes packets stopped succeeding (Request timeout for icmp_seq). Trying to start ping again and still failures. Following this, I disconnected from the network and when attempting to reconnect, the attempt failed. I could not reconnect to the AP without restarting the device.

I restarted the test with wpa_group_rekey=0 and had the same issue come up (packets stopped succeeding). However, this time when I disconnected and reconnected to the AP there was no trouble connecting.
So, I was able to recreate the bug you have referenced.

However, moving back to our original device configuration–non-password AP–I attempted the same test. However, with both wpa_group_rekey=60 and wpa_group_rekey=0, even after 20 minutes of letting ping run, there was no problems. Hence my assumption that wpa_group_rekey only has an effect when AP is configured to WPA. Is this a valid assumption?

Thus far, for the past 10 months, we have been unable to recreate our specific bug on demand. For some unknown reason, intermittently, we are unable to access the AP (non-password protected) without restarting the device.

Can you think of any other reasons why this might be happening? Again, is there any chance this is being caused by a power management setting? Or possibly a power fluctuation? Does the PMIC on the board have different power modes that it can automatically adjust to?

Furthermore, can you point me towards any information regarding setting the wcn36xx into a higher verbosity debug mode?


I have 4 devices I am going to setup in WPA mode for the weekend to see if perhaps there is just a problem with being an open AP. I’ll let you know if I experience any of the same issues as we’ve been having.


#11

Well, the effort has been mainly done in fixing client mode, though some fixes address both AP and client modes.

Correct.

There is no fine grained power mgmt in AP mode, the WiFi controller is just on and acting as an AP.

$ dmesg -n 8
$ echo 0x550 > /sys/module/wcn36xx/parameters/debug_mask

This will show 802.11 mac operation and RX/TX events. So you can ensure that packet are correctly received during the connection tentative. I also suggest running wpa_supplicant in verbose/debug mode.

Ideally, an air capture (via e.g. wireshark) would be useful.


#12

Thanks for getting back to me!
This is very helpful. I’ll get back to you on my the findings.


#13

Bad news. Bug #727 does not appear to be the same issue we’re facing here. I still have found no way to force this issue to result so recreating it requires leaving some boards running for a few days until the problem appears.

I’m setting up a test with 4 boards and enabling driver debug. I’ll post results once I capture the failure.

Next step will be to try to setup Wireshark as well.

Thanks for the info up to this point!


#14

If your wireless card (e.g. wlan0) is capable of promiscuous mode, you can sniff on e.g. channel 1 with:

rfkill unblock wifi
iwconfig wlan0 mode monitor
sudo ifconfig wlan0 up
sudo iwconfig wlan0 channel 1
sudo wireshark -i wlan0 -k

#15

Have not yet setup Wireshark.

However, I setup 4 devices with the WiFi driver in verbose mode:

  • One pair running debian-4.14.0 (official Linaro release)
  • One pair running rpb-console-image-4.9.39 (Yocto built from Linaro repo)

Fortunately, for each pair, I had a failure and a success.

Logs

Pair 1

4.14 wifi continues
4.14 wifi fails

Pair 2

4.9.39 wifi continues
4.9.39 wifi fails

Notes

hostapd settings

All systems were configured with the same hostapd config file (with different SSIDs). They were configured without any encryption/passwords. This is generally how our devices are in the field.

Problem Description

After leaving systems to run in AP mode, eventually the system is no longer accessible via this AP.

The AP continues to broadcast but in attempting to connect, the connection fails.

We have had a lot of trouble recreating this bug. It can happen after leaving the system for 10 minutes and sometimes it can take several days for the problem to surface.

Logs Incomplete

You will notice with the 4.9.39 dmesg logs that they don’t start from the very top. I believe this is because we have a log rotator running on this system. I am hoping it didn’t cut off to much info.

Indicator

In each log you will see:

usb 1-1.3: new high-speed USB device 

This marks when I plugged in a USB ethernet adaptor. After inserting the adaptor, for each system I connected to their AP in order to capture the connection successes and failures.

This should hopefully make it very easy to see whats going on.

Additional Test

Seeing that this may be a driver problem, I ran:

modprobe -r wcn36xx
modprobe wcn36xx
systemctl restart hostapd@wlan0

On one of the failed boxes. This allowed me to access the AP without restarting the system (which is what we currently have to do in the field to regain AP access).

However, fairly quickly after, I was locked out again.


Please take a look and let me know if there is anything that can be done.

This is currently my primary task right now so of there is anything I can do to help facilitate finding a solution, please let me know.

Basically, we’ve got systems out in the field that keep having this failure and our customer are not happy :frowning:


#16

I do not see any obvious error in the driver’s log, just a ‘delete station indication’ in case of failure log, which indicates that firmware reports (maybe wrongly) the client has left.

Do you have any board side hostapd log in that case? (from serial/UART).

I notice WiFi run the old firmware:

[   12.856411] wcn36xx: firmware WLAN version 'WCN v2.0 RadioPhy vRhea_GF_1.12 with 19.2MHz XO' and CRM version 'CNSS-PR-2-0-1-2-c1-00021'
[   12.856445] wcn36xx: firmware API 1.5.1.2, 41 stations, 2 bssids

The first thing to do is trying to reproduce with last release (19.01), you should see:

[    6.844556] wcn36xx: firmware WLAN version 'WCN v2.0 RadioPhy vRhea_GF_1.12 with 19.2MHz XO' and CRM version 'CNSS-PR-2-0-1-2-c1-74-130449-3'

On my side I use the following hostapd config:

interface=wlan0
driver=nl80211
ssid=testap
hw_mode=g
ieee80211n=1
ht_capab=[DSSS_CCK-40][SHORT-GI-20]
channel=1

#17

I’ll be able to setup the latest release early next week and get back to you on the findings.

I’ll also setup the previous tests with hostapd in verbose mode and capture those logs as well.

In the meantime, can you keep your device running in AP mode for a while and see if you can catch this same failure? In some cases, it’s taken >1 week for this error to occur.


#18

here is a complete log output from a failing system.

hostapd config:

ssid=Aurora-a06400
interface=wlan0
driver=nl80211
hw_mode=g
channel=6
macaddr_acl=0
auth_algs=1

The system is a 4.14 kernel. I’ve tried the latest Linaro Debian (19.01) release but there are a bunch of problems coming up for us with that release. After about a week of non-stop operation, the system freezes so I was unable to catch it in a frozen WiFi state.

I’ve also tried using the latest firmware but the driver complains and WiFi fails to start.

HostAPD and the wcn36xx driver were in full verbose mode.

Couple of things to note:

  1. ignore the date stamps. Being in AP mode there is no NTP.
  2. line 11955 is the last TX from the wcn36xx. My assumption is this is when AP mode starts to fail since it is ONLY RX from this point on.
  3. line 12312, you can see a USB device attach. That is the USB ethernet adapter allowing me to get the logs off the device. After plugging in the adapter I attempted one more connection to the AP in order to get a sample of what a ‘failure’ looks like.

My current theory is based on a note on the driver page:

Not working yet:

Recovery(In case of chip hanging, wcn36xx must restart it)

Basically, something is happening with the chip where it gets locked up. But the driver has no way of knowing this is happening.

This might explain why:

modprobe -r wcn36xx
modprobe wcn36xx
systemctl restart hostapd@wlan0

allows the AP to recover.

Please let me know if you see anything of note.


#19

I suggest you to add the following in your hostapd config:

ht_capab=[DSSS_CCK-40][SHORT-GI-20]

Weird, to workaround this you can use the watchdog to reboot the system in case of freeze.

So, which release are you running? 18.01 with up to date kernel?

For replacing the firmware you have to copy all wcnss file (/lib/firmware/wcnss.*) from 19.01 to your target in same dir. It’s important to check issue is reproducible on most recent firmware.

I primarily think about a firmware issue, but it could be a race condition in the driver tx path as well.


#20

So, which release are you running? 18.01 with up to date kernel?

We are running a modified version of an older (about 1 year?) RPB image built with Bitbake. Kernel is 4.9.39). We’ve attampted to move to newer versions but Bitbake has made that process very annoying and so, asside from this WiFi issue, the stability of our current build is sufficient (and not worth the effort of pushing forward).


Added:

ht_capab=[DSSS_CCK-40][SHORT-GI-20]

and updated the firmware as suggested (copying /lib/firmware/wcnss.* from the latest release).

Still hitting the same connection problem.

here’s the log.

Please let me know if you see anything of note. As I said before, I setup 19.01 but something caused it to freeze up after a few days (likely some interaction with our software; but since it was a quick experiment I was not interested in trying to debug that problem).

Also, thanks for all the time you’ve spent on this. Finding a fix for this problem is ideal for us as a company but on a personal level I just want to help make sure people in the future don’t run into this same issue. :smiley:


#21

oouch, anyway, I suggest you to upgrade at least the wifi driver which contains a lot of fixes, (though it maybe does not fix your specific issue):

tree: https://git.linaro.org/landing-teams/working/qualcomm/kernel.git/tree/drivers/net/wireless/ath/wcn36xx?h=release/qcomlt-4.14
log: https://git.linaro.org/landing-teams/working/qualcomm/kernel.git/log/drivers/net/wireless/ath/wcn36xx?h=release/qcomlt-4.14

If it’s a firmware issue (in TX path) it’s going to be hard to fix without qcom support.


#22

Hi, did you solved this issue? I noticed here in my tests that wcn36xx has a problem when uses WPA/PSK encryption, the connection falls and AP only returns when execute:

modprobe -r wcn36xx
modprobe wcn36xx

So I assume that the problem is related to wcn36xx because I restarted Network Manager and wlan0 interface but only when I restarted the wireless module the AP went back to normal functionality.


#23

Yes it seems to be a firmware issue. After few group rekeying, firmware seems unable to send TX packet. A workaround is to set wpa_group_rekey=0 in hostapd.conf for this issue.


#25

It works! Thanks a lot. Is there a way to configure DHCP for this AP? I’m using hostapd with fix IP address.


#26

I did it, soon I’ll return here all the steps to workaround this firmware issue.

Thanks!


#27

Hi, I’ve used this configuration to set AP:

Tested with the images bellow:

    * Download rescue files with version 108 (dragonboard-410c-bootloader-emmc-linux-108.zip):
    https://snapshots.linaro.org/96boards/dragonboard410c/linaro/rescue/108/
            
    * Download image files from version 530 (linaro-buster-developer-dragonboard-410c-530.img.gz and boot-linaro-buster-dragonboard-410c-530.img.gz): 
    http://snapshots.linaro.org/96boards/dragonboard410c/linaro/debian/530/
  • Disable power management:
    sudo iwconfig wlan0 power off
    sudo iwconfig wlan0 txpower auto

  • Configure hostapd.conf:
    sudo nano /etc/hostapd/hostapd.conf

  • Insert this information, save and exit:
    interface=wlan0
    driver=nl80211
    channel=6
    ssid=AccessPoint
    hw_mode=g
    wpa=3
    wpa_passphrase=1234567890
    wpa_key_mgmt=WPA-PSK
    wpa_pairwise=TKIP
    rsn_pairwise=CCMP
    auth_algs=1
    macaddr_acl=0
    ht_capab=[DSSS_CCK-40][SHORT-GI-20][HT40-]
    wpa_group_rekey=<0 or Big number>

  • Stop NM (NetworkManager):
    sudo systemctl stop NetworkManager

  • Set DHCP:
    cd ~
    sudo nano dnsmasq.conf

Insert:

Interface to bind to

interface=wlan0

Specify starting_range,end_range,lease_time

dhcp-range=10.0.0.3,10.0.0.20,12h

  • Apply Settings:
    sudo dnsmasq -C dnsmasq.conf

  • Set network in wlan0:
    sudo ifconfig wlan0 down
    sudo ifconfig wlan0 10.0.0.1 netmask 255.255.255.0 up

  • Run AP with:
    sudo hostapd -B /etc/hostapd/hostapd.conf

  • Use arp command to see connected devices IP:
    sudo arp

  • ping IP address:
    ping Address | perl -nle ‘print scalar(localtime), " ", $_’

  • If ping fails and AP stop working, execute:
    modprobe -r wcn36xx
    modprobe wcn36xx
    systemctl restart hostapd@wlan0