Inconsistent WiFi


#1

Not sure whether or not this is an issue with hardware, firmware, software, or all of the above but the performance of the WCN3620 is quite poor. When using the wifi on the DB410c I experience frequent connection drops and high pings in excess of 1000ms.

Looking at the 96Boards bug tracker I see several bugs open on this issue, but not much in the way of improvement:
https://bugs.96boards.org/show_bug.cgi?id=272
https://bugs.96boards.org/show_bug.cgi?id=254
https://bugs.96boards.org/show_bug.cgi?id=300
https://bugs.96boards.org/show_bug.cgi?id=214
…There are a few more but these seem to be the most relevant.

Is there anything I can try to make this a bit more consistent?


#2

Hi Rob,

There are two known issues that with symptoms matching your’s;

The first one is that latency of individual packets are very high (500+ms for me) when power-save is enabled on the interface. This can be “mitigated” by executing “iw dev wlan0 set power_save off” at any time the interface is up. I believe that this is either a configuration issue or some protocol issue between the kernel driver and the firmware.

The second is that the coexistence with Bluetooth (especially BLE) doesn’t seem to be reasonably tuned, resulting in WiFi losses during certain BT operations. This problem likely stems from the fact that the drivers does not tune the BT coexistence parameters.

Regards,
Bjorn


#3

Hi Bjorn,

I set the power_save mode to off and I still get intermittent disconnects. Would disabling network manager help with this at all?

Also are there any known co-interference issues reported when using multiple dragonboards in close proximity?

Thanks,

Rob


#4

Over the past day we’ve been doing some testing of mitigation strategies for this WIFI issue and found some pretty interesting things -

When pinging through to the Internet we see two major issues:
Packets out of sequence, and very, very high ping times

To mitigate this we tried to bring the interface down and up again, as well as restarting network manager. These actions were performed while running the PING test when the destination host became unreachable.

You can see the logs in the gist below:


#5

I also did a few tests on the 17.04.01 release and I’m afraid the power_save off workaround doesn’t work for me either.

I tried to characterise things a bit by using iperf in UDP mode (sends at 1Mbps, bidirectional, ~90 packets second). The results are pretty distinctive:

  • Packet loss comes in occasional avalanches and is otherwise zero (i.e. I’m not suffering interference from overloaded airwaves)
  • Packet loss is cyclic (occurs every 120 seconds)
  • Data being sent to the dragonboard exhibits packet loss equivalent to approx 1000ms worth of data and at this point delay receiving a packet rises to ~400ms.
  • Data being send by the dragonboard exhibits no packet loss but the maximum delay is ~1400ms

I’ve absolutely no idea if the fact that 1000ms + 400ms = 1400ms is a coincidence or not :slight_smile: .

For the record the two commands I am using are:

  1. [On PC] iperf -s -u -i 10 -e
  2. [On Dragonboard] iperf -c wychelm -d -u -e -i 10 -t $((60*20))

#6

I’m aware that my observations have essentially nothing in common with Rob’s so I also did a ping the outside world test. For me (quiet home network with cable broadband) I get similar observations pinging the outside world (unexpected latency every two minutes) that I do with iperf to a local machine. I cannot reproduce behavior similar to Rob.

It’s also interesting to note that when pinging from the dragonboard we are never able to observe packet loss. I think this is because the timing window to observe packet loss is pretty small (when there is only 1 ping per second). As we saw with iperf the system back buffers the outgoing packets so if we try to ping during a blackout the outgoing ping requests will be delayed but not lost, with the system being healthy again ready to receive the reply.


#7

Could this be due to differences in WIFI router setup?

Can we compare configurations?

I’m using the following settings:
Channel 8,
HT20 channel bandwidth,
WPA2-Personal,
RTS Threshold 2347,
Router is Equipped with Beamforming


#8

We can (I’m using channel 11, also WPA2-PSK, no other knobs that I can see) but given we are debugging WiFi it would also be good to eliminate the “through to the internet” part of your test first (since that’s the routers problem, not the dragonboards :wink: ). Basically there’s a whole pile of potential bufferbloat and traffic shaping that could exist between either of us and 8.8.8.8!

So… I can see that same high latency every two minutes just by pinging my router directly… If it helps the Linux-a-like timestamps in my logs come from: https://github.com/daniel-thompson/toys/blob/master/src/tstamp.c (so its just “tstamp ping 192.168.1.1” for me).


#9

Have you tried doing the following:
Ping the gateway from the db410c while another host pings the db410c?

This results (at least in my environment) in a total loss of connectivity to the db410c for 2 Minutes and 49 Seconds before becoming connectible once again. During this connectivity loss and gain we jump from sequence 163 to 329 thus dropping 166 packets during the connectivity loss.

You can see my test here (I used your tstamp program to generate the timestamps):


#10

Hi Rob

Having seen your timestamped logs I think your results now have enough in common with my own that we don’t have to be disturbed by the differences

To be the interesting points in your log appear to be …

At 44 seconds where you see behaviour similar to me (although the length of delay you observe is larger):

[   41.517770] 64 bytes from 10.1.100.1: icmp_seq=42 ttl=64 time=0.691 ms
[   44.187370] 64 bytes from 10.1.100.1: icmp_seq=43 ttl=64 time=1655 ms
[   44.187706] 64 bytes from 10.1.100.1: icmp_seq=44 ttl=64 time=631 ms
[   44.557915] 64 bytes from 10.1.100.1: icmp_seq=45 ttl=64 time=0.679 ms

At 164 seconds (notable for being 44 + 2*60) where we a similar problem… and a disconnect of some form:

[  161.862395] 64 bytes from 10.1.100.1: icmp_seq=161 ttl=64 time=2.46 ms
[  164.284070] 64 bytes from 10.1.100.1: icmp_seq=162 ttl=64 time=1422 ms
[  164.284070] 64 bytes from 10.1.100.1: icmp_seq=163 ttl=64 time=408 ms
[  333.828653] 64 bytes from 10.1.100.1: icmp_seq=329 ttl=64 time=0.659 ms

Finally at 404 seconds (noteable for being 164 + 4*60) we see another bubble:

[  401.814760] 64 bytes from 10.1.100.1: icmp_seq=396 ttl=64 time=0.681 ms
[  404.288240] 64 bytes from 10.1.100.1: icmp_seq=397 ttl=64 time=1467 ms
[  404.288240] 64 bytes from 10.1.100.1: icmp_seq=398 ttl=64 time=444 ms
[  405.860574] 64 bytes from 10.1.100.1: icmp_seq=400 ttl=64 time=0.665 ms

As I said, this is sufficiently similar to what I observe to call it the same phenomenon. The lengths of interruption might be longer for you (and you as seeing full network disconnect too, whilst I only see the latency problems) but the characteristic every two minutes pattern does seem to be common.

Since it appears to be a regular interference if you want to get a better sense of maximum delay you need to increase the ping rate. For comparison, when I ping at -i 0.01 my latency bubbles look like this:

[    8.138460] 64 bytes from 192.168.1.1: icmp_seq=558 ttl=64 time=0.582 ms
[    9.580049] 64 bytes from 192.168.1.1: icmp_seq=559 ttl=64 time=1426 ms
[    9.580896] 64 bytes from 192.168.1.1: icmp_seq=560 ttl=64 time=1411 ms
[    9.597706] 64 bytes from 192.168.1.1: icmp_seq=561 ttl=64 time=1405 ms

Daniel.


#11

Great, I’m glad that we’re on the same page now.
I am having a hard time figuring out what is causing this issue, any wild guesses why this is happening?
Could this be due to a hardware issue with the WCN3620 itself? Or is this still a software problem?

Would you like me to run the ping test on my side with the increased ping rate?

Thanks for all your help,

Rob

EDIT: I found an old post to the mailing list about a potential issue where the 3620 keeps asking wpa_supplicant for a scan whilst connected to a network.

Here’s the post: http://lists.infradead.org/pipermail/wcn36xx/2015-January/001424.html
Could this be the potentially be our issue? It looks like very similar behavior.

Is there a verbose mode for the driver that we can enable to see more in the way of logging? I feel like we’re just taking stabs in the dark.


#12

It is likely that the upstream driver fails to tickle some part of the system in the right way! I was not able to reproduce this issue on the #99 android build which uses a different driver (I used iperf3 from termux to run a few tests). Unfortunately finding out what the correct tickle actually is could still prove to be very difficult (and there’s no guarantee it is the network driver that provides it)

I would be quite interested to see what the result is but it would be an indulgence… I doubt it will bring about a fix any quicker.

If only is was help. It is always useful to get the most detailed possible description of a bug but the second part is describing the bug to the person who can say: “two minutes, that’s odd, the only thing that happens every two minutes is XXX”. Finding that person is hard :frowning: .


#13

This seems like a quite complex bug, and just hearing from someone on the team verify that there is something strange happening with this driver is very important to me.

Otherwise I would have gone on tangents looking into things that are probably not the issue (e.g. things like RF interference, and router setup). Now I have some idea of what might be going on when we lose connectivity like this.

Besides that I found a way to get more logging out of the driver by scouring some posts on the mailing lists -
If you remove the module and reinsert it like this: insmod wcn36xx.ko debug_mask=0xffffffff
The driver will now give tons of output into your dmesg log, and actually found something quite interesting but I’m not sure if it’s related.

It looks like the driver is trying to scan for 5GHz networks even though we’re only a 2.4GHz card:
[ 906.859852] wcn36xx: HAL >>> 00000000: 06 00 00 00 09 00 00 00 38
[ 906.867046] wcn36xx: SMD <<< 00000000: 07 00 00 00 15 00 00 00 00 00 00 00 0b 7e 6e 8b 0f 00 00 00 57
[ 906.867077] wcn36xx: SMD command completed in 4ms
[ 906.903164] wcn36xx: hal end scan channel 56
[ 906.903179] wcn36xx: HAL >>> 00000000: 08 00 00 00 09 00 00 00 38
[ 906.903830] wcn36xx: SMD <<< 00000000: 09 00 00 00 0c 00 00 00 00 00 00 00
[ 906.903856] wcn36xx: SMD command completed in 0ms
[ 906.903865] wcn36xx: hal start scan channel 60
[ 906.903874] wcn36xx: HAL >>> 00000000: 06 00 00 00 09 00 00 00 3c
[ 906.911017] wcn36xx: SMD <<< 00000000: 07 00 00 00 15 00 00 00 00 00 00 00 0b 7e 6e 8b 0f 00 00 00 57

I know that the WCN3620/3660/3680 utilize the same driver, but could we be possibly reporting to be a card with caps that we don’t have and causing the issue? I know it’s a real stab in the dark but I hope it helps.

-Rob


#14

So I decided to give the latest OpenEmbedded build a try to see if WIFI was a little more consistent. I am running the OpenEmbedded latest build located here: https://builds.96boards.org/snapshots/dragonboard410c/linaro/openembedded/morty/rpb/latest/

Turns out it exhibits most of the same findings as last week (frequent disconnects, high ping times.

The OE build does exhibit a time out/auth issue that is not seen in the Debian distro:
root@dragonboard-410c:~# [ 1981.081587] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 1981.126768] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 1981.330108] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 1981.534101] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 1981.738205] wlan0: authentication with ac:9e:17:7f:57:f0 timed out
[ 1993.484093] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 1993.523771] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 1993.726115] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 1993.930136] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 1994.134170] wlan0: authentication with ac:9e:17:7f:57:f0 timed out
[ 2004.039973] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 2005.746710] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 2005.786838] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 2005.990157] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 2006.194144] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 2006.398245] wlan0: authentication with ac:9e:17:7f:57:f0 timed out
[ 2018.127251] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 2018.167481] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 2018.370299] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 2018.574211] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 2018.778206] wlan0: authentication with ac:9e:17:7f:57:f0 timed out
[ 2029.041306] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 2033.023120] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 2033.062778] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 2033.266226] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 2033.470244] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 2033.674236] wlan0: authentication with ac:9e:17:7f:57:f0 timed out

EDIT: Some more weirdness from the OpenEmbedded latest build - When removing and inserting the wcn36xx module we get a panic stating HAL TX failed and Failed to push NV to chip. You can see the latest log here:


#15

Sorry for not responding earlier, apparently the new forum doesn’t automatically send an email notification when threads are updated.

This is very much in line with the previously reported issue that NetworkManager will scan for networks every minute or so. This did previously cause a 15 second hickup but with the introduction of

03c95dbef642 (“wcn36xx: Implement cancel_hw_scan”)

I got this down to < 1s, I do however remember seeing some delays in the serving of other traffic while testing this.

To confirm this you should be able to trigger this behavior by just issuing:

iw dev wlan0 scan

Iirc, the command sequence between the driver and the firmware is (now) identical for this operation between the two drivers, but I do expect this to be a driver/fw interaction issue in some way.

Do you have the downstream driver setup so that you can confirm that there are no traces of the scan in the latency numbers during a scan?

Regards,
Bjorn


#16

I started seeing this issue recently as well, but I thought it was related to me switching out my older Apple AP for a Google WiFi.

I created a bug report to track this particular issue here: https://bugs.96boards.org/show_bug.cgi?id=538

Regards,
Bjorn


#17

This is a known bug, which was fixed in efad8396e906 (“wcn36xx: Close SMD channel on device removal”) and backported to the Linaro tree, if not already in a release it should show up soon.


#18

Do you have the downstream driver setup so that you can confirm that there are no traces of the scan in the latency numbers during a scan?

I am running the most recent snapshot on my DB410C, but can flash to any version that is helpful for diagnosing this issue. What is a good version to use as a test for this issue?

To confirm this you should be able to trigger this behavior by just issuing:
iw dev wlan0 scan

I have been able to confirm that network scanning while pinging increases latency each time the command is executed. Is there a way to disable this automatic scan for networks so as to lower the amount of latency experienced during use?

I started seeing this issue (auth timeout) recently as well, but I thought it was related to me switching out my older Apple AP for a Google WiFi.

Glad to help narrow down the issue, WIFI problems can make you question even the slightest change in configuration! :smile:


#19

Our OE build config was updated on 6/23 with this fix. and no build was triggered since… so next build will indeed have the fix!


#20

In Android builds there is a file named WCNSS_qcom_cfg.ini that contains user preferences that overrides factory defaults in the WLAN driver. This file is usually found in the /data/misc/wifi directory. Does the driver on Debian look for the parameter overrides in this file?

I ask because, one of the issues that we see is that the idle scan mode of the radio ends up causing latency issues when the radio is in use. In the WCNSS_qcom_cfg.ini file there is a parameter that might be able to turn this idle scan mode off:

# Enable/Disable Idle Scan
gEnableIdleScan=0

Is this possible in the Debian distribution?