Inconsistent WiFi

Rob_Gries · June 20, 2017, 2:29pm

Not sure whether or not this is an issue with hardware, firmware, software, or all of the above but the performance of the WCN3620 is quite poor. When using the wifi on the DB410c I experience frequent connection drops and high pings in excess of 1000ms.

Looking at the 96Boards bug tracker I see several bugs open on this issue, but not much in the way of improvement:
https://bugs.96boards.org/show_bug.cgi?id=272
https://bugs.96boards.org/show_bug.cgi?id=254
https://bugs.96boards.org/show_bug.cgi?id=300
https://bugs.96boards.org/show_bug.cgi?id=214
…There are a few more but these seem to be the most relevant.

Is there anything I can try to make this a bit more consistent?

bamse · June 20, 2017, 11:07pm

Hi Rob,

There are two known issues that with symptoms matching your’s;

The first one is that latency of individual packets are very high (500+ms for me) when power-save is enabled on the interface. This can be “mitigated” by executing “iw dev wlan0 set power_save off” at any time the interface is up. I believe that this is either a configuration issue or some protocol issue between the kernel driver and the firmware.

The second is that the coexistence with Bluetooth (especially BLE) doesn’t seem to be reasonably tuned, resulting in WiFi losses during certain BT operations. This problem likely stems from the fact that the drivers does not tune the BT coexistence parameters.

Regards,
Bjorn

Rob_Gries · June 22, 2017, 12:34pm

Hi Bjorn,

I set the power_save mode to off and I still get intermittent disconnects. Would disabling network manager help with this at all?

Also are there any known co-interference issues reported when using multiple dragonboards in close proximity?

Thanks,

Rob

Rob_Gries · June 22, 2017, 2:01pm

Over the past day we’ve been doing some testing of mitigation strategies for this WIFI issue and found some pretty interesting things -

When pinging through to the Internet we see two major issues:
Packets out of sequence, and very, very high ping times

To mitigate this we tried to bring the interface down and up again, as well as restarting network manager. These actions were performed while running the PING test when the destination host became unreachable.

You can see the logs in the gist below:

gist.github.com

https://gist.github.com/RobGries/dac3aa80cbe98d82a3fb643f09ca8a9b

db410c-networkRestart-ifdown-ifup-ping.txt

sudo ping 8.8.8.8

PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=56 time=25.9 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=56 time=132 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=56 time=23.7 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=56 time=141 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=56 time=29.0 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=56 time=27.2 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=56 time=156 ms

This file has been truncated. show original

danielt · June 22, 2017, 2:32pm

I also did a few tests on the 17.04.01 release and I’m afraid the power_save off workaround doesn’t work for me either.

I tried to characterise things a bit by using iperf in UDP mode (sends at 1Mbps, bidirectional, ~90 packets second). The results are pretty distinctive:

Packet loss comes in occasional avalanches and is otherwise zero (i.e. I’m not suffering interference from overloaded airwaves)
Packet loss is cyclic (occurs every 120 seconds)
Data being sent to the dragonboard exhibits packet loss equivalent to approx 1000ms worth of data and at this point delay receiving a packet rises to ~400ms.
Data being send by the dragonboard exhibits no packet loss but the maximum delay is ~1400ms

I’ve absolutely no idea if the fact that 1000ms + 400ms = 1400ms is a coincidence or not .

For the record the two commands I am using are:

[On PC] iperf -s -u -i 10 -e
[On Dragonboard] iperf -c wychelm -d -u -e -i 10 -t $((60*20))

danielt · June 22, 2017, 3:41pm

I’m aware that my observations have essentially nothing in common with Rob’s so I also did a ping the outside world test. For me (quiet home network with cable broadband) I get similar observations pinging the outside world (unexpected latency every two minutes) that I do with iperf to a local machine. I cannot reproduce behavior similar to Rob.

It’s also interesting to note that when pinging from the dragonboard we are never able to observe packet loss. I think this is because the timing window to observe packet loss is pretty small (when there is only 1 ping per second). As we saw with iperf the system back buffers the outgoing packets so if we try to ping during a blackout the outgoing ping requests will be delayed but not lost, with the system being healthy again ready to receive the reply.

Rob_Gries · June 22, 2017, 3:47pm

Could this be due to differences in WIFI router setup?

Can we compare configurations?

I’m using the following settings:
Channel 8,
HT20 channel bandwidth,
WPA2-Personal,
RTS Threshold 2347,
Router is Equipped with Beamforming

danielt · June 22, 2017, 4:17pm

We can (I’m using channel 11, also WPA2-PSK, no other knobs that I can see) but given we are debugging WiFi it would also be good to eliminate the “through to the internet” part of your test first (since that’s the routers problem, not the dragonboards ). Basically there’s a whole pile of potential bufferbloat and traffic shaping that could exist between either of us and 8.8.8.8!

So… I can see that same high latency every two minutes just by pinging my router directly… If it helps the Linux-a-like timestamps in my logs come from: https://github.com/daniel-thompson/toys/blob/master/src/tstamp.c (so its just “tstamp ping 192.168.1.1” for me).

Rob_Gries · June 22, 2017, 7:08pm

Have you tried doing the following:
Ping the gateway from the db410c while another host pings the db410c?

This results (at least in my environment) in a total loss of connectivity to the db410c for 2 Minutes and 49 Seconds before becoming connectible once again. During this connectivity loss and gain we jump from sequence 163 to 329 thus dropping 166 packets during the connectivity loss.

You can see my test here (I used your tstamp program to generate the timestamps):

gist.github.com

https://gist.github.com/RobGries/4fdcac92edf6bcc0b11c99baf2f44467

db410c-LocalRouter-Ping with-Tstamp-while-host-pings-board

sudo ./tstamp ping 10.1.100.1
[    0.014689] PING 10.1.100.1 (10.1.100.1) 56(84) bytes of data.
[    0.020503] 64 bytes from 10.1.100.1: icmp_seq=1 ttl=64 time=5.08 ms
[    1.017570] 64 bytes from 10.1.100.1: icmp_seq=2 ttl=64 time=0.695 ms
[    2.020651] 64 bytes from 10.1.100.1: icmp_seq=3 ttl=64 time=0.657 ms
[    3.044970] 64 bytes from 10.1.100.1: icmp_seq=4 ttl=64 time=0.981 ms
[    4.046065] 64 bytes from 10.1.100.1: icmp_seq=5 ttl=64 time=0.774 ms
[    5.046476] 64 bytes from 10.1.100.1: icmp_seq=6 ttl=64 time=0.661 ms
[    6.053026] 64 bytes from 10.1.100.1: icmp_seq=7 ttl=64 time=1.06 ms
[    7.054058] 64 bytes from 10.1.100.1: icmp_seq=8 ttl=64 time=0.702 ms

This file has been truncated. show original

danielt · June 23, 2017, 9:06am

Hi Rob

Having seen your timestamped logs I think your results now have enough in common with my own that we don’t have to be disturbed by the differences

To be the interesting points in your log appear to be …

At 44 seconds where you see behaviour similar to me (although the length of delay you observe is larger):

[   41.517770] 64 bytes from 10.1.100.1: icmp_seq=42 ttl=64 time=0.691 ms
[   44.187370] 64 bytes from 10.1.100.1: icmp_seq=43 ttl=64 time=1655 ms
[   44.187706] 64 bytes from 10.1.100.1: icmp_seq=44 ttl=64 time=631 ms
[   44.557915] 64 bytes from 10.1.100.1: icmp_seq=45 ttl=64 time=0.679 ms

At 164 seconds (notable for being 44 + 2*60) where we a similar problem… and a disconnect of some form:

[  161.862395] 64 bytes from 10.1.100.1: icmp_seq=161 ttl=64 time=2.46 ms
[  164.284070] 64 bytes from 10.1.100.1: icmp_seq=162 ttl=64 time=1422 ms
[  164.284070] 64 bytes from 10.1.100.1: icmp_seq=163 ttl=64 time=408 ms
[  333.828653] 64 bytes from 10.1.100.1: icmp_seq=329 ttl=64 time=0.659 ms

Finally at 404 seconds (noteable for being 164 + 4*60) we see another bubble:

[  401.814760] 64 bytes from 10.1.100.1: icmp_seq=396 ttl=64 time=0.681 ms
[  404.288240] 64 bytes from 10.1.100.1: icmp_seq=397 ttl=64 time=1467 ms
[  404.288240] 64 bytes from 10.1.100.1: icmp_seq=398 ttl=64 time=444 ms
[  405.860574] 64 bytes from 10.1.100.1: icmp_seq=400 ttl=64 time=0.665 ms

As I said, this is sufficiently similar to what I observe to call it the same phenomenon. The lengths of interruption might be longer for you (and you as seeing full network disconnect too, whilst I only see the latency problems) but the characteristic every two minutes pattern does seem to be common.

Since it appears to be a regular interference if you want to get a better sense of maximum delay you need to increase the ping rate. For comparison, when I ping at -i 0.01 my latency bubbles look like this:

[    8.138460] 64 bytes from 192.168.1.1: icmp_seq=558 ttl=64 time=0.582 ms
[    9.580049] 64 bytes from 192.168.1.1: icmp_seq=559 ttl=64 time=1426 ms
[    9.580896] 64 bytes from 192.168.1.1: icmp_seq=560 ttl=64 time=1411 ms
[    9.597706] 64 bytes from 192.168.1.1: icmp_seq=561 ttl=64 time=1405 ms

Daniel.

Rob_Gries · June 23, 2017, 2:42pm

Great, I’m glad that we’re on the same page now.
I am having a hard time figuring out what is causing this issue, any wild guesses why this is happening?
Could this be due to a hardware issue with the WCN3620 itself? Or is this still a software problem?

Would you like me to run the ping test on my side with the increased ping rate?

Thanks for all your help,

Rob

EDIT: I found an old post to the mailing list about a potential issue where the 3620 keeps asking wpa_supplicant for a scan whilst connected to a network.

Here’s the post: msm8916 + wcn3620 mainline support?
Could this be the potentially be our issue? It looks like very similar behavior.

Is there a verbose mode for the driver that we can enable to see more in the way of logging? I feel like we’re just taking stabs in the dark.

danielt · June 23, 2017, 2:46pm

It is likely that the upstream driver fails to tickle some part of the system in the right way! I was not able to reproduce this issue on the #99 android build which uses a different driver (I used iperf3 from termux to run a few tests). Unfortunately finding out what the correct tickle actually is could still prove to be very difficult (and there’s no guarantee it is the network driver that provides it)

I would be quite interested to see what the result is but it would be an indulgence… I doubt it will bring about a fix any quicker.

If only is was help. It is always useful to get the most detailed possible description of a bug but the second part is describing the bug to the person who can say: “two minutes, that’s odd, the only thing that happens every two minutes is XXX”. Finding that person is hard .

Rob_Gries · June 23, 2017, 3:13pm

This seems like a quite complex bug, and just hearing from someone on the team verify that there is something strange happening with this driver is very important to me.

Otherwise I would have gone on tangents looking into things that are probably not the issue (e.g. things like RF interference, and router setup). Now I have some idea of what might be going on when we lose connectivity like this.

Besides that I found a way to get more logging out of the driver by scouring some posts on the mailing lists -
If you remove the module and reinsert it like this: insmod wcn36xx.ko debug_mask=0xffffffff
The driver will now give tons of output into your dmesg log, and actually found something quite interesting but I’m not sure if it’s related.

It looks like the driver is trying to scan for 5GHz networks even though we’re only a 2.4GHz card:
[ 906.859852] wcn36xx: HAL >>> 00000000: 06 00 00 00 09 00 00 00 38
[ 906.867046] wcn36xx: SMD <<< 00000000: 07 00 00 00 15 00 00 00 00 00 00 00 0b 7e 6e 8b 0f 00 00 00 57
[ 906.867077] wcn36xx: SMD command completed in 4ms
[ 906.903164] wcn36xx: hal end scan channel 56
[ 906.903179] wcn36xx: HAL >>> 00000000: 08 00 00 00 09 00 00 00 38
[ 906.903830] wcn36xx: SMD <<< 00000000: 09 00 00 00 0c 00 00 00 00 00 00 00
[ 906.903856] wcn36xx: SMD command completed in 0ms
[ 906.903865] wcn36xx: hal start scan channel 60
[ 906.903874] wcn36xx: HAL >>> 00000000: 06 00 00 00 09 00 00 00 3c
[ 906.911017] wcn36xx: SMD <<< 00000000: 07 00 00 00 15 00 00 00 00 00 00 00 0b 7e 6e 8b 0f 00 00 00 57

I know that the WCN3620/3660/3680 utilize the same driver, but could we be possibly reporting to be a card with caps that we don’t have and causing the issue? I know it’s a real stab in the dark but I hope it helps.

-Rob

Rob_Gries · June 26, 2017, 3:18pm

So I decided to give the latest OpenEmbedded build a try to see if WIFI was a little more consistent. I am running the OpenEmbedded latest build located here: https://builds.96boards.org/snapshots/dragonboard410c/linaro/openembedded/morty/rpb/latest/

Turns out it exhibits most of the same findings as last week (frequent disconnects, high ping times.

The OE build does exhibit a time out/auth issue that is not seen in the Debian distro:
root@dragonboard-410c:~# [ 1981.081587] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 1981.126768] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 1981.330108] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 1981.534101] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 1981.738205] wlan0: authentication with ac:9e:17:7f:57:f0 timed out
[ 1993.484093] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 1993.523771] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 1993.726115] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 1993.930136] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 1994.134170] wlan0: authentication with ac:9e:17:7f:57:f0 timed out
[ 2004.039973] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 2005.746710] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 2005.786838] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 2005.990157] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 2006.194144] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 2006.398245] wlan0: authentication with ac:9e:17:7f:57:f0 timed out
[ 2018.127251] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 2018.167481] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 2018.370299] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 2018.574211] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 2018.778206] wlan0: authentication with ac:9e:17:7f:57:f0 timed out
[ 2029.041306] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 2033.023120] wlan0: authenticate with ac:9e:17:7f:57:f0
[ 2033.062778] wlan0: send auth to ac:9e:17:7f:57:f0 (try 1/3)
[ 2033.266226] wlan0: send auth to ac:9e:17:7f:57:f0 (try 2/3)
[ 2033.470244] wlan0: send auth to ac:9e:17:7f:57:f0 (try 3/3)
[ 2033.674236] wlan0: authentication with ac:9e:17:7f:57:f0 timed out

EDIT: Some more weirdness from the OpenEmbedded latest build - When removing and inserting the wcn36xx module we get a panic stating HAL TX failed and Failed to push NV to chip. You can see the latest log here:

gist.github.com

https://gist.github.com/RobGries/e9e98e0fb463efe9ce3ab0f53f331bcf

gistfile1.txt

[   60.410070] ------------[ cut here ]------------
[   60.413884] WARNING: CPU: 2 PID: 2800 at /srv/oe/build/tmp-rpb-glibc/work-shared/dragonboard-410c/kernel-source/drivers/rpmsg/rpmsg_core.c:116 rpmsg_send+0x28/0x40
[   60.418484] Modules linked in: wcn36xx ctr ccm bnep arc4 btqcomsmd btqca mac80211 bluetooth cfg80211 ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc overlay venus_dec venus_enc qcom_wcnss_pil msm_rng rng_core venus_core qcom_camss videobuf2_dma_sg qcom_cci mdt_loader v4l2_mem2mem videobuf2_memops videobuf2_v4l2 videobuf2_core fuse [last unloaded: wcn36xx]
[   60.479404] 
[   60.479497] CPU: 2 PID: 2800 Comm: wpa_supplicant Tainted: G        W       4.9.27+linaro #1
[   60.481065] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[   60.489483] task: ffff8000345a4100 task.stack: ffff80002d0dc000
[   60.496167] PC is at rpmsg_send+0x28/0x40
[   60.501831] LR is at wcn36xx_smd_send_and_wait+0x58/0x140 [wcn36xx]
[   60.505975] pc : [<ffff0000088694a0>] lr : [<ffff000000e35f08>] pstate: 20000145

This file has been truncated. show original

bamse · June 27, 2017, 6:12pm

Sorry for not responding earlier, apparently the new forum doesn’t automatically send an email notification when threads are updated.

This is very much in line with the previously reported issue that NetworkManager will scan for networks every minute or so. This did previously cause a 15 second hickup but with the introduction of

03c95dbef642 (“wcn36xx: Implement cancel_hw_scan”)

I got this down to < 1s, I do however remember seeing some delays in the serving of other traffic while testing this.

To confirm this you should be able to trigger this behavior by just issuing:

iw dev wlan0 scan

Iirc, the command sequence between the driver and the firmware is (now) identical for this operation between the two drivers, but I do expect this to be a driver/fw interaction issue in some way.

Do you have the downstream driver setup so that you can confirm that there are no traces of the scan in the latency numbers during a scan?

Regards,
Bjorn

bamse · June 27, 2017, 6:19pm

I started seeing this issue recently as well, but I thought it was related to me switching out my older Apple AP for a Google WiFi.

I created a bug report to track this particular issue here: https://bugs.96boards.org/show_bug.cgi?id=538

Regards,
Bjorn

bamse · June 27, 2017, 6:20pm

This is a known bug, which was fixed in efad8396e906 (“wcn36xx: Close SMD channel on device removal”) and backported to the Linaro tree, if not already in a release it should show up soon.

Rob_Gries · June 27, 2017, 6:57pm

Do you have the downstream driver setup so that you can confirm that there are no traces of the scan in the latency numbers during a scan?

I am running the most recent snapshot on my DB410C, but can flash to any version that is helpful for diagnosing this issue. What is a good version to use as a test for this issue?

To confirm this you should be able to trigger this behavior by just issuing:
iw dev wlan0 scan

I have been able to confirm that network scanning while pinging increases latency each time the command is executed. Is there a way to disable this automatic scan for networks so as to lower the amount of latency experienced during use?

I started seeing this issue (auth timeout) recently as well, but I thought it was related to me switching out my older Apple AP for a Google WiFi.

Glad to help narrow down the issue, WIFI problems can make you question even the slightest change in configuration!

anon91830841 · June 27, 2017, 8:51pm

Our OE build config was updated on 6/23 with this fix. and no build was triggered since… so next build will indeed have the fix!

Rob_Gries · June 28, 2017, 3:45pm

In Android builds there is a file named WCNSS_qcom_cfg.ini that contains user preferences that overrides factory defaults in the WLAN driver. This file is usually found in the /data/misc/wifi directory. Does the driver on Debian look for the parameter overrides in this file?

I ask because, one of the issues that we see is that the idle scan mode of the radio ends up causing latency issues when the radio is in use. In the WCNSS_qcom_cfg.ini file there is a parameter that might be able to turn this idle scan mode off:

# Enable/Disable Idle Scan
gEnableIdleScan=0

Is this possible in the Debian distribution?