Inconsistent WiFi


#41

If you would expect >1000ms ping times with any single tuner wifi, then why would linaro use networkmanager at all?

Why don’t we see this with Android if it is a fundamental issue with single tuner wifi?


#42

Ultimately NetworkManager is the tool that gves us roaming support.

Ultimately scanning decisions are heuristics and NM and Android can
implement different approaches. The last time I looked into this there
was a very active discussion within NetworkManager on alternative
heuristics. In fact it looks like that work has just landed in the
upstream development branches:
https://bugzilla.gnome.org/show_bug.cgi?id=766482

Could still be a while before it hits a released image though…


#43

We don’t expect 1s ping times due to NetworkManager, this is much more likely the result of you running wcn36xx with power save enabled; which does trigger some bug showing up as ridiculous rtt. Try running “iw dev wlan0 set power_save off” and there’s a good chance you’re seeing a huge improvement.

The network manager periodically issue a network scan, which does cause a measurable difference in the throughput and latency, but not on the scale affecting TCP traffic.


#44

Here is what I see with power save off, network manager on

root@:~# iw dev wlan0 set power_save off
root@:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=57 time=925.628 ms
64 bytes from 8.8.8.8: seq=1 ttl=57 time=12.352 ms
64 bytes from 8.8.8.8: seq=2 ttl=57 time=12.170 ms
64 bytes from 8.8.8.8: seq=3 ttl=57 time=12.537 ms
64 bytes from 8.8.8.8: seq=4 ttl=57 time=12.217 ms
64 bytes from 8.8.8.8: seq=5 ttl=57 time=14.719 ms
64 bytes from 8.8.8.8: seq=6 ttl=57 time=952.531 ms
64 bytes from 8.8.8.8: seq=7 ttl=57 time=11.568 ms
64 bytes from 8.8.8.8: seq=8 ttl=57 time=11.998 ms
64 bytes from 8.8.8.8: seq=9 ttl=57 time=1012.062 ms
64 bytes from 8.8.8.8: seq=10 ttl=57 time=12.633 ms
64 bytes from 8.8.8.8: seq=11 ttl=57 time=11.972 ms
64 bytes from 8.8.8.8: seq=12 ttl=57 time=11.848 ms
64 bytes from 8.8.8.8: seq=13 ttl=57 time=11.751 ms
64 bytes from 8.8.8.8: seq=14 ttl=57 time=11.758 ms
64 bytes from 8.8.8.8: seq=15 ttl=57 time=12.942 ms
64 bytes from 8.8.8.8: seq=16 ttl=57 time=1012.024 ms
64 bytes from 8.8.8.8: seq=17 ttl=57 time=11.984 ms
64 bytes from 8.8.8.8: seq=18 ttl=57 time=1012.240 ms
64 bytes from 8.8.8.8: seq=19 ttl=57 time=15.943 ms
64 bytes from 8.8.8.8: seq=20 ttl=57 time=11.678 ms
64 bytes from 8.8.8.8: seq=21 ttl=57 time=11.872 ms
64 bytes from 8.8.8.8: seq=22 ttl=57 time=564.177 ms


#45

same build, disable NetworkManager.service
ip link set wlan0 up
wpa_supplicant -B -D wext -i wlan0 -c /etc/wpa_supplicant.conf
dhclient wlan0

root@:~# iw dev wlan0 set power_save off
root@:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=57 time=11.601 ms
64 bytes from 8.8.8.8: seq=1 ttl=57 time=11.482 ms
64 bytes from 8.8.8.8: seq=2 ttl=57 time=11.448 ms
64 bytes from 8.8.8.8: seq=3 ttl=57 time=11.491 ms
64 bytes from 8.8.8.8: seq=4 ttl=57 time=11.523 ms
64 bytes from 8.8.8.8: seq=5 ttl=57 time=13.511 ms
64 bytes from 8.8.8.8: seq=6 ttl=57 time=13.004 ms
64 bytes from 8.8.8.8: seq=7 ttl=57 time=11.780 ms
64 bytes from 8.8.8.8: seq=8 ttl=57 time=11.616 ms
64 bytes from 8.8.8.8: seq=9 ttl=57 time=12.711 ms
64 bytes from 8.8.8.8: seq=10 ttl=57 time=11.579 ms
64 bytes from 8.8.8.8: seq=11 ttl=57 time=11.516 ms
64 bytes from 8.8.8.8: seq=12 ttl=57 time=11.395 ms
64 bytes from 8.8.8.8: seq=13 ttl=57 time=11.466 ms
64 bytes from 8.8.8.8: seq=14 ttl=57 time=11.426 ms
64 bytes from 8.8.8.8: seq=15 ttl=57 time=11.383 ms
64 bytes from 8.8.8.8: seq=16 ttl=57 time=11.368 ms
64 bytes from 8.8.8.8: seq=17 ttl=57 time=11.667 ms
64 bytes from 8.8.8.8: seq=18 ttl=57 time=11.376 ms
64 bytes from 8.8.8.8: seq=19 ttl=57 time=11.620 ms
64 bytes from 8.8.8.8: seq=20 ttl=57 time=12.670 ms
64 bytes from 8.8.8.8: seq=21 ttl=57 time=11.575 ms
64 bytes from 8.8.8.8: seq=22 ttl=57 time=11.549 ms
64 bytes from 8.8.8.8: seq=23 ttl=57 time=11.396 ms
64 bytes from 8.8.8.8: seq=24 ttl=57 time=11.458 ms
64 bytes from 8.8.8.8: seq=25 ttl=57 time=11.421 ms
64 bytes from 8.8.8.8: seq=26 ttl=57 time=11.412 ms
64 bytes from 8.8.8.8: seq=27 ttl=57 time=11.424 ms
64 bytes from 8.8.8.8: seq=28 ttl=57 time=12.370 ms
64 bytes from 8.8.8.8: seq=29 ttl=57 time=25.271 ms
64 bytes from 8.8.8.8: seq=30 ttl=57 time=11.516 ms
64 bytes from 8.8.8.8: seq=31 ttl=57 time=12.590 ms
64 bytes from 8.8.8.8: seq=32 ttl=57 time=11.633 ms
64 bytes from 8.8.8.8: seq=33 ttl=57 time=13.570 ms
64 bytes from 8.8.8.8: seq=34 ttl=57 time=11.742 ms
64 bytes from 8.8.8.8: seq=35 ttl=57 time=13.009 ms
64 bytes from 8.8.8.8: seq=36 ttl=57 time=15.340 ms
64 bytes from 8.8.8.8: seq=37 ttl=57 time=11.456 ms


#46

Note that at each scan, 37 channels are scanned.

$ echo 0x400 > /sys/module/wcn36xx/parameters/debug_mask
$ dmesg
wcn36xx: mac80211 scan 37 channels worker

Looking at the driver, we can find the scan method (wcn36xx_hw_scan_worker):
for (i = 0; i < req->n_channels; i++) {
...
wcn36xx_smd_start_scan(wcn, req->channels[i]->hw_value);
msleep(30);
wcn36xx_smd_end_scan(wcn, req->channels[i]->hw_value);
...
}

So we can deduct that the scan loop will take about 1110ms (30ms * 37 channels).
I would say that this seems to match the different observations.
That said, I don’t know what is really done inside the firmware, if some sort of antenna time-sharing is performed.

The remaining question is why scanning 37 channels since there are only 14 channels in the 2.4Ghz band.

Debug shows that the driver actually try to scan 5Ghz channels:
[ 1908.495625] wcn36xx: hal start scan channel 14
[ 1908.538978] wcn36xx: hal end scan channel 14
[ 1908.539629] wcn36xx: hal start scan channel 36
[ 1908.586980] wcn36xx: hal end scan channel 36

This should not happen with this 2.4Ghz(only) chip (we could save 700ms).
iw phy0 info reports 5GHz support.


#47

It looks like there is already a bug report for 5GHz.

https://bugs.96boards.org/show_bug.cgi?id=493

and there was a quick failed attempt and removing 5GHz. Can this be picked up again?


#48

Yes this one seems coming from a different root cause since disabling 5GHz band capability from driver src does not seem to help.

However, for this particular scan issue, I think adding some chip/band-support detection could help.
We are going to have a look.


#49

Thanks for looking into it. As of now this is killing us. How difficult do you think it will be to add chip/band-support detection?


#50

Patch is currently under review, latency during scan is reduced as expected.


#51

Hi @Loic,

I’ve integrated your patch that is under review for the chip/band support detection, and we still experience the reboot issue when pushing UDP packets through the interface with iperf. However, it seems that it is a little bit more reliable from the standpoint that it won’t try to connect to a network that it cannot support.

The iperf test in Inconsistent WiFi will trigger the bug if the -t switch is increased to 5 Minutes (300s). It seems that sustained activity at 30s is no longer able to trigger the bug reliably.


#52

fwiw, the patch has not been merge into linux-next, so it will show up in the next Linux kernel version, and it was backported into the Linaro kernel branch (based on 4.9), so it will be included in the next Linaro release as well.


#53

Hi @ndec,

That’s good news. I’ve been following the work that is on going in the Linaro Git repos, so I’ve been backporting some of the new patches into the stable branch #260 build to see if anything improves in the way of wifi support.

I’ve been messing with the following patches from ath-next, qcom’s codeaurora rpb repo, and upstream linaro and applying them to the #260 build that I’ve been using as a baseline:

The result is that the WiFi adaptor is a little bit more stable, however the UDP bug still exists and can be triggered with iperf with around ~3-5 minutes of sustained activity. I’ve also seen that the wifi module will stall when it doesn’t crash and cause a loss of connectivity. As @2048Bit has said before, this is killing us and making it hard to work with this particular chip. Hopefully this helps shed some additional light on some of the issues that we are having.

I’ll continue to mess around with this and will share if I find a particular configuration that works reliably.

-Rob


#54

Has anyone tried building and using Dan Williams’ version of network manager from Daniel T’s post here:

I’m building it now inside of build #260, along with the patches that I found and cataloged.

FYI - this is the repo - https://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?h=dcbw/wifi-scan


#55

This hack is known to eliminate the crash with UDP - https://source.codeaurora.org/quic/imm/imm/kernel/commit/?h=IMM.LE.1.0&id=9deb3c88a455290c31aa5d93953a45a09809eb5c


#56

I’ve actually tried that one. It makes the crash harder to reproduce but it is definitely there. Try running iperf for an extended period of time when that patch is incorporated.


#57

Thanks Rob, you refer to the ‘firmware/remoteproc crash’ right ? not a Kernel one ?


#58

@Loic

To be honest, I’m not quite sure what is causing the complete reboot since I do not have any logging facilities when running the system like this.

It is possible that it is a firmware/remoteproc crash, but am I certain? No. All I know is that when running into the issue the entire system restarts without any appreciable error or warning. I don’t have a license for ARM Coresight but I do have a proper JTAG port. So if CoreSight is enabled, and I believe it is I could perhaps get some additional information.

Has anyone tried to reproduce this with an ARM debugger attached via Coresight STM? If not perhaps it should be done?


#59

Ok thanks rob, seems that this topic includes different issues, the best way to track them is to open a ticket on bugzilla to track them separately.


#60

@Loic,

I’ve posted the bug here:
https://bugs.96boards.org/show_bug.cgi?id=609