-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WILC1000 firmware floods packets when multiple stations are connected to third party AP #6
Comments
Microchip Salesforce case number 00374724. |
Thank you very much for your effort and insight! We have basically the same issue rendering any WiFi Network in range unable to send data or even be seen, as soon as ~10 WILC3000 access points are switched on (even without traffic). Our Microchip Salesforce case number is 00328657 (Date/Time Opened | 9/6/2018 10:39 AM) where we also supplied packet capture data. If not, I'm considering reporting this "network jamming" feature to the German RF-network authorities so they can effectively ban the hardware from being sold. |
@ZonKnoZ We would be interested in seeing your issue report details posted here too. Please share the details, including packet capture files, so that when Microchip eventually post a "fix" we can all test it knowing the likely failure modes. I have suspected that this issue will be provoked by any AP in range, even with the WILC1000 stations not connected to it, however we have already spent vast amounts of time attempting to diagnose this problem. I do not have time to allocate to further digging. It sounds like your experience has been similar to ours. This is a very simple setup to recreate and while the details of the problem are difficult to determine, the effect is very very obvious. Hopefully our combined efforts will motivate an appropriate response. |
Thanks @tsifb for the detailed report. |
@AdhamAbozaeid I see they have replied in Salesforce saying they will investigate. I hope that will be given high priority. Are there any plans to open source the firmware for the WILC modules? While we do not really want to be inside the module firmware, it might be useful for our own wifi experts to be able to inspect the code. Currently the issues with this wifi module are stalling our product release, and it is now critical path for us. |
@tsifb , there are currently no plans to open source the firmware. |
I can confirm that something happens when more than one WILC1000 device is in one wifi network. |
@MTaulin your observations match ours. Also, If you closely inspect the on-air traffic with a single WILC1000 active, there will still be erroneous packets and you will observe some incorrectly re-transmitted legitimate packets. This does slow the throughput from what the device is capable of. Using firmware and driver from more than 12 months ago, we saw significantly higher throughput - however there were other serious bugs that made that combination unusable. We also observed packets with source and destination addresses that matched the AP (TP-link) We proved without doubt that these packets were NOT transmitted from the AP, but in fact were sent by the WILC1000 modules. You can prove this using a packet sniffer (we use rtl8812AU device). If it is located very close to the wilc1000, and distant from the AP, you will be able to observe different per packet RSSI per transmitter. Strong RSSI will indicate WILC1000 is transmitting. (radio tap header includes RSSI value) |
@AdhamAbozaeid I have had no solid feedback via salesforce (7 days) Can you raise this internally at Microchip to progress the investigating with high priority? @MTaulin's report shows the same symptoms. @ZonKnoZ's issue is likely closely related or the same. |
@tsifb the support team is currently working on reproducing the issue and will get back to you soon with feedback. |
@rocky134 mentioned our problem in issue #36 as well meanwhile we are making plans to drop support of the WILC3K altogether and switching to an usb dongle. The sad part is that we are going to ditch 1K+ WILC3000 chips due to this issue... @tsifb you can contact me on my github email adress to discuss the issue |
@ZonKnoZ , Work on WILC is prioritized according to the open tickets on salesforce. There have been a lot of improvements and bug fixes implemented since release 15.0, and more stability implemented on 15.2 (current dev branch) that's being released on February 11th. Thanks, |
Thanks for the update. This is very important info because we need to handle this situation and base our business decisions on something... |
@ZonKnoZ Please share your findings on alternate parts. We are in the same position as you, and are considering alternatives as well. We so have a spare USB port that connects out through small daughter board which we could probably remake with an USB connected wifi chipset. |
@AdhamAbozaeid, are you directly involved in or with the Engineering team working on this firmware issue? We would appreciate some more frequent updates with some details of your progress towards a fix. Everyone following this ticket understands that Salesforce tickets need to be prioritized. Can you tell us all what priority level has been placed in this issue? Also, we are watching GitHub closely (both Driver and Firmware repos), currently running the latest from the dev branches, including the latest firmware 15.2 release candidate. We are able to update to any new commits and re-test very quickly, if this is of any assistance. |
@tsifb There are several WILC3000 APs active and I see quite a few retransmissions(starting with Frame 100). Nevertheless it would be nice to have a software solution, we could distribute OTA instead of replacing hardware. The USB replacement ist just a measure to "fix" the problem of already produced units. |
@tsifb My team is responsible for the driver/FW development for WILC1000/WILC3000. We started working on this issue few days ago. I'll post updates once I have reasonable findings. |
@ZonKnoZ the content of your capture is very very similar to my single unit capture. @AdhamAbozaeid note this characteristic. In this case, the WILC module does not send erroneous CF-END frame, but re-transmits a (valid?) packet several times (with the R flag set indicating a retry, and the same sequence number as the original). While the on-air characteristics are different to the multi unit CF-END bursts, my experience and best guesswork tells me that this is likely the same root cause. |
@tsifb , this is a normal behavior at packet 100. WILC will retransmit unicast packets if the receiver didn't send back an ACK frame |
@AdhamAbozaeid what is the timeout in microseconds before a Probe Response retry is sent? Observation is that the WILC module sends retries after a much shorter period than other devices (as seen in this @ZonKnoZ 's capture) Also, in my Single unit iperf capture, there are significant numbers of QOS-Data frame retries. @ZonKnoZ can you tell us your setup to capture these packets, including the hardware device you used? |
@tsifb I can see the CF sequence happening even after probe responses directed from the AP to other station, not only after re transmissions from WILC |
@AdhamAbozaeid I understand the timing may not be accurate. I don't have immediate access to the hardware needed to get more accurate timings. What is the actual timeout value / calculation used in the WILC firmware? I did notice the WILC management frames in @ZonKnoZ 's capture were at 6Mbps. I expected these to be 1Mbps. I'm not sure if this is another useful observation. Unsure of the exact configuration in his case. It might be worth reviewing the tx data rates used by the WILC module for the management frames. Regarding the CF-End Frames, note the signal strength in the radio tap header. These are ALL very strong (-26 to -28dBm), so must have been transmitted by the WILC module. (Ignore the MAC addresses, which I believe are most likely incorrect) Maybe configure a WILC to not actively probe, and configure it to connect to a non-existent SSID, single channel (which should mean the WILC is quiet). Monitor it to see if it can be triggered into sending erroneous frames by other activity on the same channel from a network using a different SSID? I would like to help out directly with this, but I am unable to do so right now. |
@tsifb The hostapd AP config is done as follows:
I was scanning at work with 10 WILC APs active. It is hard to find a spot with less than 50 acess points active in the neighbourhood and I have no faraday cage available but I might try to find a spot less noisy. |
@tsifb, the ACK timeout ranges between 33 to 207 usec. I have also changed the location of WILC a few times and I can see the RSSI captured by the sniffer changes proportionally, so I agree that the packets might be coming out of WILC |
Given the relative RSSI difference in my captures, it is proven without doubt that the packet comes from the WILC module. I see @ZonKnoZ 's configuration has hw_mode=g which explains the mgmt frames sent at the lowest g rate. |
@ZonKnoZ I have used the AR9271 in another product. We considered testing it in this product - but decided on another device. We have samples enroute to us now. We intend to integrate its driver and initially use this a test comparison against the WILC, with some packet captures that we can all use as reference. |
@AdhamAbozaeid today is day 26 since reporting this issue. We have product waiting to to ship, and new production runs are on hold. |
It's 190 days for me, since I communicated this issue towards microchip the first time. I know that being asked about progress migt be stressful while coping with complex dev/debug tasks, but the same goes for waiting without any feedback. @AdhamAbozaeid I'm really sorry that you have to cope with customers as well as with development at the same time. It seems that you are a lone wolf @microchip handlig this, which I hope is just a false&provocative assumption causing an official statement on this matter that might help handling this from a business perspective... |
@AdhamAbozaeid I see you just released the firmware 15.2 - does this include a fix for this issue? I don't see any additional commits. |
@ZonKnoZ, @tsifb release 15.2 doesn't include a fix for this issue. This is the same as the dev branch that was under QA to be released. I assure you that this issue is currently getting highest priority, so thanks for understanding. So far, I was able to reproduce the problem, and I could confirm that these packets aren't generated nor received from the FW, same as all medium control packets that requires accurate timing that can't be handled by SW. As for the estimate for a fix, it will be hard to give an accurate estimate given the current information I have for now, but I'd say it would take roughly 2~3 weeks to find a fix, plus 2~3 more weeks for QA sanity checks for an engineering release with a fix to be available. |
Just compiled the latest dev driver for a 4.9.59 kernel
I started testing the setup and will try to replicate our previous problems.
with a single device traffic seems to flow ok:
Unfortunately the whole system froze shortly after transferring those ~200MB with no output I could capture leading to this. Would the wilc debug port output help if i reproduce and record this? The stability and congestion tests with multiple devices will take a bit more time. @AdhamAbozaeid which kernels are officially supported (the Linux user guide describes only 4.9 for SAMA5D4)? |
@ZonKnoZ , the changes submitted to the dev branch doesn't have fixes for the CF end problem, nor the uptime yet. We use SAMA5d4 with kernel 4.9 for testing, but note that the dev branch is fully tested at official releases. All intermediate commits are sanity checked only. |
All, the attached firmware has a tentative fix for the CF-End flood packets. You can use it to verify the test as we previously discussed. |
@AdhamAbozaeid , here we go:
Then I just connect a tablet to the AP and start to stream a youtube video. Freeze comes after a while. Btw. I'm aware that this is a bleeding edge test case here + I use a WILC3000 , so sorry but can't test it. |
@ZonKnoZ , Do you see the same issue with the master, or on the dev only? Do you have the driver/FW logs? |
We have built in the build 11050 firmware and are testing. @AdhamAbozaeid Are there additional issues / fixes that you are aware of that will be included in the next release? |
Most of the changes will be in the driver to align with WILC's driver on the kernel staging tree, in addition to:
|
@AdhamAbozaeid My question was about the next firmware release, although if there are driver changes that need to also be included at the same time - that would be useful to know. I am trying to assess which combinations we should be focusing on testing, and how much effort to burn on them. I see there are a bunch of new driver dev branch commits. We are also working in integrating these into our latest internal dev branch, which is on a slightly newer linux kernel. Your input would be helpful on this. As part of the testing, we will be carefully observing the debug output from the driver and WILC module serial port - if we notice anything suspicious, we will open new issues here to get your feedback. |
Testing Update: Yesterday with 5 units connected to the AP has shown a large improvement in usability. I do have some concerns about performance where the AP to Station signal strength is lower (~-75dBm) - the throughput drops more than I expected, even at the reduced data rate. More investigation is needed here. Also the ability to maintain a somewhat usable ssh connection to one unit, while a file transfer / iperf is operating with another unit is not what I expect. It seems like the file transfer (TP-link AP-> WILC Station) does significantly delay packets to other units (on the order of several seconds to 10s of seconds.) Further testing in the next few day: |
WILC FW and driver releases are always aligned. |
@AdhamAbozaeid did you build a new WILC3000 binary set too? @Mateusz-Gwara may be able to test as well? @Mateusz-Gwara we would appreciate your eyes on this new firmware too, if that is possible ;) |
@AdhamAbozaeid ok. thanks for that information. |
@tsifb , I just added the WILC3000 FW as well to my comment above |
I still observe some CF-End frames being transmitted by the WILC module. |
@tsifb , these CF-end packets are legitimate. It's used by STAs to indicate the truncation of an obtained TXOP. You can refer to the "Truncation of TXOP" section int he 802.11 |
Ok, so.... if I disable RTS/CTS by setting the rts threshold to 2346, will I still see the CF-End frames? |
When using mulitple WILCs on the same network, multiple transmissions of CF-End packets were observed on the air coming out from WILC Fixes #6 Signed-off-by: Adham Abozaeid <[email protected]>
@AdhamAbozaeid I noticed the binaries recently added to dev branch have a different build number to the binary you attached here for testing. Are there additional changes? |
No, TXOP is different from RTS/CTS |
The binary provided in this thread (11050) is 15.2 + the fix for this issue. The fix on the dev branch has other fixes on top of 15.2 |
ok yes, I understood that 11050 just fixed this issue. I was referring to the last dev commit a few days ago d66a52f I'm expending resources testing this firmware. I need to understand which binary we should be using so that we don't have to waste time. My assumption is that 11064M = 15.2 release + CF END fix (as per 11050) + something else. I would prefer to be close as possible to the next release candidate - so if 11064M has additional fixes, this is of interest. |
I mean that the fix on the dev. branch has the fix for cf-end along with other fixes on top of 15.2 The engineering release that we are currently running sanity QA on will be release 15.2 + CF-End fix to minimize the changes and avoid introducing further bugs since we don't run full QA on engineering releases. This engineering release will be available in 2~3 weeks through salesforce for users who don't want to wait till the official release is ready. |
@AdhamAbozaeid ok great, thanks for the clarification and timings. That helps us a lot. @Mateusz-Gwara I recommend you try the wilc3000 binaries linked above if you haven't been done so already. |
I ran the driver ( a65d074 ) and firmware ( d66a52f ) from the latest dev branch commits respectively.
After a while the device freezes stops responding and the console is flooded with this:
|
Are you seeing the same behavior with the firmware binary posted here, with the driver from 15.2 (a65d074 )? |
@AdhamAbozaeid @Mateusz-Gwara we haven't tested AP mode yet (or AP+STA) I will try this sometime over the next few working days. Reliability seems to be good in STA mode.... no crashes / lockups seen so far. |
@AdhamAbozaeid
So far, the firmware posted here seems to be more stable. This is what I see when a client connects with the most promising variant 4:
|
Summary
The WILC1000 module is occasionally transmitting large volumes of erroneous packets in bursts even when a device should be almost completely idle. The problem is particularly more prevalent when more than one WILC1000 devices are associated with the same AP as if they are interacting with each other, and trigger this condition.
The setup is a TP-link or Netgear AP, with WILC1000 modules on custom PCBs connected to SAMA5D42 running linux kernel 4.9.52 and the linux4wilc driver.
The WILC1000 modules are configured only as Stations. (no AP is configured in the WILC1000)
The problem has been initially observed as intermittent poor throughput when loading ~100MB root filesystem images into the custom PCB over Wi-Fi. The problem appears to be much worse when multiple units are operating, even when only loading the file into one unit at a time.
We have made a very detailed analysis of the Wi-Fi behaviour, including using an independent Wi-Fi packet capture device (sniffer), a spectrum analyser, and monitoring the WILC1000 supply current to see when it transmits.
Analysis
Packet Capture file notes:
Setup:
Filename: multiplewilc1000-no-traffic.zip
Multiple WILC1000 stations connected to AP, with no other additional higher level traffic (no pings, no iperf, no file transfers)
Frame 54: This is correct addressing for CF-END frame, however most probably send from Other WILC1000 due to -82dBm signal strength, and is out of context (no other PCF packets seen)
Frame 55: First Erroneous CF-End Frame.
Frame 56: Second Erroneous CF-End frame.
Frame 57: Third Erroneous CF-End frame, however is sent my another WILC1000.
Frames 58 - 1893: alternating erroneous CF-End frames transmitted by WILC1000 unit, as seen by inspecting rssi values.
Throughout the rest of the capture log:
Filename: singlewilc1000-iperf-udp30M.zip
Packet capture of a single WILC1000 sending UDP frames from iperf.
When comparing the previous packet capture with a single WILC1000 connected to the SAME Access Point, the blocks of rapidly repeating CF-End frames do not occur, and observed file transfer throughput is much higher, although still not as high as should be possible.
There are lots of re-transmitted frames. for example, Frames 91 and 92 are re-transmissions of Frame 90.
Frame 93: Decodes as a CF-End frame, however is transmitted by the WILC1000 station. CF-End should only be sent from the AP. Note: RSSI is -26dBm, so must have been transmitted by WILC1000
CF-End frames are transmitted by WILC1000, however there are no other PCF packets sent. This indicates that PCF function is not being used, however WILC1000 seems to be transmitting CF-End frames.
Conclusion
When Mulitple WILC1000 Stations are close, they trigger each other's firmware into transmitting junk packets very rapidly. A single WILC1000 will not flood the network, however does send incorrect junk frames and will re-transmit many otherwise valid frames, sometimes so quickly that an acknowledgement can not have been received.
This issue renders the WILC1000 unusable, or at best - when a single Station is used - poor performance.
We have spent considerable engineering resources on investigation of this and other previous WILC1000 issues. We require urgent and dedicated attention from Microchip engineers to fix this serious issue.
A Salesforce issue will be opened referencing this report.
The text was updated successfully, but these errors were encountered: