Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WILC1000 firmware floods packets when multiple stations are connected to third party AP #6

Closed
tsifb opened this issue Jan 18, 2019 · 63 comments

Comments

@tsifb
Copy link

tsifb commented Jan 18, 2019

Summary

The WILC1000 module is occasionally transmitting large volumes of erroneous packets in bursts even when a device should be almost completely idle. The problem is particularly more prevalent when more than one WILC1000 devices are associated with the same AP as if they are interacting with each other, and trigger this condition.

The setup is a TP-link or Netgear AP, with WILC1000 modules on custom PCBs connected to SAMA5D42 running linux kernel 4.9.52 and the linux4wilc driver.

The WILC1000 modules are configured only as Stations. (no AP is configured in the WILC1000)

The problem has been initially observed as intermittent poor throughput when loading ~100MB root filesystem images into the custom PCB over Wi-Fi. The problem appears to be much worse when multiple units are operating, even when only loading the file into one unit at a time.

We have made a very detailed analysis of the Wi-Fi behaviour, including using an independent Wi-Fi packet capture device (sniffer), a spectrum analyser, and monitoring the WILC1000 supply current to see when it transmits.

Analysis

  1. With two devices (or more) connected to the same AP, if either device is generating even a small amount of traffic (i.e. transmitting), then both devices begin rapidly transmitting very significant amounts of traffic when they should be transmitting essentially nothing and sitting idle.
    1. The devices have been verified as transmitting by watching the VBATT supply current using an oscilloscope across a shunt resistor.
      1. Inspection of the oscilloscope waveform show very clearly when a WILC 1000 is transmitting.
    2. Using a Wi-Spy DBx spectrum analyser, the rapid transmitting can be observed.
      1. Typically there will be a burst of rapid transmission that lasts for 10-12 seconds, and can be often followed by shorter (0.5-1s) burst approximately 1-2 seconds after the longer burst.
    3. The data transmitted has been captured using a Wi-Fi sniffer (rtl8812au usb Wi-Fi adapter + wireshark (802.11 + radio tap ) + kali linux)
      1. The packets captured record the RSSI in the radio tap header.
      2. The physical organisation of AP and stations have (significantly) different rssi values at the sniffer.
      3. This gives a per packet record of which device actually transmitted a packet, irrespective of the MAC addresses included in the packet.
    4. Using the Wi-Fi sniffer packet capture, many thousands of irrelevant packets are received
      1. Packets are largely correctly formatted, with correct checksums.
      2. Packets are mostly decoded in wireshark as CF-END frames, although some other almost instantaneous retry packets have also been observed.
      3. Some of the erroneous CF-END packets have invalid 00:00:00:00:00:00 source address (BSSID)
      4. Some of the erroneous CF-END packets have invalid 00:00:00:00:00:00 destination address (RA)
      5. Some of the erroneous CF-END packets have both invalid source and invalid destination addresses of 00:00:00:00:00:00
    5. Some of the erroneous CF-END packets have the access point address in either or both source or destination.
      1. We have absolutely proven without doubt that even though the source MAC address in these packet indicates the source MAC of the AP, it has been transmitted my the WILC1000 station.
      2. The captured packet's RSSI iu the radio tap header can only be that of a WILC1000, not the AP.
        1. The Wi-Spy Spectrum analyser shows the signal strength that can only be the WILC1000
        2. The VBATT current on the oscilloscope shows the WILC 1000 is transmitting during this period
    6. CF-END frames should only be transmitted by the AP (PCF) and contain the AP MAC as source (BSSID) and broadcast address as RA.
    7. During this time throughput performance of the drops significantly / ping packets are significantly delayed
  2. WILC 1000 Devices are running in STA mode only. (no AP is configured)
  3. Problem occurs with both a TP-Link Archer C20 or a Linksys WRT ACS-1900 as the access point.
    1. With the Linksys, problem appears slightly less prevalent.
  4. Problem occurs at intermittent intervals - when just a simple 1s ping from one device to the AP, the problem might occur after a few minutes, and last for 10-12 seconds. (TP-link AP is more frequent)
  5. Problem starts once module connects to AP (i.e. driver, module is up, firmware is loaded and wpa_supplicant running)
    1. Problem does NOT stop when driver is stopped and kernel module is removed from kernel - i.e. firmware is still running independent of host.
    2. Problem only stops once kernel driver is reloaded AND firmware is reloaded i.e. once WILC1000 is reset/restarted.
  6. Problem occurs when using an Microchip (Atmel) SAMA5D4XULT dev board with a Rev 4 WILC1000-SD dev board and wilc_demo image i.e. solely Microchip hardware and firmware.
  7. With only one WILC1000 unit running, the problem is much less prevalent
    1. There are still numerous Wi-Fi retries similar to: RX packets queuing up in wilc1000 firmware driver#25
  8. With multiple WILC100 stations connected, the network becomes almost useless, with very frequent periods of extremely low throughput. (6 stations for example)
  9. Versions:
    1. WILC1000 firmware: 15.2 RC2
    2. linux4wilc driver: commit hash 06108eeb4beb070d2ed870c2ac86aeb65b5c6da0 from -dev branch

Packet Capture file notes:

Setup:

  • Multiple WILC1000 stations connected to TP-Link Access Point.
  • Packet sniffer (RTL8812AU + Wireshark on Kali linux) located very close to WILC1000 MAC f8:f0:05:d0:9f:22
  • AP and Other WILC1000 stations located several meters away from above WILC1000
  • This setup provides VERY clear indication of when 9f:22 transmits because Radiotap header contains the signal strength.
  • Observation of typical (normal) packets in the capture logs show RSSI values for various transmitters at the packer sniffer
    • e.g. Beacon Frames sent from AP show typical basic rate RSSI value.
    • e.g NULL data frames or QOS-Data frames show RSSI from 9f:22.

Filename: multiplewilc1000-no-traffic.zip

Multiple WILC1000 stations connected to AP, with no other additional higher level traffic (no pings, no iperf, no file transfers)

Frame 54: This is correct addressing for CF-END frame, however most probably send from Other WILC1000 due to -82dBm signal strength, and is out of context (no other PCF packets seen)

Frame 55: First Erroneous CF-End Frame.

  • Radio Tap Header Antenna Signal = -24dBm indicating it is transmitted from WILC1000 (MAC: f8:f0:05:d0:9f:22) located a few cm from sniffer.
  • time delta from F54 is 0.000001003s
  • Source MAC address is not correct - this MAC is the AP's MAC
  • Destination MAC is not correct, this is the AP's MAC - and for a CF-END frame, it should be broadcast address

Frame 56: Second Erroneous CF-End frame.

  • Again rssi is -24dbm - sent from 9f:22 WILC1000
  • time delta from F55 is 0.000000620

Frame 57: Third Erroneous CF-End frame, however is sent my another WILC1000.

  • RSSI is -80dBm - much weaker signal, around that observed from other WILC1000.
  • Incorrect addressing.

Frames 58 - 1893: alternating erroneous CF-End frames transmitted by WILC1000 unit, as seen by inspecting rssi values.

  • A very large block of frames - should never be seem in normal PCF operation.
  • each frame is transmitted in very rapid succession, as seem by inspecting the packet delta times.
  • elapsed time from frame 52-1893 = 0.332s / Frames transmitted at average of 5545 pkts/s.

Throughout the rest of the capture log:

  • bursts of erroneous CF-End frames are observed.

Filename: singlewilc1000-iperf-udp30M.zip

Packet capture of a single WILC1000 sending UDP frames from iperf.

When comparing the previous packet capture with a single WILC1000 connected to the SAME Access Point, the blocks of rapidly repeating CF-End frames do not occur, and observed file transfer throughput is much higher, although still not as high as should be possible.

There are lots of re-transmitted frames. for example, Frames 91 and 92 are re-transmissions of Frame 90.

Frame 93: Decodes as a CF-End frame, however is transmitted by the WILC1000 station. CF-End should only be sent from the AP. Note: RSSI is -26dBm, so must have been transmitted by WILC1000

CF-End frames are transmitted by WILC1000, however there are no other PCF packets sent. This indicates that PCF function is not being used, however WILC1000 seems to be transmitting CF-End frames.

Conclusion

When Mulitple WILC1000 Stations are close, they trigger each other's firmware into transmitting junk packets very rapidly. A single WILC1000 will not flood the network, however does send incorrect junk frames and will re-transmit many otherwise valid frames, sometimes so quickly that an acknowledgement can not have been received.

This issue renders the WILC1000 unusable, or at best - when a single Station is used - poor performance.

We have spent considerable engineering resources on investigation of this and other previous WILC1000 issues. We require urgent and dedicated attention from Microchip engineers to fix this serious issue.

A Salesforce issue will be opened referencing this report.

@tsifb
Copy link
Author

tsifb commented Jan 18, 2019

Microchip Salesforce case number 00374724.

@Mateusz-Gwara
Copy link

Mateusz-Gwara commented Jan 18, 2019

Thank you very much for your effort and insight!

We have basically the same issue rendering any WiFi Network in range unable to send data or even be seen, as soon as ~10 WILC3000 access points are switched on (even without traffic).
And I'm talking about any Network, not just the WILC access points.

Our Microchip Salesforce case number is 00328657 (Date/Time Opened | 9/6/2018 10:39 AM) where we also supplied packet capture data.
Unfortunately Microchip wasn't able to reproduce this until now.
So, let's hope your input allows to shed some light on this issue!

If not, I'm considering reporting this "network jamming" feature to the German RF-network authorities so they can effectively ban the hardware from being sold.
It's just 1.5K bucks.
Maybe that will motivate enough to try to find anything ;)

@tsifb
Copy link
Author

tsifb commented Jan 20, 2019

@ZonKnoZ We would be interested in seeing your issue report details posted here too. Please share the details, including packet capture files, so that when Microchip eventually post a "fix" we can all test it knowing the likely failure modes.

I have suspected that this issue will be provoked by any AP in range, even with the WILC1000 stations not connected to it, however we have already spent vast amounts of time attempting to diagnose this problem. I do not have time to allocate to further digging.

It sounds like your experience has been similar to ours. This is a very simple setup to recreate and while the details of the problem are difficult to determine, the effect is very very obvious. Hopefully our combined efforts will motivate an appropriate response.

@AdhamAbozaeid
Copy link

Thanks @tsifb for the detailed report.
The salesforce team will analyze it and take appropriate action

@tsifb
Copy link
Author

tsifb commented Jan 21, 2019

@AdhamAbozaeid I see they have replied in Salesforce saying they will investigate. I hope that will be given high priority.

Are there any plans to open source the firmware for the WILC modules? While we do not really want to be inside the module firmware, it might be useful for our own wifi experts to be able to inspect the code.
If not open source, is there possibility of getting access under NDA so we can assist directly?

Currently the issues with this wifi module are stalling our product release, and it is now critical path for us.

@AdhamAbozaeid
Copy link

@tsifb , there are currently no plans to open source the firmware.
Please mention the criticality and its impact on your production schedule in the ticket so that the salesforce team can prioritize it with respect to their input queue.

@MTaulin
Copy link

MTaulin commented Jan 24, 2019

I can confirm that something happens when more than one WILC1000 device is in one wifi network.
It's not that critical like mentioned from the threads above but i also get some suspicious packages when im using more than one wilc1000 device.
I a use custom imx6 board with wilc1000 and Tp-link accesspoint. In my case im not sure about the packages because the "source" and "destination" are both from the tp-link but those always just happening when two or more devices with the wilc1000 are connected. The packages are CF-End control-frames and also coming in bursts in an interval of a few seconds but not that much to completely drop the data rate. Also i get some of those packages with 00:00:00_00:00:00 as source and destination address.
I also connected two more devices into the wifi network and can see that the amount of those packages also raises so i expect that as the reason for those packages

@tsifb
Copy link
Author

tsifb commented Jan 24, 2019

@MTaulin your observations match ours.

Also, If you closely inspect the on-air traffic with a single WILC1000 active, there will still be erroneous packets and you will observe some incorrectly re-transmitted legitimate packets. This does slow the throughput from what the device is capable of. Using firmware and driver from more than 12 months ago, we saw significantly higher throughput - however there were other serious bugs that made that combination unusable.

We also observed packets with source and destination addresses that matched the AP (TP-link) We proved without doubt that these packets were NOT transmitted from the AP, but in fact were sent by the WILC1000 modules. You can prove this using a packet sniffer (we use rtl8812AU device). If it is located very close to the wilc1000, and distant from the AP, you will be able to observe different per packet RSSI per transmitter. Strong RSSI will indicate WILC1000 is transmitting. (radio tap header includes RSSI value)

@tsifb
Copy link
Author

tsifb commented Jan 24, 2019

@AdhamAbozaeid I have had no solid feedback via salesforce (7 days)

Can you raise this internally at Microchip to progress the investigating with high priority? @MTaulin's report shows the same symptoms. @ZonKnoZ's issue is likely closely related or the same.

@AdhamAbozaeid
Copy link

@tsifb the support team is currently working on reproducing the issue and will get back to you soon with feedback.
Thank you for your patience.

@Mateusz-Gwara
Copy link

@tsifb

@rocky134 mentioned our problem in issue #36 as well

meanwhile we are making plans to drop support of the WILC3K altogether and switching to an usb dongle.
The same setup we have simply works as you would expect WiFi to work with any other WiFi chip (atheros, Ralink, you name it).

The sad part is that we are going to ditch 1K+ WILC3000 chips due to this issue...

@tsifb you can contact me on my github email adress to discuss the issue
After 1 Year of problems with this chip I simply don't expect things to magically change...

@AdhamAbozaeid
Copy link

@ZonKnoZ , Work on WILC is prioritized according to the open tickets on salesforce. There have been a lot of improvements and bug fixes implemented since release 15.0, and more stability implemented on 15.2 (current dev branch) that's being released on February 11th.
For the jamming issue, the salesforce team was able to reproduce the problem as I mentioned last week, and as soon as we find a proper fix it will be submitted to the dev branch.

Thanks,
Adham

@Mateusz-Gwara
Copy link

Thanks for the update.
You as a developer have more current info on our tickets issue than we get via Salesforce.

This is very important info because we need to handle this situation and base our business decisions on something...

@tsifb
Copy link
Author

tsifb commented Feb 4, 2019

@ZonKnoZ Please share your findings on alternate parts. We are in the same position as you, and are considering alternatives as well. We so have a spare USB port that connects out through small daughter board which we could probably remake with an USB connected wifi chipset.
For our application, we would need significant volume of parts for mass production, therefore a usb dongle arrangement is unlikely to be suitable.
We also require both AP and Station mode operation, preferably concurrently. I have seen some alternatives using the Atheros AR9271, which I have used in a previous company / product. The driver did get some attention for our specific use, but overall it was very reliable and worked well.
We would also consider an SDIO connected module, with some type of adapter board, or making our own module to be footprint compatible with the wilc1000 module - this is within our capabilities, however a late stage change such as this is very expensive to our business.

@tsifb
Copy link
Author

tsifb commented Feb 4, 2019

@AdhamAbozaeid, are you directly involved in or with the Engineering team working on this firmware issue?
Unfortunately we do not see much movement in Saleforce. They only response in the last 15 days being that the team has reproduced this issue and is working on a fix.

We would appreciate some more frequent updates with some details of your progress towards a fix.

Everyone following this ticket understands that Salesforce tickets need to be prioritized. Can you tell us all what priority level has been placed in this issue?

Also, we are watching GitHub closely (both Driver and Firmware repos), currently running the latest from the dev branches, including the latest firmware 15.2 release candidate. We are able to update to any new commits and re-test very quickly, if this is of any assistance.

@Mateusz-Gwara
Copy link

@tsifb
Here is a dump I submitted to our Issue on salesforce on 11/5/2018 4:36 PM
capture-of-l2-wilc3k-connection-filtered.zip

There are several WILC3000 APs active and I see quite a few retransmissions(starting with Frame 100).

Nevertheless it would be nice to have a software solution, we could distribute OTA instead of replacing hardware.
We also are running the latest driver/FW commits from the dev branch.

The USB replacement ist just a measure to "fix" the problem of already produced units.
We are validating alternatives (SPI/SDIO) for our next production run as well to be able to switch if this issue is still there until then.

@AdhamAbozaeid
Copy link

AdhamAbozaeid commented Feb 4, 2019

@tsifb My team is responsible for the driver/FW development for WILC1000/WILC3000. We started working on this issue few days ago. I'll post updates once I have reasonable findings.
Thanks for offering assistance. I'll let you know if I needed more logs or maybe to validate a fix

@tsifb
Copy link
Author

tsifb commented Feb 5, 2019

@ZonKnoZ the content of your capture is very very similar to my single unit capture.
Notably, the burst of retransmitted packets starting from packet 100. Note the time interval between packet 100 and 101 is very short.

@AdhamAbozaeid note this characteristic. In this case, the WILC module does not send erroneous CF-END frame, but re-transmits a (valid?) packet several times (with the R flag set indicating a retry, and the same sequence number as the original).

While the on-air characteristics are different to the multi unit CF-END bursts, my experience and best guesswork tells me that this is likely the same root cause.

@AdhamAbozaeid
Copy link

@tsifb , this is a normal behavior at packet 100. WILC will retransmit unicast packets if the receiver didn't send back an ACK frame

@tsifb
Copy link
Author

tsifb commented Feb 6, 2019

@AdhamAbozaeid what is the timeout in microseconds before a Probe Response retry is sent? Observation is that the WILC module sends retries after a much shorter period than other devices (as seen in this @ZonKnoZ 's capture)

Also, in my Single unit iperf capture, there are significant numbers of QOS-Data frame retries.
Notes:
1, these are also very small deltas.
2, they are variable in number
3, mostly followed by a WILC module sending a CF-END frame
Given these observation, I suspect this is very closely related to the flooding issue with multiple devices.

@ZonKnoZ can you tell us your setup to capture these packets, including the hardware device you used?
I took some care to make sure that my channel was (relatively) quiet, and that the sniffer was only using a single channel. I suspect from looking at your log, that some packets are missed due to the sniffer scanning other channels. Notably there are not many ACK packets recorded.
This was not easy to do. Something like: kill off Kali's Network Manager, manually setup wpa supplicant to a single channel scan, connect to the AP, then kill wpa_supplicant, set monitor mode, run wireshark.... I don't recall exactly the steps, but it took some work to get it right.

@AdhamAbozaeid
Copy link

@tsifb I can see the CF sequence happening even after probe responses directed from the AP to other station, not only after re transmissions from WILC
Also please note that at the level microseconds, the air sniffer will not be very accurate. You would need a high speed scope to get accurate interframe spacing times. On my sniffer, I see the retries varies from hundred microseconds to few milliseconds both for the AP and for WILC

@tsifb
Copy link
Author

tsifb commented Feb 7, 2019

@AdhamAbozaeid I understand the timing may not be accurate. I don't have immediate access to the hardware needed to get more accurate timings.
After actually doing the calculations , I got around 340us for an Ack response time at 1Mbps.

What is the actual timeout value / calculation used in the WILC firmware?

I did notice the WILC management frames in @ZonKnoZ 's capture were at 6Mbps. I expected these to be 1Mbps. I'm not sure if this is another useful observation. Unsure of the exact configuration in his case. It might be worth reviewing the tx data rates used by the WILC module for the management frames.

Regarding the CF-End Frames, note the signal strength in the radio tap header. These are ALL very strong (-26 to -28dBm), so must have been transmitted by the WILC module. (Ignore the MAC addresses, which I believe are most likely incorrect)
My understanding from @ZonKnoZ 's previous comment is that the WILC tx'ed CF-End frames will happen even if they are not connected to an AP. This probably indicates that frames directed to other stations could likely be triggering the WILC to tx these frames.
I haven't done any testing on this specifically, but it is worthwhile looking at this.

Maybe configure a WILC to not actively probe, and configure it to connect to a non-existent SSID, single channel (which should mean the WILC is quiet). Monitor it to see if it can be triggered into sending erroneous frames by other activity on the same channel from a network using a different SSID?

I would like to help out directly with this, but I am unable to do so right now.

@Mateusz-Gwara
Copy link

Mateusz-Gwara commented Feb 7, 2019

@tsifb
The chip I use for scanning is:
ID 0cf3:9271 Atheros Communications, Inc. AR9271 802.11n

The hostapd AP config is done as follows:

interface=wlan0
driver=nl80211
country_code=DE
channel=".(int rand(8)+3)."
auth_algs=1
hw_mode=g
max_num_sta=16
wpa_group_rekey=7200
wpa_ptk_rekey=7200
wpa_gmk_rekey=86400
macaddr_acl=0
wpa=2
wpa_key_mgmt=WPA-PSK
wpa_pairwise=CCMP
rsn_pairwise=CCMP
ssid=".$ssid."
wpa_passphrase=".$pass."

I was scanning at work with 10 WILC APs active.
I was fixed to one channel as described in these instructions:
WLAN Sniffer.pdf
^^these were the steps Microchip support wanted me to perform in order to have a valid scan result

It is hard to find a spot with less than 50 acess points active in the neighbourhood and I have no faraday cage available but I might try to find a spot less noisy.
Sorry for being so unresponsive right now, but we are quite busy atm.
Still, I'm willing to help where I can in order to get this issue resolved.

@AdhamAbozaeid
Copy link

@tsifb, the ACK timeout ranges between 33 to 207 usec.
There shouldn't be a problem with using 6Mbps to transmit management frames. I even think 1Mbps would be more prune to collision, and hence might required CFP.

I have also changed the location of WILC a few times and I can see the RSSI captured by the sniffer changes proportionally, so I agree that the packets might be coming out of WILC

@tsifb
Copy link
Author

tsifb commented Feb 8, 2019

Given the relative RSSI difference in my captures, it is proven without doubt that the packet comes from the WILC module.

I see @ZonKnoZ 's configuration has hw_mode=g which explains the mgmt frames sent at the lowest g rate.

@tsifb
Copy link
Author

tsifb commented Feb 8, 2019

@ZonKnoZ I have used the AR9271 in another product. We considered testing it in this product - but decided on another device. We have samples enroute to us now. We intend to integrate its driver and initially use this a test comparison against the WILC, with some packet captures that we can all use as reference.

@tsifb
Copy link
Author

tsifb commented Feb 11, 2019

@AdhamAbozaeid today is day 26 since reporting this issue.
Can I get an update on your progress so far, and an estimated date on a new release that includes a fix?

We have product waiting to to ship, and new production runs are on hold.

@Mateusz-Gwara
Copy link

It's 190 days for me, since I communicated this issue towards microchip the first time.

I know that being asked about progress migt be stressful while coping with complex dev/debug tasks, but the same goes for waiting without any feedback.

@AdhamAbozaeid I'm really sorry that you have to cope with customers as well as with development at the same time.

It seems that you are a lone wolf @microchip handlig this, which I hope is just a false&provocative assumption causing an official statement on this matter that might help handling this from a business perspective...

@tsifb
Copy link
Author

tsifb commented Feb 14, 2019

@AdhamAbozaeid I see you just released the firmware 15.2 - does this include a fix for this issue? I don't see any additional commits.

@AdhamAbozaeid
Copy link

AdhamAbozaeid commented Feb 14, 2019

@ZonKnoZ, @tsifb release 15.2 doesn't include a fix for this issue. This is the same as the dev branch that was under QA to be released.

I assure you that this issue is currently getting highest priority, so thanks for understanding.
Usually higher layers of support at microchip takes care of handling customer issues, but this one looks too deep and has to involve multiple development teams.

So far, I was able to reproduce the problem, and I could confirm that these packets aren't generated nor received from the FW, same as all medium control packets that requires accurate timing that can't be handled by SW.
For now, I'm working with the HW team to understand how these packets are generated, and if the FW is mis-configuring any options related to CF-end packets.

As for the estimate for a fix, it will be hard to give an accurate estimate given the current information I have for now, but I'd say it would take roughly 2~3 weeks to find a fix, plus 2~3 more weeks for QA sanity checks for an engineering release with a fix to be available.
Once we have more information, I'll send you a fix for validation, and will get you more accurate information about the engineering release availability.

@Mateusz-Gwara
Copy link

Just compiled the latest dev driver for a 4.9.59 kernel
Running with the latest dev firmware it at least seems to load.
This comes up as the FW version:

[wilc_wlan_initialize]WILC Firmware Ver = WILC_WIFI_FW_REL_15_2_RC3 Build: 11045M

I started testing the setup and will try to replicate our previous problems.
Uptime is still high after loading the wilc module though:

15:44:06 up 32 min, 1 user, load average: 2.93, 2.77, 2.19

with a single device traffic seems to flow ok:

wlan0 Link encap:Ethernet HWaddr f8:f0:05:c0:2a:1c
inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:372 Metric:1
RX packets:287085 errors:0 dropped:0 overruns:0 frame:0
TX packets:549804 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:20031590 (19.1 MiB) TX bytes:210797809 (201.0 MiB)

Unfortunately the whole system froze shortly after transferring those ~200MB with no output I could capture leading to this.

Would the wilc debug port output help if i reproduce and record this?

The stability and congestion tests with multiple devices will take a bit more time.

@AdhamAbozaeid which kernels are officially supported (the Linux user guide describes only 4.9 for SAMA5D4)?

@AdhamAbozaeid
Copy link

@ZonKnoZ , the changes submitted to the dev branch doesn't have fixes for the CF end problem, nor the uptime yet.
Can you describe the steps you are doing to reproduce this freeze?

We use SAMA5d4 with kernel 4.9 for testing, but note that the dev branch is fully tested at official releases. All intermediate commits are sanity checked only.

@AdhamAbozaeid
Copy link

AdhamAbozaeid commented Mar 1, 2019

All, the attached firmware has a tentative fix for the CF-End flood packets. You can use it to verify the test as we previously discussed.
Please note that this FW is for testing purposes only, and isn't ready for release yet.
wilc1000_wifi_firmware.zip
wilc3000_wifi_firmware.zip

@AdhamAbozaeid AdhamAbozaeid transferred this issue from linux4wilc/driver Mar 1, 2019
@Mateusz-Gwara
Copy link

@AdhamAbozaeid , here we go:

  1. ti AM3352 CPU based board similar to Beaglebone Black running on ti-processor-sdk-linux-am335x-evm-04.02.00.09 (that is linux-4.9.59+gitAUTOINC+a75d8e9305-ga75d8e9305).
  2. the drivers & firmware were downloaded last friday 15p.m. CET from the dev branches respectively.
  3. modprobe wilc_spi
  4. service hostapd start
    with hostapd.conf:

interface=wlan0
hw_mode=g
channel=3
ieee80211d=1
country_code=DE
ieee80211n=1
wmm_enabled=1
auth_algs=1
wpa=2
wpa_key_mgmt=WPA-PSK
rsn_pairwise=CCMP
ssid=test123
wpa_passphrase=test123!

Then I just connect a tablet to the AP and start to stream a youtube video. Freeze comes after a while.

Btw. I'm aware that this is a bleeding edge test case here + I use a WILC3000 , so sorry but can't test it.

@AdhamAbozaeid
Copy link

@ZonKnoZ , Do you see the same issue with the master, or on the dev only? Do you have the driver/FW logs?
Also, please open a new ticket for this one, since it's not related to the CF End packets problem

@tsifb
Copy link
Author

tsifb commented Mar 5, 2019

We have built in the build 11050 firmware and are testing.
So far, initial observation is that the flood of CF-END does not occur.
Now testing with more units powered on, and using our typical application use-case.

@AdhamAbozaeid Are there additional issues / fixes that you are aware of that will be included in the next release?

@AdhamAbozaeid
Copy link

AdhamAbozaeid commented Mar 5, 2019

Most of the changes will be in the driver to align with WILC's driver on the kernel staging tree, in addition to:

@tsifb
Copy link
Author

tsifb commented Mar 5, 2019

@AdhamAbozaeid My question was about the next firmware release, although if there are driver changes that need to also be included at the same time - that would be useful to know.

I am trying to assess which combinations we should be focusing on testing, and how much effort to burn on them.

I see there are a bunch of new driver dev branch commits. We are also working in integrating these into our latest internal dev branch, which is on a slightly newer linux kernel.
We do not want to change too much - so we can retest as close as possible to the original setup to provide quality feedback to you, however our eventual target will include a newer kernel and important wilc driver updates.

Your input would be helpful on this.

As part of the testing, we will be carefully observing the debug output from the driver and WILC module serial port - if we notice anything suspicious, we will open new issues here to get your feedback.

@tsifb
Copy link
Author

tsifb commented Mar 6, 2019

Testing Update: Yesterday with 5 units connected to the AP has shown a large improvement in usability. I do have some concerns about performance where the AP to Station signal strength is lower (~-75dBm) - the throughput drops more than I expected, even at the reduced data rate. More investigation is needed here.

Also the ability to maintain a somewhat usable ssh connection to one unit, while a file transfer / iperf is operating with another unit is not what I expect. It seems like the file transfer (TP-link AP-> WILC Station) does significantly delay packets to other units (on the order of several seconds to 10s of seconds.)
This may impact our factory use-case, where commands need to be remotely executed in units, while others are downloading large firmware images. This may be a separate issue, but I thought it was worth mentioning early.

Further testing in the next few day:
1, more units + throughput to one unit
2, more units + multiple concurrent high throughput operations ( file transfers )
3, more units + remote shell + file transfers
4, concurrent WILC AP and WILC STA operation on one unit.
5, testing latest driver dev branch + linux kernel 4.14
During this testing we will be trying to get an understanding of reliability as well. (Power cycling in our end use application is a very rare event, and is difficult to perform)

@AdhamAbozaeid
Copy link

WILC FW and driver releases are always aligned.
So far, the changes done on top of 15.2 for the driver and firmware are independent, so you can use the fix for this issue with the driver from 15.2, without having to integrate driver changes from the dev branch

@tsifb
Copy link
Author

tsifb commented Mar 6, 2019

@AdhamAbozaeid did you build a new WILC3000 binary set too? @Mateusz-Gwara may be able to test as well?

@Mateusz-Gwara we would appreciate your eyes on this new firmware too, if that is possible ;)

@tsifb
Copy link
Author

tsifb commented Mar 6, 2019

@AdhamAbozaeid ok. thanks for that information.

@AdhamAbozaeid
Copy link

@tsifb , I just added the WILC3000 FW as well to my comment above

@tsifb
Copy link
Author

tsifb commented Mar 6, 2019

I still observe some CF-End frames being transmitted by the WILC module.
I do not see the large bursts of CF-End frames.
See attached file.
As before, the Source address is not the WILC MAC, however based on the signal strength, they must be tx'ed by the WILC module. I do not believe these are legitimate packets.

cfend-11050-sta-only.zip

@AdhamAbozaeid
Copy link

@tsifb , these CF-end packets are legitimate. It's used by STAs to indicate the truncation of an obtained TXOP. You can refer to the "Truncation of TXOP" section int he 802.11

@tsifb
Copy link
Author

tsifb commented Mar 7, 2019

Ok, so.... if I disable RTS/CTS by setting the rts threshold to 2346, will I still see the CF-End frames?
I'm assuming these are used to signal an early end to the period allocated by RTS/CTS, is that correct?

AdhamAbozaeid pushed a commit that referenced this issue Mar 7, 2019
When using mulitple WILCs on the same network, multiple transmissions
of CF-End packets were observed on the air coming out from WILC

Fixes #6
Signed-off-by: Adham Abozaeid <[email protected]>
@tsifb
Copy link
Author

tsifb commented Mar 10, 2019

@AdhamAbozaeid I noticed the binaries recently added to dev branch have a different build number to the binary you attached here for testing.
Build: 11050 vs 11064M

Are there additional changes?
Should we move to the dev branch version to continue testing?
What does the "M" suffix mean?

@AdhamAbozaeid
Copy link

No, TXOP is different from RTS/CTS

@AdhamAbozaeid
Copy link

The binary provided in this thread (11050) is 15.2 + the fix for this issue. The fix on the dev branch has other fixes on top of 15.2
The engineering release that will be provided through salesforce will be 15.2+ fix for this issue only, same as 11050

@tsifb
Copy link
Author

tsifb commented Mar 11, 2019

ok yes, I understood that 11050 just fixed this issue.
but your answer is confusing - can you clarify?

I was referring to the last dev commit a few days ago d66a52f
which appears to have build number 11064M, and the commit comment says the cf end fix is included.
Your previous comment confuses us because you implied this CF-end fix is not included.

I'm expending resources testing this firmware. I need to understand which binary we should be using so that we don't have to waste time.

My assumption is that 11064M = 15.2 release + CF END fix (as per 11050) + something else.
Is this correct?
What is the "something else"?
Is 11064M going to be closer to your next (15.3?) release version?

I would prefer to be close as possible to the next release candidate - so if 11064M has additional fixes, this is of interest.

@AdhamAbozaeid
Copy link

I mean that the fix on the dev. branch has the fix for cf-end along with other fixes on top of 15.2
d66a52f has release 15.2 (2342582) + 5ca0dc2+ CF-End bug fix.
5ca0dc2 is a modification to move initialization code from the driver to the FW.

The engineering release that we are currently running sanity QA on will be release 15.2 + CF-End fix to minimize the changes and avoid introducing further bugs since we don't run full QA on engineering releases. This engineering release will be available in 2~3 weeks through salesforce for users who don't want to wait till the official release is ready.
Release 15.3 will be the first official release that has this fix, and is planned for July time frame

@tsifb
Copy link
Author

tsifb commented Mar 12, 2019

@AdhamAbozaeid ok great, thanks for the clarification and timings. That helps us a lot.
Also as a heads up, I am going to open a new issue to discuss another observation we have made during testing, however the results for 11050 build with respect to this issue are good.
We see much better operations when concurrently updating firmware images (~100MB) in 7 units in the same location, and have not seen any other reliability issues with this combination of firmware and driver so far.

@Mateusz-Gwara I recommend you try the wilc3000 binaries linked above if you haven't been done so already.

@Mateusz-Gwara
Copy link

I ran the driver ( a65d074 ) and firmware ( d66a52f ) from the latest dev branch commits respectively.
The units I ran it on, were running kernel 4.9.59 as I already described above.
Unfortunately all test units ended up in a “semi-frozen” state where “ifconfig” would hang and a normal reboot would hang as well.
As our devices initially load a hostapd config, they perform a “service hostapd restart”.
This is what comes up then:

[  151.018370] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[  151.026957] pgd = da4f4000
[  151.029799] [00000010] *pgd=9ce91831, *pte=00000000, *ppte=00000000
[  151.036579] Internal error: Oops: 17 [#1] ARM
[  151.041122] Modules linked in: wilc_spi(C) cfg80211 rfkill xt_TCPMSS xt_tcpudp iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrao
[  151.081654] CPU: 0 PID: 3969 Comm: hostapd Tainted: G         C      4.9.59casenio-ga75d8e9305 #4
[  151.090902] Hardware name: Generic AM33XX (Flattened Device Tree)
[  151.097250] task: da48f1c0 task.stack: dca5a000
[  151.102105] PC is at get_key+0x5c/0xc4 [wilc_spi]
[  151.107055] LR is at get_key+0x40/0xc4 [wilc_spi]
[  151.111961] pc : [<bf12805c>]    lr : [<bf128040>]    psr: 60000013
[  151.111961] sp : dca5bcd8  ip : 00000000  fp : bf27378e
[  151.123931] r10: dc820000  r9 : da4e2000  r8 : 00000000
[  151.129372] r7 : dc9fc900  r6 : 00000000  r5 : 00000002  r4 : da4e2280
[  151.136184] r3 : 00000000  r2 : da48f1c0  r1 : c0a152f0  r0 : da4e2288
[  151.142989] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  151.150424] Control: 10c5387d  Table: 9a4f4019  DAC: 00000051
[  151.156410] Process hostapd (pid: 3969, stack limit = 0xdca5a210)
[  151.162764] Stack: (0xdca5bcd8 to 0xdca5c000)
[  151.167309] bcc0:                                                       60000013 dc9fc900
[  151.175841] bce0: 00000000 c0163e4c c0a39cb4 00000002 00000000 dca5bd60 bf128000 bf224f9c
[  151.184370] bd00: 00000000 dca5bd24 bf21fd48 c0610d58 da605014 da4e2280 0000000c bf263734
[  151.192886] bd20: 00000002 dc9fc900 00000000 00000002 00000004 bf26c5cc bf263734 de3bc814
[  151.201416] bd40: dc8daa80 00000000 00000014 da397400 00000000 c0577e08 bf2630c4 00000000
[  151.209940] bd60: 386d4420 00000f72 de3bc800 de3bc810 de3bc814 da397400 da4e2000 dc820000
[  151.218475] bd80: de3bc800 00000000 c0577be0 dc8daa80 dca5bdcc 00000001 00000000 c0577128
[  151.227010] bda0: c0a57448 dc8daa80 00000024 dc8daa80 dca5bdcc c0577bcc de1e7000 dcd30c00
[  151.235549] bdc0: 00000024 c0576a78 00000001 7fffffff 00000008 00000000 dc8daa80 00000024
[  151.244075] bde0: dcd30c00 00000000 dca5bf68 c0576ecc be86fc88 00000001 dca5bf68 dca5a000
[  151.252603] be00: 00000000 de280e80 00000000 00000f81 00000000 00000000 be86fcc0 dca5bf68
[  151.261129] be20: 00000000 00000000 d4236480 c0107704 dca5be50 00000000 00067cc8 c052edd4
[  151.269659] be40: dca5bf68 c052f910 00000001 00000000 da48f1c0 c0166b54 b2b2929f 14a58638
[  151.278193] be60: 00000000 00000050 da48f668 00000000 0009a980 00000024 00000000 c12100e4
[  151.286724] be80: da48f1c0 c0166b54 1db89764 17cc08f3 00000000 da48f668 da48f660 00000000
[  151.295260] bea0: 00000000 c0a5f494 60000013 c0168538 00000010 00000000 00000000 d44a8f8c
[  151.303792] bec0: 00000000 00000000 00000000 c0a5f494 60000013 c0168538 00000001 00000000
[  151.312319] bee0: 00000000 00000000 00000000 c0a17754 00000000 00000000 00000002 c0a5f494
[  151.320857] bf00: 60000013 c0168538 00000000 00000000 00000000 c0255350 00000000 00000000
[  151.329391] bf20: c0233cb0 c017abcc c0a5eabb de012600 00000000 be86fc70 00000000 d4236480
[  151.337929] bf40: 00000128 c0107704 dca5a000 00000000 00067cc8 c05306ac 00000000 00000000
[  151.346463] bf60: 00000000 fffffff7 dca5beb0 0000000c 00000001 00000000 00000000 dca5be78
[  151.354990] bf80: 00000000 dca5a000 00000000 00000000 00000000 00000000 00097e58 00096c90
[  151.363521] bfa0: b6edb000 c0107560 00097e58 00096c90 00000005 be86fc70 00000000 00000000
[  151.372057] bfc0: 00097e58 00096c90 b6edb000 00000128 000303ad 00096e00 00097ba0 00067cc8
[  151.380596] bfe0: be86fc70 be86fc58 b6eccd17 b6cfaffc 40000010 00000005 9eff6861 9eff6c61
[  151.389849] [<bf12805c>] (get_key [wilc_spi]) from [<bf224f9c>] (nl80211_get_key+0x1fc/0x430 [cfg80211])
[  151.400107] [<bf224f9c>] (nl80211_get_key [cfg80211]) from [<c0577e08>] (genl_rcv_msg+0x228/0x3dc)
[  151.409456] [<c0577e08>] (genl_rcv_msg) from [<c0577128>] (netlink_rcv_skb+0xbc/0xe0)
[  151.417625] [<c0577128>] (netlink_rcv_skb) from [<c0577bcc>] (genl_rcv+0x20/0x34)
[  151.425434] [<c0577bcc>] (genl_rcv) from [<c0576a78>] (netlink_unicast+0x160/0x1f0)
[  151.433419] [<c0576a78>] (netlink_unicast) from [<c0576ecc>] (netlink_sendmsg+0x2f8/0x370)
[  151.442052] [<c0576ecc>] (netlink_sendmsg) from [<c052edd4>] (sock_sendmsg+0x14/0x24)
[  151.450225] [<c052edd4>] (sock_sendmsg) from [<c052f910>] (___sys_sendmsg+0x1ec/0x200)
[  151.458476] [<c052f910>] (___sys_sendmsg) from [<c05306ac>] (__sys_sendmsg+0x40/0x6c)
[  151.466657] [<c05306ac>] (__sys_sendmsg) from [<c0107560>] (ret_fast_syscall+0x0/0x1c)
[  151.474929] Code: e3130010 1a000011 e0840105 e5903590 (e5932010) 
[  151.481811] ---[ end trace 54605407682fda2c ]---

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.036579] Internal error: Oops: 17 [#1] ARM

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.156410] Process hostapd (pid: 3969, stack limit = 0xdca5a210)

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.162764] Stack: (0xdca5bcd8 to 0xdca5c000)

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.167309] bcc0:                                                       60000013 dc9fc900

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.175841] bce0: 00000000 c0163e4c c0a39cb4 00000002 00000000 dca5bd60 bf128000 bf224f9c

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.184370] bd00: 00000000 dca5bd24 bf21fd48 c0610d58 da605014 da4e2280 0000000c bf263734

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.192886] bd20: 00000002 dc9fc900 00000000 00000002 00000004 bf26c5cc bf263734 de3bc814

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.201416] bd40: dc8daa80 00000000 00000014 da397400 00000000 c0577e08 bf2630c4 00000000

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.209940] bd60: 386d4420 00000f72 de3bc800 de3bc810 de3bc814 da397400 da4e2000 dc820000

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.218475] bd80: de3bc800 00000000 c0577be0 dc8daa80 dca5bdcc 00000001 00000000 c0577128

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.227010] bda0: c0a57448 dc8daa80 00000024 dc8daa80 dca5bdcc c0577bcc de1e7000 dcd30c00

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.235549] bdc0: 00000024 c0576a78 00000001 7fffffff 00000008 00000000 dc8daa80 00000024

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.244075] bde0: dcd30c00 00000000 dca5bf68 c0576ecc be86fc88 00000001 dca5bf68 dca5a000

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.252603] be00: 00000000 de280e80 00000000 00000f81 00000000 00000000 be86fcc0 dca5bf68

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.261129] be20: 00000000 00000000 d4236480 c0107704 dca5be50 00000000 00067cc8 c052edd4

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.269659] be40: dca5bf68 c052f910 00000001 00000000 da48f1c0 c0166b54 b2b2929f 14a58638

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.278193] be60: 00000000 00000050 da48f668 00000000 0009a980 00000024 00000000 c12100e4

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.286724] be80: da48f1c0 c0166b54 1db89764 17cc08f3 00000000 da48f668 da48f660 00000000

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.295260] bea0: 00000000 c0a5f494 60000013 c0168538 00000010 00000000 00000000 d44a8f8c

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.303792] bec0: 00000000 00000000 00000000 c0a5f494 60000013 c0168538 00000001 00000000

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.312319] bee0: 00000000 00000000 00000000 c0a17754 00000000 00000000 00000002 c0a5f494

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.320857] bf00: 60000013 c0168538 00000000 00000000 00000000 c0255350 00000000 00000000

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.329391] bf20: c0233cb0 c017abcc c0a5eabb de012600 00000000 be86fc70 00000000 d4236480

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.337929] bf40: 00000128 c0107704 dca5a000 00000000 00067cc8 c05306ac 00000000 00000000

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.346463] bf60: 00000000 fffffff7 dca5beb0 0000000c 00000001 00000000 00000000 dca5be78

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.354990] bf80: 00000000 dca5a000 00000000 00000000 00000000 00000000 00097e58 00096c90

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.363521] bfa0: b6edb000 c0107560 00097e58 00096c90 00000005 be86fc70 00000000 00000000

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.372057] bfc0: 00097e58 00096c90 b6edb000 00000128 000303ad 00096e00 00097ba0 00067cc8

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.380596] bfe0: be86fc70 be86fc58 b6eccd17 b6cfaffc 40000010 00000005 9eff6861 9eff6c61

Message from syslogd@89490200001095386101 at Mar 13 14:14:13 ...
 kernel:[  151.474929] Code: e3130010 1a000011 e0840105 e5903590 (e5932010) 
[  156.641080] WILC_SPI spi1.2 wlan0: INFO [debug_thread]*** Debug Thread Running ***
[  163.040175] WILC_SPI spi1.2 wlan0: INFO [debug_thread]*** Debug Thread Running ***
[  169.440404] WILC_SPI spi1.2 wlan0: INFO [debug_thread]*** Debug Thread Running ***
[  175.839809] WILC_SPI spi1.2 wlan0: INFO [debug_thread]*** Debug Thread Running ***
[  182.240015] WILC_SPI spi1.2 wlan0: INFO [debug_thread]*** Debug Thread Running ***

After a while the device freezes stops responding and the console is flooded with this:

[ 7882.740654] WILC_SPI spi1.2: Failed cmd response, cmd (cf), resp (00)
[ 7882.747379] WILC_SPI spi1.2: Failed cmd reset
[ 7882.751947] WILC_SPI spi1.2: Reset and retry 3 1 -568744354
[ 7882.758913] WILC_SPI spi1.2: Failed cmd response, cmd (c3), resp (00)
[ 7882.765662] WILC_SPI spi1.2: Failed cmd, write reg (00000001)...
[ 7882.773086] WILC_SPI spi1.2: Failed cmd response, cmd (cf), resp (00)
[ 7882.779808] WILC_SPI spi1.2: Failed cmd reset
[ 7882.784376] WILC_SPI spi1.2: Reset and retry 2 1 -568744354
[ 7882.791366] WILC_SPI spi1.2: Failed cmd response, cmd (c3), resp (00)
[ 7882.798091] WILC_SPI spi1.2: Failed cmd, write reg (00000001)...
[ 7882.805524] WILC_SPI spi1.2: Failed cmd response, cmd (cf), resp (00)

@AdhamAbozaeid
Copy link

Are you seeing the same behavior with the firmware binary posted here, with the driver from 15.2 (a65d074 )?

@tsifb
Copy link
Author

tsifb commented Mar 14, 2019

@AdhamAbozaeid @Mateusz-Gwara we haven't tested AP mode yet (or AP+STA) I will try this sometime over the next few working days.
Also note we use sdio and are testing on kernel 4.14.73-linux4sam_6.0 whereas @Mateusz-Gwara is using spi.

Reliability seems to be good in STA mode.... no crashes / lockups seen so far.

@Mateusz-Gwara
Copy link

Mateusz-Gwara commented Mar 14, 2019

@AdhamAbozaeid
I have 4 versions running now:

  • 4.9 kernel (ti-processor-sdk-linux-am335x-evm-04.02.00.09 linux-4.9.59+gitAUTOINC+a75d8e9305-ga75d8e9305)
  • a65d074 driver
  • dev firmware from 7.March 2019
  • 4.9 kernel (ti-processor-sdk-linux-am335x-evm-04.02.00.09 linux-4.9.59+gitAUTOINC+a75d8e9305-ga75d8e9305)
  • a65d074 driver
  • firmware from 5.March 2019 posted here
  • 4.14 kernel (ti-processor-sdk-linux-am335x-evm-05.01.00.11 linux-4.14.67+gitAUTOINC+d315a9bb00-gd315a9bb00)
  • a65d074 driver
  • dev firmware from 7.March 2019
  • 4.14 kernel (ti-processor-sdk-linux-am335x-evm-05.01.00.11 linux-4.14.67+gitAUTOINC+d315a9bb00-gd315a9bb00)
  • a65d074 driver
  • firmware from 5.March 2019 posted here

So far, the firmware posted here seems to be more stable.
At least it does not show the errors I posted about the dev-branch firmware before.

This is what I see when a client connects with the most promising variant 4:

[  209.109897] ========================================================
[  209.116527] WARNING: possible irq lock inversion dependency detected
[  209.123163] 4.14.67casenio-gd315a9bb00 #4 Tainted: G         C
[  209.129698] --------------------------------------------------------
[  209.136327] swapper/0 just changed the state of lock:
[  209.141588]  (_xmit_ETHER#2){+.-.}, at: [<c0593b50>] sch_direct_xmit+0x9c/0x184
[  209.149245] but this lock took another, SOFTIRQ-unsafe lock in the past:
[  209.156238]  (fs_reclaim){+.+.}
[  209.156247]
[  209.156247]
[  209.156247] and interrupts could create inverse lock ordering between them.
[  209.156247]
[  209.171408]
[  209.171408] other info that might help us debug this:
[  209.178218]  Possible interrupt unsafe locking scenario:
[  209.178218]
[  209.185293]        CPU0                    CPU1
[  209.190015]        ----                    ----
[  209.194727]   lock(fs_reclaim);
[  209.198007]                                local_irq_disable();
[  209.204179]                                lock(_xmit_ETHER#2);
[  209.210365]                                lock(fs_reclaim);
[  209.216279]   <Interrupt>
[  209.218998]     lock(_xmit_ETHER#2);
[  209.222726]
[  209.222726]  *** DEADLOCK ***
[  209.222726]
[  209.228902] 4 locks held by swapper/0:
[  209.232806]  #0:  (rcu_read_lock){....}, at: [<c0568700>] process_backlog+0x48/0x144
[  209.240897]  #1:  (rcu_read_lock){....}, at: [<c057b4b0>] neigh_update+0x270/0x698
[  209.248804]  #2:  (rcu_read_lock_bh){....}, at: [<c056ab9c>] __dev_queue_xmit+0x40/0x890
[  209.257266]  #3:  (dev->qdisc_running_key ?: &qdisc_running_key#2){+...}, at: [<c057b520>] neigh_update+0x2e0/0x698
[  209.268187]
[  209.268187] the shortest dependencies between 2nd lock and 1st lock:
[  209.276376]  -> (fs_reclaim){+.+.} ops: 915842 {
[  209.281202]     HARDIRQ-ON-W at:
[  209.284588]                       fs_reclaim_acquire.part.2+0x30/0x40
[  209.291315]                       kmem_cache_alloc_trace+0x2c/0x1dc
[  209.297865]                       alloc_worker.constprop.13+0x18/0x54
[  209.304590]                       create_worker+0x30/0x170
[  209.310319]                       workqueue_init+0x70/0x110
[  209.316134]                       kernel_init_freeable+0x44/0x1b8
[  209.322490]                       kernel_init+0x8/0x110
[  209.327942]                       ret_from_fork+0x14/0x20
[  209.333566]     SOFTIRQ-ON-W at:
[  209.336929]                       fs_reclaim_acquire.part.2+0x30/0x40
[  209.343647]                       kmem_cache_alloc_trace+0x2c/0x1dc
[  209.350183]                       alloc_worker.constprop.13+0x18/0x54
[  209.356905]                       create_worker+0x30/0x170
[  209.362632]                       workqueue_init+0x70/0x110
[  209.368445]                       kernel_init_freeable+0x44/0x1b8
[  209.374800]                       kernel_init+0x8/0x110
[  209.380247]                       ret_from_fork+0x14/0x20
[  209.385883]     INITIAL USE at:
[  209.389162]                      fs_reclaim_acquire.part.2+0x30/0x40
[  209.395793]                      kmem_cache_alloc_trace+0x2c/0x1dc
[  209.402242]                      alloc_worker.constprop.13+0x18/0x54
[  209.408862]                      create_worker+0x30/0x170
[  209.414489]                      workqueue_init+0x70/0x110
[  209.420217]                      kernel_init_freeable+0x44/0x1b8
[  209.426473]                      kernel_init+0x8/0x110
[  209.431833]                      ret_from_fork+0x14/0x20
[  209.437371]   }
[  209.439201]   ... key      at: [<c0a266a8>] __fs_reclaim_map+0x0/0x10
[  209.445913]   ... acquired at:
[  209.449094]    kmem_cache_alloc_trace+0x2c/0x1dc
[  209.454033]    txq_add_mgmt_pkt+0xac/0x164 [wilc_spi]
[  209.459347]    wilc_wfi_mon_xmit+0x148/0x224 [wilc_spi]
[  209.464799]    dev_hard_start_xmit+0xb8/0x2f8
[  209.469333]    sch_direct_xmit+0xc0/0x184
[  209.473514]    __dev_queue_xmit+0x688/0x890
[  209.477882]    packet_sendmsg+0x5ec/0x127c
[  209.482161]    sock_sendmsg+0x14/0x24
[  209.485982]    ___sys_sendmsg+0x204/0x21c
[  209.490158]    __sys_sendmsg+0x40/0x6c
[  209.494061]    ret_fast_syscall+0x0/0x28
[  209.498146]
[  209.499693] -> (_xmit_ETHER#2){+.-.} ops: 342 {
[  209.504429]    HARDIRQ-ON-W at:
[  209.507710]                     _raw_spin_lock+0x28/0x38
[  209.513244]                     dev_deactivate_many+0xe8/0x3b4
[  209.519327]                     __dev_close_many+0x68/0xd8
[  209.525049]                     __dev_change_flags+0x98/0x178
[  209.531046]                     dev_change_flags+0x18/0x48
[  209.536771]                     devinet_ioctl+0x6ec/0x824
[  209.542402]                     sock_ioctl+0xe4/0x328
[  209.547668]                     do_vfs_ioctl+0x8c/0x95c
[  209.553129]                     SyS_ioctl+0x64/0x74
[  209.558223]                     ret_fast_syscall+0x0/0x28
[  209.563857]    IN-SOFTIRQ-W at:
[  209.567131]                     _raw_spin_lock+0x28/0x38
[  209.572677]                     sch_direct_xmit+0x9c/0x184
[  209.578405]                     __dev_queue_xmit+0x688/0x890
[  209.584303]                     neigh_update+0x2e0/0x698
[  209.589849]                     arp_process+0x2f8/0x960
[  209.595295]                     arp_rcv+0x15c/0x234
[  209.600380]                     __netif_receive_skb_core+0x4f4/0xb94
[  209.607010]                     process_backlog+0x90/0x144
[  209.612738]                     net_rx_action+0xc8/0x3a8
[  209.618279]                     __do_softirq+0xb8/0x478
[  209.623723]                     irq_exit+0x124/0x170
[  209.628908]                     __handle_domain_irq+0x50/0xa8
[  209.634903]                     __irq_svc+0x70/0x98
[  209.639986]                     arch_cpu_idle+0x20/0x3c
[  209.645439]                     do_idle+0x100/0x18c
[  209.650528]                     cpu_startup_entry+0xc/0x10
[  209.656257]                     start_kernel+0x330/0x39c
[  209.661790]    INITIAL USE at:
[  209.664978]                    _raw_spin_lock+0x28/0x38
[  209.670433]                    dev_deactivate_many+0xe8/0x3b4
[  209.676417]                    __dev_close_many+0x68/0xd8
[  209.682045]                    __dev_change_flags+0x98/0x178
[  209.687944]                    dev_change_flags+0x18/0x48
[  209.693565]                    devinet_ioctl+0x6ec/0x824
[  209.699102]                    sock_ioctl+0xe4/0x328
[  209.704285]                    do_vfs_ioctl+0x8c/0x95c
[  209.709636]                    SyS_ioctl+0x64/0x74
[  209.714631]                    ret_fast_syscall+0x0/0x28
[  209.720164]  }
[  209.721904]  ... key      at: [<c11c5398>] netdev_xmit_lock_key+0x8/0x1c8
[  209.728997]  ... acquired at:
[  209.732088]    __lock_acquire+0x480/0x18d0
[  209.736356]    lock_acquire+0xb0/0x1b4
[  209.740263]    _raw_spin_lock+0x28/0x38
[  209.744258]    sch_direct_xmit+0x9c/0x184
[  209.748442]    __dev_queue_xmit+0x688/0x890
[  209.752805]    neigh_update+0x2e0/0x698
[  209.756806]    arp_process+0x2f8/0x960
[  209.760711]    arp_rcv+0x15c/0x234
[  209.764254]    __netif_receive_skb_core+0x4f4/0xb94
[  209.769345]    process_backlog+0x90/0x144
[  209.773523]    net_rx_action+0xc8/0x3a8
[  209.777526]    __do_softirq+0xb8/0x478
[  209.781434]    irq_exit+0x124/0x170
[  209.785069]    __handle_domain_irq+0x50/0xa8
[  209.789513]    __irq_svc+0x70/0x98
[  209.793056]    arch_cpu_idle+0x20/0x3c
[  209.796966]    do_idle+0x100/0x18c
[  209.800505]    cpu_startup_entry+0xc/0x10
[  209.804683]    start_kernel+0x330/0x39c
[  209.808678]
[  209.810225]
[  209.810225] stack backtrace:
[  209.814780] CPU: 0 PID: 0 Comm: swapper Tainted: G         C      4.14.67casenio-gd315a9bb00 #4
[  209.823849] Hardware name: Generic AM33XX (Flattened Device Tree)
[  209.830216] [<c010e168>] (unwind_backtrace) from [<c010c1f0>] (show_stack+0x10/0x14)
[  209.838288] [<c010c1f0>] (show_stack) from [<c01643f4>] (check_usage_forwards+0x134/0x158)
[  209.846902] [<c01643f4>] (check_usage_forwards) from [<c0164888>] (mark_lock+0x3c4/0x6c0)
[  209.855435] [<c0164888>] (mark_lock) from [<c0165274>] (__lock_acquire+0x480/0x18d0)
[  209.863515] [<c0165274>] (__lock_acquire) from [<c0166f54>] (lock_acquire+0xb0/0x1b4)
[  209.871688] [<c0166f54>] (lock_acquire) from [<c06454f0>] (_raw_spin_lock+0x28/0x38)
[  209.879763] [<c06454f0>] (_raw_spin_lock) from [<c0593b50>] (sch_direct_xmit+0x9c/0x184)
[  209.888206] [<c0593b50>] (sch_direct_xmit) from [<c056b1e4>] (__dev_queue_xmit+0x688/0x890)
[  209.896913] [<c056b1e4>] (__dev_queue_xmit) from [<c057b520>] (neigh_update+0x2e0/0x698)
[  209.905345] [<c057b520>] (neigh_update) from [<c05dc940>] (arp_process+0x2f8/0x960)
[  209.913329] [<c05dc940>] (arp_process) from [<c05dd118>] (arp_rcv+0x15c/0x234)
[  209.920863] [<c05dd118>] (arp_rcv) from [<c05665d0>] (__netif_receive_skb_core+0x4f4/0xb94)
[  209.929574] [<c05665d0>] (__netif_receive_skb_core) from [<c0568748>] (process_backlog+0x90/0x144)
[  209.938922] [<c0568748>] (process_backlog) from [<c056c6e0>] (net_rx_action+0xc8/0x3a8)
[  209.947265] [<c056c6e0>] (net_rx_action) from [<c0101560>] (__do_softirq+0xb8/0x478)
[  209.955339] [<c0101560>] (__do_softirq) from [<c012d9a8>] (irq_exit+0x124/0x170)
[  209.963063] [<c012d9a8>] (irq_exit) from [<c0172d7c>] (__handle_domain_irq+0x50/0xa8)
[  209.971221] [<c0172d7c>] (__handle_domain_irq) from [<c0645e30>] (__irq_svc+0x70/0x98)
[  209.979478] [<c0645e30>] (__irq_svc) from [<c0108300>] (arch_cpu_idle+0x20/0x3c)
[  209.987198] [<c0108300>] (arch_cpu_idle) from [<c015cad4>] (do_idle+0x100/0x18c)
[  209.994911] [<c015cad4>] (do_idle) from [<c015cec4>] (cpu_startup_entry+0xc/0x10)
[  210.002719] [<c015cec4>] (cpu_startup_entry) from [<c0900c00>] (start_kernel+0x330/0x39c)
[  214.241231] WILC_SPI spi0.2 wlan0: INFO [debug_thread]*** Debug Thread Running ***

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants