Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usb: fix enumeration issues by propagating buffer overflow error. #738

Closed
wants to merge 1 commit into from

Conversation

mattthebaker
Copy link
Contributor

Summary

Fixes USB enumeration issues with fast host controllers.

The USB spec permits SETUP for control transfers to be sent at any time. The device is not permitted to STALL or NAK, so the USBD controller ignores BK0RDY status when it receives a SETUP.

Some faster host controllers may send a SETUP very soon after the previous OUT, for example the Raspberry Pi5 has been confirmed on USB analyzer to SETUP in as little as 20us after the previous OUT. This is a very narrow window to handle the OUT and maintain proper state.

In my project (atsamd21 at 32mhz cpu clock with USB at mid priority), by the time the OUT in the endpoint buffer is read it has already been overwritten by the SETUP packet. The USB stack is expecting a zero length packet to end the previous descriptor read, so it has passed in a zero length slice. The bank read finds that the endpoint has more than zero data and throws a BufferOverflow.

This change allows the error to propagate without dropping the SETUP in the endpoint buffer. The USB stack will then exit the previous descriptor read and proceed with the SETUP of the next control transfer.

In my project this fixes (at least) enumeration issues with Pi5, Pi CM4, and so far a couple of x86 class PCs running Windows or Linux. So far all hosts known to have enumeration issues that were tested with this change have been resolved.

Checklist

  • [ X] CHANGELOG.md for the BSP or HAL updated
  • All new or modified code is well documented, especially public items

@mattthebaker
Copy link
Contributor Author

Rebased on present HEAD~1 to avoid inheriting CI errors.

@jbeaurivage
Copy link
Contributor

@mattthebaker, I'm going to run a few tests with this PR, but it seems like it solves some USB issues on my HW as well.

@ianrrees
Copy link
Contributor

ianrrees commented Jun 5, 2024

There's some infrastructure for testing USB, related to the usb-device crate. Offhand, I don't know if it would be feasible to make a test case that reproduces this issue as the host side of that uses libusb, however it would be worth running the suite against this PR to catch any regressions (and to dust off the tests :) ).

A good entry point to the current situation might be this issue - sorry I've not advanced it, but am now gainfully unemployed and maybe even getting past a period of burnout so could try to get back in to it. rust-embedded-community/usb-device#100

@sajattack
Copy link
Member

This will need to be rebased to fix the ci errors.

@mattthebaker
Copy link
Contributor Author

@mattthebaker, I'm going to run a few tests with this PR, but it seems like it solves some USB issues on my HW as well.

Excellent!

There's some infrastructure for testing USB, related to the usb-device crate. Offhand, I don't know if it would be feasible to make a test case that reproduces this issue as the host side of that uses libusb, however it would be worth running the suite against this PR to catch any regressions (and to dust off the tests :) ).

A good entry point to the current situation might be this issue - sorry I've not advanced it, but am now gainfully unemployed and maybe even getting past a period of burnout so could try to get back in to it. rust-embedded-community/usb-device#100

Hmm, I'm not sure if it can be easily created in user space tests since the OS will typically handle enumeration and believe SETUP is primarily used there? Any test that resets a device might be able to look for re-enumeration within some short period of time, but the host would need to be confirmed to be susceptible to the issue.

Any kind of extended regression coverage would be nice.

Fix is already in production for me and will quickly see use on at least a couple hundred devices and a lot of unique hosts (pretty much all Linux though). During debug I tested against usb_device 0.3.1 and 0.3.2 -- new debug hooks there ultimately helped identify the root cause. My device only uses CDC/ACM.

@ianrrees
Copy link
Contributor

ianrrees commented Aug 1, 2024

@mattthebaker - thanks very much for your work on this, and apologies it's taken me so long to do anything with the PR! I've run the usb-device test suite against your branch on SAMD21 and SAMD51 (Adafruit Metro M0 and M4 boards), can't see any change there.

I'm not quite sure this fixes the underlying issue, but I'm still getting back up to speed on USB and don't have a firm idea of a different fix. A couple items I'm currently pondering, which you might have considered already:

In the circumstance where Inner::read()'s caller expects a ZLP, should an incoming SETUP result in the transfer (perhaps not the right terminology - what I mean is the data that might span multiple reads or require a ZLP to terminate) being discarded or accepted? My hunch is that the transfer should be discarded, like this PR does.

How about the case where the caller expects more data in a transfer, so supplies a larger buffer than an unexpected SETUP?

I wonder if we should split Inner::read() in to two methods like Inner::read_setup() and Inner::read_data(), because it seems like we're currently trying to do two semantically different things with the same function.

Tangentially related: we should update the usb-device crate used by the HAL. I'll put that on my TO-DO list.

@ianrrees
Copy link
Contributor

ianrrees commented Aug 1, 2024

The SAMD21 datasheet (DS40001882H) in section 32.6.2.6 describes management around SETUP transactions being handled largely in hardware. The USB peripheral will overwrite the endpoint buffer with an incoming SETUP transaction so long as EPINTFLAG.RXSTP is 0; effectively, SETUP transactions can't overwrite each other, but they can overwrite non-SETUP transactions.

In a hypothetical Inner::read_data() method, I think the best we can do is check received_setup_interrupt() after the read has been attempted (and that needs to happen if the endpoint returned a buffer overflow error, to cover the ZLP case). But, even that seems to leave a timing hole, because the datasheet describes the USB peripheral writing data before it sets the RXSTP bit...

@ianrrees
Copy link
Contributor

ianrrees commented Aug 1, 2024

I'm thinking the next step is to create a test case that aims to reproduce the issue that this PR is meant to fix. How does the failure manifest - is it just that the device doesn't enumerate successfully at all, does it take noticeably longer (I guess due to a retry), something else?

(updated to add) My current thought is two types of test:

  1. Harness triggers a restart of the device, expects it to enumerate within some reasonable amount of time
  2. Harness writes some data to the firmware via a control transfer, then reads it back via another control transfer. The harness should compare the data, and use amounts of data aimed at hitting problem areas (where ZLPs might be expected, and not). If the device is too slow in reading out control data, it might erroneously get some of the next SETUP data.

@mattthebaker
Copy link
Contributor Author

No problem, trying to help others that have likely seen the issue.

In the circumstance where Inner::read()'s caller expects a ZLP, should an incoming SETUP result in the transfer (perhaps not the right terminology - what I mean is the data that might span multiple reads or require a ZLP to terminate) being discarded or accepted? My hunch is that the transfer should be discarded, like this PR does.

So the specific case I'm aware of where this happens, the status stage of a control transfer gets its ZLP clobbered by the next control transfer's SETUP. In this case, that ZLP is the final action of the first control transfer, and on reception by the device it will be ACKed if the OUT buffer is ready. The device could STALL if its not ready, but then the host would try again rather than proceeding.

From the host's perspective the control transfer is now entirely complete with no errors. If it is a fast host it may move immediately into another control transfer SETUP.

If there were something the device side stack could discard, in this case I believe you would prefer to accept it. However, the only case I'm aware that this happens is when the control transfer is reading -- IN transfers -- so afaik there isn't any device bound transfer that could be uncertain. To resolve the issue the next control transfer will need to start correctly.

How about the case where the caller expects more data in a transfer, so supplies a larger buffer than an unexpected SETUP?

I'm not aware of a case where the host would behave this way, but I'm not an expert. AFAIK the rapid SETUP only comes due to how the final status stage works. In other cases if the state machines get out of sync or have bugs they would STALL / NAK until the host decides to just reset the device bus and try again. Only the SETUP per spec bypasses those mechanisms and can lead to this issue.

Splitting the read function into two might make sense.

In a hypothetical Inner::read_data() method, I think the best we can do is check received_setup_interrupt() after the read has been attempted (and that needs to happen if the endpoint returned a buffer overflow error, to cover the ZLP case). But, even that seems to leave a timing hole, because the datasheet describes the USB peripheral writing data before it sets the RXSTP bit...

Do you have a spot in the datasheet where you saw that comment about RXSTP? It seems that reading received_setup_interrupt() may not reliable if it has its own race with the buffer state. If the expected ZLP reads incorrectly, aborting the read but keeping all other state valid for the next state advance seems appropriate still. Its closely equivalent to what you're describing other than the received_setup_interrupt() will be flagged when the interrupt fires or next state advance that has the read instead.

I'm thinking the next step is to create a test case that aims to reproduce the issue that this PR is meant to fix. How does the failure manifest - is it just that the device doesn't enumerate successfully at all, does it take noticeably longer (I guess due to a retry), something else?

It depends on how rapidly the host sends the SETUP and how quick the device is with clearing:

  1. If the host is only marginally faster than the device at handling, the enumeration would often succeed, but some packets (like say the manufacturer name string) would get dropped. Or sometimes the first enumeration step would glitch and not recover, the usb controller would time out, restart and try again. So the enumeration time could be very slow (5-10s or more).
  2. If the host is very fast compared to the device, it would fail enumeration on the first control transfer every time and never enumerate.
  1. Harness triggers a restart of the device, expects it to enumerate within some reasonable amount of time

This should hit it directly with the caveat that the host its run on would need to have fast setup behavior and/or artificial delays slowing down the device to force the behavior. For reference it seems like many host controllers will only send 1 control transfer per start of frame, so 1ms, but others will blast them out with basically zero delay.

2. Harness writes some data to the firmware via a control transfer, then reads it back via another control transfer. The harness should compare the data, and use amounts of data aimed at hitting problem areas (where ZLPs might be expected, and not). If the device is too slow in reading out control data, it might erroneously get some of the next SETUP data.

If that is possible with the host libraries that would be great for extra targeted coverage.

A possible 3. is considering the timing dependent nature, would be to have a device build with intentional long delays (somewhere around 1-3ms?) inserted between handling events, and run significant parts of the existing regressions against it.

@ianrrees
Copy link
Contributor

ianrrees commented Aug 2, 2024

Cool, sounds like progress. I don't have much time for programming today, but will try to implement a couple tests and a possible fix (which will involve changes to usb-device) soon.

I'm not aware of a case where the host would behave this way

It may be that a reasonable host implementation would never do that - I'd forgotten that the SETUP includes a length field, so the device knows how much data to expect.

Do you have a spot in the datasheet where you saw that comment about RXSTP?

Section 32.6.2.6 though it's a bit narrative.

I'm envisioning that read() will be changed to check the RXSTP bit near the end, and if it is set return an error (maybe InvalidState) without changing the peripheral state - this says the data (if any) has been overwritten by a new SETUP. A new read_setup() will work much like read() now does, but return an error if there is data but RXSTP is not set, and clear RXSTP before returning successfully. The logic in usb-device for control pipes will need to change slightly so that the new read_setup() is used when a setup is expected, and the places it reads regular data or ZLPs may need to handle that error.

It seems that reading received_setup_interrupt() may not reliable if it has its own race with the buffer state

Maybe it's not a real problem if the OUT ZLP is the only case where this occurs, since the device behaviour doesn't change depending on whether it successfully reads the ZLP or if the SETUP arrives before/during the ZLP handling. I guess this hinges on whether a host can write OUT data that the device needs, then follow up with a SETUP too quickly (as above) - my hunch is that this doesn't happen.

It depends on how rapidly the host sends the SETUP and how quick the device is with clearing:

Do you have examples of devices (or maybe device+driver combos) in those two categories? I've got a small assortment of SBCs and laptops, have seen the occasional weird enumeration issue but nothing as consistent as your 2.

If that is possible with the host libraries that would be great for extra targeted coverage.

I think it should be - the usb-device test suite is pretty minimal, but based on libusb which can do control transfers.

have a device build with intentional long delays (somewhere around 1-3ms?) inserted between handling events

Yep, that seems like a great idea. Should be easy to do between calls to poll() from the normal device firmware, which I imagine would be sufficient. Might be a bit awkward to put that functionality inside the usb-device code though, if that's necessary.

@mattthebaker
Copy link
Contributor Author

I'm envisioning that read() will be changed to check the RXSTP bit near the end, and if it is set return an error (maybe InvalidState) without changing the peripheral state - this says the data (if any) has been overwritten by a new SETUP. A new read_setup() will work much like read() now does, but return an error if there is data but RXSTP is not set, and clear RXSTP before returning successfully. The logic in usb-device for control pipes will need to change slightly so that the new read_setup() is used when a setup is expected, and the places it reads regular data or ZLPs may need to handle that error.

Sounds pretty clean to me.

Do you have examples of devices (or maybe device+driver combos) in those two categories? I've got a small assortment of SBCs and laptops, have seen the occasional weird enumeration issue but nothing as consistent as your 2.

The two I've debugged personally are my desktop PC -- an Intel i7-12700K / Z690 chipset and Win10 Pro -- and the raspberry pi 5. This is in conjunction with a samd21e clocked at 32mhz and RTIC where USB had to be lower priority than some real time tasks. Older firmware build that had a polling loop instead was slightly more reliable with the pi5 (more likely to glitch or slowly enumerate), but never worked on the desktop PC. My users had issues on some other various other embedded linux boards but I would need to hunt down old notes.

A Pi5 would be the simplest way to get a hardware test setup going, but need to note now that some field updates may have to be rolled back -- they issued their own fix (1 SETUP per SOF) as unrelated devices had similar issues, I'm not sure if they even ran a rust stack. https://forums.raspberrypi.com/viewtopic.php?t=365410

I'll be able to test any proposed alternate fixes in a pi5 setup.

@ianrrees
Copy link
Contributor

ianrrees commented Aug 6, 2024

I've started working on this today, but only got a bit of refactoring the test suite done. WIP on usb-device is at https://github.com/ianrrees/usb-device/tree/enumeration .

I'll be able to test any proposed alternate fixes in a pi5 setup.

Great, thanks! I don't have a Pi5 yet.

@ianrrees
Copy link
Contributor

ianrrees commented Aug 10, 2024

I've implemented the enumeration time test and found that the control loopback test exists already[1]. With my laptop and a Metro M0 Express (SAMD21, no bootloader) I see about 95% of enumerations happening in something like 8-900ms, but some are about 6 seconds. Haven't yet made the delay test; I started looking at where to put the delay and didn't find a spot where it seemed to make the enumeration problem more likely.

Proposed fixes span usb-device and atsamd-rs, I'd suggest using https://github.com/ianrrees/test-usb-device/tree/enumeration as an entry point to try the changes out. It's a little bit of a Git hack as the usb-device submodule commit is one from my usb-device repo. Probably the easiest thing to do is clone the test-usb-device repo with --recursive, cd in to the usb-device directory and do git remote add ianrrees [email protected]:ianrrees/usb-device.git, git fetch ianrrees, cd .., git checkout enumeration, git submodule update.

With those changes, I haven't seen any enumeration failures in ~40 runs of the test suite. (edit: now a few thousand, with both SAMD21 and SAMD51)

[1] here, though we might want to revise the list of lengths.

@ianrrees
Copy link
Contributor

Forgot to mention, there's a little bit of Cargo secret sauce that you probably know about already, but is quite helpful here. The "patch" feature that can be used to replace the source of a crate through the whole dependency tree, for example here replaces usb-device.

@ianrrees
Copy link
Contributor

@mattthebaker, just checking in on this as it's been a few days. I've had a brief chat with the original author of usb-device and it seems the mixing up of SETUP and OUT data is a known issue so it would be good to PR a fix to usb-device, and once that's done we'd update the HAL to suit.

With my laptop, I can see the changes I referenced above make enumerations more reliable, but it would be good to confirm whether is addresses the complete enumeration failures you've reported with RPi5 as well.

Great find! I suspect that sorting this out will fix a longstanding annoyance with https://github.com/ianrrees/usb-spi which is used at my old job.

@ianrrees ianrrees mentioned this pull request Aug 22, 2024
4 tasks
@ianrrees
Copy link
Contributor

would be good to confirm whether is addresses the complete enumeration failures you've reported with RPi5 as well.

Sorry, just realised I misread and the complete failures were with a desktop!

@9names ran some tests of my proposed fix with RPi5 today and was able to reproduce the failures using current code, and see that the fix it appears to work. So, I'm thinking the next step is to get those changes in to usb-device, and once that's done we can make corresponding changes to the HAL. Will close this PR out in the mean time, since the new approach would overwrite these changes anyway.

Thanks everyone!

@ianrrees ianrrees closed this Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants