-
Notifications
You must be signed in to change notification settings - Fork 328
feat: Add stall detection to recover from frozen uploads #775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
93158d8
to
e5f7871
Compare
More tests coming soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, Tom! It's greatly appreciated! I'm excited about this feature, and we have been kicking around this idea for some time already.
I had a high level look at your PR and left a few comments.
Rate-based: Detects when overall transfer rate drops below threshold
This mechanism will likely only work if the upload is split across multiple requests, won't it? If a large file is uploaded at normal speeds over 5min in one request, but the HTTP stack does not support progress events, it will be detected as a stall. Such a scenario is not distinguishable from actual stalls.
If progress events are not available and one wants to have stall detection, the best option would be to enable chunkSize
and set an appropriate request timeout, which fits the chunk size.
In addition, requests can be delayed for reasons other than stalls. For example, some people use the onBeforeRequest
callback to implement a rate limit/queue system for outgoing requests. Or they fetch authentication tokens to be uses in the upload requests. These processes might delay requests and thus progress events, so we have to be careful to not flag them as stalls.
Let me know what you think!
aa262bd
to
578c8d8
Compare
Hi @Acconut, I think all the feedback above has now been addressed: New component
Updates to existing
|
578c8d8
to
e6d0f8e
Compare
389875b
to
ead2cf0
Compare
Hi @Acconut - just giving this a bump, incase you haven't seen it. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I had another look and left some comments. Overall, this is moving in the right direction!
53e7269
to
2df3d6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates! I left a few more comments to make this detection extra robust!
2e91cc0
to
7c94d1e
Compare
Feel free to review again, all that feedback should now have been addressed. We'd really love to get our hands on this as soon as possible. What do you think about shipping a pre-release? We have a system we can test this with now and a load of established test automation that leverages a containerised sat-link emulation proxy (we run through a matrix of dynamic adverse network scenarios and transfers of varying sizes). I hope to open-source that tooling at some point so you guys can use it too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, a pre-release is doable once this has been merged.
This feature addresses the issue of uploads hanging indefinitely in unreliable network conditions, particularly in Node.js environments where no default timeout exists. When uploads stall due to network issues, TCP connections can enter a degraded state where no data is transferred but no error is triggered. This implementation detects such stalls and forces a retry. Implementation details: - Progress-based: Detects when no upload progress events are fired - Gracefully integrates with the existing retry mechanism - Fully configurable with sensible defaults: - 30s stall timeout (time with no progress before considering stalled) - 5s check interval (how often to check for stalls) This is especially important for uploads over satellite links, cellular networks, or other unreliable connections where TCP backoff can cause indefinite stalls.
Also refactors the test-stall-detection test suite.
7c94d1e
to
daf2ae0
Compare
Those bits should now have been addressed, feel free to take another look. It's worth noting my wife is about to have a baby any day now, so I may not be available to progress this soon. Once we get a pre-release out we'll be sure to heavily test it and share our findings. |
Do you think we might be able to ship a pre-release of this before the end of the week? Sorry to keep bugging you 😅 - we'd really love to get our hands on this feature asap. |
I'll try, but I cannot make any promises at the time. |
Maybe hold off merging, i've found an issue we might need to talk about first when used with NodeHttpStack. No rush from our side, i've published a release of the fork to test with a real-world consuming service. Basically in my real-world tests this morning i've discovered we can't get low level TCP socket transfer progress from the Node HTTP module, you can only see an internal I/O buffer being filled, you then get silence until that buffer is completely drained (transmitted) and starts to be filled again, this means you consistently get false positives for stall detection. |
That's great and is probably the best way to take this for test drive right now.
Oh, I see. Getting progress information in Node.js has been rather tedious compared to XMLHttpRequest in browsers, which provides progress events on its own. In Node.js, we use our own logic to write smaller chunks to the HTTP request stream to get progress information, see tus-js-client/lib/node/NodeHttpStack.ts Lines 225 to 276 in 021960d
But as you correctly mentioned, this only allows us to retrieve details about how much data landed in the HTTP request stream. There are usually other buffers involved before the data hit the networks. We can't even assume that the HTTP request has its own TCP connection since HTTP/2 and HTTP/3 perform multiplexing over TCP and UDP, respectively. So even if there was a way to get low-level TCP information, it would only be useful for HTTP/1.1 requests. That being said, I wonder if you could prevent false positives by tuning the stall detection parameters. Increasing the stall timeout while reducing the high water mark in the HTTP request stream (not sure how that's best done, to be honest) could help. In addition, you might want to experiment with different chunk sizes or throttle configurations in In addition, I tested the stall detection on my own yesterday and it does not seem to fully work in browsers yet. The logic assumes that our HTTP stacks emit an error when requests are aborted, but the XHRHttpStack does not do that. It just leaves the promise hanging. This should be easy to fix and I'm working on a patch, but more testing is needed. I also learned that Uppy has a similar approach to stall detection: https://github.com/transloadit/uppy/blob/main/packages/%40uppy/utils/src/ProgressTimeout.ts It's a bit simpler as it doesn't use check intervals, but just always renew the timeout when a progress event is emitted. It's also enabled by default with a timeout of 30s and they see it's working well. Might be worth a look, although Uppy is only designed for browsers and not Node.js. |
Opened #778 for this. |
Related issue
#773
Overview
This feature addresses the issue of uploads hanging indefinitely in unreliable network conditions, particularly in Node.js environments where no default timeout exists.
When uploads stall due to network issues, TCP connections can enter a degraded state where no data is transferred but no error is triggered. This implementation detects such stalls and forces a retry.
This is especially important for uploads over satellite links, cellular networks, or other unreliable connections where TCP backoff can cause indefinite stalls.
Implementation details
Rate-based: Detects when overall transfer rate drops below thresholdAutomatically selects the appropriate method based on HTTP stack capabilities1 byte/s minimum transfer rate