Skip to content

feat: Add stall detection to recover from frozen uploads #775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

m7kvqbe1
Copy link

@m7kvqbe1 m7kvqbe1 commented May 20, 2025

Related issue

#773

Overview

This feature addresses the issue of uploads hanging indefinitely in unreliable network conditions, particularly in Node.js environments where no default timeout exists.

When uploads stall due to network issues, TCP connections can enter a degraded state where no data is transferred but no error is triggered. This implementation detects such stalls and forces a retry.

This is especially important for uploads over satellite links, cellular networks, or other unreliable connections where TCP backoff can cause indefinite stalls.

Implementation details

  • Supports two detection methods:
    • Progress-based: Detects when no upload progress events are fired
    • Rate-based: Detects when overall transfer rate drops below threshold
  • Automatically selects the appropriate method based on HTTP stack capabilities
  • Gracefully integrates with the existing retry mechanism
  • Fully configurable with sensible defaults:
    • 30s stall timeout (time with no progress before considering stalled)
    • 5s check interval (how often to check for stalls)
    • 1 byte/s minimum transfer rate

@m7kvqbe1
Copy link
Author

More tests coming soon.

Copy link
Member

@Acconut Acconut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, Tom! It's greatly appreciated! I'm excited about this feature, and we have been kicking around this idea for some time already.

I had a high level look at your PR and left a few comments.

Rate-based: Detects when overall transfer rate drops below threshold

This mechanism will likely only work if the upload is split across multiple requests, won't it? If a large file is uploaded at normal speeds over 5min in one request, but the HTTP stack does not support progress events, it will be detected as a stall. Such a scenario is not distinguishable from actual stalls.

If progress events are not available and one wants to have stall detection, the best option would be to enable chunkSize and set an appropriate request timeout, which fits the chunk size.

In addition, requests can be delayed for reasons other than stalls. For example, some people use the onBeforeRequest callback to implement a rate limit/queue system for outgoing requests. Or they fetch authentication tokens to be uses in the upload requests. These processes might delay requests and thus progress events, so we have to be careful to not flag them as stalls.

Let me know what you think!

@m7kvqbe1 m7kvqbe1 force-pushed the feat/stall-detection branch from aa262bd to 578c8d8 Compare May 28, 2025 10:45
@m7kvqbe1
Copy link
Author

m7kvqbe1 commented May 28, 2025

Hi @Acconut, I think all the feedback above has now been addressed:

New component

  • StallDetector class - separate component as requested
  • Only uses progress-based detection (removed flawed rate-based approach)
  • Pauses during onBeforeRequest callbacks to prevent false positives

Updates to existing

  • Added stall detection to Upload class with proper cleanup
  • Added supportsProgressEvents() to all HTTP stacks (XHR: true, Fetch: false, Node: true)
  • Deep merge for stall detection options in browser/node index files
  • Defaults: { enabled: false, stallTimeout: 30000, checkInterval: 5000 }

@m7kvqbe1 m7kvqbe1 force-pushed the feat/stall-detection branch from 578c8d8 to e6d0f8e Compare May 28, 2025 11:19
@m7kvqbe1 m7kvqbe1 marked this pull request as ready for review May 28, 2025 11:19
@m7kvqbe1 m7kvqbe1 requested a review from Acconut May 28, 2025 11:19
@m7kvqbe1 m7kvqbe1 force-pushed the feat/stall-detection branch 2 times, most recently from 389875b to ead2cf0 Compare May 28, 2025 11:36
@m7kvqbe1
Copy link
Author

m7kvqbe1 commented Jun 2, 2025

Hi @Acconut - just giving this a bump, incase you haven't seen it. Thanks!

Copy link
Member

@Acconut Acconut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I had another look and left some comments. Overall, this is moving in the right direction!

@m7kvqbe1 m7kvqbe1 force-pushed the feat/stall-detection branch 2 times, most recently from 53e7269 to 2df3d6c Compare June 4, 2025 11:16
@m7kvqbe1 m7kvqbe1 requested a review from Acconut June 4, 2025 11:25
Copy link
Member

@Acconut Acconut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates! I left a few more comments to make this detection extra robust!

@m7kvqbe1 m7kvqbe1 force-pushed the feat/stall-detection branch 2 times, most recently from 2e91cc0 to 7c94d1e Compare June 5, 2025 20:34
@m7kvqbe1
Copy link
Author

m7kvqbe1 commented Jun 5, 2025

@Acconut

Feel free to review again, all that feedback should now have been addressed.

We'd really love to get our hands on this as soon as possible. What do you think about shipping a pre-release?

We have a system we can test this with now and a load of established test automation that leverages a containerised sat-link emulation proxy (we run through a matrix of dynamic adverse network scenarios and transfers of varying sizes).

I hope to open-source that tooling at some point so you guys can use it too.

@m7kvqbe1 m7kvqbe1 requested a review from Acconut June 5, 2025 20:55
Copy link
Member

@Acconut Acconut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, a pre-release is doable once this has been merged.

m7kvqbe1 added 5 commits June 6, 2025 12:46
This feature addresses the issue of uploads hanging indefinitely in
unreliable network conditions, particularly in Node.js environments
where no default timeout exists.

When uploads stall due to network issues, TCP connections can enter a
degraded state where no data is transferred but no error is triggered.
This implementation detects such stalls and forces a retry.

Implementation details:
- Progress-based: Detects when no upload progress events are fired
- Gracefully integrates with the existing retry mechanism
- Fully configurable with sensible defaults:
  - 30s stall timeout (time with no progress before considering stalled)
  - 5s check interval (how often to check for stalls)

This is especially important for uploads over satellite links, cellular
networks, or other unreliable connections where TCP backoff can cause
indefinite stalls.
Also refactors the test-stall-detection test suite.
@m7kvqbe1 m7kvqbe1 force-pushed the feat/stall-detection branch from 7c94d1e to daf2ae0 Compare June 6, 2025 12:02
@m7kvqbe1 m7kvqbe1 requested a review from Acconut June 6, 2025 12:10
@m7kvqbe1
Copy link
Author

m7kvqbe1 commented Jun 6, 2025

@Acconut

Those bits should now have been addressed, feel free to take another look.

It's worth noting my wife is about to have a baby any day now, so I may not be available to progress this soon.

Once we get a pre-release out we'll be sure to heavily test it and share our findings.

@m7kvqbe1
Copy link
Author

@Acconut

Do you think we might be able to ship a pre-release of this before the end of the week?

Sorry to keep bugging you 😅 - we'd really love to get our hands on this feature asap.

@Acconut
Copy link
Member

Acconut commented Jun 12, 2025

Do you think we might be able to ship a pre-release of this before the end of the week?

I'll try, but I cannot make any promises at the time.

@m7kvqbe1
Copy link
Author

m7kvqbe1 commented Jun 12, 2025

Maybe hold off merging, i've found an issue we might need to talk about first when used with NodeHttpStack.

No rush from our side, i've published a release of the fork to test with a real-world consuming service.

Basically in my real-world tests this morning i've discovered we can't get low level TCP socket transfer progress from the Node HTTP module, you can only see an internal I/O buffer being filled, you then get silence until that buffer is completely drained (transmitted) and starts to be filled again, this means you consistently get false positives for stall detection.

@Acconut
Copy link
Member

Acconut commented Jun 13, 2025

No rush from our side, i've published a release of the fork to test with a real-world consuming service.

That's great and is probably the best way to take this for test drive right now.

Basically in my real-world tests this morning i've discovered we can't get low level TCP socket transfer progress from the Node HTTP module, you can only see an internal I/O buffer being filled, you then get silence until that buffer is completely drained (transmitted) and starts to be filled again, this means you consistently get false positives for stall detection.

Oh, I see. Getting progress information in Node.js has been rather tedious compared to XMLHttpRequest in browsers, which provides progress events on its own. In Node.js, we use our own logic to write smaller chunks to the HTTP request stream to get progress information, see

// writeBufferToStreamWithProgress writes chunks from `source` (either a
// Buffer or Uint8Array) to the readable stream `stream`.
// The size of the chunk depends on the stream's highWaterMark to fill the
// stream's internal buffer as best as possible.
// If the internal buffer is full, the callback `onprogress` will be invoked
// to notify about the write progress. Writing will be resumed once the internal
// buffer is empty, as indicated by the emitted `drain` event.
// See https://nodejs.org/docs/latest/api/stream.html#buffering for more details
// on the buffering behavior of streams.
function writeBufferToStreamWithProgress(
stream: Writable,
source: Uint8Array,
onprogress: HttpProgressHandler,
) {
onprogress = throttle(onprogress, 100, {
leading: true,
trailing: false,
})
let offset = 0
function writeNextChunk() {
// Take at most the amount of bytes from highWaterMark. This should fill the streams
// internal buffer already.
const chunkSize = Math.min(stream.writableHighWaterMark, source.length - offset)
// Note: We use subarray instead of slice because it works without copying data for
// Buffers and Uint8Arrays.
const chunk = source.subarray(offset, offset + chunkSize)
offset += chunk.length
// `write` returns true if the internal buffer is not full and we should write more.
// If the stream is destroyed because the request is aborted, it will return false
// and no 'drain' event is emitted, so won't continue writing data.
const canContinue = stream.write(chunk)
if (!canContinue) {
// If the buffer is full, wait for the 'drain' event to write more data.
stream.once('drain', writeNextChunk)
onprogress(offset)
} else if (offset < source.length) {
// If there's still data to write and the buffer is not full, write next chunk.
writeNextChunk()
} else {
// If all data has been written, close the stream if needed, and emit a 'finish' event.
stream.end()
}
}
// Start writing the first chunk.
writeNextChunk()
}
.

But as you correctly mentioned, this only allows us to retrieve details about how much data landed in the HTTP request stream. There are usually other buffers involved before the data hit the networks. We can't even assume that the HTTP request has its own TCP connection since HTTP/2 and HTTP/3 perform multiplexing over TCP and UDP, respectively. So even if there was a way to get low-level TCP information, it would only be useful for HTTP/1.1 requests.

That being said, I wonder if you could prevent false positives by tuning the stall detection parameters. Increasing the stall timeout while reducing the high water mark in the HTTP request stream (not sure how that's best done, to be honest) could help. In addition, you might want to experiment with different chunk sizes or throttle configurations in writeBufferToStreamWithProgress to get the desired behavior.

In addition, I tested the stall detection on my own yesterday and it does not seem to fully work in browsers yet. The logic assumes that our HTTP stacks emit an error when requests are aborted, but the XHRHttpStack does not do that. It just leaves the promise hanging. This should be easy to fix and I'm working on a patch, but more testing is needed.

I also learned that Uppy has a similar approach to stall detection: https://github.com/transloadit/uppy/blob/main/packages/%40uppy/utils/src/ProgressTimeout.ts It's a bit simpler as it doesn't use check intervals, but just always renew the timeout when a progress event is emitted. It's also enabled by default with a timeout of 30s and they see it's working well. Might be worth a look, although Uppy is only designed for browsers and not Node.js.

@Acconut
Copy link
Member

Acconut commented Jun 13, 2025

In addition, I tested the stall detection on my own yesterday and it does not seem to fully work in browsers yet. The logic assumes that our HTTP stacks emit an error when requests are aborted, but the XHRHttpStack does not do that. It just leaves the promise hanging. This should be easy to fix and I'm working on a patch, but more testing is needed.

Opened #778 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants