Skip to content

Thoughts on handling connection issues

Marius edited this page Sep 22, 2020 · 2 revisions

Right now, tusd has a read timeout of 6s after which the server closes the connection. The idea is to quickly detect dead connections, so the upload lock is not held for too long. The problem is that the timeout may also be triggered if the connection is still alive and data is just really slowly transferred (i.e. if you have a 1kb/s connection and the server has a receive buffer of 16kb, it would take 16s to fill the buffer, exceeding the timeout of 6s).

We could increase the read timeout, so slow connections have enough time to fill the buffers and not trigger the timeout as easily. However, this also means that the lock for an upload is held for longer duration. Assuming we have a timeout of 30s: If the client connection gets interrupted and the client wants to resume the interrupted upload, the server might take up to 30s until it declares the connection timed out and releases the lock. This is called a half-open connection (see https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html).

Therefore, we want to have the benefits of a small timeout but also of a big timeout.

Can TCP Keepalive help? It allows the server to ping the client and check if the connection is still alive (https://felixge.de/2014/08/26/using-tcp-keepalive-with-go/). The default settings on Linux are to start sending pings after 2hrs of inactivity, at an interval of 75s. If 9 pings fail in a row, the connection is reset. Maybe it would be possible to set these numbers lower and detect early connection issues.

The approaches from above discuss a server-side detection of connection issues. The underlying problem is that upload resources can only be accessed by one client at a time to avoid data corruption. Concurrent access is prohibited by locks. If we assume that only one client modifies an upload resources, an incoming request from the client indicates that the client is not interested in previous requests anymore. Therefore, the server should not error out if the lock cannot be acquired but instead should terminate any previous request causing the lock to be released and making it available for the new request.

So, if the connection is half-open, the client can just send a new request and the server will take care of cleaning up the previous request and making sure the upload can be modified again.

Does this have security implications? Yes, an attacker can limit the availability of an upload resource. By sending a request to the upload URL, the attacker can interrupt the legitimate upload request. However, this is only possible if the upload URL is known and can also be prevented by authentication.

Tasks:

  • Measure current rate of 500 i/o timeout, 423 Locked and rate of unfinished uploads
  • Ensure that filestore and s3store properly handle cancellations
  • Implement a new memorylocker which can cancel other requests
  • Increase the default read timeout
  • Deploy and measure changes in error rates
Clone this wiki locally