Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid failing requests when re-establishing connections to the cluster #273

Open
kostja opened this issue Nov 23, 2023 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@kostja
Copy link

kostja commented Nov 23, 2023

Currently the python driver is frivolously failing network requests if for whatever reason there is no connection to the cluster. or the socket is dead. However, there may be many valid reasons why the connection is not available, and if the request is not in progress, it should not be dropped.

Observed in scylladb/scylladb#16110 and in scylladb/scylladb#14746 , where a belated notification about a node restart leads to a spurious test failure.

How the driver should behave:

  • if there is no viable connection in the pool, wait until there is an available connection, and then try to execute the request.
  • if there is a write failure when trying to write the request data to the socket, read(0 bytes) from the socket, to see if there is an EOF. If there is an eof, e.g. the physical connection is dead, try to open another connection, and retry the request.

The two steps above should significantly reduce the amount of spurious failures on topology changes.

@avelanarius
Copy link

Refs scylladb/java-driver#236 (from a different perspective, but a similar issue)

@avelanarius
Copy link

avelanarius commented Feb 9, 2024

How the driver should behave:

  • if there is no viable connection in the pool, wait until there is an available connection, and then try to execute the request.
  • if there is a write failure when trying to write the request data to the socket, read(0 bytes) from the socket, to see if there is an EOF. If there is an eof, e.g. the physical connection is dead, try to open another connection, and retry the request.
    The two steps above should significantly reduce the amount of spurious failures on topology changes.

Regarding the proposed new behavior, wouldn't it be better if the request was sent to another replica instead of waiting for a connection to the replica that we had connection issues with? However what I've just described is essentially a retry policy that the driver already has support for...

@kbr-scylla
Copy link

if there is a write failure when trying to write the request data to the socket, read(0 bytes) from the socket, to see if there is an EOF. If there is an eof, e.g. the physical connection is dead, try to open another connection, and retry the request.

This is only safe for idempotent queries.

Regarding the proposed new behavior, wouldn't it be better if the request was sent to another replica instead of waiting for a connection to the replica that we had connection issues with?

The client may direct the query to a specific replica.


So I generally agree that we should avoid failing requests.
This should be done by not closing connections that are live and currently running queries.

If the driver for whatever reason really really needs to reopen a connection, it should do it in phases:

  • open a new connection, but don't yet close the old one
  • start routing new queries to the new connection
  • drain queries from the old connection (i.e. wait for all existing queries to finish)
  • only then close the old connection

Retrying queries can only happen if the user requests for such policy, and only for idempotent queries.

The most problems we have in tests is because the driver is sometimes closing connections for no apparent reason -- e.g. after node restart, we verify on the test side that a SELECT directed to this node succeeds (which implies that the driver already managed to open a new connection to this node!), and then subsequent query fails (because driver decides to close it again.) If we prevent that from happening, we'd be good.

@kostja
Copy link
Author

kostja commented Mar 5, 2024

if there is a write failure when trying to write the request data to the socket, read(0 bytes) from the socket, to see if there is an EOF. If there is an eof, e.g. the physical connection is dead, try to open another connection, and retry the request.

This is only safe for idempotent queries.

This is useful, first of all, during testing. Your choice in the test: fail the test because of a connection error right away, or retry and maaaybe fail the test because of the lack of idempotency + double execution.

How likely there is a double execution? If you get a write failure from a syscall it means the OS actually failed to store the request in the system TCP buffer. It is in theory possible that the request is actually stored and then sent to the destination, however, in practice, the implementations are careful enough to not do that. I haven't seen this ever leading to a double execution in my life (although I agree one has to study the actual tcp stack implementation to be sure).

Regarding the proposed new behavior, wouldn't it be better if the request was sent to another replica instead of waiting for a connection to the replica that we had connection issues with?

The client may direct the query to a specific replica.

So I generally agree that we should avoid failing requests. This should be done by not closing connections that are live and currently running queries.

If the driver for whatever reason really really needs to reopen a connection, it should do it in phases:

  • open a new connection, but don't yet close the old one
  • start routing new queries to the new connection
  • drain queries from the old connection (i.e. wait for all existing queries to finish)
  • only then close the old connection

Retrying queries can only happen if the user requests for such policy, and only for idempotent queries.

The most problems we have in tests is because the driver is sometimes closing connections for no apparent reason -- e.g. after node restart, we verify on the test side that a SELECT directed to this node succeeds (which implies that the driver already managed to open a new connection to this node!), and then subsequent query fails (because driver decides to close it again.) If we prevent that from happening, we'd be good.

I don't mind this suggestion at all (just more extensive than what I proposed).

@kbr-scylla
Copy link

The most problems we have in tests is because the driver is sometimes closing connections for no apparent reason

Example of this behavior: #317

@roydahan roydahan added the enhancement New feature or request label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants