-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid failing requests when re-establishing connections to the cluster #273
Comments
Refs scylladb/java-driver#236 (from a different perspective, but a similar issue) |
Regarding the proposed new behavior, wouldn't it be better if the request was sent to another replica instead of waiting for a connection to the replica that we had connection issues with? However what I've just described is essentially a retry policy that the driver already has support for... |
This is only safe for idempotent queries.
The client may direct the query to a specific replica. So I generally agree that we should avoid failing requests. If the driver for whatever reason really really needs to reopen a connection, it should do it in phases:
Retrying queries can only happen if the user requests for such policy, and only for idempotent queries. The most problems we have in tests is because the driver is sometimes closing connections for no apparent reason -- e.g. after node restart, we verify on the test side that a SELECT directed to this node succeeds (which implies that the driver already managed to open a new connection to this node!), and then subsequent query fails (because driver decides to close it again.) If we prevent that from happening, we'd be good. |
This is useful, first of all, during testing. Your choice in the test: fail the test because of a connection error right away, or retry and maaaybe fail the test because of the lack of idempotency + double execution. How likely there is a double execution? If you get a write failure from a syscall it means the OS actually failed to store the request in the system TCP buffer. It is in theory possible that the request is actually stored and then sent to the destination, however, in practice, the implementations are careful enough to not do that. I haven't seen this ever leading to a double execution in my life (although I agree one has to study the actual tcp stack implementation to be sure).
I don't mind this suggestion at all (just more extensive than what I proposed). |
Example of this behavior: #317 |
Currently the python driver is frivolously failing network requests if for whatever reason there is no connection to the cluster. or the socket is dead. However, there may be many valid reasons why the connection is not available, and if the request is not in progress, it should not be dropped.
Observed in scylladb/scylladb#16110 and in scylladb/scylladb#14746 , where a belated notification about a node restart leads to a spurious test failure.
How the driver should behave:
The two steps above should significantly reduce the amount of spurious failures on topology changes.
The text was updated successfully, but these errors were encountered: