Connection timeouts interact poorly with multiple targets #2171

jwreschnig-utiq · 2024-11-21T08:59:23Z

We have been testing pgxpool failover with a primary/standby database setup; one of the scenarios we've tested is:

a pool configured with postgres://primary,standby/...&prefer-standby
no global connect timeout, because we have a wide range of acceptable times
instead, we always use context timeouts, e.g. calling pool.Exec or pool.Acquire with a context with a 5 second timeout
blocking all network traffic between the client and primary
the standby is healthy

In this situation, it seems the dial is spending its entire five seconds trying to connect to the primary, and then gives up before trying what would be a successful and preferred connection to the standby. From reading the code I believe this is because the primary is first in the list, not specifically because it's a primary or the preference.

To me this is surprising behavior; I expected something like net.DialContext where the context's timeout is spread over the possible hosts, or perhaps a parallel race dial and picking the best successful result.

I'm not filing as a bug, because I am not 100% sure my read of the problem was correct, and because I understand there's value in aligning with libpq behavior so if libpq also does this, this might be the right default. But, I cannot figure out satisfying a way around this:

Any pool-level connect timeout is too low for some cases and too high for others. My experience is that in general it's confusing and fragile to mix timeouts at multiple levels (pool and query), and we prefer to handle all timeouts via contexts if possible.
If I set a DialFunc to shorten the lifetime myself, it still only sees a single address at a time. Even if I leak my original host string into it, some of those hosts might have resolved to multiple addresses, so I don't know how much to divide up the dial context's timeout.
Retrying at the application level is ineffective; it gets stuck at the first server each time. We would need e.g. random load balancing for this to work.

Some potential fixes I see would be:

spread the timeout transparently, as net.DialContext does - this would probably be a breaking change for some people so might need an explicit flag to enable
pass details about the possible connection targets through to the DialFunc via context values (ugh), to let us implement our own timeout based on the number of possible hosts and preferences
allow passing a second value specifically for connect timeouts with each request, overriding the pool connect timeout (again, this would probably have to go by context value for API compatibility)

The text was updated successfully, but these errors were encountered:

jackc · 2024-11-29T14:53:43Z

Yeah, that's a tricky issue. The intention for context cancellation of a query was simply to cancel the query. If the query causes a connection attempt to be made that connection attempt will continue in the background even if the query context is cancelled.

I'm not sure there is a good solution to also using the context of the query to control how multiple target connection attempts work.

My intuition is that making the pool better about keeping available good connections might be a better approach. Frequent health checks and a high min connections could be done now. Not sure what other features might make it better.

jackc · 2024-12-06T13:59:36Z

See #2185. I proposed something there that might have some overlap with this.

jwreschnig-utiq mentioned this issue Nov 21, 2024

Connection timeout wrongly applied to non-preferred hosts #2172

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection timeouts interact poorly with multiple targets #2171

Connection timeouts interact poorly with multiple targets #2171

jwreschnig-utiq commented Nov 21, 2024 •

edited

Loading

jackc commented Nov 29, 2024

jackc commented Dec 6, 2024

Connection timeouts interact poorly with multiple targets #2171

Connection timeouts interact poorly with multiple targets #2171

Comments

jwreschnig-utiq commented Nov 21, 2024 • edited Loading

jackc commented Nov 29, 2024

jackc commented Dec 6, 2024

jwreschnig-utiq commented Nov 21, 2024 •

edited

Loading