Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection timeouts interact poorly with multiple targets #2171

Open
jwreschnig-utiq opened this issue Nov 21, 2024 · 0 comments
Open

Connection timeouts interact poorly with multiple targets #2171

jwreschnig-utiq opened this issue Nov 21, 2024 · 0 comments

Comments

@jwreschnig-utiq
Copy link

jwreschnig-utiq commented Nov 21, 2024

We have been testing pgxpool failover with a primary/standby database setup; one of the scenarios we've tested is:

  • a pool configured with postgres://primary,standby/...&prefer-standby
  • no global connect timeout, because we have a wide range of acceptable times
  • instead, we always use context timeouts, e.g. calling pool.Exec or pool.Acquire with a context with a 5 second timeout
  • blocking all network traffic between the client and primary
  • the standby is healthy

In this situation, it seems the dial is spending its entire five seconds trying to connect to the primary, and then gives up before trying what would be a successful and preferred connection to the standby. From reading the code I believe this is because the primary is first in the list, not specifically because it's a primary or the preference.

To me this is surprising behavior; I expected something like net.DialContext where the context's timeout is spread over the possible hosts, or perhaps a parallel race dial and picking the best successful result.

I'm not filing as a bug, because I am not 100% sure my read of the problem was correct, and because I understand there's value in aligning with libpq behavior so if libpq also does this, this might be the right default. But, I cannot figure out satisfying a way around this:

  • Any pool-level connect timeout is too low for some cases and too high for others. My experience is that in general it's confusing and fragile to mix timeouts at multiple levels (pool and query), and we prefer to handle all timeouts via contexts if possible.
  • If I set a DialFunc to shorten the lifetime myself, it still only sees a single address at a time. Even if I leak my original host string into it, some of those hosts might have resolved to multiple addresses, so I don't know how much to divide up the dial context's timeout.
  • Retrying at the application level is ineffective; it gets stuck at the first server each time. We would need e.g. random load balancing for this to work.

Some potential fixes I see would be:

  • spread the timeout transparently, as net.DialContext does - this would probably be a breaking change for some people so might need an explicit flag to enable
  • pass details about the possible connection targets through to the DialFunc via context values (ugh), to let us implement our own timeout based on the number of possible hosts and preferences
  • allow passing a second value specifically for connect timeouts with each request, overriding the pool connect timeout (again, this would probably have to go by context value for API compatibility)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant