-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postgres Connection Timeout Crash #1606
Comments
Here's a test script that causes a connection timeout error for some users:
|
The suspicious part, I think, is: for l1_dataset in dc.index.datasets.search(product=product):
While the cursor is active, the Perhaps each chunk of the results that the iterator is returning take long enough to "use" that they go beyond the timeout? In this particular example we're only logging, so I wouldn't think that we're being too slow consuming our chunk of the iterator. But if we're logging to the filesystem we do know the lustre filesystem sometimes has very large pauses — a handful of "large pauses" within one chunk of the iterator could push it over the connection timeout? (ie, nothing has been read from the cursor/connection within that timeout). (Only a guess at this point.) |
I agree that is almost certainly the root cause. What happens if we greedily suck up the results before logging? e.g.
(Obviously it hangs with no feedback to the user for a long time while it sucks down the whole product worth of datasets into memory - but does it timeout?) |
Also try:
|
@SpacemanPaul your first example also fails on my end, although with no stack trace - it simply outputs |
|
That makes sense. I think the batch-get API above is probably the best approach to this class of problems. |
Interesting. So we have:
|
I'm hunting a crash for some users running ODC code on the NCI, somewhere in database connection handling. A unique quirk of the NCI environment is that idle TCP Connections are killed after 5 minutes, whereas in most environments it's 2 hours. This is probably relevant, but I'm not sure.
An example stack trace
I can reliably cause exceptions using plain psycopg2, by attempting to use a connection that's been idle for more than 5 minutes.
I haven't been able to reproduce the exception using the ODC Index API. I've been inserting
time.sleep(330)
between different calls, which is what I could do with plain psycopg2.I suspect that ODC is more resilient due to setting
pool_recycle=60
on the SQLAlchemy connection pool.datacube-core/datacube/drivers/postgres/_connections.py
Lines 131 to 134 in 4f1e636
One line in the traceback that I'm suspicious of is this line 231, since I think exiting the context would close the connection. Manually calling
.close()
seems unnecessary.datacube-core/datacube/drivers/postgres/_connections.py
Lines 226 to 231 in 4f1e636
Summary
The text was updated successfully, but these errors were encountered: