-
Notifications
You must be signed in to change notification settings - Fork 747
feat(compute): Introduce Postgres downtime metrics #11346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8305 tests run: 7817 passed, 0 failed, 488 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
53536e7 at 2025-04-24T12:03:41.714Z :recycle: |
a5bb108
to
efb87af
Compare
ololobus
commented
Mar 31, 2025
tristan957
reviewed
Mar 31, 2025
ololobus
commented
Apr 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed all the review comments, thanks
efb87af
to
3fb9098
Compare
tristan957
reviewed
Apr 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid. Just a few more comments
3fb9098
to
3c4d4c5
Compare
tristan957
approved these changes
Apr 23, 2025
3c4d4c5
to
53536e7
Compare
github-merge-queue bot
pushed a commit
that referenced
this pull request
Jun 5, 2025
## Problem After introducing a naive downtime calculation for the Postgres process inside compute in #11346, I noticed that some amount of computes regularly report short downtime. After checking some particular cases, it looks like all of them report downtime close to the end of the life of the compute, i.e., when the control plane calls a `/terminate` and we are waiting for Postgres to exit. Compute monitor also produces a lot of error logs because Postgres stops accepting connections, but it's expected during the termination process. ## Summary of changes Regularly check the compute status inside the main compute monitor loop and exit gracefully when the compute is in some terminal or soon-to-be-terminal state. --------- Co-authored-by: Tristan Partin <[email protected]>
skyzh
pushed a commit
that referenced
this pull request
Jun 6, 2025
## Problem After introducing a naive downtime calculation for the Postgres process inside compute in #11346, I noticed that some amount of computes regularly report short downtime. After checking some particular cases, it looks like all of them report downtime close to the end of the life of the compute, i.e., when the control plane calls a `/terminate` and we are waiting for Postgres to exit. Compute monitor also produces a lot of error logs because Postgres stops accepting connections, but it's expected during the termination process. ## Summary of changes Regularly check the compute status inside the main compute monitor loop and exit gracefully when the compute is in some terminal or soon-to-be-terminal state. --------- Co-authored-by: Tristan Partin <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Currently, we only report the timestamp of the last moment we think Postgres was active. The problem is that if Postgres gets completely unresponsive, we still report some old timestamp, and it's impossible to distinguish situations 'Postgres is effectively down' and 'Postgres is running, but no client activity'.
Summary of changes
Refactor the
compute_ctl
's compute monitor so that it was easier to track the connection errors and failed activity checks, and reportnow() - last_successful_check
as current downtime on any failureAfter adding a test, I also noticed that the compute monitor may not reconnect even though queries fail with
connection closed
orerror communicating with the server: Connection reset by peer (os error 54)
, but for some reason we do not catch it withclient.is_closed()
, so I added an explicit reconnect in case of any failures.Discussion: https://neondb.slack.com/archives/C03TN5G758R/p1742489426966639