Skip to content

feat(compute): Introduce Postgres downtime metrics #11346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 24, 2025
Merged

Conversation

ololobus
Copy link
Member

@ololobus ololobus commented Mar 21, 2025

Problem

Currently, we only report the timestamp of the last moment we think Postgres was active. The problem is that if Postgres gets completely unresponsive, we still report some old timestamp, and it's impossible to distinguish situations 'Postgres is effectively down' and 'Postgres is running, but no client activity'.

Summary of changes

Refactor the compute_ctl's compute monitor so that it was easier to track the connection errors and failed activity checks, and report

  • now() - last_successful_check as current downtime on any failure
  • cumulative Postgres downtime during the whole compute lifetime

After adding a test, I also noticed that the compute monitor may not reconnect even though queries fail with connection closed or error communicating with the server: Connection reset by peer (os error 54), but for some reason we do not catch it with client.is_closed(), so I added an explicit reconnect in case of any failures.

Discussion: https://neondb.slack.com/archives/C03TN5G758R/p1742489426966639

Copy link

github-actions bot commented Mar 21, 2025

8305 tests run: 7817 passed, 0 failed, 488 skipped (full report)


Code coverage* (full report)

  • functions: 33.1% (9016 of 27256 functions)
  • lines: 49.0% (78035 of 159255 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
53536e7 at 2025-04-24T12:03:41.714Z :recycle:

@ololobus ololobus force-pushed the alexk/pg-health-check branch from a5bb108 to efb87af Compare March 24, 2025 17:48
@ololobus ololobus marked this pull request as ready for review March 24, 2025 17:48
@ololobus ololobus requested a review from a team as a code owner March 24, 2025 17:48
@ololobus ololobus requested review from knizhnik and tristan957 March 24, 2025 17:48
@ololobus ololobus changed the title feat(compute): Introduce compute_pg_downtime_ms metric feat(compute): Introduce Postgres downtime metrics Mar 24, 2025
@ololobus ololobus requested a review from myrrc March 25, 2025 11:41
Copy link
Member Author

@ololobus ololobus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed all the review comments, thanks

@ololobus ololobus force-pushed the alexk/pg-health-check branch from efb87af to 3fb9098 Compare April 23, 2025 12:39
Copy link
Member

@tristan957 tristan957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid. Just a few more comments

@ololobus ololobus force-pushed the alexk/pg-health-check branch from 3fb9098 to 3c4d4c5 Compare April 23, 2025 15:33
@ololobus ololobus force-pushed the alexk/pg-health-check branch from 3c4d4c5 to 53536e7 Compare April 24, 2025 10:51
@ololobus ololobus added this pull request to the merge queue Apr 24, 2025
Merged via the queue into main with commit 985056b Apr 24, 2025
100 checks passed
@ololobus ololobus deleted the alexk/pg-health-check branch April 24, 2025 13:59
github-merge-queue bot pushed a commit that referenced this pull request Jun 5, 2025
## Problem

After introducing a naive downtime calculation for the Postgres process
inside compute in #11346, I
noticed that some amount of computes regularly report short downtime.
After checking some particular cases, it looks like all of them report
downtime close to the end of the life of the compute, i.e., when the
control plane calls a `/terminate` and we are waiting for Postgres to
exit.

Compute monitor also produces a lot of error logs because Postgres stops
accepting connections, but it's expected during the termination process.

## Summary of changes

Regularly check the compute status inside the main compute monitor loop
and exit gracefully when the compute is in some terminal or
soon-to-be-terminal state.

---------

Co-authored-by: Tristan Partin <[email protected]>
skyzh pushed a commit that referenced this pull request Jun 6, 2025
## Problem

After introducing a naive downtime calculation for the Postgres process
inside compute in #11346, I
noticed that some amount of computes regularly report short downtime.
After checking some particular cases, it looks like all of them report
downtime close to the end of the life of the compute, i.e., when the
control plane calls a `/terminate` and we are waiting for Postgres to
exit.

Compute monitor also produces a lot of error logs because Postgres stops
accepting connections, but it's expected during the termination process.

## Summary of changes

Regularly check the compute status inside the main compute monitor loop
and exit gracefully when the compute is in some terminal or
soon-to-be-terminal state.

---------

Co-authored-by: Tristan Partin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants