feat(compute): Introduce Postgres downtime metrics #11346

ololobus · 2025-03-21T19:35:52Z

Problem

Currently, we only report the timestamp of the last moment we think Postgres was active. The problem is that if Postgres gets completely unresponsive, we still report some old timestamp, and it's impossible to distinguish situations 'Postgres is effectively down' and 'Postgres is running, but no client activity'.

Summary of changes

Refactor the compute_ctl's compute monitor so that it was easier to track the connection errors and failed activity checks, and report

now() - last_successful_check as current downtime on any failure
cumulative Postgres downtime during the whole compute lifetime

After adding a test, I also noticed that the compute monitor may not reconnect even though queries fail with connection closed or error communicating with the server: Connection reset by peer (os error 54), but for some reason we do not catch it with client.is_closed(), so I added an explicit reconnect in case of any failures.

Discussion: https://neondb.slack.com/archives/C03TN5G758R/p1742489426966639

github-actions · 2025-03-21T20:32:22Z

8305 tests run: 7817 passed, 0 failed, 488 skipped (full report)

Code coverage* (full report)

functions: 33.1% (9016 of 27256 functions)
lines: 49.0% (78035 of 159255 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
53536e7 at 2025-04-24T12:03:41.714Z :recycle:}

compute_tools/src/monitor.rs

compute_tools/src/metrics.rs

compute_tools/src/monitor.rs

test_runner/regress/test_compute_monitor.py

ololobus

Fixed all the review comments, thanks

compute_tools/src/metrics.rs

compute_tools/src/monitor.rs

test_runner/regress/test_compute_monitor.py

tristan957

Looks solid. Just a few more comments

compute_tools/src/monitor.rs

## Problem After introducing a naive downtime calculation for the Postgres process inside compute in #11346, I noticed that some amount of computes regularly report short downtime. After checking some particular cases, it looks like all of them report downtime close to the end of the life of the compute, i.e., when the control plane calls a `/terminate` and we are waiting for Postgres to exit. Compute monitor also produces a lot of error logs because Postgres stops accepting connections, but it's expected during the termination process. ## Summary of changes Regularly check the compute status inside the main compute monitor loop and exit gracefully when the compute is in some terminal or soon-to-be-terminal state. --------- Co-authored-by: Tristan Partin <[email protected]>

ololobus force-pushed the alexk/pg-health-check branch from a5bb108 to efb87af Compare March 24, 2025 17:48

ololobus marked this pull request as ready for review March 24, 2025 17:48

ololobus requested a review from a team as a code owner March 24, 2025 17:48

ololobus requested review from knizhnik and tristan957 March 24, 2025 17:48

ololobus changed the title ~~feat(compute): Introduce compute_pg_downtime_ms metric~~ feat(compute): Introduce Postgres downtime metrics Mar 24, 2025

ololobus requested a review from myrrc March 25, 2025 11:41

ololobus commented Mar 31, 2025

View reviewed changes

compute_tools/src/monitor.rs Show resolved Hide resolved

tristan957 reviewed Mar 31, 2025

View reviewed changes

ololobus commented Apr 23, 2025

View reviewed changes

ololobus force-pushed the alexk/pg-health-check branch from efb87af to 3fb9098 Compare April 23, 2025 12:39

tristan957 reviewed Apr 23, 2025

View reviewed changes

compute_tools/src/monitor.rs Outdated Show resolved Hide resolved

compute_tools/src/monitor.rs Outdated Show resolved Hide resolved

compute_tools/src/monitor.rs Show resolved Hide resolved

compute_tools/src/monitor.rs Outdated Show resolved Hide resolved

ololobus force-pushed the alexk/pg-health-check branch from 3fb9098 to 3c4d4c5 Compare April 23, 2025 15:33

tristan957 approved these changes Apr 23, 2025

View reviewed changes

ololobus added 5 commits April 24, 2025 12:51

feat(compute): Introduce compute_pg_downtime_ms metric

b62949b

Add test and cumulative downtime metric

318febf

Review fixes

3aa375f

Review pt. 2

b3fe320

Minor comment fix

53536e7

ololobus force-pushed the alexk/pg-health-check branch from 3c4d4c5 to 53536e7 Compare April 24, 2025 10:51

ololobus added this pull request to the merge queue Apr 24, 2025

Merged via the queue into main with commit 985056b Apr 24, 2025
100 checks passed

ololobus deleted the alexk/pg-health-check branch April 24, 2025 13:59

ololobus mentioned this pull request May 13, 2025

feat(compute_ctl): Implement graceful compute monitor exit #11911

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(compute): Introduce Postgres downtime metrics #11346

feat(compute): Introduce Postgres downtime metrics #11346

Uh oh!

ololobus commented Mar 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ololobus left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tristan957 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat(compute): Introduce Postgres downtime metrics #11346

feat(compute): Introduce Postgres downtime metrics #11346

Uh oh!

Conversation

ololobus commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of changes

Uh oh!

github-actions bot commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

8305 tests run: 7817 passed, 0 failed, 488 skipped (full report)

Code coverage* (full report)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ololobus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tristan957 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ololobus commented Mar 21, 2025 •

edited

Loading

github-actions bot commented Mar 21, 2025 •

edited

Loading