Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sorry to do another pull request on this, especially since this one is slightly more disruptive. I just discovered a number of problems while trying to monitor more of our stuff. I also discovered that my original pull request didn't quite solve the problem that I was trying to solve. This corrects a few problems:
The query itself was computing values backwards and generating negative differences in all cases. Also, the coalesce that I submitted yesterday was too simplistic and ended up returning incorrect values in some cases. This query will return correct values that I've tested against multiple clusters that I currently run including against a cluster that has multiple active and inactive replication slots.
Line 5536/5550, there is no value in
$db->{slot_name}
ever. It is never assigned. I'm fairly certain that what is intended is to record the maximum backlog on the database across all slots so that's what it does now.Line 5541/5560, when no slots match the exclusion rules it should still be OK and not UNKNOWN.
In general,
$max
should contain the maximum number of bytes that a slot is causing the database to retain. In this original code$max
was also being used to store the results of the check (were slots skipped? set to-2
I guess?). Because the query was generating negative differences the check on line 5530 would never pass because$r->{delta}
would never be greater than-1
(or-2
) because it was always negative. So basically this check would never fail and definitely didn't work when multiple slots existed.The tests for this (in 02_replication_slots.t) continue to pass with these changes but I'm not 100% sure that they were good or exhaustive tests to begin with. I don't have a good example in the existing tests of how to test replication between multiple servers in the tests so if you have pointers on where to start writing those tests then I'd be happy to write some.