Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: refactor alerts to accomodate for single-node clusters #1010
base: master
Are you sure you want to change the base?
bugfix: refactor alerts to accomodate for single-node clusters #1010
Changes from all commits
4917f2d
5788ff9
41645e2
fc9fe6a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel that reusing the same alert for single node clusters creates confusion. For instance the alert description doesn't fit right in this case:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about reserving the final 15% as a leeway for backgrounded processes that may belong to userland but not necessarily be initiated by the user, and fall in a "grey" space.
But I agree, this introduces an opinionated constant threshold that doesn't make sense if we look at the metrics and alert definitions. Like you said, "allocatable" should mean that till 100%, and "overcommitment" should quite literally mean exceeding that limit. I'll drop the threshold.
Talking with Balut, I'm inclined to believe that the users will expect the alert to adapt to SNO, as in, if the requests simply exceed the allocatable resources. It may additionally be ambiguous to introduce new alerts for SNO as a 1+1 multi-node system may stop firing
AlertX
and start firingAlertYSNO
when it's reduced to SNO, since the latter is arguably a derivative of the former to some degree (as some users may expect).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we could of course explore the possibility of adding SNO-exclusive alerts, but I just wanted to put these points out there, I'm on the fence mostly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add on, I think we also need to revise how we handle SNO downstream, as in, at the moment we recognize SNO by a
SingleReplica
topology infrastructure, however, SNO can technically have more than one node (SNO+1 configurations), which would be in-line with having a different set of alerts for SNO while ensuring that upstream expects the same. I'll see if I can find any SNO-dedicated teams or people who can additionally shed some light here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rexagod how would you like to proceed with this PR? I've held back from merging as it seems there's open discussion still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay here, we are talking this through internally and I'll update as soon as there's a resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI the tests have moved into
tests/
directory since: