-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: refactor alerts to accomodate for single-node clusters #1010
base: master
Are you sure you want to change the base?
bugfix: refactor alerts to accomodate for single-node clusters #1010
Conversation
For the sake of brevity, let: Q: kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable), and, QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested), thus, both quota alerts relevant here exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0, which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to, sum(QQ) by (cluster) > 0, i.e., the alert will fire if *any* request limits exist. To address this, drop the "max(Q) by (cluster)" buffer assumed in non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense. Signed-off-by: Pranshu Srivastava <[email protected]>
5b96fb5
to
5cd53d6
Compare
and | ||
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0 | ||
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - | ||
sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) | |
0.95 * sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) |
Since a max(Q)
buffer is not applicable in SNO, how about a numeric buffer of 5% (or more?)? That should help alert before things go out of budget.
@@ -34,18 +34,34 @@ | |||
} + | |||
if $._config.showMultiCluster then { | |||
expr: ||| | |||
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0 | |||
(count(kube_node_info) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If showMultiCluster
is true
that implies the cluster
label is available, so the check here should probably use the cluster
label (so that each cluster is checked on whether it has a single node).
Additionally, I suggest a de-dupe for multiple KSM using max
like so:
(count(kube_node_info) == 1 | |
(count by (cluster) (max by (cluster, node) (kube_node_info)) == 1 |
and | ||
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - | ||
max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I gather here, this pre-existing latter part of the and
effectively removes single-node clusters (because "sum(cpu) - max(cpu)" can never be greater than zero on a single-node cluster).
This seems to conflict with the PR description:
the alert will fire if any request limits exist
Can you be more specific or provide a test case where the existing rule is not behaving as expected? Maybe this particular alert isn't affected?
For the sake of brevity, let:
thus, both quota alert expressions relevant here (
KubeCPUOvercommit
andKubeMemoryOvercommit
) exist in the form:sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0
, which, in case of a single-node cluster (sum(Q) by (cluster)
=max(Q) by (cluster)
), is reduced to,sum(QQ) by (cluster) > 0
, i.e., the alert will fire if any request limits exist.To address this, drop the
max(Q) by (cluster)
buffer assumed in non-SNO clusters from SNO, reducing the expression to:sum(QQ) by (cluster) - sum(Q) by (cluster) > 0
(total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense.