Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objectstorage access alerts #1428

Merged
merged 4 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Bump alloy-rules app version to 0.7.0
- Upgrades alloy to 1.4.2 to 1.5.0

### Added

- new MimirObjectStorageLowRate alert
- new LokiObjectStorageLowRate alert

## [4.25.0] - 2024-11-18

### Changed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -195,3 +195,32 @@ spec:
severity: page
team: atlas
topic: observability
- alert: LokiObjectStorageLowRate
annotations:
dashboard: loki-operational/loki-operational
description: '{{`Loki object storage write rate is down.`}}'
opsrecipe: loki/
expr: |
irate(loki_rate_store_stream_rate_bytes_count[5m]) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the second part of the expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make sure the alert fires when the metrics don't exist.
I added a comment to explain it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good thinking. You might do this for loki canary because I think don't have that for them

# This part will fire the alert when the metric does not exist
or (
label_replace(
capi_cluster_status_condition{type="ControlPlaneReady", status="True", cluster_type="management_cluster"},
"cluster_id",
"$1",
"name",
"(.*)"
) == 1
) unless on (cluster_id) (
count(loki_rate_store_stream_rate_bytes_count) by (cluster_id)
)
for: 1h
labels:
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
topic: observability
Original file line number Diff line number Diff line change
Expand Up @@ -266,4 +266,33 @@ spec:
severity: page
team: atlas
topic: observability
- alert: MimirObjectStorageLowRate
annotations:
dashboard: 8280707b8f16e7b87b840fc1cc92d4c5/mimir-writes
description: '{{`Mimir object storage write rate is down.`}}'
opsrecipe: mimir/
expr: |
irate(cortex_bucket_store_sent_chunk_size_bytes_count[5m]) == 0
# This part will fire the alert when the metric does not exist
or (
label_replace(
capi_cluster_status_condition{type="ControlPlaneReady", status="True", cluster_type="management_cluster"},
"cluster_id",
"$1",
"name",
"(.*)"
) == 1
) unless on (cluster_id) (
count(cortex_bucket_store_sent_chunk_size_bytes_count) by (cluster_id)
)
for: 1h
labels:
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
topic: observability
{{- end }}
Original file line number Diff line number Diff line change
Expand Up @@ -566,3 +566,66 @@ tests:
dashboard: mimir-continous-test/mimir-continous-test
description: "Mimir continuous test myinstall is not producing metrics."
opsrecipe: "mimir/"

# Test for MimirObjectStorageLowRate alert
- interval: 1m
input_series:
- series: 'cortex_bucket_store_sent_chunk_size_bytes_count{cluster_id="myinstall", cluster_type="management_cluster", installation="myinstall", namespace="mimir", pipeline="stable", provider="capa"}'
values: "_x90 1+1x90 90+0x90"
- series: 'capi_cluster_status_condition{cluster_id="myinstall", cluster_type="management_cluster", installation="myinstall", namespace="mimir", pipeline="stable", provider="capa", name="myinstall", type="ControlPlaneReady", status="True"}'
values: "1+0x270"
alert_rule_test:
- alertname: MimirObjectStorageLowRate
eval_time: 40m
- alertname: MimirObjectStorageLowRate
eval_time: 70m
exp_alerts:
- exp_labels:
area: platform
cancel_if_outside_working_hours: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cluster_id: myinstall
cluster_type: management_cluster
installation: myinstall
name: myinstall
namespace: mimir
pipeline: stable
provider: capa
severity: page
status: "True"
team: atlas
topic: observability
type: ControlPlaneReady
exp_annotations:
dashboard: 8280707b8f16e7b87b840fc1cc92d4c5/mimir-writes
description: "Mimir object storage write rate is down."
opsrecipe: "mimir/"
- alertname: MimirObjectStorageLowRate
eval_time: 100m
- alertname: MimirObjectStorageLowRate
- alertname: MimirObjectStorageLowRate
eval_time: 200m
- alertname: MimirObjectStorageLowRate
eval_time: 250m
exp_alerts:
- exp_labels:
area: platform
cancel_if_outside_working_hours: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cluster_id: myinstall
cluster_type: management_cluster
installation: myinstall
namespace: mimir
pipeline: stable
provider: capa
severity: page
team: atlas
topic: observability
exp_annotations:
dashboard: 8280707b8f16e7b87b840fc1cc92d4c5/mimir-writes
description: "Mimir object storage write rate is down."
opsrecipe: "mimir/"
Original file line number Diff line number Diff line change
Expand Up @@ -566,3 +566,66 @@ tests:
dashboard: mimir-continous-test/mimir-continous-test
description: "Mimir continuous test myinstall is not producing metrics."
opsrecipe: "mimir/"

# Test for MimirObjectStorageLowRate alert
- interval: 1m
input_series:
- series: 'cortex_bucket_store_sent_chunk_size_bytes_count{cluster_id="myinstall", cluster_type="management_cluster", installation="myinstall", namespace="mimir", pipeline="stable", provider="capz"}'
values: "_x90 1+1x90 90+0x90"
- series: 'capi_cluster_status_condition{cluster_id="myinstall", cluster_type="management_cluster", installation="myinstall", namespace="mimir", pipeline="stable", provider="capz", name="myinstall", type="ControlPlaneReady", status="True"}'
values: "1+0x270"
alert_rule_test:
- alertname: MimirObjectStorageLowRate
eval_time: 40m
- alertname: MimirObjectStorageLowRate
eval_time: 70m
exp_alerts:
- exp_labels:
area: platform
cancel_if_outside_working_hours: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cluster_id: myinstall
cluster_type: management_cluster
installation: myinstall
name: myinstall
namespace: mimir
pipeline: stable
provider: capz
severity: page
status: "True"
team: atlas
topic: observability
type: ControlPlaneReady
exp_annotations:
dashboard: 8280707b8f16e7b87b840fc1cc92d4c5/mimir-writes
description: "Mimir object storage write rate is down."
opsrecipe: "mimir/"
- alertname: MimirObjectStorageLowRate
eval_time: 100m
- alertname: MimirObjectStorageLowRate
- alertname: MimirObjectStorageLowRate
eval_time: 200m
- alertname: MimirObjectStorageLowRate
eval_time: 250m
exp_alerts:
- exp_labels:
area: platform
cancel_if_outside_working_hours: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cluster_id: myinstall
cluster_type: management_cluster
installation: myinstall
namespace: mimir
pipeline: stable
provider: capz
severity: page
team: atlas
topic: observability
exp_annotations:
dashboard: 8280707b8f16e7b87b840fc1cc92d4c5/mimir-writes
description: "Mimir object storage write rate is down."
opsrecipe: "mimir/"
Original file line number Diff line number Diff line change
Expand Up @@ -324,3 +324,63 @@ tests:
opsrecipe: "loki/"
- alertname: LokiMissingLogs
eval_time: 300m

# Test for LokiObjectStorageLowRate alert
- interval: 1m
input_series:
- series: 'loki_rate_store_stream_rate_bytes_count{cluster_id="myinstall", cluster_type="management_cluster", installation="myinstall", namespace="loki", pipeline="stable"}'
values: "_x90 1+1x90 90+0x90"
- series: 'capi_cluster_status_condition{cluster_id="myinstall", cluster_type="management_cluster", installation="myinstall", namespace="loki", pipeline="stable", name="myinstall", type="ControlPlaneReady", status="True"}'
values: "1+0x270"
alert_rule_test:
- alertname: LokiObjectStorageLowRate
eval_time: 40m
- alertname: LokiObjectStorageLowRate
eval_time: 70m
exp_alerts:
- exp_labels:
area: platform
cancel_if_outside_working_hours: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cluster_id: myinstall
cluster_type: management_cluster
installation: myinstall
name: myinstall
namespace: loki
pipeline: stable
severity: page
status: "True"
team: atlas
topic: observability
type: ControlPlaneReady
exp_annotations:
dashboard: loki-operational/loki-operational
description: "Loki object storage write rate is down."
opsrecipe: "loki/"
- alertname: LokiObjectStorageLowRate
eval_time: 100m
- alertname: LokiObjectStorageLowRate
eval_time: 200m
- alertname: LokiObjectStorageLowRate
eval_time: 250m
exp_alerts:
- exp_labels:
area: platform
cancel_if_outside_working_hours: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cluster_id: myinstall
cluster_type: management_cluster
installation: myinstall
namespace: loki
pipeline: stable
severity: page
team: atlas
topic: observability
exp_annotations:
dashboard: loki-operational/loki-operational
description: "Loki object storage write rate is down."
opsrecipe: "loki/"