-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mongo node syn alert implementation #2142
base: development/2.6
Are you sure you want to change the base?
mongo node syn alert implementation #2142
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
544a30d
to
15b9ede
Compare
monitoring/mongodb/alerts.yaml
Outdated
|
||
- alert: MongoDbNodeRecovering | ||
expr: | | ||
sum by (pod) (mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", rs_state="3"} >= 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have all these states (excluding arbitrer)
https://www.mongodb.com/docs/manual/reference/replica-states/
I wonder if we should have alert on all the unwanted states, and not only restrict the alert on the RECOVERING
state? Like, detecting the state is neither 0; 1 nor 2. We may also need some specific alert depending on the state:
-
For 3 (RECOVERING)
- MongoDB secondaries will periodically enter this state during high load, if they fail to keep up with the incoming data. In general, this state is not harmful.
- But the issue arises when it stays for too long (the op log is overwritten), that's why I think the
for: 5m
should be changed to something higher, like 24h... - When we perform the "full init sync" procedure, the affected mongodb instance will also end up in this state, and depending on the cluster' size, it can take hours, but this is expected, in this case...
- It's great to have a separate alert, so we can ask the reader to check the logs, and if the instance cannot recover, it will have a specific log (we document it in the troubleshooting procedure - but not sure we want to reference it from Zenko).
-
For 5 (STARTUP2) is also fine, but again, if it stays for too long (> several hours), the instance may be stuck.
-
For 6 (UNKNOWN), 8 (DOWN) and 10 (REMOVED), these are unwanted states, so we can alert directly without waiting.
-
For 9 (ROLLBACK) this is fine as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done here :6bc518d
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command:
Alternatively, the |
225aeff
to
7d5b0fe
Compare
7d5b0fe
to
92ba78e
Compare
monitoring/mongodb/alerts.yaml
Outdated
labels: | ||
severity: warning | ||
annotations: | ||
description: "The Mongodb instance `{{ $labels.pod }}` is in the 'RECOVERING' state for over an hour. The instance may not be able to join the replica set if the platform ingests a large number of operations during this time. This alert is expected if the 'Resync a Data Services MongoDB Member' procedure has recently been executed." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This alert is expected if the 'Resync a Data Services MongoDB Member' procedure has recently been executed.
is it really ?
we must make sure alerts will not trigger when nothing wrong is happening, to avoid alert fatigue and people either contacting support or ignoring alerts: if recovering is proceeding fine, then we should not have any alert.... it is needed only when recovering fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we perform a full init sync, if the cluster is still processing operations, there is a risk that we never converge, if the recovering takes longer than the oplog history...
The recovering instance might be in this state for more than one hour on most of the deployed platforms today
monitoring/mongodb/alerts.yaml
Outdated
|
||
- alert: MongoDbNodeRecovering | ||
expr: | | ||
avg_over_time(mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", rs_state="3"}[1h]) == 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is happening when recovering fails: does the pod restart, or does it stay in recovering forever?
if the pod restarts, then the average may well be below 3...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we have the livenessProbe
configured to force the pod restart otherwise it would stay in the recovering state forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually went for a warning if during the last hour it has been in the recovering state for at least once, if we want to have an alert only when it's stuck in that state , we may point that in the MongoDbNodeNotSynced or the alert for the primary as the fact that the node is not present might be due to the fact that it's stuck
monitoring/mongodb/alerts.yaml
Outdated
|
||
- alert: MongoDbNodeNotSynced | ||
expr: | | ||
sum by (pod) (mongodb_mongod_replset_number_of_members{set="data-db-mongodb-sharded-shard-0"}) != ${replicas} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not precise enough : this alert would not trigger if some pods are RECOVERING...
we already have an alert when there is no PRIMARY:
absent_over_time(mongodb_rs_members_state{namespace="${namespace}",pod=~"${service}.*",member_state="PRIMARY"}[1m]) == 1
would it not make sense to check that we have the expected number of SECONDARY ? (i.e. $replicas - 1
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done here : 5e684e8
monitoring/mongodb/alerts.yaml
Outdated
|
||
- alert: MongoDbNodeNotSynced | ||
expr: | | ||
sum by (pod) (mongodb_mongod_replset_number_of_members{set="data-db-mongodb-sharded-shard-0"}) != ${replicas} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data-db-mongodb-sharded-shard-0
cannot and should not be hard-coded.
- we should not add a parameter, but make the alert "generic" : i.e. it would trigger for any shard (or config-svr) actually.
- filtering needs to be done, using the "exxisting" parameters (namespace, service)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done here : 5e684e8
monitoring/mongodb/alerts.yaml
Outdated
- alert: MongoDbInvalidState | ||
expr: | | ||
avg_over_time(mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", rs_state=~"6|8|10"}[1m]) > 0 | ||
for: 1m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably tolerate more than 1m, there may be some temporary switch to these states? (on startup or shutdown, maybe) → 5m ?
or are we garanteed this should not happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done here : 5e684e8
monitoring/mongodb/alerts.yaml
Outdated
labels: | ||
severity: warning | ||
annotations: | ||
description: "The Mongodb instance `{{ $labels.pod }}` is in the 'RECOVERING' state for over an hour. The instance may not be able to join the replica set if the platform ingests a large number of operations during this time. This alert is expected if the 'Resync a Data Services MongoDB Member' procedure has recently been executed." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a different way to check for STARTUP2 and RECOVERING states: is it expected? is there a reason not to use the same approach (or possibly alert) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for 2 alerts in order to have a more precise description, as the recovering can be expected in some cases
f5cbaad
to
5e684e8
Compare
sum by (pod) (mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) != (${replicas} - 1) | ||
for: 1m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: "The MongoDB instance `{{ $labels.pod }}` is out of the replica set. It does no longer receive any data and must be added back to the cluster to avoid performance and storage problems." | ||
summary: MongoDB node not in replica set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the
pod
would be different for each time serie (it is${service}.*
), so thesum
not compare to ${replicas}. To do this aggregation, we need to aggregate on a label which is hase a different value each "group" of pods (mongo-shard$X, mongo-cfgsvr) but also the same value for every pod in the group - since we are doing an alert on the aggregate, we will not have a single "pod", and cannot point to the node in the bad state.
labels: | ||
severity: warning | ||
annotations: | ||
description: "The Mongodb instance `{{ $labels.pod }}` is in the 'STARTUP2' state for an hour. The instance might be stuck." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instance
can be misleading, usually we (esp. customers) would think of "instance" as the whole mongodb cluster... → prefer something less ambiguous, like pod
description: "The Mongodb instance `{{ $labels.pod }}` is in the 'STARTUP2' state for an hour. The instance might be stuck." | |
description: "Mongodb pod `{{ $labels.pod }}` is in the 'STARTUP2' state for an hour. The instance might be stuck." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for the other alerts)
Issue : ZENKO-4881