mongo node syn alert implementation #2142

benzekrimaha · 2024-09-02T11:23:14Z

bert-e · 2024-09-02T11:23:18Z

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2024-09-02T11:23:23Z

Incorrect fix version

The Fix Version/s in issue ZENKO-4881 contains:

None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

2.10.0
2.6.67
2.7.63
2.8.43
2.9.18

Please check the Fix Version/s of ZENKO-4881, or the target
branch of this pull request.

williamlardier · 2024-09-02T12:12:12Z

monitoring/mongodb/alerts.yaml

+
+  - alert: MongoDbNodeRecovering
+    expr: |
+      sum by (pod) (mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", rs_state="3"} >= 1)


We can have all these states (excluding arbitrer)
https://www.mongodb.com/docs/manual/reference/replica-states/

I wonder if we should have alert on all the unwanted states, and not only restrict the alert on the RECOVERING state? Like, detecting the state is neither 0; 1 nor 2. We may also need some specific alert depending on the state:

For 3 (RECOVERING)

MongoDB secondaries will periodically enter this state during high load, if they fail to keep up with the incoming data. In general, this state is not harmful.

But the issue arises when it stays for too long (the op log is overwritten), that's why I think the for: 5m should be changed to something higher, like 24h...

When we perform the "full init sync" procedure, the affected mongodb instance will also end up in this state, and depending on the cluster' size, it can take hours, but this is expected, in this case...

It's great to have a separate alert, so we can ask the reader to check the logs, and if the instance cannot recover, it will have a specific log (we document it in the troubleshooting procedure - but not sure we want to reference it from Zenko).

For 5 (STARTUP2) is also fine, but again, if it stays for too long (> several hours), the instance may be stuck.

For 6 (UNKNOWN), 8 (DOWN) and 10 (REMOVED), these are unwanted states, so we can alert directly without waiting.

For 9 (ROLLBACK) this is fine as well.

done here :6bc518d

monitoring/mongodb/alerts.yaml

bert-e · 2024-09-02T12:34:57Z

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

Issue : ZENKO-4881

francoisferrand · 2024-09-06T06:42:41Z

monitoring/mongodb/alerts.yaml

+    labels:
+      severity: warning
+    annotations:
+      description: "The Mongodb instance `{{ $labels.pod }}` is in the 'RECOVERING' state for over an hour. The instance may not be able to join the replica set if the platform ingests a large number of operations during this time. This alert is expected if the 'Resync a Data Services MongoDB Member' procedure has recently been executed."


This alert is expected if the 'Resync a Data Services MongoDB Member' procedure has recently been executed.

is it really ?

we must make sure alerts will not trigger when nothing wrong is happening, to avoid alert fatigue and people either contacting support or ignoring alerts: if recovering is proceeding fine, then we should not have any alert.... it is needed only when recovering fails

When we perform a full init sync, if the cluster is still processing operations, there is a risk that we never converge, if the recovering takes longer than the oplog history...
The recovering instance might be in this state for more than one hour on most of the deployed platforms today

francoisferrand · 2024-09-06T06:43:34Z

monitoring/mongodb/alerts.yaml

+
+  - alert: MongoDbNodeRecovering
+    expr: |
+      avg_over_time(mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", rs_state="3"}[1h]) == 3


what is happening when recovering fails: does the pod restart, or does it stay in recovering forever?

if the pod restarts, then the average may well be below 3...

Indeed we have the livenessProbe configured to force the pod restart otherwise it would stay in the recovering state forever

I actually went for a warning if during the last hour it has been in the recovering state for at least once, if we want to have an alert only when it's stuck in that state , we may point that in the MongoDbNodeNotSynced or the alert for the primary as the fact that the node is not present might be due to the fact that it's stuck

francoisferrand · 2024-09-06T06:47:31Z

monitoring/mongodb/alerts.yaml

+
+  - alert: MongoDbNodeNotSynced
+    expr: |
+      sum by (pod) (mongodb_mongod_replset_number_of_members{set="data-db-mongodb-sharded-shard-0"}) != ${replicas}


this is not precise enough : this alert would not trigger if some pods are RECOVERING...

we already have an alert when there is no PRIMARY:
absent_over_time(mongodb_rs_members_state{namespace="${namespace}",pod=~"${service}.*",member_state="PRIMARY"}[1m]) == 1

would it not make sense to check that we have the expected number of SECONDARY ? (i.e. $replicas - 1)

done here : 5e684e8

francoisferrand · 2024-09-06T06:49:07Z

monitoring/mongodb/alerts.yaml

+
+  - alert: MongoDbNodeNotSynced
+    expr: |
+      sum by (pod) (mongodb_mongod_replset_number_of_members{set="data-db-mongodb-sharded-shard-0"}) != ${replicas}


data-db-mongodb-sharded-shard-0 cannot and should not be hard-coded.

we should not add a parameter, but make the alert "generic" : i.e. it would trigger for any shard (or config-svr) actually.

filtering needs to be done, using the "exxisting" parameters (namespace, service)

done here : 5e684e8

francoisferrand · 2024-09-06T06:52:29Z

monitoring/mongodb/alerts.yaml

+  - alert: MongoDbInvalidState
+    expr: |
+      avg_over_time(mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", rs_state=~"6|8|10"}[1m]) > 0
+    for: 1m


we can probably tolerate more than 1m, there may be some temporary switch to these states? (on startup or shutdown, maybe) → 5m ?
or are we garanteed this should not happen?

done here : 5e684e8

francoisferrand · 2024-09-06T06:53:51Z

monitoring/mongodb/alerts.yaml

+    labels:
+      severity: warning
+    annotations:
+      description: "The Mongodb instance `{{ $labels.pod }}` is in the 'RECOVERING' state for over an hour. The instance may not be able to join the replica set if the platform ingests a large number of operations during this time. This alert is expected if the 'Resync a Data Services MongoDB Member' procedure has recently been executed."


we have a different way to check for STARTUP2 and RECOVERING states: is it expected? is there a reason not to use the same approach (or possibly alert) ?

I went for 2 alerts in order to have a more precise description, as the recovering can be expected in some cases

francoisferrand · 2024-09-16T11:50:56Z

monitoring/mongodb/alerts.yaml

+      sum by (pod) (mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) != (${replicas} - 1)
+    for: 1m
+    labels:
+      severity: critical
+    annotations:
+      description: "The MongoDB instance `{{ $labels.pod }}` is out of the replica set. It does no longer receive any data and must be added back to the cluster to avoid performance and storage problems."
+      summary: MongoDB node not in replica set


the pod would be different for each time serie (it is ${service}.*), so the sum not compare to ${replicas}. To do this aggregation, we need to aggregate on a label which is hase a different value each "group" of pods (mongo-shard$X, mongo-cfgsvr) but also the same value for every pod in the group

since we are doing an alert on the aggregate, we will not have a single "pod", and cannot point to the node in the bad state.

francoisferrand · 2024-09-16T11:52:33Z

monitoring/mongodb/alerts.yaml

+    labels:
+      severity: warning
+    annotations:
+      description: "The Mongodb instance `{{ $labels.pod }}` is in the 'STARTUP2' state for an hour. The instance might be stuck."


instance can be misleading, usually we (esp. customers) would think of "instance" as the whole mongodb cluster... → prefer something less ambiguous, like pod

Suggested change

description: "The Mongodb instance `{{ $labels.pod }}` is in the 'STARTUP2' state for an hour. The instance might be stuck."

description: "Mongodb pod `{{ $labels.pod }}` is in the 'STARTUP2' state for an hour. The instance might be stuck."

(same for the other alerts)

benzekrimaha force-pushed the improvement/ZENKO-4881-mongo-sync-alert branch 2 times, most recently from 544a30d to 15b9ede Compare September 2, 2024 11:26

benzekrimaha marked this pull request as ready for review September 2, 2024 11:26

benzekrimaha requested review from francoisferrand, williamlardier and KillianG September 2, 2024 11:27

williamlardier reviewed Sep 2, 2024

View reviewed changes

benzekrimaha force-pushed the improvement/ZENKO-4881-mongo-sync-alert branch 3 times, most recently from 225aeff to 7d5b0fe Compare September 4, 2024 10:53

benzekrimaha requested a review from williamlardier September 4, 2024 10:54

benzekrimaha added 2 commits September 4, 2024 15:57

mongo node syn alert implementation

c2449b6

Issue : ZENKO-4881

mongo node syn test alert implementation

92ba78e

Issue : ZENKO-4881

benzekrimaha force-pushed the improvement/ZENKO-4881-mongo-sync-alert branch from 7d5b0fe to 92ba78e Compare September 4, 2024 13:58

francoisferrand reviewed Sep 6, 2024

View reviewed changes

post review fixup

5e684e8

benzekrimaha force-pushed the improvement/ZENKO-4881-mongo-sync-alert branch from f5cbaad to 5e684e8 Compare September 10, 2024 12:51

benzekrimaha requested a review from francoisferrand September 10, 2024 12:54

francoisferrand reviewed Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mongo node syn alert implementation #2142

mongo node syn alert implementation #2142

benzekrimaha commented Sep 2, 2024 •

edited

Loading

bert-e commented Sep 2, 2024

bert-e commented Sep 2, 2024

williamlardier Sep 2, 2024

benzekrimaha Sep 4, 2024

bert-e commented Sep 2, 2024

francoisferrand Sep 6, 2024 •

edited

Loading

williamlardier Sep 6, 2024

francoisferrand Sep 6, 2024

benzekrimaha Sep 6, 2024 •

edited

Loading

benzekrimaha Sep 10, 2024

francoisferrand Sep 6, 2024

benzekrimaha Sep 10, 2024

francoisferrand Sep 6, 2024

benzekrimaha Sep 10, 2024

francoisferrand Sep 6, 2024 •

edited

Loading

benzekrimaha Sep 10, 2024

francoisferrand Sep 6, 2024

benzekrimaha Sep 10, 2024

francoisferrand Sep 16, 2024

francoisferrand Sep 16, 2024

francoisferrand Sep 16, 2024

	description: "The Mongodb instance `{{ $labels.pod }}` is in the 'STARTUP2' state for an hour. The instance might be stuck."
	description: "Mongodb pod `{{ $labels.pod }}` is in the 'STARTUP2' state for an hour. The instance might be stuck."

mongo node syn alert implementation #2142

Are you sure you want to change the base?

mongo node syn alert implementation #2142

Conversation

benzekrimaha commented Sep 2, 2024 • edited Loading

bert-e commented Sep 2, 2024

Hello benzekrimaha,

bert-e commented Sep 2, 2024

Incorrect fix version

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bert-e commented Sep 2, 2024

Request integration branches

francoisferrand Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benzekrimaha Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

francoisferrand Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benzekrimaha commented Sep 2, 2024 •

edited

Loading

francoisferrand Sep 6, 2024 •

edited

Loading

benzekrimaha Sep 6, 2024 •

edited

Loading

francoisferrand Sep 6, 2024 •

edited

Loading