-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MongoDB ReplicationLag alert #2205
base: development/2.10
Are you sure you want to change the base?
Conversation
Hello williamlardier,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command:
Alternatively, the |
f082f1b
to
0554827
Compare
- The threshold is not 30s anymore. This fixed value has two problems: - it is not consistent with the flowControlTargetLagSeconds config set as 10s. - it is creating a lot of alert when a 3 node cluster is under load, when the lag may be in the 1min-30min range without impact on the replication. - The new alert considers the oplog window of the mongod instancesto dynamically create alerts. - Warning when we exceed 70% of the oplog. - Critical when we exceed 95%. In this case there is a sizing issue and a high risk of requiring full init syncs. Issue: ZENKO-4986
0554827
to
aee13f6
Compare
max by(pod, rs_nm) ( | ||
mongodb_rs_members_optimeDate{namespace="${namespace}",pod=~"${service}.*",member_state="PRIMARY"} | ||
) | ||
- ignoring(member_idx) group_right min by(pod, rs_nm, member_idx) ( | ||
mongodb_rs_members_optimeDate{namespace="${namespace}",pod=~"${service}.*",member_state="SECONDARY"} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, this computation is done on each pod : we compare the primary's oplog date vs all the secondary oplog date, as viewed by this pod.
→ this gives the max optime delta as viewed by this pod.
should it not be other way, i.e. for each member_idx, diff the max(PRIMARY) with the max(SECONDARY) [ignoring pods], which gives a simpler expression (no ignoring, no need to perfrom 2 nested max) and more precisely mesures the lag of that member_idx vs the primary ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check that approach yes (I'm also trying that on multi shard clusters to ensure it's future proof)
(from what I tested, we hit issues because labels are not unique the other way around... giving errors)
max ( | ||
(mongodb_mongod_replset_oplog_head_timestamp{namespace="${namespace}",pod=~"${service}.*"} | ||
- on(pod) | ||
mongodb_mongod_replset_oplog_tail_timestamp{namespace="${namespace}",pod=~"${service}.*"}) | ||
* on(pod) group_left() | ||
(mongodb_mongod_replset_my_state{namespace="${namespace}",pod=~"${service}.*"} == 1) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can we/should we only look at the oplog headroom from the PRIMARY ?
(i.e. if one SECONDARY has lots of headroom, it would affect the metrics even though it does not minimize the risk) - should we not take the
min
headroom (instead ofmax
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is the journal window. The headroom would typically be something like
(avg(mongodb_mongod_replset_oplog_tail_timestamp - mongodb_mongod_replset_oplog_head_timestamp)
- (avg(mongodb_mongod_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_mongod_replset_member_optime_date{state="SECONDARY"})))
Here, the whole alert is similar to the headroom, but computed as a percentage so we can define thresholds and not just alert when it's too late.
As the replication happens between the primary and secondaries only, not between secondaries, I don't think we need to consider secondary oplog window: its size has nothing to do with the secondary ability to converge its replication.
should we not take the min headroom (instead of max)?
I asked myself this question and with only one replicaset we only have one value anyway. But I chose max
to be on the optimistic side and avoid alerting too early. min
makes sense if we want to detect the problem as early as possible, I'll go with it; I am aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is the journal window. The headroom would typically be something like
ok, journal window :-)
But my point was really about looking only at the primary, since this is the "important" oplog in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what we do already, by filtering on the "my state" being 1 (=== PRIMARY), did you have something else in mind?
ReplicationLagWarning alert to become dynamic:
flowControlTargetLagSeconds
config set as 10s.optimeDate
(current lag) of any secondary exceeds 70% of the primary oplog window for 8min.Issue: ZENKO-4986