-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rebalance coordinator spec #1485
base: master
Are you sure you want to change the base?
Conversation
maxRebalanceDuration = 10.seconds | ||
) | ||
revokeDuration <- coordinator.toRebalanceListener.onRevoked(Set(tp)).timed.map(_._1) | ||
} yield assertTrue(revokeDuration.toMillis > 150, revokeDuration.toMillis < 6000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of timing conditions like this as they are often flaky. Any way to keep using the promise for signalling stuff and expressing the condition in a different way..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point and yet and I don't see how it could be done differently.
These tests are written such that at worst they can cause false positives. False negatives are either not possible or very unlikely.
In test should wait for the last pulled offset to commit
we want to validate the that the rebalance is extended till all three offsets are committed.
Failure scenarios:
- Rebalance doesn't wait for commit
- validated by asserting minimum duration. This can cause false positives, but not false negatives: if it it takes less time it has to be because it didn't wait
- alternative: count commits. When commit count after rebalance is not number of expected records the rebalance ended too quickly. This is worse: we now we have a race condition on the commit count which can also cause false positives.
- Rebalance waits too long (no commits handled at all)
- validated by asserting maximum duration. The max rebalance duration is quite long: 10s, the assertion is using max 6s. You need a very slow machine to cause false negatives.
- alternative: no idea
In test should continue if waiting for the stream to continue has timed out
we want to validate the the rebalance did its best in waiting for commits that never happened.
Failure scenarios:
- Rebalance ends before max rebalance duration
- validated by checking actual rebalance duration. This can cause false positives, but not false negatives: if it it did not wait long enough, it has to be because it didn't wait
- alternative: no idea
In test should not wait for stream to commit
we want to validate the the rebalance doesn't wait.
Failure scenarios:
- Rebalance does wait for commits:
- validated by checking that the rebalance took shorter than what it would have taken when it awaited the commits. The rebalance could be slow for other reasons, so this can indeed cause false negatives. However, with 600ms there is so much time that a false negative is unlikely. (False positives are not possible.)
- alternative: no idea. We can't count commits because we don't do commits. Doing commits increases the complexity of the test because then it needs to address the uncertainty of whether the rebalance ended because of commits, or because its was not waiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See alternative in #1498
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See alternative in #1498
IMHO this is worse. We now we have a race condition on the commit count which can also cause false positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can combine the approaches though. I'll try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the additional assertion in 6dbf077
In addition, we now ensure that the records were pulled by the stream in 203b77d, before the revoke could theoretically happen before the stream pulled the records,this would completely invalidate the test.
The remaining time based assertions are still needed because they validate the time based behavior of the rebalance listener.
f6bac44
to
447cbec
Compare
Previously, some of the rebalanceSafeCommits test passed regardless of whether rebalanceSafeCommits was enabled or not. This is now fixed. Also: - added test outlines - made assertions more precise - removed unnecessary code - add a test for when rebalanceSafeCommits is disabled
203b77d
to
a81aa06
Compare
Previously, some of the rebalanceSafeCommits tests passed regardless of whether rebalanceSafeCommits was enabled or not. This is now fixed.
Also: