Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rebalance coordinator spec #1485

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

erikvanoosten
Copy link
Collaborator

@erikvanoosten erikvanoosten commented Feb 23, 2025

Previously, some of the rebalanceSafeCommits tests passed regardless of whether rebalanceSafeCommits was enabled or not. This is now fixed.

Also:

  • added test outlines
  • made assertions more precise
  • removed unnecessary code
  • add a test for when rebalanceSafeCommits is disabled

maxRebalanceDuration = 10.seconds
)
revokeDuration <- coordinator.toRebalanceListener.onRevoked(Set(tp)).timed.map(_._1)
} yield assertTrue(revokeDuration.toMillis > 150, revokeDuration.toMillis < 6000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of timing conditions like this as they are often flaky. Any way to keep using the promise for signalling stuff and expressing the condition in a different way..?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and yet and I don't see how it could be done differently.

These tests are written such that at worst they can cause false positives. False negatives are either not possible or very unlikely.

In test should wait for the last pulled offset to commit we want to validate the that the rebalance is extended till all three offsets are committed.
Failure scenarios:

  • Rebalance doesn't wait for commit
    • validated by asserting minimum duration. This can cause false positives, but not false negatives: if it it takes less time it has to be because it didn't wait
    • alternative: count commits. When commit count after rebalance is not number of expected records the rebalance ended too quickly. This is worse: we now we have a race condition on the commit count which can also cause false positives.
  • Rebalance waits too long (no commits handled at all)
    • validated by asserting maximum duration. The max rebalance duration is quite long: 10s, the assertion is using max 6s. You need a very slow machine to cause false negatives.
    • alternative: no idea

In test should continue if waiting for the stream to continue has timed out we want to validate the the rebalance did its best in waiting for commits that never happened.
Failure scenarios:

  • Rebalance ends before max rebalance duration
    • validated by checking actual rebalance duration. This can cause false positives, but not false negatives: if it it did not wait long enough, it has to be because it didn't wait
    • alternative: no idea

In test should not wait for stream to commit we want to validate the the rebalance doesn't wait.
Failure scenarios:

  • Rebalance does wait for commits:
    • validated by checking that the rebalance took shorter than what it would have taken when it awaited the commits. The rebalance could be slow for other reasons, so this can indeed cause false negatives. However, with 600ms there is so much time that a false negative is unlikely. (False positives are not possible.)
    • alternative: no idea. We can't count commits because we don't do commits. Doing commits increases the complexity of the test because then it needs to address the uncertainty of whether the rebalance ended because of commits, or because its was not waiting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See alternative in #1498

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See alternative in #1498

IMHO this is worse. We now we have a race condition on the commit count which can also cause false positives.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can combine the approaches though. I'll try it out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the additional assertion in 6dbf077

In addition, we now ensure that the records were pulled by the stream in 203b77d, before the revoke could theoretically happen before the stream pulled the records,this would completely invalidate the test.

The remaining time based assertions are still needed because they validate the time based behavior of the rebalance listener.

Previously, some of the rebalanceSafeCommits test passed regardless of whether rebalanceSafeCommits was enabled or not. This is now fixed.

Also:
- added test outlines
- made assertions more precise
- removed unnecessary code
- add a test for when rebalanceSafeCommits is disabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants