Skip to content

KAFKA-19421: Deflake streams_broker_down_resilience_test #19999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

lucasbru
Copy link
Member

@lucasbru lucasbru commented Jun 19, 2025

streams_broker_down_resilience_test produce messages with null key
to a topic with three partitions and expect each partition to be
non-empty afterward. But I don't think this is a correct assumption, as
a producer may try to be sticky and only produce to two partitions.

This cause occasional flakiness in the test.

The fix is to produce records with keys.

@lucasbru lucasbru requested review from Copilot and mjsax June 19, 2025 12:27
@github-actions github-actions bot added tests Test fixes (including flaky tests) small Small PRs labels Jun 19, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes flakiness in streams_broker_down_resilience_test by producing records with keys (via a new repeating_keys parameter) to ensure all partitions receive messages.
Key changes:

  • Added repeating_keys argument to assert_produce calls in four resilience tests with clarifying comments.
  • Extended assert_produce signature in base_streams_test.py to accept and pass through repeating_keys.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/kafkatest/tests/streams/streams_broker_down_resilience_test.py Added comments and passed repeating_keys=self.num_messages to assert_produce in multiple tests.
tests/kafkatest/tests/streams/base_streams_test.py Updated assert_produce signature to include repeating_keys and forwarded it to get_producer.
Comments suppressed due to low confidence (2)

tests/kafkatest/tests/streams/base_streams_test.py:85

  • [nitpick] The parameter name 'repeating_keys' is ambiguous—it sounds like a boolean flag but is used as a numeric key count. Consider renaming it to something more descriptive like 'num_keys' or 'unique_key_count'.
    def assert_produce(self, topic, test_state, num_messages=5, timeout_sec=60, repeating_keys=None):

tests/kafkatest/tests/streams/base_streams_test.py:85

  • The new parameter 'repeating_keys' is undocumented. Please update the method docstring or add a comment to explain its intended use and expected values.
    def assert_produce(self, topic, test_state, num_messages=5, timeout_sec=60, repeating_keys=None):

Comment on lines +132 to +137
# repeating_keys enables production of records with keys, ensuring that we produce to all 3 partitions
self.assert_produce(self.inputTopic,
"sending_message_after_broker_down_initially",
num_messages=self.num_messages,
timeout_sec=120)
timeout_sec=120,
repeating_keys=self.num_messages)
Copy link
Preview

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This comment is repeated verbatim across multiple test cases. Consider abstracting the key-production logic into a helper or adding a shared comment to reduce duplication.

Copilot uses AI. Check for mistakes.

Copy link
Member

@FrankYang0529 FrankYang0529 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The StreamsBrokerDownResilience e2e cases can pass.

Out of curiosity, do we want to introduce repeating_keys to assert_produce_consume as well?

def assert_produce_consume(self,
streams_source_topic,
streams_sink_topic,
client_id,
test_state,
num_messages=5,
timeout_sec=60):
self.assert_produce(streams_source_topic, test_state, num_messages, timeout_sec)
self.assert_consume(client_id, test_state, streams_sink_topic, num_messages, timeout_sec)

@lucasbru
Copy link
Member Author

Thanks for the comment @FrankYang0529

Out of curiosity, do we want to introduce repeating_keys to assert_produce_consume as well?

We could, but it's not necessary, because in that case we only assert that our messages are produced to / consumed from any partition. For assert_produce we add it, to support the assertion that all partitions have data.

@lucasbru
Copy link
Member Author

Here are 40 passing runs (across all parameters) of the test with this PR. Not sure if this is sufficient to trigger the original flake, but at least it should show that I didn't break things:

https://confluent-open-source-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/trunk/2025-06-19--001.f4645ca5-144e-40cd-85c5-16efd1b32e71--1750341166--lucasbru--deflake_streams_broker_down_resilience_test--c06d4dedde/report.html

@lucasbru lucasbru requested a review from bbejeck June 20, 2025 09:45
@lucasbru
Copy link
Member Author

PTAL @aliehsaeedii

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
small Small PRs tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants