Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-17672 Run quarantined tests separately #17329

Merged
merged 11 commits into from
Oct 6, 2024

Conversation

mumrah
Copy link
Contributor

@mumrah mumrah commented Sep 30, 2024

Introduce new quarantinedTest that excludes tests tagged with "quarantined". Also introduce two new build parameters "maxQuarantineTestRetries" and "maxQuarantineTestRetryFailures".

@github-actions github-actions bot added core Kafka Broker connect build Gradle build or GitHub Actions labels Sep 30, 2024
@mumrah mumrah added the do-not-merge PRs that are only open temporarily and should not be merged label Oct 1, 2024
@mumrah
Copy link
Contributor Author

mumrah commented Oct 2, 2024

@chia7712 looks like #17336 fixed the issue

@mumrah
Copy link
Contributor Author

mumrah commented Oct 2, 2024

In the build, we can see the main tests and the quarantined tests being run

https://github.com/apache/kafka/actions/runs/11130826247/job/30931421135?pr=17329#step:7:21

Parsing build/junit-xml/share-coordinator/test/TEST-org.apache.kafka.coordinator.share.ShareCoordinatorRecordHelpersTest.xml
  Parsing build/junit-xml/share-coordinator/test/TEST-org.apache.kafka.coordinator.share.ShareCoordinatorRecordSerdeTest.xml
  Parsing build/junit-xml/share-coordinator/test/TEST-org.apache.kafka.coordinator.share.ShareCoordinatorServiceTest.xml
  Parsing build/junit-xml/share-coordinator/test/TEST-org.apache.kafka.coordinator.share.ShareCoordinatorShardTest.xml
  
  // snip
  
  Parsing build/junit-xml/core/quarantinedTest/TEST-kafka.coordinator.transaction.ProducerIdManagerTest.xml

So the test collection and JUnit reporting are working.

@mumrah mumrah removed the do-not-merge PRs that are only open temporarily and should not be merged label Oct 2, 2024
@mumrah mumrah requested a review from chia7712 October 2, 2024 14:23
@mumrah
Copy link
Contributor Author

mumrah commented Oct 3, 2024

@chia7712 can you take a look?

Copy link
Contributor

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah thanks for this patch. that is cooooooooooooooool!

build.gradle Show resolved Hide resolved
if (ext.isGithubActions) {
def dest = rootProject.layout.buildDirectory.dir("junit-xml/${project.name}").get().asFile
println "Copy JUnit XML for ${project.name} to $dest"
ant.copy(todir: "$dest/quarantinedTest", failonerror: "false") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has it been included in the artifact? The artifact link (https://github.com/apache/kafka/actions/runs/11130826247/artifacts/2002753330) doesn't seem to include it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

build.gradle Show resolved Hide resolved
@mumrah
Copy link
Contributor Author

mumrah commented Oct 4, 2024

I realized we should not have quarantinedTest task explicitly fail the build like we do with test. This is because we are explicitly telling Gradle not to cache this task

    outputs.upToDateWhen { false }
    outputs.cacheIf { false }

This means we can also stop forcing ignoreFailures to true in the CI build. If we have more than 20 flaky failures in the quarantinedTest task, it will fail the overall Gradle run.

@mumrah mumrah requested a review from chia7712 October 4, 2024 12:37
Copy link
Contributor

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have more than 20 flaky failures in the quarantinedTest task, it will fail the overall Gradle run.

pardon me, I assumed the quarantined tests should NOT impact the CI flow for PR

build.gradle Outdated
@@ -501,6 +504,7 @@ subprojects {

useJUnitPlatform {
includeEngines 'junit-jupiter'
excludeTags 'quarantined'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we use deflake.yml to run flaky tests if they are excluded by the test task? Maybe we should enable deflake.yml to also run quarantinedTest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will update that job.

@Target({ElementType.TYPE, ElementType.METHOD})
@Retention(RetentionPolicy.RUNTIME)
@Tag("quarantined")
public @interface Quarantined {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update KIP based on this new tag? the tag name in KIP is Flaky rather than Quarantined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wonder if @Flaky is better for the annotation.

The "quarantine" as a concept includes flaky tests as well as new integration tests, so it might make sense to keep the flaky annotation.

@Tag("quarantined")
public @interface Flaky 

or

@Tag("flaky")
public @interface Flaky 

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tag("flaky")
public @interface Flaky

I prefer this one since it’s shorter and has been part of Kafka for a long time.

@mumrah
Copy link
Contributor Author

mumrah commented Oct 4, 2024

I assumed the quarantined tests should NOT impact the CI flow for PR

Since we're defining this new task separately, we have some flexibility. We can have it fail after some high threshold (like 20 that I arbitrarily chose), or we can prevent it from ever failing the build.

In principle, I guess it's best to not fail the build and rely on the quarantined test reports to ensure we don't accumulate too many quarantined tests or have them for too long.

@chia7712 chia7712 merged commit d38a90d into apache:trunk Oct 6, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Gradle build or GitHub Actions connect core Kafka Broker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants