Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(IDX): only enable flaky = True for tests that are >= 1% flaky #4325

Merged

Conversation

basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Mar 11, 2025

Many tests marked with flaky = True are not actually flaky (anymore) or have a very low flakiness rate (< 1%). To reduce noise we remove these flaky = True settings. For the ones that are actually flaky (>= 1% flakiness rate) we make sure they're marked as such and add a comment with their actual flakiness rate to document the motivation for flaky = True.

We also remove flaky = False settings since that's the default.

@basvandijk basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Mar 11, 2025
@github-actions github-actions bot added the test label Mar 11, 2025
@basvandijk basvandijk removed the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Mar 11, 2025
@basvandijk basvandijk marked this pull request as ready for review March 11, 2025 17:50
@basvandijk basvandijk requested review from a team as code owners March 11, 2025 17:50
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).

To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:

  1. Done.

  2. No canister behavior changes.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).

To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:

  1. Done.

  2. No canister behavior changes.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).

To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:

  1. Done.

  2. No canister behavior changes.

@basvandijk basvandijk dismissed github-actions[bot]’s stale review March 11, 2025 18:02

No canister behavior changes.

@mbjorkqvist
Copy link
Member

Thanks for your relentless work on improving our testing, and in particular the flakiness @basvandijk! Do you have some concrete numbers or estimates on the impact on CI performance? Looking at the issue from another PoV, this will potentially increase the time that developers need to spend on PRs, either looking into the logs of failed (now non-flaky) tests to figure out if they were caused by their changes, or alternatively simply clicking retry once or twice (possibly with some delay) before deciding to investigate further? Yet another viewpoint is that some of the tests that are now being marked non-flaky are actually not flaky (anymore), with the rare failures being due to e.g., unrelated transient infrastructure issues, in which case removing the flaky flag definitely makes sense.

@basvandijk
Copy link
Collaborator Author

@mbjorkqvist

Do you have some concrete numbers or estimates on the impact on CI performance?

No I don't have concrete numbers nor an estimate. It's hard to predict what effect this will have in practise.

Looking at the issue from another PoV, this will potentially increase the time that developers need to spend on PRs, either looking into the logs of failed (now non-flaky) tests to figure out if they were caused by their changes, or alternatively simply clicking retry once or twice (possibly with some delay) before deciding to investigate further?

True, which is why we limited the removal of flake = True to those tests with a flakiness rate of < 1%.

Yet another viewpoint is that some of the tests that are now being marked non-flaky are actually not flaky (anymore), with the rare failures being due to e.g., unrelated transient infrastructure issues, in which case removing the flaky flag definitely makes sense.

Yes, the vast majority of tests where I removed flaky = True had a flakiness rate of 0%.

Note that we will extend Superset with some new charts that show the tests that were run as part of multiple attempts of a single workflow (i.e. those workflows that were manually retried) and that had both more than 0 failures and more than 0 successes, i.e. flaky tests.

@basvandijk basvandijk enabled auto-merge March 11, 2025 19:20
@alin-at-dfinity
Copy link
Contributor

alin-at-dfinity commented Mar 12, 2025

[Please take the below with a grain of salt. It likely looks like me pushing back against the change. It's not that, I'm just doing a brain-dump of whatever came to mind when looking at the change and the discussion. I.e. feel free to politely ignore any or all of the points below if you find them unconvincing.]

Tests marked with flaky = True are not cached by bazel which means they'll run for every PR regardless of whether they ran successfully before.

Would it be possible to only cache successful runs of flaky tests instead?

Do you have some concrete numbers or estimates on the impact on CI performance?
No I don't have concrete numbers nor an estimate.

An exceedingly naive estimate looking at the number of flaky = True lines in the PR (100+) and assuming a 99.5% success rate yields an overall CI failure rate of 40%. Which, anecdotally, is not far from my personal experience. Even a 99.9% success rate across 100 flaky tests results in a 10% CI failure rate.

Yet another viewpoint is that some of the tests that are now being marked non-flaky are actually not flaky (anymore), with the rare failures being due to e.g., unrelated transient infrastructure issues, in which case removing the flaky flag definitely makes sense.

I would argue that it doesn't matter whether the source of the flakiness is the test itself or the infrastructure. Sure, in the former case you would expect the owner to do something about it; and in the latter case it's IDX's responsibility. But in terms of outcomes, there is no difference: the test simply fails X% of the time even though the tested functionality is correct.

Therefore we only mark tests as flaky for tests that were actually more than 1% flaky over the last month.

The way I look at flaky = True is that it should be a determination made by a human, meaning that either the test is fundamentally flaky (e.g. it tests that such and such a thing completes within 100 ms) or it heavily relies on infrastructure that is overloaded and thus flaky (e.g. the replica upgrade-downgrade tests, that do multiple replica deployments and are thus disproportionately affected by slow CI machines).

Automating what gets labeled as flaky loses this property. And it affects (it's hard to say in which direction) the likelihood of the owners doing anything about fixing actually flaky tests: on the one hand, they get a monthly reminder (assuming we do this monthly); on the other hand, it's just something they will likely automatically rubber-stamp, potentially without noticing that most of their tests have been marked as flaky (because the system will automatically rerun their half-broken tests for them).

Copy link
Member

@gregorydemay gregorydemay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @basvandijk! This also helps us removing some "historical" flaky = True flags

@basvandijk basvandijk disabled auto-merge March 12, 2025 10:52
@basvandijk
Copy link
Collaborator Author

basvandijk commented Mar 12, 2025

@mbjorkqvist @alin-at-dfinity your comments got me thinking. There was an assumption in our team that tests marked as flaky were not cached by bazel. I just tested this assumption by adding the following bazel test:

sh_test(
    name = "flaky_test",
    srcs = ["flaky_test.sh"],
    flaky = True,
)

where flaky_test.sh is a script which fails 50% of the time:

#!/bin/bash
set -e
NUMBER="$(grep -m1 -ao '[0-1]' /dev/urandom | head -n1)"
if [[ "$NUMBER" -eq 1 ]]; then
  echo "fail"
  exit 1
else
  exit 0
fi

It succeeded the first time:

ubuntu@devenv-container:/ic$ bazel test :flaky_test 
INFO: Invocation ID: 251a4dc2-e7df-4130-afcc-b3dd108cd227
INFO: Analyzed target //:flaky_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //:flaky_test up-to-date:
  bazel-bin/flaky_test
INFO: Elapsed time: 0.388s, Critical Path: 0.13s
INFO: 2 processes: 2 processwrapper-sandbox.
INFO: Build completed successfully, 2 total actions
//:flaky_test                                                            PASSED in 0.1s

Executed 1 out of 1 test: 1 test passes.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

The second time it got a cache hit:

ubuntu@devenv-container:/ic$ bazel test :flaky_test 
INFO: Invocation ID: 195f85aa-428f-474c-82f2-091b80375b0d
INFO: Analyzed target //:flaky_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //:flaky_test up-to-date:
  bazel-bin/flaky_test
INFO: Elapsed time: 0.218s, Critical Path: 0.01s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
//:flaky_test                                                   (cached) PASSED in 0.1s

Executed 0 out of 1 test: 1 test passes.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

This seems to disprove our theory that tests marked as flaky are not cached.

ChatGPT agrees with that when asked:
"Does bazel cache tests marked as flaky?":

Yes, Bazel does cache tests marked as flaky, but with some caveats.
How Bazel Handles Flaky Tests

  1. Caching of Successful Runs:
  • If a flaky test (flaky = True) passes, Bazel caches the result just like a non-flaky test.
  • Future runs will use the cached result unless invalidated by changes to dependencies.
  1. Automatic Retries:
  • When a flaky test fails, Bazel automatically retries it (by default, up to 3 times).
  • If the test eventually passes, Bazel caches that successful result.
  1. Cache Invalidation on Failure:
  • If a flaky test fails all retry attempts, Bazel does not cache the failure.
  • This ensures that subsequent runs will re-execute the test instead of using a previously failed result.
  1. Impact on --cache_test_results:
  • If --cache_test_results=no is set, Bazel will re-run tests regardless of whether they are flaky or not.
  • If --cache_test_results=auto (default), Bazel caches successful flaky test results but does not cache persistent failures.

Key Takeaways

  • Flaky tests can be cached, but only if they pass.
  • If a flaky test fails all retries, Bazel does not cache the failure.
  • Running bazel test later might re-run the flaky test even if it previously failed.
    Would you like help tweaking your test strategy to handle flakiness better? 🚀

So taking this into account I would like to propose a minor addition to this PR:

Replace this bazel config with:

test --flaky_test_attempts=3

This will have the effect that any failing test will always be retried up to 3 times. See the docs for details: https://bazel.build/reference/command-line-reference#flag--flaky_test_attempts

This will have the following advantages:

  • No manual retries of the whole job are necessary.
  • We keep detecting flakiness and no need to modify our GitHub stats service nor any Superset datasets or charts.

Note that ChatGPT's 3rd point about "Cache Invalidation on Failure" is important. We do want to keep the flaky = True setting for tests that are quite flaky (>= 1%) since if they would fail 3 times in a row their failure will be cached. That would mean that the only way to rerun those tests will be to change one of their inputs which might not be trivial (merging master into your PR might not always change their inputs).

WDYT?

@basvandijk
Copy link
Collaborator Author

There is an obvious downside to test --flaky_test_attempts=3: it will cause ALL failing test to rerun up to 3 times. This includes tests that are really not flaky and fail because of bugs in the PR (like //bazel:buildifier_test for example).

Copy link
Contributor

@kpop-dfinity kpop-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton, Bas!

@mbjorkqvist
Copy link
Member

Note that ChatGPT's 3rd point about "Cache Invalidation on Failure" is important. We do want to keep the flaky = True setting for tests that are quite flaky (>= 1%) since if they would fail 3 times in a row their failure will be cached.

Is this really the case? I don't think I've ever seen bazel cache a failure result, regardless if the test was marked flaky = True or not. I haven't tried with --flaky_test_attempts, but I would find it odd if failures would be cached with that.

I don't think I've thought of all the (corner)cases regarding the different approaches, but I think I'd be inclined to remove the ("historical") flaky = True flags (as in commit bd393b2 in this PR), as some of the tests are probably not flaky anymore - if they are, the owners could re-add the flags. Running all failing tests up to three times feels like it would mask the problem that @gregorydemay tried to address and referred to, although your improved processes for detecting and analyzing the flakiness is a difference wrt. that effort.

@basvandijk
Copy link
Collaborator Author

basvandijk commented Mar 12, 2025

Is this really the case?

I just tested this and it indeed seems bazel doesn't cache failures.

@basvandijk
Copy link
Collaborator Author

Running all failing tests up to three times feels like it would mask the problem that @gregorydemay tried to address and referred to, although your improved processes for detecting and analyzing the flakiness is a difference wrt. that effort.

Yes, with Superset we can easily see which tests are flaky so we won't be masking anything.

Considering that system-tests are our most flaky test, an alternative for test --flaky_test_attempts=3 would be:
test --flaky_test_attempts=//rs/test/.*@3 to only retry ALL system-tests when they fail.

@basvandijk basvandijk added this pull request to the merge queue Mar 13, 2025
Merged via the queue into master with commit 085e56a Mar 13, 2025
50 of 51 checks passed
@basvandijk basvandijk deleted the basvandijk/disable-flaky-for-less-then-1pct-flaky-tests branch March 13, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.