test(IDX): only enable `flaky = True` for tests that are >= 1% flaky #4325

basvandijk · 2025-03-11T16:52:20Z

Many tests marked with flaky = True are not actually flaky (anymore) or have a very low flakiness rate (< 1%). To reduce noise we remove these flaky = True settings. For the ones that are actually flaky (>= 1% flakiness rate) we make sure they're marked as such and add a comment with their actual flakiness rate to document the motivation for flaky = True.

We also remove flaky = False settings since that's the default.

rs/sns/integration_tests/BUILD.bazel

github-actions

If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).

To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:

Done.
No canister behavior changes.

github-actions

If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).

To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:

Done.
No canister behavior changes.

github-actions

If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).

To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:

Done.
No canister behavior changes.

No canister behavior changes.

mbjorkqvist · 2025-03-11T18:14:09Z

Thanks for your relentless work on improving our testing, and in particular the flakiness @basvandijk! Do you have some concrete numbers or estimates on the impact on CI performance? Looking at the issue from another PoV, this will potentially increase the time that developers need to spend on PRs, either looking into the logs of failed (now non-flaky) tests to figure out if they were caused by their changes, or alternatively simply clicking retry once or twice (possibly with some delay) before deciding to investigate further? Yet another viewpoint is that some of the tests that are now being marked non-flaky are actually not flaky (anymore), with the rare failures being due to e.g., unrelated transient infrastructure issues, in which case removing the flaky flag definitely makes sense.

basvandijk · 2025-03-11T18:44:34Z

@mbjorkqvist

Do you have some concrete numbers or estimates on the impact on CI performance?

No I don't have concrete numbers nor an estimate. It's hard to predict what effect this will have in practise.

Looking at the issue from another PoV, this will potentially increase the time that developers need to spend on PRs, either looking into the logs of failed (now non-flaky) tests to figure out if they were caused by their changes, or alternatively simply clicking retry once or twice (possibly with some delay) before deciding to investigate further?

True, which is why we limited the removal of flake = True to those tests with a flakiness rate of < 1%.

Yet another viewpoint is that some of the tests that are now being marked non-flaky are actually not flaky (anymore), with the rare failures being due to e.g., unrelated transient infrastructure issues, in which case removing the flaky flag definitely makes sense.

Yes, the vast majority of tests where I removed flaky = True had a flakiness rate of 0%.

Note that we will extend Superset with some new charts that show the tests that were run as part of multiple attempts of a single workflow (i.e. those workflows that were manually retried) and that had both more than 0 failures and more than 0 successes, i.e. flaky tests.

alin-at-dfinity · 2025-03-12T06:23:21Z

[Please take the below with a grain of salt. It likely looks like me pushing back against the change. It's not that, I'm just doing a brain-dump of whatever came to mind when looking at the change and the discussion. I.e. feel free to politely ignore any or all of the points below if you find them unconvincing.]

Tests marked with flaky = True are not cached by bazel which means they'll run for every PR regardless of whether they ran successfully before.

Would it be possible to only cache successful runs of flaky tests instead?

Do you have some concrete numbers or estimates on the impact on CI performance?
No I don't have concrete numbers nor an estimate.

An exceedingly naive estimate looking at the number of flaky = True lines in the PR (100+) and assuming a 99.5% success rate yields an overall CI failure rate of 40%. Which, anecdotally, is not far from my personal experience. Even a 99.9% success rate across 100 flaky tests results in a 10% CI failure rate.

Yet another viewpoint is that some of the tests that are now being marked non-flaky are actually not flaky (anymore), with the rare failures being due to e.g., unrelated transient infrastructure issues, in which case removing the flaky flag definitely makes sense.

I would argue that it doesn't matter whether the source of the flakiness is the test itself or the infrastructure. Sure, in the former case you would expect the owner to do something about it; and in the latter case it's IDX's responsibility. But in terms of outcomes, there is no difference: the test simply fails X% of the time even though the tested functionality is correct.

Therefore we only mark tests as flaky for tests that were actually more than 1% flaky over the last month.

The way I look at flaky = True is that it should be a determination made by a human, meaning that either the test is fundamentally flaky (e.g. it tests that such and such a thing completes within 100 ms) or it heavily relies on infrastructure that is overloaded and thus flaky (e.g. the replica upgrade-downgrade tests, that do multiple replica deployments and are thus disproportionately affected by slow CI machines).

Automating what gets labeled as flaky loses this property. And it affects (it's hard to say in which direction) the likelihood of the owners doing anything about fixing actually flaky tests: on the one hand, they get a monthly reminder (assuming we do this monthly); on the other hand, it's just something they will likely automatically rubber-stamp, potentially without noticing that most of their tests have been marked as flaky (because the system will automatically rerun their half-broken tests for them).

gregorydemay

Thanks for this PR @basvandijk! This also helps us removing some "historical" flaky = True flags

basvandijk · 2025-03-12T11:22:49Z

@mbjorkqvist @alin-at-dfinity your comments got me thinking. There was an assumption in our team that tests marked as flaky were not cached by bazel. I just tested this assumption by adding the following bazel test:

sh_test(
    name = "flaky_test",
    srcs = ["flaky_test.sh"],
    flaky = True,
)

where flaky_test.sh is a script which fails 50% of the time:

#!/bin/bash
set -e
NUMBER="$(grep -m1 -ao '[0-1]' /dev/urandom | head -n1)"
if [[ "$NUMBER" -eq 1 ]]; then
  echo "fail"
  exit 1
else
  exit 0
fi

It succeeded the first time:

ubuntu@devenv-container:/ic$ bazel test :flaky_test 
INFO: Invocation ID: 251a4dc2-e7df-4130-afcc-b3dd108cd227
INFO: Analyzed target //:flaky_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //:flaky_test up-to-date:
  bazel-bin/flaky_test
INFO: Elapsed time: 0.388s, Critical Path: 0.13s
INFO: 2 processes: 2 processwrapper-sandbox.
INFO: Build completed successfully, 2 total actions
//:flaky_test                                                            PASSED in 0.1s

Executed 1 out of 1 test: 1 test passes.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

The second time it got a cache hit:

ubuntu@devenv-container:/ic$ bazel test :flaky_test 
INFO: Invocation ID: 195f85aa-428f-474c-82f2-091b80375b0d
INFO: Analyzed target //:flaky_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //:flaky_test up-to-date:
  bazel-bin/flaky_test
INFO: Elapsed time: 0.218s, Critical Path: 0.01s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
//:flaky_test                                                   (cached) PASSED in 0.1s

Executed 0 out of 1 test: 1 test passes.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

This seems to disprove our theory that tests marked as flaky are not cached.

ChatGPT agrees with that when asked:
"Does bazel cache tests marked as flaky?":

Yes, Bazel does cache tests marked as flaky, but with some caveats.
How Bazel Handles Flaky Tests

Caching of Successful Runs:

If a flaky test (flaky = True) passes, Bazel caches the result just like a non-flaky test.

Future runs will use the cached result unless invalidated by changes to dependencies.

Automatic Retries:

When a flaky test fails, Bazel automatically retries it (by default, up to 3 times).

If the test eventually passes, Bazel caches that successful result.

Cache Invalidation on Failure:

If a flaky test fails all retry attempts, Bazel does not cache the failure.

This ensures that subsequent runs will re-execute the test instead of using a previously failed result.

Impact on --cache_test_results:

If --cache_test_results=no is set, Bazel will re-run tests regardless of whether they are flaky or not.

If --cache_test_results=auto (default), Bazel caches successful flaky test results but does not cache persistent failures.

Key Takeaways

Flaky tests can be cached, but only if they pass.

If a flaky test fails all retries, Bazel does not cache the failure.

Running bazel test later might re-run the flaky test even if it previously failed.
Would you like help tweaking your test strategy to handle flakiness better? 🚀

So taking this into account I would like to propose a minor addition to this PR:

Replace this bazel config with:

test --flaky_test_attempts=3

This will have the effect that any failing test will always be retried up to 3 times. See the docs for details: https://bazel.build/reference/command-line-reference#flag--flaky_test_attempts

This will have the following advantages:

No manual retries of the whole job are necessary.
We keep detecting flakiness and no need to modify our GitHub stats service nor any Superset datasets or charts.

Note that ChatGPT's 3rd point about "Cache Invalidation on Failure" is important. We do want to keep the flaky = True setting for tests that are quite flaky (>= 1%) since if they would fail 3 times in a row their failure will be cached. That would mean that the only way to rerun those tests will be to change one of their inputs which might not be trivial (merging master into your PR might not always change their inputs).

WDYT?

basvandijk · 2025-03-12T11:31:59Z

There is an obvious downside to test --flaky_test_attempts=3: it will cause ALL failing test to rerun up to 3 times. This includes tests that are really not flaky and fail because of bugs in the PR (like //bazel:buildifier_test for example).

kpop-dfinity

Thanks a ton, Bas!

mbjorkqvist · 2025-03-12T12:31:13Z

Note that ChatGPT's 3rd point about "Cache Invalidation on Failure" is important. We do want to keep the flaky = True setting for tests that are quite flaky (>= 1%) since if they would fail 3 times in a row their failure will be cached.

Is this really the case? I don't think I've ever seen bazel cache a failure result, regardless if the test was marked flaky = True or not. I haven't tried with --flaky_test_attempts, but I would find it odd if failures would be cached with that.

I don't think I've thought of all the (corner)cases regarding the different approaches, but I think I'd be inclined to remove the ("historical") flaky = True flags (as in commit bd393b2 in this PR), as some of the tests are probably not flaky anymore - if they are, the owners could re-add the flags. Running all failing tests up to three times feels like it would mask the problem that @gregorydemay tried to address and referred to, although your improved processes for detecting and analyzing the flakiness is a difference wrt. that effort.

basvandijk · 2025-03-12T12:36:16Z

Is this really the case?

I just tested this and it indeed seems bazel doesn't cache failures.

rs/tests/consensus/subnet_recovery/BUILD.bazel

basvandijk · 2025-03-12T12:53:03Z

Running all failing tests up to three times feels like it would mask the problem that @gregorydemay tried to address and referred to, although your improved processes for detecting and analyzing the flakiness is a difference wrt. that effort.

Yes, with Superset we can easily see which tests are flaky so we won't be masking anything.

Considering that system-tests are our most flaky test, an alternative for test --flaky_test_attempts=3 would be:
test --flaky_test_attempts=//rs/test/.*@3 to only retry ALL system-tests when they fail.

test(IDX): only enable flaky = True for tests that are >= 1% flaky

938757d

basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Mar 11, 2025

github-actions bot added the test label Mar 11, 2025

nmattia approved these changes Mar 11, 2025

View reviewed changes

rs/sns/integration_tests/BUILD.bazel Outdated Show resolved Hide resolved

fix dates

bd393b2

basvandijk removed the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Mar 11, 2025

basvandijk marked this pull request as ready for review March 11, 2025 17:50

basvandijk requested review from a team as code owners March 11, 2025 17:50

github-actions bot requested changes Mar 11, 2025

View reviewed changes

github-actions bot previously requested changes Mar 11, 2025

View reviewed changes

github-actions bot added @sdk @execution @consensus @node @nns-team @ic-message-routing-owners @finint labels Mar 11, 2025

marko-k0 approved these changes Mar 11, 2025

View reviewed changes

mraszyk approved these changes Mar 11, 2025

View reviewed changes

mbjorkqvist approved these changes Mar 11, 2025

View reviewed changes

r-birkner approved these changes Mar 11, 2025

View reviewed changes

lwshang approved these changes Mar 11, 2025

View reviewed changes

basvandijk enabled auto-merge March 11, 2025 19:20

altkdf approved these changes Mar 11, 2025

View reviewed changes

andrewbattat approved these changes Mar 11, 2025

View reviewed changes

fspreiss approved these changes Mar 11, 2025

View reviewed changes

ShuoWangNSL approved these changes Mar 12, 2025

View reviewed changes

gregorydemay approved these changes Mar 12, 2025

View reviewed changes

basvandijk disabled auto-merge March 12, 2025 10:52

kpop-dfinity approved these changes Mar 12, 2025

View reviewed changes

aterga approved these changes Mar 12, 2025

View reviewed changes

eichhorl reviewed Mar 12, 2025

View reviewed changes

rs/tests/consensus/subnet_recovery/BUILD.bazel Show resolved Hide resolved

dsarlis approved these changes Mar 12, 2025

View reviewed changes

max-dfinity approved these changes Mar 12, 2025

View reviewed changes

basvandijk added this pull request to the merge queue Mar 13, 2025

Merged via the queue into master with commit 085e56a Mar 13, 2025
50 of 51 checks passed

basvandijk deleted the basvandijk/disable-flaky-for-less-then-1pct-flaky-tests branch March 13, 2025 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(IDX): only enable `flaky = True` for tests that are >= 1% flaky #4325

test(IDX): only enable `flaky = True` for tests that are >= 1% flaky #4325

basvandijk commented Mar 11, 2025 •

edited

Loading

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

mbjorkqvist commented Mar 11, 2025

basvandijk commented Mar 11, 2025

alin-at-dfinity commented Mar 12, 2025 •

edited

Loading

gregorydemay left a comment

basvandijk commented Mar 12, 2025 •

edited

Loading

basvandijk commented Mar 12, 2025

kpop-dfinity left a comment

mbjorkqvist commented Mar 12, 2025

basvandijk commented Mar 12, 2025 •

edited

Loading

basvandijk commented Mar 12, 2025

test(IDX): only enable flaky = True for tests that are >= 1% flaky #4325

test(IDX): only enable flaky = True for tests that are >= 1% flaky #4325

Conversation

basvandijk commented Mar 11, 2025 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

mbjorkqvist commented Mar 11, 2025

basvandijk commented Mar 11, 2025

alin-at-dfinity commented Mar 12, 2025 • edited Loading

gregorydemay left a comment

Choose a reason for hiding this comment

basvandijk commented Mar 12, 2025 • edited Loading

basvandijk commented Mar 12, 2025

kpop-dfinity left a comment

Choose a reason for hiding this comment

mbjorkqvist commented Mar 12, 2025

basvandijk commented Mar 12, 2025 • edited Loading

basvandijk commented Mar 12, 2025

test(IDX): only enable `flaky = True` for tests that are >= 1% flaky #4325

test(IDX): only enable `flaky = True` for tests that are >= 1% flaky #4325

basvandijk commented Mar 11, 2025 •

edited

Loading

alin-at-dfinity commented Mar 12, 2025 •

edited

Loading

basvandijk commented Mar 12, 2025 •

edited

Loading

basvandijk commented Mar 12, 2025 •

edited

Loading