-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(IDX): only enable flaky = True
for tests that are >= 1% flaky
#4325
test(IDX): only enable flaky = True
for tests that are >= 1% flaky
#4325
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).
To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:
-
Done.
-
No canister behavior changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).
To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:
-
Done.
-
No canister behavior changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this pull request affects the behavior of any canister owned by
the Governance team, remember to update the corresponding
unreleased_changes.md file(s).
To acknowldge this reminder (and unblock the PR), dismiss this
code review by going to the bottom of the pull request page, and
supply one of the following reasons:
-
Done.
-
No canister behavior changes.
No canister behavior changes.
Thanks for your relentless work on improving our testing, and in particular the flakiness @basvandijk! Do you have some concrete numbers or estimates on the impact on CI performance? Looking at the issue from another PoV, this will potentially increase the time that developers need to spend on PRs, either looking into the logs of failed (now non-flaky) tests to figure out if they were caused by their changes, or alternatively simply clicking retry once or twice (possibly with some delay) before deciding to investigate further? Yet another viewpoint is that some of the tests that are now being marked non-flaky are actually not flaky (anymore), with the rare failures being due to e.g., unrelated transient infrastructure issues, in which case removing the flaky flag definitely makes sense. |
No I don't have concrete numbers nor an estimate. It's hard to predict what effect this will have in practise.
True, which is why we limited the removal of
Yes, the vast majority of tests where I removed Note that we will extend Superset with some new charts that show the tests that were run as part of multiple attempts of a single workflow (i.e. those workflows that were manually retried) and that had both more than 0 failures and more than 0 successes, i.e. flaky tests. |
[Please take the below with a grain of salt. It likely looks like me pushing back against the change. It's not that, I'm just doing a brain-dump of whatever came to mind when looking at the change and the discussion. I.e. feel free to politely ignore any or all of the points below if you find them unconvincing.]
Would it be possible to only cache successful runs of flaky tests instead?
An exceedingly naive estimate looking at the number of
I would argue that it doesn't matter whether the source of the flakiness is the test itself or the infrastructure. Sure, in the former case you would expect the owner to do something about it; and in the latter case it's IDX's responsibility. But in terms of outcomes, there is no difference: the test simply fails X% of the time even though the tested functionality is correct.
The way I look at Automating what gets labeled as flaky loses this property. And it affects (it's hard to say in which direction) the likelihood of the owners doing anything about fixing actually flaky tests: on the one hand, they get a monthly reminder (assuming we do this monthly); on the other hand, it's just something they will likely automatically rubber-stamp, potentially without noticing that most of their tests have been marked as flaky (because the system will automatically rerun their half-broken tests for them). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @basvandijk! This also helps us removing some "historical" flaky = True
flags
@mbjorkqvist @alin-at-dfinity your comments got me thinking. There was an assumption in our team that tests marked as flaky were not cached by bazel. I just tested this assumption by adding the following bazel test: sh_test(
name = "flaky_test",
srcs = ["flaky_test.sh"],
flaky = True,
) where #!/bin/bash
set -e
NUMBER="$(grep -m1 -ao '[0-1]' /dev/urandom | head -n1)"
if [[ "$NUMBER" -eq 1 ]]; then
echo "fail"
exit 1
else
exit 0
fi It succeeded the first time:
The second time it got a cache hit:
This seems to disprove our theory that tests marked as flaky are not cached. ChatGPT agrees with that when asked:
So taking this into account I would like to propose a minor addition to this PR: Replace this bazel config with:
This will have the effect that any failing test will always be retried up to 3 times. See the docs for details: https://bazel.build/reference/command-line-reference#flag--flaky_test_attempts This will have the following advantages:
Note that ChatGPT's 3rd point about "Cache Invalidation on Failure" is important. We do want to keep the WDYT? |
There is an obvious downside to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton, Bas!
Is this really the case? I don't think I've ever seen bazel cache a failure result, regardless if the test was marked I don't think I've thought of all the (corner)cases regarding the different approaches, but I think I'd be inclined to remove the ("historical") |
I just tested this and it indeed seems bazel doesn't cache failures. |
Yes, with Superset we can easily see which tests are flaky so we won't be masking anything. Considering that system-tests are our most flaky test, an alternative for |
Many tests marked with
flaky = True
are not actually flaky (anymore) or have a very low flakiness rate (< 1%). To reduce noise we remove theseflaky = True
settings. For the ones that are actually flaky (>= 1% flakiness rate) we make sure they're marked as such and add a comment with their actual flakiness rate to document the motivation forflaky = True
.We also remove
flaky = False
settings since that's the default.