Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Previous failing test masks correct result in remote cache #11057

Closed
gergelyfabian opened this issue Apr 2, 2020 · 29 comments
Closed

Previous failing test masks correct result in remote cache #11057

gergelyfabian opened this issue Apr 2, 2020 · 29 comments
Assignees
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@gergelyfabian
Copy link

gergelyfabian commented Apr 2, 2020

ATTENTION! Please read and follow:

  • if this is a question about how to build / test / query / deploy using Bazel, or a discussion starter, send it to [email protected]
  • if this is a bug or feature request, fill the form below as best as you can.

Description of the problem / feature request:

Assume we have two branches of a git repository. master with passing unit tests and a unit_test_failure branch with failing unit tests. I have disk_cache enabled.
I run:

$ git checkout master
$ bazel test //...
-> unit tests pass. No cache hit.
$ bazel test //...
-> unit tests pass. Cache hit.
$ git checkout unit_test_failure
$ bazel test //...
-> unit tests fail
$ git checkout master
$ bazel test //...
-> unit tests pass. No cache hit.

If instead, I add "rm -rf bazel-out/k8-fastbuild/testlogs/" right before running the last "bazel test" on master (or "bazel clean"), then there will be a remote cache hit:

$ git checkout master
$ bazel test //...
-> unit tests pass. No cache hit.
$ bazel test //...
-> unit tests pass. Cache hit.
$ git checkout unit_test_failure
$ bazel test //...
-> unit tests fail
$ git checkout master

Run "bazel clean" or "rm -rf bazel-out/k8-fastbuild/testlogs/"

$ rm -rf bazel-out/k8-fastbuild/testlogs/
$ bazel test //...
-> unit tests pass. Cache hit.

I would expect a remote cache hit also in the first case. But this does not happen.

I suppose this is a bug. If not, please let me know what I'm doing wrong.

Feature requests: what underlying problem are you trying to solve with this feature?

Replace this line with your answer.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

See in Description.
Please use my repo on github: https://github.com/gergelyfabian/bazel-scala-example.
The example branches I used in this description (master and unit_test_failure) are there and testable.

What operating system are you running Bazel on?

Ubuntu 18.04.

What's the output of bazel info release?

INFO: Invocation ID: a2999d7d-0b72-4f0f-b42d-9590c7ea672f
release 2.2.0

Tested on Bazel 2.0.0 and 2.2.0.

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Replace this line with your answer.

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

[email protected]:gergelyfabian/bazel-scala-example.git
e1e4ae4da5a3d8c18a68e1c8c6baddb34d187659
9b4a64e1be19b28ff73941446d6729cd98dcd8cc

Have you found anything relevant by searching the web?

There is another issue that seems similar: #9389.

Any other information, logs, or outputs that you want to share?

Case reproducing the bug:

$ git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
$ bazel test //...
INFO: Invocation ID: 92696c4f-639f-4e5b-87b7-f22adcfd8d3a
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
INFO: Elapsed time: 0.167s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test (cached) PASSED in 0.0s

Executed 0 out of 2 tests: 2 tests pass.
INFO: Build completed successfully, 1 total action
$ git checkout unit_test_failure
Switched to branch 'unit_test_failure'
$ bazel test //...
INFO: Invocation ID: 1eef1f1d-7df4-4d3d-ac92-97e638fe17e9
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
FAIL: //example-maven:test (see /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log)
INFO: Elapsed time: 1.233s, Critical Path: 1.07s
INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox.
INFO: Build completed, 1 test FAILED, 3 total actions
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test FAILED in 1.0s
/home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log

Executed 1 out of 2 tests: 1 test passes and 1 fails locally.
INFO: Build completed, 1 test FAILED, 3 total actions
$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ bazel test //...
INFO: Invocation ID: 3a21e211-1902-4433-b344-9f04b3d83558
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
INFO: Elapsed time: 1.240s, Critical Path: 1.07s
INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox.
INFO: Build completed successfully, 3 total actions
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test PASSED in 1.0s

Executed 1 out of 2 tests: 2 tests pass.
INFO: Build completed successfully, 3 total actions


Forcing remote cache hit:

$ git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
$
$ bazel test //...
INFO: Invocation ID: 71339ed1-128c-412c-a6cd-6e812b38a880
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
INFO: Elapsed time: 0.234s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test (cached) PASSED in 1.0s

Executed 0 out of 2 tests: 2 tests pass.
INFO: Build completed successfully, 1 total action
$ git checkout unit_test_failure
Switched to branch 'unit_test_failure'
$ bazel test //...
INFO: Invocation ID: ec056e85-d024-4072-916b-d89fd017b278
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
FAIL: //example-maven:test (see /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log)
INFO: Elapsed time: 1.285s, Critical Path: 1.04s
INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox.
INFO: Build completed, 1 test FAILED, 3 total actions
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test FAILED in 1.0s
/home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log

Executed 1 out of 2 tests: 1 test passes and 1 fails locally.
INFO: Build completed, 1 test FAILED, 3 total actions
$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ bazel clean
INFO: Invocation ID: 0cbb6d3d-a42d-41c2-9ece-dde8feb7d795
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
$ bazel test //...
INFO: Invocation ID: afbf5193-0f6a-4b0e-8b6d-a301aee6ba11
INFO: Analyzed 5 targets (58 packages loaded, 1299 targets configured).
INFO: Found 3 targets and 2 test targets...
INFO: From Building external/com_google_protobuf/libprotobuf_java.jar (123 source files, 1 source jar):
warning: -parameters is not supported for target value 1.7. Use 1.8 or later.
INFO: Elapsed time: 5.215s, Critical Path: 2.67s
INFO: 212 processes: 212 remote cache hit.
INFO: Build completed successfully, 245 total actions
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test (cached) PASSED in 0.0s

Executed 0 out of 2 tests: 2 tests pass.
INFO: Build completed successfully, 245 total actions

@gergelyfabian gergelyfabian changed the title Failing test masks previous correct result in local workspace cache Previous failing test masks correct result in remote cache Apr 2, 2020
@gergelyfabian
Copy link
Author

Workaround

When returning to the proper branch, instead of running "bazel clean", it is enough to run:

rm -rf bazel-out/k8-fastbuild/testlogs/

Then the test results will be properly taken from remote cache. I'd consider doing this always before running tests (e.g. in a CICD tool).

@dslomov dslomov added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Apr 3, 2020
@dslomov
Copy link
Contributor

dslomov commented Apr 3, 2020

/cc @meteorcloudy

@ulfjack
Copy link
Contributor

ulfjack commented Apr 3, 2020

I think this is just a display issue - when Bazel gets a test result from the disk cache, it incorrectly doesn't mark it as a cache hit even though it is. Try adding a sleep 20 to the test and see how long it takes?

@gergelyfabian
Copy link
Author

I added a sleep 20s to the test (not yet committed), and yes, it's not just a display issue, it is indeed waiting 20s when running the test again on the "good" branch.

If I delete testlogs, then it's not waiting 20s and has the results right away.

@gergelyfabian
Copy link
Author

Just a clarification... I think this is not a bug in remote cache (because the remote cache stores good results), but rather somewhere in the "communication" between the sandbox and the remote cache. Why is the previous failed test result in the sandbox "masking" the good result from the remote cache?

@meteorcloudy meteorcloudy self-assigned this Apr 3, 2020
@meteorcloudy
Copy link
Member

Thanks for reporting this! I'll look into the cause.

@gergelyfabian
Copy link
Author

One addition. If I have two branches, that differ in their test code, but tests are passing on both branches, assuming the remote cache is already populated, then the behavior is the following:

  1. test on branch1, tests pass, remote cache hit.
  2. test on branch1, tests pass, local cache hit.
  3. test on branch2, tests pass, remote cache hit.
  4. test on branch2, tests pass, local cache hit.
  5. test on branch1, tests pass, remote cache hit.

It seems that it's only the fail that somehow "tags" a failed test target and forces it to be rebuilt when switching back to the "good" branch. This can be avoided by the workaround to run "bazel clean" or delete the testlogs.

@meteorcloudy
Copy link
Member

@gergelyfabian Thanks for the detailed info! I can also reproduce this with a very simple case: https://github.com/meteorcloudy/my_tests/tree/master/test_result_cache_test, both on Windows and Linux.
Investigating now..

@meteorcloudy
Copy link
Member

I think the problematic logic is here:

if (cacheTestResults == TriState.AUTO && prevStatus != null && !prevStatus.getTestPassed()) {

Here if Bazel detects previous test run failed, it will unconditionally rerun the test without taking any cache into consideration.

rm bazel-out/k8-fastbuild/testlogs/ worked because it removed the previous test status.

Let me see how can we optimize the logic.

@meteorcloudy
Copy link
Member

Update: If we simply remove this check, we'll cause failed test to be cached.

@ulfjack
Copy link
Contributor

ulfjack commented Apr 3, 2020

Removing the check isn't the right solution. I think the issue is that we use that to also toggle a bit in the remote caching protocol on whether we accept a cache hit or not. If we assume that the remote cache never caches failed actions, then we don't need to set that bit. However, we do need to set the bit if --cache_test_results=no is set.

@ulfjack
Copy link
Contributor

ulfjack commented Apr 3, 2020

To clarify - the check in TestRunnerAction is also applied to the local cache, where we do want this behavior - at least that's what the flag documentation for cache_test_results says. You could work around by setting --cache_test_results=yes.

@meteorcloudy
Copy link
Member

But --cache_test_results=yes will also cause failed test not to rerun anymore.
@ulfjack Can you tell me where do we toggle the bit for remote caching protocol? I'm not quite sure how we decide a test will be cached by remote caching or not.

@ulfjack
Copy link
Contributor

ulfjack commented Apr 3, 2020

boolean uploadResults = Status.SUCCESS.equals(result.status()) && result.exitCode() == 0;

@ulfjack
Copy link
Contributor

ulfjack commented Apr 3, 2020

And here:

@ulfjack
Copy link
Contributor

ulfjack commented Apr 3, 2020

Anyway, that's the class that's primarily responsible.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) type: bug and removed untriaged labels Dec 9, 2020
@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Apr 19, 2023
@github-actions
Copy link

github-actions bot commented May 4, 2023

This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team (@bazelbuild/triage). Thanks!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 4, 2023
@juanzolotoochin
Copy link

@bazelbuild/triage can this issue be reopened? it's a big issue when bazel is used for large integration tests in a CI with persistent workers.

Any code change that goes through CI can affect code in a way that breaks one or many large tests. Because of this bug, it also leaves the worker in a state in which it will not reuse test results from remote cache when moving on to work on the CI run for another, completely unrelated PR.

@tjgq tjgq reopened this Jan 31, 2024
@tjgq tjgq added not stale Issues or PRs that are inactive but not considered stale and removed stale Issues or PRs that are stale (no activity for 30 days) labels Jan 31, 2024
@tomdegoede
Copy link
Contributor

Wow I just observed this issue when preparing a Bazel demo. This explains some weird cache misses in our CI as well. Since this discussion the implementation of maybeReadCacheStatus has changed which breaks the workaround of deleting test.cache_status files. Not loading a cached test result when the previous local run has failed, but code modifications have been reverted, is most definitely a bug.

I feel like the fix should be relatively easy as TestRunnerAction.shouldCacheResult should never consider the previous TestResultData as it is invoked for a new TestRunnerSpawn. Could you confirm my thought @meteorcloudy @ulfjack ? I'm willing to work on a PR for this.

@tomdegoede
Copy link
Contributor

@tjgq could you verify my thoughts above?

@tjgq
Copy link
Contributor

tjgq commented Aug 1, 2024

I feel like the fix should be relatively easy as TestRunnerAction.shouldCacheResult should never consider the previous TestResultData as it is invoked for a new TestRunnerSpawn.

Did you mean "shouldCacheResult should ignore the previous TestResultData if it is invoked with a different TestRunnerSpawn than the one that produced the TestResultData"? Conceptually I agree, but I'm not sure how to translate that into code without keeping around a reference to the TestRunnerSpawn (which might be a bad idea).

It's also worth keeping in mind that, since the status file is persisted to disk, any solutions that rely on keeping additional in-memory state won't fix the issue across server restarts. For that, we'd need to store something in the status file that captures the identity of the test (e.g. something like j.c.g.devtools.build.lib.remote.ActionKey, although that particular choice is architecturally awkward because the core test machinery should not depend on the remote package which is mostly Bazel-specific).

@tjgq
Copy link
Contributor

tjgq commented Aug 1, 2024

I also believe this cache bypass logic would not be needed at all if we could ensure that a failed action can never be stored in a disk or remote cache. This is already the case for a disk cache or for a remote cache that only Bazel writes to (i.e., no remote execution, nor clients other than Bazel), because Bazel will never upload a failed action. But it's an issue for remote execution (or a remote cache that may have been previously used with remote execution) because the protocol is silent on whether failed actions are cached, and doesn't give clients a way to ask one way or the other (they can only force unconditional reexecution of a cached action, not reexecution only in the case of a cached failure).

@henkerik
Copy link

henkerik commented Aug 1, 2024

Hey @tjgq, I'm probably missing a lot here but shouldn't we not simply change the implementation of computeExecuteUnconditionallyFromTestStatus? Something like this:

https://github.com/dhlparcel/bazel/pull/1/files

Just joining the conversation because I run into this issue at work as well and there is finally some more activity on this issue. 👍 . In the past we simply removed the testlogs as mentioned earlier in this issue, but that doesn't seem to work anymore.

@tjgq
Copy link
Contributor

tjgq commented Aug 1, 2024

Maybe, but since I don't fully understand the reason for the existence of this logic (my best guess is what I've described in #11057 (comment) but it's just that, a guess) I'm worried that removing it might have adverse effects. Let me do a bit of archaeology and get back to you.

@tjgq
Copy link
Contributor

tjgq commented Aug 1, 2024

For future reference, here's a super tiny repro:

cat >BUILD <<EOF
sh_test(srcs = ["test.sh"])
EOF
cat >.bazelrc <<EOF
build --disk_cache=/tmp/disk_cache
EOF
touch test.sh && chmod +x test.sh

echo "exit 0" > test.sh
bazel test :test   # build successful, disk cache miss
bazel test :test   # build successful, disk cache hit
echo "exit 1" > test.sh
bazel build :test  # build failure, disk cache miss
echo "exit 0" > test.sh
bazel test :test  # build successful, disk cache miss (but expected disk cache hit)

After some spelunking I think I know what the correct fix looks like. Expect a commit soon.

@fmeum
Copy link
Collaborator

fmeum commented Aug 9, 2024

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Aug 9, 2024
@iancha1992
Copy link
Member

@bazel-io fork 7.4.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Aug 9, 2024
fmeum pushed a commit to fmeum/bazel that referenced this issue Sep 19, 2024
…a failure of the same test.

A source edit might have occurred in the interim, putting the test back into a passing state that has been previously stored in the cache. The current logic disables disk/remote caching when rerunning a test following a failure, which causes spurious reexecutions (see issue bazelbuild#11057).

Although this code has been around for a very long time, it seems that it was originally added to ensure that a flaky test failure doesn't become sticky. However, for this purpose it suffices to bust the local cache, as a disk/remote cache should never store failures (if it did, this would not be a complete fix, and we'd have much bigger problems elsewhere). Note that Bazel itself never uploads local failures to a disk/remote cache, but can't control whether a remote executor does.

The fix is to split the `TestRunnerAction.canBeCached` logic into two versions: one that takes into account a previous test result (for the local cache check) and another that doesn't (to determine whether we should attempt to hit the disk/remote cache).

Fixes bazelbuild#11057.

PiperOrigin-RevId: 661319835
Change-Id: I9248cbfa31dd135b2bda971e48bb17d7f828889c
(cherry picked from commit e9709b7)
fmeum pushed a commit to fmeum/bazel that referenced this issue Sep 19, 2024
…a failure of the same test.

A source edit might have occurred in the interim, putting the test back into a passing state that has been previously stored in the cache. The current logic disables disk/remote caching when rerunning a test following a failure, which causes spurious reexecutions (see issue bazelbuild#11057).

Although this code has been around for a very long time, it seems that it was originally added to ensure that a flaky test failure doesn't become sticky. However, for this purpose it suffices to bust the local cache, as a disk/remote cache should never store failures (if it did, this would not be a complete fix, and we'd have much bigger problems elsewhere). Note that Bazel itself never uploads local failures to a disk/remote cache, but can't control whether a remote executor does.

The fix is to split the `TestRunnerAction.canBeCached` logic into two versions: one that takes into account a previous test result (for the local cache check) and another that doesn't (to determine whether we should attempt to hit the disk/remote cache).

Fixes bazelbuild#11057.

PiperOrigin-RevId: 661319835
Change-Id: I9248cbfa31dd135b2bda971e48bb17d7f828889c
(cherry picked from commit e9709b7)
github-merge-queue bot pushed a commit that referenced this issue Sep 19, 2024
…llowing … (#23673)

…a failure of the same test.

A source edit might have occurred in the interim, putting the test back
into a passing state that has been previously stored in the cache. The
current logic disables disk/remote caching when rerunning a test
following a failure, which causes spurious reexecutions (see issue
#11057).

Although this code has been around for a very long time, it seems that
it was originally added to ensure that a flaky test failure doesn't
become sticky. However, for this purpose it suffices to bust the local
cache, as a disk/remote cache should never store failures (if it did,
this would not be a complete fix, and we'd have much bigger problems
elsewhere). Note that Bazel itself never uploads local failures to a
disk/remote cache, but can't control whether a remote executor does.

The fix is to split the `TestRunnerAction.canBeCached` logic into two
versions: one that takes into account a previous test result (for the
local cache check) and another that doesn't (to determine whether we
should attempt to hit the disk/remote cache).

Fixes #11057.

PiperOrigin-RevId: 661319835
Change-Id: I9248cbfa31dd135b2bda971e48bb17d7f828889c 
(cherry picked from commit e9709b7)

Fixes #23259

Co-authored-by: Googler <[email protected]>
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests