-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Previous failing test masks correct result in remote cache #11057
Comments
Workaround When returning to the proper branch, instead of running "bazel clean", it is enough to run: rm -rf bazel-out/k8-fastbuild/testlogs/ Then the test results will be properly taken from remote cache. I'd consider doing this always before running tests (e.g. in a CICD tool). |
/cc @meteorcloudy |
I think this is just a display issue - when Bazel gets a test result from the disk cache, it incorrectly doesn't mark it as a cache hit even though it is. Try adding a sleep 20 to the test and see how long it takes? |
I added a sleep 20s to the test (not yet committed), and yes, it's not just a display issue, it is indeed waiting 20s when running the test again on the "good" branch. If I delete testlogs, then it's not waiting 20s and has the results right away. |
Just a clarification... I think this is not a bug in remote cache (because the remote cache stores good results), but rather somewhere in the "communication" between the sandbox and the remote cache. Why is the previous failed test result in the sandbox "masking" the good result from the remote cache? |
Thanks for reporting this! I'll look into the cause. |
One addition. If I have two branches, that differ in their test code, but tests are passing on both branches, assuming the remote cache is already populated, then the behavior is the following:
It seems that it's only the fail that somehow "tags" a failed test target and forces it to be rebuilt when switching back to the "good" branch. This can be avoided by the workaround to run "bazel clean" or delete the testlogs. |
@gergelyfabian Thanks for the detailed info! I can also reproduce this with a very simple case: https://github.com/meteorcloudy/my_tests/tree/master/test_result_cache_test, both on Windows and Linux. |
I think the problematic logic is here: bazel/src/main/java/com/google/devtools/build/lib/analysis/test/TestRunnerAction.java Line 429 in a9b118b
Here if Bazel detects previous test run failed, it will unconditionally rerun the test without taking any cache into consideration.
Let me see how can we optimize the logic. |
Update: If we simply remove this check, we'll cause failed test to be cached. |
Removing the check isn't the right solution. I think the issue is that we use that to also toggle a bit in the remote caching protocol on whether we accept a cache hit or not. If we assume that the remote cache never caches failed actions, then we don't need to set that bit. However, we do need to set the bit if |
To clarify - the check in TestRunnerAction is also applied to the local cache, where we do want this behavior - at least that's what the flag documentation for |
But |
|
And here:
|
Anyway, that's the class that's primarily responsible. |
Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team ( |
This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team ( |
@bazelbuild/triage can this issue be reopened? it's a big issue when bazel is used for large integration tests in a CI with persistent workers. Any code change that goes through CI can affect code in a way that breaks one or many large tests. Because of this bug, it also leaves the worker in a state in which it will not reuse test results from remote cache when moving on to work on the CI run for another, completely unrelated PR. |
Wow I just observed this issue when preparing a Bazel demo. This explains some weird cache misses in our CI as well. Since this discussion the implementation of maybeReadCacheStatus has changed which breaks the workaround of deleting test.cache_status files. Not loading a cached test result when the previous local run has failed, but code modifications have been reverted, is most definitely a bug. I feel like the fix should be relatively easy as TestRunnerAction.shouldCacheResult should never consider the previous TestResultData as it is invoked for a new TestRunnerSpawn. Could you confirm my thought @meteorcloudy @ulfjack ? I'm willing to work on a PR for this. |
@tjgq could you verify my thoughts above? |
Did you mean "shouldCacheResult should ignore the previous TestResultData if it is invoked with a different TestRunnerSpawn than the one that produced the TestResultData"? Conceptually I agree, but I'm not sure how to translate that into code without keeping around a reference to the TestRunnerSpawn (which might be a bad idea). It's also worth keeping in mind that, since the status file is persisted to disk, any solutions that rely on keeping additional in-memory state won't fix the issue across server restarts. For that, we'd need to store something in the status file that captures the identity of the test (e.g. something like |
I also believe this cache bypass logic would not be needed at all if we could ensure that a failed action can never be stored in a disk or remote cache. This is already the case for a disk cache or for a remote cache that only Bazel writes to (i.e., no remote execution, nor clients other than Bazel), because Bazel will never upload a failed action. But it's an issue for remote execution (or a remote cache that may have been previously used with remote execution) because the protocol is silent on whether failed actions are cached, and doesn't give clients a way to ask one way or the other (they can only force unconditional reexecution of a cached action, not reexecution only in the case of a cached failure). |
Hey @tjgq, I'm probably missing a lot here but shouldn't we not simply change the implementation of https://github.com/dhlparcel/bazel/pull/1/files Just joining the conversation because I run into this issue at work as well and there is finally some more activity on this issue. 👍 . In the past we simply removed the testlogs as mentioned earlier in this issue, but that doesn't seem to work anymore. |
Maybe, but since I don't fully understand the reason for the existence of this logic (my best guess is what I've described in #11057 (comment) but it's just that, a guess) I'm worried that removing it might have adverse effects. Let me do a bit of archaeology and get back to you. |
For future reference, here's a super tiny repro:
After some spelunking I think I know what the correct fix looks like. Expect a commit soon. |
@bazel-io flag |
@bazel-io fork 7.4.0 |
…a failure of the same test. A source edit might have occurred in the interim, putting the test back into a passing state that has been previously stored in the cache. The current logic disables disk/remote caching when rerunning a test following a failure, which causes spurious reexecutions (see issue bazelbuild#11057). Although this code has been around for a very long time, it seems that it was originally added to ensure that a flaky test failure doesn't become sticky. However, for this purpose it suffices to bust the local cache, as a disk/remote cache should never store failures (if it did, this would not be a complete fix, and we'd have much bigger problems elsewhere). Note that Bazel itself never uploads local failures to a disk/remote cache, but can't control whether a remote executor does. The fix is to split the `TestRunnerAction.canBeCached` logic into two versions: one that takes into account a previous test result (for the local cache check) and another that doesn't (to determine whether we should attempt to hit the disk/remote cache). Fixes bazelbuild#11057. PiperOrigin-RevId: 661319835 Change-Id: I9248cbfa31dd135b2bda971e48bb17d7f828889c (cherry picked from commit e9709b7)
…a failure of the same test. A source edit might have occurred in the interim, putting the test back into a passing state that has been previously stored in the cache. The current logic disables disk/remote caching when rerunning a test following a failure, which causes spurious reexecutions (see issue bazelbuild#11057). Although this code has been around for a very long time, it seems that it was originally added to ensure that a flaky test failure doesn't become sticky. However, for this purpose it suffices to bust the local cache, as a disk/remote cache should never store failures (if it did, this would not be a complete fix, and we'd have much bigger problems elsewhere). Note that Bazel itself never uploads local failures to a disk/remote cache, but can't control whether a remote executor does. The fix is to split the `TestRunnerAction.canBeCached` logic into two versions: one that takes into account a previous test result (for the local cache check) and another that doesn't (to determine whether we should attempt to hit the disk/remote cache). Fixes bazelbuild#11057. PiperOrigin-RevId: 661319835 Change-Id: I9248cbfa31dd135b2bda971e48bb17d7f828889c (cherry picked from commit e9709b7)
…llowing … (#23673) …a failure of the same test. A source edit might have occurred in the interim, putting the test back into a passing state that has been previously stored in the cache. The current logic disables disk/remote caching when rerunning a test following a failure, which causes spurious reexecutions (see issue #11057). Although this code has been around for a very long time, it seems that it was originally added to ensure that a flaky test failure doesn't become sticky. However, for this purpose it suffices to bust the local cache, as a disk/remote cache should never store failures (if it did, this would not be a complete fix, and we'd have much bigger problems elsewhere). Note that Bazel itself never uploads local failures to a disk/remote cache, but can't control whether a remote executor does. The fix is to split the `TestRunnerAction.canBeCached` logic into two versions: one that takes into account a previous test result (for the local cache check) and another that doesn't (to determine whether we should attempt to hit the disk/remote cache). Fixes #11057. PiperOrigin-RevId: 661319835 Change-Id: I9248cbfa31dd135b2bda971e48bb17d7f828889c (cherry picked from commit e9709b7) Fixes #23259 Co-authored-by: Googler <[email protected]>
A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. |
Description of the problem / feature request:
Assume we have two branches of a git repository. master with passing unit tests and a unit_test_failure branch with failing unit tests. I have disk_cache enabled.
I run:
$ git checkout master
$ bazel test //...
-> unit tests pass. No cache hit.
$ bazel test //...
-> unit tests pass. Cache hit.
$ git checkout unit_test_failure
$ bazel test //...
-> unit tests fail
$ git checkout master
$ bazel test //...
-> unit tests pass. No cache hit.
If instead, I add "rm -rf bazel-out/k8-fastbuild/testlogs/" right before running the last "bazel test" on master (or "bazel clean"), then there will be a remote cache hit:
$ git checkout master
$ bazel test //...
-> unit tests pass. No cache hit.
$ bazel test //...
-> unit tests pass. Cache hit.
$ git checkout unit_test_failure
$ bazel test //...
-> unit tests fail
$ git checkout master
Run "bazel clean" or "rm -rf bazel-out/k8-fastbuild/testlogs/"
$ rm -rf bazel-out/k8-fastbuild/testlogs/
$ bazel test //...
-> unit tests pass. Cache hit.
I would expect a remote cache hit also in the first case. But this does not happen.
I suppose this is a bug. If not, please let me know what I'm doing wrong.
Feature requests: what underlying problem are you trying to solve with this feature?
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
See in Description.
Please use my repo on github: https://github.com/gergelyfabian/bazel-scala-example.
The example branches I used in this description (master and unit_test_failure) are there and testable.
What operating system are you running Bazel on?
Ubuntu 18.04.
What's the output of
bazel info release
?INFO: Invocation ID: a2999d7d-0b72-4f0f-b42d-9590c7ea672f
release 2.2.0
Tested on Bazel 2.0.0 and 2.2.0.
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?[email protected]:gergelyfabian/bazel-scala-example.git
e1e4ae4da5a3d8c18a68e1c8c6baddb34d187659
9b4a64e1be19b28ff73941446d6729cd98dcd8cc
Have you found anything relevant by searching the web?
There is another issue that seems similar: #9389.
Any other information, logs, or outputs that you want to share?
Case reproducing the bug:
$ git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
$ bazel test //...
INFO: Invocation ID: 92696c4f-639f-4e5b-87b7-f22adcfd8d3a
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
INFO: Elapsed time: 0.167s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test (cached) PASSED in 0.0s
Executed 0 out of 2 tests: 2 tests pass.
INFO: Build completed successfully, 1 total action
$ git checkout unit_test_failure
Switched to branch 'unit_test_failure'
$ bazel test //...
INFO: Invocation ID: 1eef1f1d-7df4-4d3d-ac92-97e638fe17e9
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
FAIL: //example-maven:test (see /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log)
INFO: Elapsed time: 1.233s, Critical Path: 1.07s
INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox.
INFO: Build completed, 1 test FAILED, 3 total actions
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test FAILED in 1.0s
/home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log
Executed 1 out of 2 tests: 1 test passes and 1 fails locally.
INFO: Build completed, 1 test FAILED, 3 total actions
$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ bazel test //...
INFO: Invocation ID: 3a21e211-1902-4433-b344-9f04b3d83558
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
INFO: Elapsed time: 1.240s, Critical Path: 1.07s
INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox.
INFO: Build completed successfully, 3 total actions
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test PASSED in 1.0s
Executed 1 out of 2 tests: 2 tests pass.
INFO: Build completed successfully, 3 total actions
Forcing remote cache hit:
$ git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
$
$ bazel test //...
INFO: Invocation ID: 71339ed1-128c-412c-a6cd-6e812b38a880
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
INFO: Elapsed time: 0.234s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test (cached) PASSED in 1.0s
Executed 0 out of 2 tests: 2 tests pass.
INFO: Build completed successfully, 1 total action
$ git checkout unit_test_failure
Switched to branch 'unit_test_failure'
$ bazel test //...
INFO: Invocation ID: ec056e85-d024-4072-916b-d89fd017b278
INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured).
INFO: Found 3 targets and 2 test targets...
FAIL: //example-maven:test (see /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log)
INFO: Elapsed time: 1.285s, Critical Path: 1.04s
INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox.
INFO: Build completed, 1 test FAILED, 3 total actions
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test FAILED in 1.0s
/home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log
Executed 1 out of 2 tests: 1 test passes and 1 fails locally.
INFO: Build completed, 1 test FAILED, 3 total actions
$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ bazel clean
INFO: Invocation ID: 0cbb6d3d-a42d-41c2-9ece-dde8feb7d795
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
$ bazel test //...
INFO: Invocation ID: afbf5193-0f6a-4b0e-8b6d-a301aee6ba11
INFO: Analyzed 5 targets (58 packages loaded, 1299 targets configured).
INFO: Found 3 targets and 2 test targets...
INFO: From Building external/com_google_protobuf/libprotobuf_java.jar (123 source files, 1 source jar):
warning: -parameters is not supported for target value 1.7. Use 1.8 or later.
INFO: Elapsed time: 5.215s, Critical Path: 2.67s
INFO: 212 processes: 212 remote cache hit.
INFO: Build completed successfully, 245 total actions
//example-lib:test (cached) PASSED in 0.0s
//example-maven:test (cached) PASSED in 0.0s
Executed 0 out of 2 tests: 2 tests pass.
INFO: Build completed successfully, 245 total actions
The text was updated successfully, but these errors were encountered: