Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skyframe invalidation with disk or remote cache #22367

Open
fmeum opened this issue May 14, 2024 · 16 comments
Open

Skyframe invalidation with disk or remote cache #22367

fmeum opened this issue May 14, 2024 · 16 comments
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@fmeum
Copy link
Collaborator

fmeum commented May 14, 2024

Description of the bug:

When using a disk cache (possibly also a remote cache), Bazel repeatedly marks output files obtained from the cache as dirty to trigger a download, but then doesn't download the files, so they are still missing on the next invocation.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  1. Patch Bazel with the diff further below.
  2. Repeatedly run baze --nosystem_rc --nohome_rc build --disk_cache=some_path //... on some project with Java targets.

See lines such as:

artifact is dirty: File:[[<execution_root>]bazel-out/darwin_arm64-fastbuild/bin]jeydtzir/libjeydtzir-hjar.jdeps RemoteFileArtifactValueWithExpiration{digest=0x7FACF6298A30C6CB373A988E3A3683CD3E8E9C46C59ECC570F089867234A3FEF, size=23, locationIndex=1, materializationExecPath=null, expireAtEpochMilli=1715713661059}
Invalidating: [ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//vkprdlnk:vkprdlnk, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=//wztejdxc:wztejdxc, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=0}, ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=@@rules_java~//toolchains:platformclasspath, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=0}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//jeydtzir:jeydtzir, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=//vkprdlnk:vkprdlnk, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=0}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//mynbiqpm:mynbiqpm, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//zjplsgqe:zjplsgqe, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=@@rules_java~//toolchains:platformclasspath, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=//jeydtzir:jeydtzir, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=0}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//wztejdxc:wztejdxc, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}] com.google.devtools.build.skyframe.InvalidatingNodeVisitor$DirtyingInvalidationState@104d38f9

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

diff --git a/src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java b/src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java
index 60a08ddf68..363c5291af 100644
--- a/src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java
+++ b/src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java
@@ -539,6 +539,7 @@ public class FilesystemValueChecker {
     for (Map.Entry<Artifact, FileArtifactValue> entry : actionValue.getAllFileValues().entrySet()) {
       if (artifactIsDirtyWithDirectSystemCalls(
           knownModifiedOutputFiles, remoteArtifactChecker, entry, modifiedOutputsReceiver)) {
+        System.err.println("artifact is dirty: " + entry.getKey() + " " + entry.getValue());
         isDirty = true;
       }
     }
@@ -551,6 +552,8 @@ public class FilesystemValueChecker {
           tree.getChildValues().entrySet()) {
         if (artifactIsDirtyWithDirectSystemCalls(
             knownModifiedOutputFiles, remoteArtifactChecker, childEntry, modifiedOutputsReceiver)) {
+          System.err.println(
+              "tree artifact entry is dirty: " + childEntry.getKey() + " " + childEntry.getValue());
           isDirty = true;
         }
       }
@@ -572,6 +575,7 @@ public class FilesystemValueChecker {
       if (shouldCheckTreeArtifact(sortedKnownModifiedOutputFiles.get(), treeArtifact)
           && treeArtifactIsDirty(treeArtifact, entry.getValue())) {
         // Count the changed directory as one "file".
+        System.err.println("tree artifact entry is dirty: " + treeArtifact);
         modifiedOutputsReceiver.reportModifiedOutputFile(
             getBestEffortModifiedTime(treeArtifact.getPath()), treeArtifact);
         isDirty = true;
diff --git a/src/main/java/com/google/devtools/build/skyframe/EagerInvalidator.java b/src/main/java/com/google/devtools/build/skyframe/EagerInvalidator.java
index f8f2c0d560..07988108b3 100644
--- a/src/main/java/com/google/devtools/build/skyframe/EagerInvalidator.java
+++ b/src/main/java/com/google/devtools/build/skyframe/EagerInvalidator.java
@@ -82,6 +82,7 @@ public final class EagerInvalidator {
       DirtyAndInflightTrackingProgressReceiver progressReceiver,
       InvalidationState state)
       throws InterruptedException {
+    System.err.println("Invalidating: " + diff + " " + state);
     DirtyingNodeVisitor visitor =
         createInvalidatingVisitorIfNeeded(graph, diff, progressReceiver, state);
     if (visitor != null) {
@fmeum
Copy link
Collaborator Author

fmeum commented May 14, 2024

cc @tjgq @coeuvre

@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label May 14, 2024
@coeuvre
Copy link
Member

coeuvre commented May 15, 2024

@coeuvre
Copy link
Member

coeuvre commented May 15, 2024

so they are still missing on the next invocation.

Are these outputs should be downloaded (e.g. toplevel outputs, or specified by --remote_download_regex)?

If Bazel marks an action dirty to trigger a download but then doesn't download it because it shouldn't, then it's a performance issue. Otherwise, it's a correctness issue.

@fmeum
Copy link
Collaborator Author

fmeum commented May 15, 2024

Can you repro with --noexperimental_merged_skyframe_analysis_execution?

I can't.

Are these outputs should be downloaded (e.g. toplevel outputs, or specified by --remote_download_regex)?

No, these outputs should not be downloaded (just hjars, jdeps files, ...), so I think this is just a performance issue.

@fmeum
Copy link
Collaborator Author

fmeum commented May 15, 2024

This is the root cause:

// With Skymeld, we don't have enough information at this stage to consider remote files.
// This allows BwoB to function (in the sense that it's now compatible with Skymeld), but
// there's a performance penalty for incremental build: all action nodes will be dirtied.
// We'll be relying on the other forms of caching (local action cache or remote cache).
// TODO(b/281655526): Improve this.
skyframeExecutor.detectModifiedOutputFiles(
modifiedOutputFiles,
env.getBlazeWorkspace().getLastExecutionTimeRange(),
RemoteArtifactChecker.IGNORE_ALL,
buildRequestOptions.fsvcThreads);

Note the RemoteArtifactChecker.IGNORE_ALL, which means that all remote artifacts will be marked as dirty. Happy to work on this, are there any ideas of what would need to be done to improve this? CC @joeleba

@coeuvre
Copy link
Member

coeuvre commented May 15, 2024

skymeld calls skyframeExecutor.detectModifiedOutputFiles when the first toplevel target is analyzed. However, the RemoteArtifactChecker for BwoB needs full analyze result to be able to correctly determine which action should be marked as dirty because it records the path of toplevel outputs in a trie: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/RemoteOutputChecker.java;l=128;drc=6f48f1c2b3bb73768b9ff15ce6698e21eddc503a (there are some skymeld only code, but is for clean build). Otherwise, actions for other toplevel targets will not be invalidated. So we decided to use RemoteArtifactChecker.IGNORE_ALL for now.

One potential solution I have discussed with @joeleba offline is to let skymeld calls skyframeExecutor.detectModifiedOutputFiles for each toplevel target so that the RemoteArtifactChecker.IGNORE_ALL can be replaced with the one from BowB.

@coeuvre
Copy link
Member

coeuvre commented May 15, 2024

This is the root cause:

Argh, I intended to post this link but somehow my clipboard is corrupted.

@joeleba
Copy link
Member

joeleba commented May 15, 2024

This is a known limitation of Skymeld + BwoB. It's a tradeoff essentially: having both skymeld + bwob improves clean builds' performance, but comes with a performance penalty for incremental builds. Luckily the penalty only comes from repeating the skyframe work, since the have other layers of caches for actions.

In the current state, detectModifiedOutputFiles is called once before any action execution in the build. At this point, it's not possible to construct the full picture of what's dirty or not in the RemoteArtifactChecker (only available after the full analysis).

To resolve this, we would need to do detectModifiedOutputFiles incrementally, as the information for each top level target/aspect becomes available.

@zhengwei143 zhengwei143 added P3 We're not considering working on this, but happy to review a PR. (No assignee) help wanted Someone outside the Bazel team could own this and removed untriaged labels May 21, 2024
@brentleyjones
Copy link
Contributor

We reproduced this with just the remote cache. Can this have a higher than P3 priority?

@brentleyjones brentleyjones changed the title Skyframe invalidation with --disk_cache Skyframe invalidation with disk or remote cache Jun 12, 2024
@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed P3 We're not considering working on this, but happy to review a PR. (No assignee) labels Jun 12, 2024
@coeuvre
Copy link
Member

coeuvre commented Jun 12, 2024

Raising it to P2 but we probably won't have time to work on the fix until next month or two. In the mean while, PR is welcome!

@brentleyjones
Copy link
Contributor

brentleyjones commented Aug 19, 2024

Friendly ping on this.

@coeuvre
Copy link
Member

coeuvre commented Aug 21, 2024

I have a proper fix and should be able to submit it in a few days.

@coeuvre coeuvre removed P2 We'll consider working on this in future. (Assignee optional) help wanted Someone outside the Bazel team could own this labels Aug 21, 2024
@coeuvre coeuvre added the P1 I'll work on this now. (Assignee required) label Aug 21, 2024
copybara-service bot pushed a commit that referenced this issue Aug 26, 2024
Previously, we download outputs of toplevel artifacts in CompletionFunction because intermediate targets might be promoted to toplevel by skymeld in the middle of the build. However, it didn't handle runfiles of the target. This CL fixes that.

A test case will be added in a followup CL because it requires changes to skymeld & BwoB to unveil the bug.

Working towards #22367.

PiperOrigin-RevId: 667532830
Change-Id: Idfe2a0b693f103d501c9b24a454c16186468667a
copybara-service bot pushed a commit that referenced this issue Aug 26, 2024
Previously, a incremental build with Skymeld and BwoB has to re-evaluate action node in order to correctly staging toplevel artifacts. The reason is BwoB requires full analysis result before execution phrase to correctly invalidate action nodes for toplevel artifacts and with Skymeld, the full analysis result is not available yet when execution phrase starts.

This CL changes how BwoB invalidate action nodes before entering execution phrase: instead of depending on full analysis result, it now uses `remoteOutputChecker` from previous build -- if `lastRemoteOutputChecker` said Bazel should download one toplevel artifact but it's missing locally, we invalidate the action node. For clean build, `lastRemoteOutputChecker` is initialized to `null` because there is no action nodes to invalidate.

If an artifact is promoted from intermediate output to toplevel in an incremental build, no action node is invalidated. Instead, we rely on `CompletionFunction` to download them. There is one caveat though: sometimes we need to invalidate `CompletionFunction`, e.g. when `outputsMode` is changed.

Working towards #22367.

PiperOrigin-RevId: 667562015
Change-Id: Ida86cf04d99fcdec998970eb3cabd63b2318246f
@coeuvre
Copy link
Member

coeuvre commented Aug 27, 2024

@bazel-io fork 7.4.0

copybara-service bot pushed a commit that referenced this issue Aug 27, 2024
…ArtifactChecker.IGNORE_ALL.

Working towards #22367.

PiperOrigin-RevId: 667965692
Change-Id: Ifc999a51d7ce4d5466d66c0b44e09e1491b87a6e
coeuvre added a commit to coeuvre/bazel that referenced this issue Aug 27, 2024
Previously, we download outputs of toplevel artifacts in CompletionFunction because intermediate targets might be promoted to toplevel by skymeld in the middle of the build. However, it didn't handle runfiles of the target. This CL fixes that.

A test case will be added in a followup CL because it requires changes to skymeld & BwoB to unveil the bug.

Working towards bazelbuild#22367.

PiperOrigin-RevId: 667532830
Change-Id: Idfe2a0b693f103d501c9b24a454c16186468667a
coeuvre added a commit to coeuvre/bazel that referenced this issue Aug 27, 2024
Previously, we download outputs of toplevel artifacts in CompletionFunction because intermediate targets might be promoted to toplevel by skymeld in the middle of the build. However, it didn't handle runfiles of the target. This CL fixes that.

A test case will be added in a followup CL because it requires changes to skymeld & BwoB to unveil the bug.

Working towards bazelbuild#22367.

PiperOrigin-RevId: 667532830
Change-Id: Idfe2a0b693f103d501c9b24a454c16186468667a
coeuvre added a commit to coeuvre/bazel that referenced this issue Aug 28, 2024
Previously, we download outputs of toplevel artifacts in CompletionFunction because intermediate targets might be promoted to toplevel by skymeld in the middle of the build. However, it didn't handle runfiles of the target. This CL fixes that.

A test case will be added in a followup CL because it requires changes to skymeld & BwoB to unveil the bug.

Working towards bazelbuild#22367.

PiperOrigin-RevId: 667532830
Change-Id: Idfe2a0b693f103d501c9b24a454c16186468667a
github-merge-queue bot pushed a commit that referenced this issue Aug 28, 2024
Note that I have slightly modified the code to make it work with
`release-7.4.0` source tree because of missing other commits. More
specifically, instead of using `inputMap.getRunfilesTrees()` to get
runfiles, it uses `TargetCompleteEvent`.

Working towards #23434.

Original Description:

Previously, we download outputs of toplevel artifacts in
CompletionFunction because intermediate targets might be promoted to
toplevel by skymeld in the middle of the build. However, it didn't
handle runfiles of the target. This CL fixes that.

A test case will be added in a followup CL because it requires changes
to skymeld & BwoB to unveil the bug.

Working towards #22367.

PiperOrigin-RevId: 667532830
Change-Id: Idfe2a0b693f103d501c9b24a454c16186468667a
coeuvre added a commit to coeuvre/bazel that referenced this issue Aug 28, 2024
Previously, a incremental build with Skymeld and BwoB has to re-evaluate action node in order to correctly staging toplevel artifacts. The reason is BwoB requires full analysis result before execution phrase to correctly invalidate action nodes for toplevel artifacts and with Skymeld, the full analysis result is not available yet when execution phrase starts.

This CL changes how BwoB invalidate action nodes before entering execution phrase: instead of depending on full analysis result, it now uses `remoteOutputChecker` from previous build -- if `lastRemoteOutputChecker` said Bazel should download one toplevel artifact but it's missing locally, we invalidate the action node. For clean build, `lastRemoteOutputChecker` is initialized to `null` because there is no action nodes to invalidate.

If an artifact is promoted from intermediate output to toplevel in an incremental build, no action node is invalidated. Instead, we rely on `CompletionFunction` to download them. There is one caveat though: sometimes we need to invalidate `CompletionFunction`, e.g. when `outputsMode` is changed.

Working towards bazelbuild#22367.

PiperOrigin-RevId: 667562015
Change-Id: Ida86cf04d99fcdec998970eb3cabd63b2318246f
github-merge-queue bot pushed a commit that referenced this issue Aug 28, 2024
Previously, a incremental build with Skymeld and BwoB has to re-evaluate
action node in order to correctly staging toplevel artifacts. The reason
is BwoB requires full analysis result before execution phrase to
correctly invalidate action nodes for toplevel artifacts and with
Skymeld, the full analysis result is not available yet when execution
phrase starts.

This CL changes how BwoB invalidate action nodes before entering
execution phrase: instead of depending on full analysis result, it now
uses `remoteOutputChecker` from previous build -- if
`lastRemoteOutputChecker` said Bazel should download one toplevel
artifact but it's missing locally, we invalidate the action node. For
clean build, `lastRemoteOutputChecker` is initialized to `null` because
there is no action nodes to invalidate.

If an artifact is promoted from intermediate output to toplevel in an
incremental build, no action node is invalidated. Instead, we rely on
`CompletionFunction` to download them. There is one caveat though:
sometimes we need to invalidate `CompletionFunction`, e.g. when
`outputsMode` is changed.

Working towards #22367.

PiperOrigin-RevId: 667562015
Change-Id: Ida86cf04d99fcdec998970eb3cabd63b2318246f
@coeuvre
Copy link
Member

coeuvre commented Aug 28, 2024

This is fixed both in master and release-7.4.0 branches.

@brentleyjones Can you verify whether it improves your builds?

@brentleyjones
Copy link
Contributor

Thanks! I'll have to wait for an RC to start validation.

@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

8 participants