Skip to content

Crash due to invalid precondition in remote execution #26055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jmmv opened this issue May 12, 2025 · 3 comments
Open

Crash due to invalid precondition in remote execution #26055

jmmv opened this issue May 12, 2025 · 3 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@jmmv
Copy link
Contributor

jmmv commented May 12, 2025

Description of the bug:

We have a pretty heavy "bazel coverage" run that has started crashing randomly. One of the crashes looks like this and, from reading the code, I suspect there is a race condition somewhere:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'UnshareableActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//ExecPlatform/src/tests/WarpspeedIntegrationTests:ExecutorInvocationRequestTest, config=BuildConfigurationKey[53fdec518cb175bd25c52e6abc424cf7c26111e7aa02b768746e45777f0f72b9]}, actionIndex=10}' (requested by nodes 'TestCompletionKey{configuredTargetKey=ConfiguredTargetKey{label=//ExecPlatform/src/tests/WarpspeedIntegrationTests:ExecutorInvocationRequestTest, config=BuildConfigurationKey[53fdec518cb175bd25c52e6abc424cf7c26111e7aa02b768746e45777f0f72b9]}, topLevelArtifactContext=com.google.devtools.build.lib.analysis.TopLevelArtifactContext@90904c3b, exclusiveTesting=false}')
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:550)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
        at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(Unknown Source)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
        at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.lang.IllegalStateException
        at com.google.common.base.Preconditions.checkState(Preconditions.java:496)
        at com.google.devtools.build.lib.remote.ExperimentalGrpcRemoteExecutor$Execution.waitExecution(ExperimentalGrpcRemoteExecutor.java:193)
        at com.google.devtools.build.lib.remote.util.Utils.refreshIfUnauthenticated(Utils.java:528)
        at com.google.devtools.build.lib.remote.ExperimentalGrpcRemoteExecutor$Execution.lambda$start$1(ExperimentalGrpcRemoteExecutor.java:165)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:245)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:127)
        at com.google.devtools.build.lib.remote.ExperimentalGrpcRemoteExecutor$Execution.start(ExperimentalGrpcRemoteExecutor.java:163)
        at com.google.devtools.build.lib.remote.ExperimentalGrpcRemoteExecutor.executeRemotely(ExperimentalGrpcRemoteExecutor.java:370)
        at com.google.devtools.build.lib.remote.RemoteExecutionService.executeRemotely(RemoteExecutionService.java:1885)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:318)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:245)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:127)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:116)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:291)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:158)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:118)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
        at com.google.devtools.build.lib.exec.StandaloneTestStrategy.runTestAttempt(StandaloneTestStrategy.java:779)
        at com.google.devtools.build.lib.exec.StandaloneTestStrategy.beginTestAttempt(StandaloneTestStrategy.java:318)
        at com.google.devtools.build.lib.exec.StandaloneTestStrategy$StandaloneTestRunnerSpawn.execute(StandaloneTestStrategy.java:584)
        at com.google.devtools.build.lib.analysis.test.TestRunnerAction.executeAllAttempts(TestRunnerAction.java:1177)
        at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:989)
        at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:966)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1159)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1076)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:573)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:862)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:334)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:172)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
        ... 7 more

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

7.4.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?


If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label May 12, 2025
@meisterT meisterT added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels May 13, 2025
@jmmv
Copy link
Contributor Author

jmmv commented May 13, 2025

Any thoughts on what might be the trigger for this crash? We hadn't seen it before in our nightly coverage runs and now we are seeing it pretty consistently every day. We haven't changed anything significant on the Bazel side nor in our BuildBarn deployment.

@jmmv
Copy link
Contributor Author

jmmv commented Jun 2, 2025

I took a closer look at the java.log before the crash and saw:

250602 22:10:45.953:W 76 [io.grpc.netty.NettyClientHandler$2.onGoAwayReceived] Received GOAWAY with ENHANCE_YOUR_CALM. Debug data: too_many_pings

My current theory is:

  1. Bazel sends an Execute RPC.
  2. The Execute RPC times out because the remote coverage test takes a long time to complete.
  3. Bazel enters the waitExecution phase. This is wrapped in a Retrier(!!).
  4. waitExecution expects lastOperation to NOT be null, and it isn't on the first try.
  5. Somewhere in waitOperation, we set lastOperation = null. Maybe because of NOT_FOUND (the remote action expired from Buildbarn's scheduler), or maybe via the GOAWAY codepath.
  6. The retrier decides to re-execute waitExecution instead of propagating the error or the null return value. This is where I'm getting confused because of the layers of indirection and the wrapping of error codes that gRPC does with unchecked exceptions. But the NOT_FOUND ad-hoc handling feels problematic.
  7. waitExecution finds lastOperation IS null (because it set it itself!) and crashes.

@jmmv
Copy link
Contributor Author

jmmv commented Jun 3, 2025

I removed our custom keepalive flags, which apparently are mismatched with our Buildbarn deployment, and this made the GOAWAY messages disappear. However, Bazel still crashed in the same way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants