Skip to content

Fix JVM deoptimization loop due to lambda misprediction#20832

Open
itschrispeck wants to merge 1 commit intoopensearch-project:mainfrom
itschrispeck:improve_jit_branching
Open

Fix JVM deoptimization loop due to lambda misprediction#20832
itschrispeck wants to merge 1 commit intoopensearch-project:mainfrom
itschrispeck:improve_jit_branching

Conversation

@itschrispeck
Copy link

Description

This change resolves a JVM deoptimization loop by refactoring how primaryResponse is written to the stream.

To do this, we explicitly check if primaryResponse == null, avoiding the check within the lambda. This prevents the C2 compiler from generating an invokedynamic instruction, which is prone to permanent deoptimization when branch traffic flip-flops. The explicit if/else allows the JVM to compile both paths, relying on CPU branch prediction instead of a JVM trap.

It also fixes a potential race condition where the volatile primaryResponse is read once to check, and read again for use, though it may be modified in between.

This pulls the writeOptionalWriteable logic outside, but this follows the convention of other places in the repo.

Background

We recently found elevated CPU despite no change in load.
image

Profiling revealed it was due to a deopt/uncommon trap hit at BulkItemRequest#writeThin
image

This is an example deopt event from the profile:

jdk.Deoptimization {
  startTime = 19:11:35.451 (2026-03-10)
  compileId = 26645
  compiler = "c2"
  method = org.opensearch.action.bulk.BulkItemRequest.writeThin(StreamOutput)
  lineNumber = 142
  bci = 21
  instruction = "ifnonnull"
  reason = "unstable_if"
  action = "none"
  eventThread = "opensearch[esdock-opensearch-metric-mvp-production-cluster-phx-ginuz-sukam][write][T#15]" (javaThreadId = 267)
  stackTrace = [
    org.opensearch.action.bulk.BulkItemRequest.writeThin(StreamOutput) line: 142
    org.opensearch.action.bulk.BulkShardRequest.lambda$writeTo$3(StreamOutput, BulkItemRequest) line: 102
    org.opensearch.core.common.io.stream.StreamOutput.writeArray(Writeable$Writer, Object[]) line: 939
    org.opensearch.action.bulk.BulkShardRequest.writeTo(StreamOutput) line: 99
    org.opensearch.action.support.replication.TransportReplicationAction$ConcreteShardRequest.writeTo(StreamOutput) line: 1589
  ]
}

action = "none" implies that we've hit the recompilation cutoff for the method. Since lambdas use invokedynamic, my understanding is that it's possible to flip-flop between predictions, eventually hitting the cutoff. In this case, the cutoff was hit while the wrong branch was compiled, and each invocation hits the trap.

Related Issues

n/a

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Christopher Peck <peck@uber.com>
@itschrispeck itschrispeck requested a review from a team as a code owner March 10, 2026 23:17
@github-actions
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Deserialization Compatibility

The new implementation writes a boolean followed by the data (writeBoolean(true) + writeThin), while writeOptionalWriteable also writes a boolean marker before the writeable. Verify that the reading side (deserialization) is compatible with this new format, and that there are no rolling-upgrade or mixed-version cluster scenarios where one node writes the old format and another reads the new format.

if (primaryResponse != null) {
    out.writeBoolean(true);
    primaryResponse.writeThin(out);
} else {
    out.writeBoolean(false);
}

@github-actions
Copy link
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Verify serialization format compatibility with deserialization

The original code used out.writeOptionalWriteable() which handles the null-check and
boolean flag internally in a consistent way. The new implementation manually writes
a boolean flag followed by the object, but this must exactly match how
writeOptionalWriteable encodes data, otherwise deserialization will break. Verify
that the reading side uses a matching readOptionalWriteable or equivalent that
expects this exact boolean+data format.

server/src/main/java/org/opensearch/action/bulk/BulkItemRequest.java [146-151]

 if (primaryResponse != null) {
     out.writeBoolean(true);
     primaryResponse.writeThin(out);
 } else {
     out.writeBoolean(false);
 }
+// Ensure the reading side uses readBoolean() + conditional readObject(), NOT readOptionalWriteable()
Suggestion importance[1-10]: 5

__

Why: This is a valid concern about serialization compatibility - the new manual boolean+data format must match the reading side's deserialization logic. However, the improved_code is essentially the same as existing_code with just a comment added, making it more of a verification request than an actual code fix.

Low

@itschrispeck itschrispeck changed the title Fix inefficiency due to branch misprediction inside lambda Fix JVM deoptimization loop due to branch misprediction inside lambda Mar 10, 2026
@github-actions
Copy link
Contributor

❌ Gradle check result for 9887b0e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@itschrispeck itschrispeck changed the title Fix JVM deoptimization loop due to branch misprediction inside lambda Fix JVM deoptimization loop due to lambda misprediction Mar 10, 2026
@github-actions
Copy link
Contributor

✅ Gradle check result for 9887b0e: SUCCESS

@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.26%. Comparing base (8f8f7b5) to head (9887b0e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20832      +/-   ##
============================================
- Coverage     73.31%   73.26%   -0.06%     
+ Complexity    72248    72230      -18     
============================================
  Files          5795     5795              
  Lines        330044   330060      +16     
  Branches      47641    47643       +2     
============================================
- Hits         241975   241821     -154     
- Misses        68609    68880     +271     
+ Partials      19460    19359     -101     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


BulkItemResponse primaryResponse = this.primaryResponse;

// Explicitly check if primaryResponse is null to avoid branching/mispredictions in a lambda
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itschrispeck @varunbharadwaj This comment will help, but it seems possible a well-meaning future developer could "simplify" this code by using writeOptionalWriteable() again. Could we refactor StreamOutput::writeOptionalWriteable to avoid the lamba instead of changing the code here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants