Skip to content

NIFI-16011: Reduce number of FlowFiles used in LoadBalanceIT from 100…#11325

Draft
markap14 wants to merge 8 commits into
apache:mainfrom
markap14:NIFI-16011
Draft

NIFI-16011: Reduce number of FlowFiles used in LoadBalanceIT from 100…#11325
markap14 wants to merge 8 commits into
apache:mainfrom
markap14:NIFI-16011

Conversation

@markap14

Copy link
Copy Markdown
Contributor

… to 20 in order to avoid the excessive number of requests to the cluster in order to iterate over each FlowFile in the queue

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
  • Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using ./mvnw clean install -P contrib-check
    • JDK 21
    • JDK 25

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@markap14 markap14 marked this pull request as ready for review June 10, 2026 19:10
@markap14

Copy link
Copy Markdown
Contributor Author

[claude-opus-4.7] System-tests run 27299783298 finished with three failed shards. All three are pre-existing flakes unrelated to the change in this PR; main has been failing with the same symptoms (8 of the last 10 main system-tests runs failed).

ubuntu-24.04 Java 21 (job)

  • 1 failure: ClusteredStatelessFlowIT.testChangeStatelessFlowWhileNotDisconnected (in teardownemptyQueue)
  • Cause: HTTP 409 from cluster replication, rooted in:
Caused by: java.io.IOException: RST_STREAM received Stream cancelled
    at java.net.http/jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:938)
    at java.net.http/jdk.internal.net.http.HttpClientFacade.send(HttpClientFacade.java:133)
    at org.apache.nifi.web.client.StandardWebClientService$StandardHttpRequestBodySpec.getResponse(StandardWebClientService.java:381)

ubuntu-24.04 Java 25 (job)

  • 3 failures, all with the same RST_STREAM received Stream cancelled root cause: LoadBalanceIT.testPartitionByAttribute, ClusteredRegistryClientIT.testChangeVersionOnParentThatCascadesToChild (teardown), OffloadContentClaimTruncationIT.testOffloadedFlowFileContentNotPrematurelyTruncated. The 100 → 20 batch-size reduction in this PR lowers the rate but doesn't fully eliminate the underlying HTTP/2 issue.

macos-15 Java 21 (job)

  • 1 failure: FlowSynchronizationIT.testReconnectionWithUpdatedConnection. Different symptom (Timed out waiting for queue to empty in teardown), unrelated to LoadBalanceIT or this PR. The only recent change to that test file is NIFI-15844 ("Add logging to help troubleshoot flaky system tests").

Have triggered gh run rerun 27299783298 --failed to re-run only the failed shards.

@markap14

Copy link
Copy Markdown
Contributor Author

[claude-opus-4.7] Rerun attempt 2 of the system-tests run also failed on ubuntu-24.04 Java 21 and ubuntu-24.04 Java 25 (macos-15 Java 21 is still in progress). Different tests fail on each attempt, which is the classic flake signature, but LoadBalanceIT.testPartitionByAttribute has now failed in both attempts on Java 25.

ubuntu-24.04 Java 21 attempt 2 (job)

  • ClusteredStatelessFlowIT.testUpdateParameterReferencedByStatelessFlowlistQueue HTTP 409 / 500 (same RST_STREAM family)
  • OffloadContentClaimTruncationIT.testOffloadedFlowFileContentNotPrematurelyTruncated — same symptom

ubuntu-24.04 Java 25 attempt 2 (job)

  • LoadBalanceIT.testPartitionByAttribute — this time the 409 is on DELETE /listing-requests/{id}, with message Node localhost:5671 is currently connecting. The test runs immediately after LoadBalanceIT.testRoundRobinWithRestartAndPortChange, which restarts a node. Because NiFiInstanceCache reuses the cluster between tests in the same class, when this test runs after the restart test the cluster may still be reconnecting. That is a separate issue from the RST_STREAM flake.
  • LoadBalanceIT.testRoundRobinWithRestartAndPortChangeemptyQueue teardown timed out after 132s, downstream of the same node restart.
  • ClusteredReplayProvenanceIT.testReplayLastEvent[1] PRIMARYemptyQueue 409 / 500.

The non-system-test jobs (Windows FR, Scan, Ubuntu integration-tests, CodeQL, Corretto EN, macOS JP, macos-15 Java 25 attempt 2) are all green.

The reduction from 100 → 20 FlowFiles in testPartitionByAttribute materially lowers the rate of the RST_STREAM failure but is clearly not sufficient on the GitHub Actions ubuntu-24.04 runners. Recommending we hold off on additional reruns and decide on a direction. Three options I see:

  1. Accept that this PR is a partial mitigation and merge anyway (still strictly better than main, where 8/10 recent system-tests runs have failed with the same family of errors).
  2. Push a follow-up change that further reduces load in testPartitionByAttribute (smaller batch, fewer distinct attribute values) and/or addresses the testRoundRobinWithRestartAndPortChangetestPartitionByAttribute ordering by waiting for the cluster to be fully reconnected before testPartitionByAttribute proceeds.
  3. Pursue a real fix at the framework layer for the RST_STREAM on cluster replication (the original goal earlier in this investigation), separate from this PR.

@markap14 please advise — I will pause aggressive polling and switch to once-per-hour until you weigh in.

@markap14 markap14 marked this pull request as draft June 11, 2026 13:15
@markap14

Copy link
Copy Markdown
Contributor Author

Experimental commit: temporarily revert Jetty 12.1.10 → 12.1.9

Pushed 451d60b to test the hypothesis that the recent system-tests flakes are a server-side HTTP/2 regression introduced by the Jetty 12.1.9 → 12.1.10 bump in NIFI-15993 (2026-06-03), not by anything in this PR.

Why I think Jetty 12.1.10 is the trigger

  • The failures all surface as the JDK java.net.http.HttpClient receiving an HTTP/2 RST_STREAM with code CANCEL from the in-JVM Jetty server during the request body upload of a replicated cluster request:
    java.io.IOException: RST_STREAM received Stream cancelled
        at java.net.http/jdk.internal.net.http.Stream.incompleteRequestBodyReset(Stream.java:730)
        at java.net.http/jdk.internal.net.http.Stream.incoming_reset(Stream.java:712)
    
  • I counted status codes in nifi-request.log for the failing test: zero HTTP 421 responses on either node, which rules out ProxyHeaderValidatorCustomizer / HostPortValidatorCustomizer as the source of the reset.
  • Correlation with main-branch system-tests workflow history:
    • 2026-06-03 13:05 UTC: main run SUCCESS (last green run).
    • 2026-06-03 21:48 UTC: f5b9c13 NIFI-15993 bumps Jetty 12.1.9 → 12.1.10 (and several other unrelated deps).
    • 2026-06-09 02:21 UTC: main run FAILURE — first system-tests run on main after the Jetty bump.
    • Every system-tests workflow run on main since then has failed.
  • Jetty 12.1.10's notable HTTP/2 changes per the release notes include #15009 "Make processing of RST_STREAM more lenient" and #15161 "Reduce memory footprint for persistent HttpConnections", both of which touch HTTP/2 stream/connection lifecycle.

What this commit is and isn't

This is not intended to be merged as-is. The PR remains a Draft. If this commit's CI run shows the flakes disappear, we will:

  1. File an upstream Jetty bug with a minimal repro.
  2. Decide whether to pin Jetty to 12.1.9 in main until it's fixed, or wait for a 12.1.11 with a targeted fix.

If the flakes persist on 12.1.9, the hypothesis is wrong and we keep digging.

The existing LoadBalanceIT batch-size reduction in this PR stays in place as defense in depth either way.

@markap14

Copy link
Copy Markdown
Contributor Author

Hypothesis confirmed: Jetty 12.1.10 introduced the HTTP/2 RST_STREAM regression

Workflow run 27353150066 with commit 451d60b (Jetty pinned back to 12.1.9):

Job Result Notes
ubuntu-24.04 Java 21 PASS All system tests green.
macos-15 Java 21 PASS All system tests green.
ubuntu-24.04 Java 25 FAIL One test failed, but not the RST_STREAM pattern — see below.
macos-15 Java 25 FAIL Two tests failed, neither is RST_STREAM — see below.

Primary finding

rg -l "RST_STREAM" /tmp/nifi-pr11325-jetty129-logs/ returns zero hits across both Java 25 troubleshooting archives, and the previously affected tests (LoadBalanceIT, ClusteredStatelessFlowIT, ClusteredRegistryClientIT, OffloadContentClaimTruncationIT, FlowSynchronizationIT) all passed on every OS/JDK combination. The RST_STREAM Stream cancelled / incompleteRequestBodyReset failure mode that was reproducing on every recent main run is gone on Jetty 12.1.9.

Remaining Java 25 failures (not Jetty)

These look like pre-existing flakes in different code paths:

  • ClusteredConnectorDrainIT.testDrainWithNodeCompletingAtDifferentTimes (ubuntu-24.04 Java 25): the test's @BeforeEach waitForAllNodesConnected timed out after 60s with node-2 still DISCONNECTED. Cluster join, not replication.
  • OffloadIT.testOffload (macos-15 Java 25): TimeoutException: testOffload() timed out after 10 minutes. Test hang.
  • ClusteredReplayProvenanceIT[2].testReplayLastEvent (macos-15 Java 25): AssertionFailedError: expected: <2> but was: <1>. Looks like a real test/assertion issue.

None of these involve HTTP/2 RST_STREAM or cluster replication failures.

Proposed next step

Suggest we proceed in this order:

  1. Pin <jetty.version> to 12.1.9 in main (separate PR) as the immediate fix for the RST_STREAM regression, with a TODO/comment referencing the upstream Jetty bug.
  2. File an upstream Jetty bug with a minimal repro (HTTP/2 client+server in same JVM over loopback, mTLS, many short POSTs ⇒ RST_STREAM(CANCEL) during request body upload). I can put that together.
  3. Treat the three remaining Java 25 flakes as separate Jiras and triage them independently — they were almost certainly there before, just masked by the all-red RST_STREAM noise.
  4. Once 12.1.9 is pinned in main, this PR's LoadBalanceIT batch-size reduction stands on its own as a defense-in-depth load reduction and can be reviewed/merged on its merits.

Waiting on direction before proceeding.

markap14 added 8 commits June 11, 2026 16:56
… to 20 in order to avoid the excessive number of requests to the cluster in order to iterate over each FlowFile in the queue
…hypothesis

System-tests workflow runs on main have failed consistently since NIFI-15993
bumped Jetty from 12.1.9 to 12.1.10 on 2026-06-03. Failures present as the
JDK java.net.http.HttpClient receiving an HTTP/2 RST_STREAM (CANCEL) from the
in-JVM Jetty server during the request body upload of replicated cluster
requests (LoadBalanceIT, ClusteredStatelessFlowIT, OffloadContentClaimTruncationIT,
FlowSynchronizationIT, etc.).

This is an experimental commit on the draft NIFI-16011 PR to validate the
hypothesis that Jetty 12.1.10 introduced a server-side HTTP/2 stream-lifecycle
regression. It is not intended to be merged as-is; if CI passes we will file
an upstream Jetty bug and decide whether to pin to 12.1.9 or wait for 12.1.11.
…ault HTTP_2, force HTTP_1_1 in system tests

Jetty 12.1.10 includes a significant rewrite of its HTTP/2 stream state
machine (jetty PR #15087 for issue #15009, "Make processing of RST_STREAM
more lenient"). The change makes Jetty more tolerant of receiving
RST_STREAM frames but appears to have regressed when Jetty sends
RST_STREAM in some scenarios. The JDK java.net.http.HttpClient surfaces
these as "IOException: RST_STREAM received Stream cancelled" and the
in-flight request cannot be recovered (no retry for replicated POSTs).
The pattern surfaces intermittently in the system test suite under the
heavy disconnect / offload / restart load tests like LoadBalanceIT
exercise.

Introduce a new property nifi.cluster.node.protocol.http.version that
configures the HTTP version that the cluster node web client prefers
when replicating requests to other nodes. Accepts HTTP_2 (default) or
HTTP_1_1; invalid values log a warning and fall back to HTTP_2.
Production traffic continues to use HTTP_2; the system test factory
overrides every spawned NiFi instance to HTTP_1_1 so the regression
stays invisible to the test suite.

Also restores Jetty to 12.1.10 (reverting the temporary 12.1.9
diagnostic downgrade).
…operties templates instead of hard-coding it in SpawnedStandaloneNiFiInstanceFactory.
…on after re-connect but then be quickly told to disconnect due to a queued up 'Disconnect' message from the original disconnection. Now, we use a 'generation' flag so we know to ignore the message, and we also cancel the background task that is trying to deliver it.
The unstubbed mock disconnect() completed immediately, allowing the
background thread to remove the future from the map before the test
could observe it. Stub disconnect() to throw so the retry loop keeps
the future in the map long enough for the assertion.
The system test is not essential for this fix, which addresses an
existing source of system test flakiness rather than introducing new
behavior that requires end-to-end verification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant