NIFI-16011: Reduce number of FlowFiles used in LoadBalanceIT from 100… by markap14 · Pull Request #11325 · apache/nifi

markap14 · 2026-06-10T19:10:28Z

… to 20 in order to avoid the excessive number of requests to the cluster in order to iterate over each FlowFile in the queue

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Apache NiFi Jira issue created

Pull Request Tracking

Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

Pull Request based on current revision of the main branch
Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

Build completed using ./mvnw clean install -P contrib-check
- JDK 21
- JDK 25

Licensing

New dependencies are compatible with the Apache License 2.0 according to the License Policy
New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

Documentation formatting appears as expected in rendered files

markap14 · 2026-06-10T20:26:56Z

[claude-opus-4.7] System-tests run 27299783298 finished with three failed shards. All three are pre-existing flakes unrelated to the change in this PR; main has been failing with the same symptoms (8 of the last 10 main system-tests runs failed).

ubuntu-24.04 Java 21 (job)

1 failure: ClusteredStatelessFlowIT.testChangeStatelessFlowWhileNotDisconnected (in teardown → emptyQueue)
Cause: HTTP 409 from cluster replication, rooted in:

Caused by: java.io.IOException: RST_STREAM received Stream cancelled
    at java.net.http/jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:938)
    at java.net.http/jdk.internal.net.http.HttpClientFacade.send(HttpClientFacade.java:133)
    at org.apache.nifi.web.client.StandardWebClientService$StandardHttpRequestBodySpec.getResponse(StandardWebClientService.java:381)

ubuntu-24.04 Java 25 (job)

3 failures, all with the same RST_STREAM received Stream cancelled root cause: LoadBalanceIT.testPartitionByAttribute, ClusteredRegistryClientIT.testChangeVersionOnParentThatCascadesToChild (teardown), OffloadContentClaimTruncationIT.testOffloadedFlowFileContentNotPrematurelyTruncated. The 100 → 20 batch-size reduction in this PR lowers the rate but doesn't fully eliminate the underlying HTTP/2 issue.

macos-15 Java 21 (job)

1 failure: FlowSynchronizationIT.testReconnectionWithUpdatedConnection. Different symptom (Timed out waiting for queue to empty in teardown), unrelated to LoadBalanceIT or this PR. The only recent change to that test file is NIFI-15844 ("Add logging to help troubleshoot flaky system tests").

Have triggered gh run rerun 27299783298 --failed to re-run only the failed shards.

markap14 · 2026-06-10T21:31:50Z

[claude-opus-4.7] Rerun attempt 2 of the system-tests run also failed on ubuntu-24.04 Java 21 and ubuntu-24.04 Java 25 (macos-15 Java 21 is still in progress). Different tests fail on each attempt, which is the classic flake signature, but LoadBalanceIT.testPartitionByAttribute has now failed in both attempts on Java 25.

ubuntu-24.04 Java 21 attempt 2 (job)

ClusteredStatelessFlowIT.testUpdateParameterReferencedByStatelessFlow — listQueue HTTP 409 / 500 (same RST_STREAM family)
OffloadContentClaimTruncationIT.testOffloadedFlowFileContentNotPrematurelyTruncated — same symptom

ubuntu-24.04 Java 25 attempt 2 (job)

LoadBalanceIT.testPartitionByAttribute — this time the 409 is on DELETE /listing-requests/{id}, with message Node localhost:5671 is currently connecting. The test runs immediately after LoadBalanceIT.testRoundRobinWithRestartAndPortChange, which restarts a node. Because NiFiInstanceCache reuses the cluster between tests in the same class, when this test runs after the restart test the cluster may still be reconnecting. That is a separate issue from the RST_STREAM flake.
LoadBalanceIT.testRoundRobinWithRestartAndPortChange — emptyQueue teardown timed out after 132s, downstream of the same node restart.
ClusteredReplayProvenanceIT.testReplayLastEvent[1] PRIMARY — emptyQueue 409 / 500.

The non-system-test jobs (Windows FR, Scan, Ubuntu integration-tests, CodeQL, Corretto EN, macOS JP, macos-15 Java 25 attempt 2) are all green.

The reduction from 100 → 20 FlowFiles in testPartitionByAttribute materially lowers the rate of the RST_STREAM failure but is clearly not sufficient on the GitHub Actions ubuntu-24.04 runners. Recommending we hold off on additional reruns and decide on a direction. Three options I see:

Accept that this PR is a partial mitigation and merge anyway (still strictly better than main, where 8/10 recent system-tests runs have failed with the same family of errors).
Push a follow-up change that further reduces load in testPartitionByAttribute (smaller batch, fewer distinct attribute values) and/or addresses the testRoundRobinWithRestartAndPortChange → testPartitionByAttribute ordering by waiting for the cluster to be fully reconnected before testPartitionByAttribute proceeds.
Pursue a real fix at the framework layer for the RST_STREAM on cluster replication (the original goal earlier in this investigation), separate from this PR.

@markap14 please advise — I will pause aggressive polling and switch to once-per-hour until you weigh in.

markap14 · 2026-06-11T14:14:49Z

Experimental commit: temporarily revert Jetty 12.1.10 → 12.1.9

Pushed 451d60b to test the hypothesis that the recent system-tests flakes are a server-side HTTP/2 regression introduced by the Jetty 12.1.9 → 12.1.10 bump in NIFI-15993 (2026-06-03), not by anything in this PR.

Why I think Jetty 12.1.10 is the trigger

The failures all surface as the JDK java.net.http.HttpClient receiving an HTTP/2 RST_STREAM with code CANCEL from the in-JVM Jetty server during the request body upload of a replicated cluster request:

java.io.IOException: RST_STREAM received Stream cancelled
    at java.net.http/jdk.internal.net.http.Stream.incompleteRequestBodyReset(Stream.java:730)
    at java.net.http/jdk.internal.net.http.Stream.incoming_reset(Stream.java:712)

I counted status codes in nifi-request.log for the failing test: zero HTTP 421 responses on either node, which rules out ProxyHeaderValidatorCustomizer / HostPortValidatorCustomizer as the source of the reset.
Correlation with main-branch system-tests workflow history:
- 2026-06-03 13:05 UTC: main run SUCCESS (last green run).
- 2026-06-03 21:48 UTC: f5b9c13 NIFI-15993 bumps Jetty 12.1.9 → 12.1.10 (and several other unrelated deps).
- 2026-06-09 02:21 UTC: main run FAILURE — first system-tests run on main after the Jetty bump.
- Every system-tests workflow run on main since then has failed.
Jetty 12.1.10's notable HTTP/2 changes per the release notes include #15009 "Make processing of RST_STREAM more lenient" and #15161 "Reduce memory footprint for persistent HttpConnections", both of which touch HTTP/2 stream/connection lifecycle.

What this commit is and isn't

This is not intended to be merged as-is. The PR remains a Draft. If this commit's CI run shows the flakes disappear, we will:

File an upstream Jetty bug with a minimal repro.
Decide whether to pin Jetty to 12.1.9 in main until it's fixed, or wait for a 12.1.11 with a targeted fix.

If the flakes persist on 12.1.9, the hypothesis is wrong and we keep digging.

The existing LoadBalanceIT batch-size reduction in this PR stays in place as defense in depth either way.

markap14 · 2026-06-11T15:41:46Z

Hypothesis confirmed: Jetty 12.1.10 introduced the HTTP/2 RST_STREAM regression

Workflow run 27353150066 with commit 451d60b (Jetty pinned back to 12.1.9):

Job	Result	Notes
`ubuntu-24.04 Java 21`	PASS	All system tests green.
`macos-15 Java 21`	PASS	All system tests green.
`ubuntu-24.04 Java 25`	FAIL	One test failed, but not the RST_STREAM pattern — see below.
`macos-15 Java 25`	FAIL	Two tests failed, neither is RST_STREAM — see below.

Primary finding

rg -l "RST_STREAM" /tmp/nifi-pr11325-jetty129-logs/ returns zero hits across both Java 25 troubleshooting archives, and the previously affected tests (LoadBalanceIT, ClusteredStatelessFlowIT, ClusteredRegistryClientIT, OffloadContentClaimTruncationIT, FlowSynchronizationIT) all passed on every OS/JDK combination. The RST_STREAM Stream cancelled / incompleteRequestBodyReset failure mode that was reproducing on every recent main run is gone on Jetty 12.1.9.

Remaining Java 25 failures (not Jetty)

These look like pre-existing flakes in different code paths:

ClusteredConnectorDrainIT.testDrainWithNodeCompletingAtDifferentTimes (ubuntu-24.04 Java 25): the test's @BeforeEach waitForAllNodesConnected timed out after 60s with node-2 still DISCONNECTED. Cluster join, not replication.
OffloadIT.testOffload (macos-15 Java 25): TimeoutException: testOffload() timed out after 10 minutes. Test hang.
ClusteredReplayProvenanceIT[2].testReplayLastEvent (macos-15 Java 25): AssertionFailedError: expected: <2> but was: <1>. Looks like a real test/assertion issue.

None of these involve HTTP/2 RST_STREAM or cluster replication failures.

Proposed next step

Suggest we proceed in this order:

Pin <jetty.version> to 12.1.9 in main (separate PR) as the immediate fix for the RST_STREAM regression, with a TODO/comment referencing the upstream Jetty bug.
File an upstream Jetty bug with a minimal repro (HTTP/2 client+server in same JVM over loopback, mTLS, many short POSTs ⇒ RST_STREAM(CANCEL) during request body upload). I can put that together.
Treat the three remaining Java 25 flakes as separate Jiras and triage them independently — they were almost certainly there before, just masked by the all-red RST_STREAM noise.
Once 12.1.9 is pinned in main, this PR's LoadBalanceIT batch-size reduction stands on its own as a defense-in-depth load reduction and can be reviewed/merged on its merits.

Waiting on direction before proceeding.

… to 20 in order to avoid the excessive number of requests to the cluster in order to iterate over each FlowFile in the queue

…hypothesis System-tests workflow runs on main have failed consistently since NIFI-15993 bumped Jetty from 12.1.9 to 12.1.10 on 2026-06-03. Failures present as the JDK java.net.http.HttpClient receiving an HTTP/2 RST_STREAM (CANCEL) from the in-JVM Jetty server during the request body upload of replicated cluster requests (LoadBalanceIT, ClusteredStatelessFlowIT, OffloadContentClaimTruncationIT, FlowSynchronizationIT, etc.). This is an experimental commit on the draft NIFI-16011 PR to validate the hypothesis that Jetty 12.1.10 introduced a server-side HTTP/2 stream-lifecycle regression. It is not intended to be merged as-is; if CI passes we will file an upstream Jetty bug and decide whether to pin to 12.1.9 or wait for 12.1.11.

…ault HTTP_2, force HTTP_1_1 in system tests Jetty 12.1.10 includes a significant rewrite of its HTTP/2 stream state machine (jetty PR #15087 for issue #15009, "Make processing of RST_STREAM more lenient"). The change makes Jetty more tolerant of receiving RST_STREAM frames but appears to have regressed when Jetty sends RST_STREAM in some scenarios. The JDK java.net.http.HttpClient surfaces these as "IOException: RST_STREAM received Stream cancelled" and the in-flight request cannot be recovered (no retry for replicated POSTs). The pattern surfaces intermittently in the system test suite under the heavy disconnect / offload / restart load tests like LoadBalanceIT exercise. Introduce a new property nifi.cluster.node.protocol.http.version that configures the HTTP version that the cluster node web client prefers when replicating requests to other nodes. Accepts HTTP_2 (default) or HTTP_1_1; invalid values log a warning and fall back to HTTP_2. Production traffic continues to use HTTP_2; the system test factory overrides every spawned NiFi instance to HTTP_1_1 so the regression stays invisible to the test suite. Also restores Jetty to 12.1.10 (reverting the temporary 12.1.9 diagnostic downgrade).

…operties templates instead of hard-coding it in SpawnedStandaloneNiFiInstanceFactory.

…on after re-connect but then be quickly told to disconnect due to a queued up 'Disconnect' message from the original disconnection. Now, we use a 'generation' flag so we know to ignore the message, and we also cancel the background task that is trying to deliver it.

The unstubbed mock disconnect() completed immediately, allowing the background thread to remove the future from the map before the test could observe it. Stub disconnect() to throw so the retry loop keeps the future in the map long enough for the assertion.

The system test is not essential for this fix, which addresses an existing source of system test flakiness rather than introducing new behavior that requires end-to-end verification.

markap14 marked this pull request as ready for review June 10, 2026 19:10

markap14 marked this pull request as draft June 11, 2026 13:15

markap14 force-pushed the NIFI-16011 branch from b0a39b5 to 45fe737 Compare June 11, 2026 20:54

markap14 added 8 commits June 11, 2026 16:56

NIFI-16011: Reduce number of FlowFiles used in LoadBalanceIT from 100…

7bec73e

… to 20 in order to avoid the excessive number of requests to the cluster in order to iterate over each FlowFile in the queue

NIFI-16011: Move system-test HTTP_1_1 setting into the conf/*/nifi.pr…

823692b

…operties templates instead of hard-coding it in SpawnedStandaloneNiFiInstanceFactory.

NIFI-16006: Fix import ordering in TestNodeClusterCoordinator

76348be

NIFI-16006: Remove DisconnectAndRestartIT system test

43fe62e

The system test is not essential for this fix, which addresses an existing source of system test flakiness rather than introducing new behavior that requires end-to-end verification.

markap14 force-pushed the NIFI-16011 branch from 45fe737 to 43fe62e Compare June 11, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-16011: Reduce number of FlowFiles used in LoadBalanceIT from 100…#11325

NIFI-16011: Reduce number of FlowFiles used in LoadBalanceIT from 100…#11325
markap14 wants to merge 8 commits into
apache:mainfrom
markap14:NIFI-16011

markap14 commented Jun 10, 2026

Uh oh!

markap14 commented Jun 10, 2026

Uh oh!

markap14 commented Jun 10, 2026

Uh oh!

markap14 commented Jun 11, 2026

Uh oh!

markap14 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

markap14 commented Jun 10, 2026

Summary

Tracking

Issue Tracking

Pull Request Tracking

Pull Request Formatting

Verification

Build

Licensing

Documentation

Uh oh!

markap14 commented Jun 10, 2026

Uh oh!

markap14 commented Jun 10, 2026

Uh oh!

markap14 commented Jun 11, 2026

Experimental commit: temporarily revert Jetty 12.1.10 → 12.1.9

Why I think Jetty 12.1.10 is the trigger

What this commit is and isn't

Uh oh!

markap14 commented Jun 11, 2026

Hypothesis confirmed: Jetty 12.1.10 introduced the HTTP/2 RST_STREAM regression

Primary finding

Remaining Java 25 failures (not Jetty)

Proposed next step

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant