feat: introduce intermediate valid task counts for big partition counts by Fly-Style · Pull Request #19549 · apache/druid

Fly-Style · 2026-06-03T15:26:17Z

Description

The cost-based autoscaler derives candidate task counts from possible partitions-per-task ratios. For large partition counts these candidates can be very far apart near the top of the assignment range - e.g. for 400 partitions, the candidates scales 200 -> 400. Because the cost model only evaluates the generated candidates, it has no intermediate option to settle on, forcing coarse, all-or-nothing scaling decisions.

This PR adds deterministic intermediate candidates so the cost model has finer-grained options, without changing the cost model itself.

Introduced intermediate valid task counts for large gaps

CostBasedAutoScaler.computeValidTaskCounts now post-processes the generated candidate list: after the base partitions-per-task candidates are produced and sorted, every adjacent pair whose gap exceeds MAX_CANDIDATE_GAP (100) is split with intermediate candidates atINTERPOLATION_FRACTIONS = {0.33, 0.66}, rounded to the nearest integer.

Tuned the lag amplification multiplier:
WeightedCostFunction.LAG_AMPLIFICATION_MULTIPLIER is lowered from 0.4 to 0.35 based on further testing, for a slightly more balanced high-lag recovery response.

Release note

The cost-based supervisor autoscaler now considers intermediate task counts when the candidate task counts derived from partition assignment are far apart, enabling smoother scaling for streams with large partition counts.

This PR has:

been self-reviewed.
added unit tests covering the new candidate-generation behaviour (no-gap, single-gap,
multi-gap, and single-candidate cases) in CostBasedAutoScalerTest.
been tested in a test Druid cluster.

FrankChen021

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 3 of 3 changed files.

This is an automated review by Codex GPT-5.5

kfaraz · 2026-06-05T05:46:18Z

@Fly-Style , IIUC, the partition assignment is all-or-nothing by design.

Could you please elaborate on the benefits of the intermediate scaling?
e.g. in the case of 400 partitions, what would be the benefit of scaling from 200 tasks to say 300 tasks (instead of directly to 400)?

With 300 tasks, 100 tasks would be working on 2 partitions each and 200 tasks would be working on 1 partition each (and we cannot control which partitions get assigned to which tasks). So, the tasks working on 2 partitions have effectively not changed and would continue to be the ingestion bottleneck.

Please note that in a non-uniform partition assignment, the idleness cost computation would be a little more involved, since the tasks working on fewer partitions would be more idle than others on average.

Fly-Style · 2026-06-05T06:48:15Z

@kfaraz During our usage of this autoscaler, we saw that, let's say, 200 tasks to handle 400 partitions were under-provisioned, and 400 tasks were over-provisioned, and adjusting the weights didn't improve the situation much.

This PR was introduced to provide better variants to the autoscaler in such a situation, and it acknowledges of margin of some tasks would handle 2 partitions instead of 1.

jtuglu1 · 2026-06-05T23:51:01Z

At some point I wonder how complicated we really want to make this thing. IMO, auto-scalers should be sufficiently "dumb" in that their behavior is predictable enough to debug on-call. Each "edge case" behavior adds more operator/code complexity which IMO isn't the direction we should be heading.

If possible, I'd rather attack the core problem (that we cannot scale frequently without disruption) by potentially allowing "sticky" partitions to groupIds or somehow allowing tasks to dynamically get re-assigned to partitions (without shutting down). This would allow a "dumber" auto-scaler to run more frequently w/o the worry of disrupting high-throughput supervisors, as tasks could dynamically adjust to the suggested partition assignments.

Fly-Style · 2026-06-10T08:51:41Z

@jtuglu1 partially done in #19562 :)

Also we have dynamic partition reassignment in mind, it is the next step.

Introduce intermediate valid task counts for big partition counts

debcad0

Fly-Style requested a review from kfaraz June 3, 2026 15:26

Fly-Style self-assigned this Jun 3, 2026

github-actions Bot added the Area - Ingestion label Jun 3, 2026

FrankChen021 reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce intermediate valid task counts for big partition counts#19549

feat: introduce intermediate valid task counts for big partition counts#19549
Fly-Style wants to merge 1 commit into
apache:masterfrom
Fly-Style:cba-even-valid-tasks

Fly-Style commented Jun 3, 2026

Uh oh!

FrankChen021 left a comment

Uh oh!

kfaraz commented Jun 5, 2026

Uh oh!

Fly-Style commented Jun 5, 2026 •

edited

Loading

Uh oh!

jtuglu1 commented Jun 5, 2026 •

edited

Loading

Uh oh!

Fly-Style commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Fly-Style commented Jun 3, 2026

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

kfaraz commented Jun 5, 2026

Uh oh!

Fly-Style commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtuglu1 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fly-Style commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fly-Style commented Jun 5, 2026 •

edited

Loading

jtuglu1 commented Jun 5, 2026 •

edited

Loading

Fly-Style commented Jun 10, 2026 •

edited

Loading