feat: introduce intermediate valid task counts for big partition counts#19549
feat: introduce intermediate valid task counts for big partition counts#19549Fly-Style wants to merge 1 commit into
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 3 of 3 changed files.
This is an automated review by Codex GPT-5.5
|
@Fly-Style , IIUC, the partition assignment is all-or-nothing by design. Could you please elaborate on the benefits of the intermediate scaling? With 300 tasks, 100 tasks would be working on 2 partitions each and 200 tasks would be working on 1 partition each (and we cannot control which partitions get assigned to which tasks). So, the tasks working on 2 partitions have effectively not changed and would continue to be the ingestion bottleneck. Please note that in a non-uniform partition assignment, the idleness cost computation would be a little more involved, since the tasks working on fewer partitions would be more idle than others on average. |
|
@kfaraz During our usage of this autoscaler, we saw that, let's say, 200 tasks to handle 400 partitions were under-provisioned, and 400 tasks were over-provisioned, and adjusting the weights didn't improve the situation much. This PR was introduced to provide better variants to the autoscaler in such a situation, and it acknowledges of margin of some tasks would handle 2 partitions instead of 1. |
|
At some point I wonder how complicated we really want to make this thing. IMO, auto-scalers should be sufficiently "dumb" in that their behavior is predictable enough to debug on-call. Each "edge case" behavior adds more operator/code complexity which IMO isn't the direction we should be heading. If possible, I'd rather attack the core problem (that we cannot scale frequently without disruption) by potentially allowing "sticky" partitions to groupIds or somehow allowing tasks to dynamically get re-assigned to partitions (without shutting down). This would allow a "dumber" auto-scaler to run more frequently w/o the worry of disrupting high-throughput supervisors, as tasks could dynamically adjust to the suggested partition assignments. |
Description
The cost-based autoscaler derives candidate task counts from possible partitions-per-task ratios. For large partition counts these candidates can be very far apart near the top of the assignment range - e.g. for 400 partitions, the candidates scales
200 -> 400. Because the cost model only evaluates the generated candidates, it has no intermediate option to settle on, forcing coarse, all-or-nothing scaling decisions.This PR adds deterministic intermediate candidates so the cost model has finer-grained options, without changing the cost model itself.
Introduced intermediate valid task counts for large gaps
CostBasedAutoScaler.computeValidTaskCountsnow post-processes the generated candidate list: after the base partitions-per-task candidates are produced and sorted, every adjacent pair whose gap exceedsMAX_CANDIDATE_GAP(100) is split with intermediate candidates atINTERPOLATION_FRACTIONS = {0.33, 0.66}, rounded to the nearest integer.Tuned the lag amplification multiplier:
WeightedCostFunction.LAG_AMPLIFICATION_MULTIPLIERis lowered from 0.4 to 0.35 based on further testing, for a slightly more balanced high-lag recovery response.Release note
The cost-based supervisor autoscaler now considers intermediate task counts when the candidate task counts derived from partition assignment are far apart, enabling smoother scaling for streams with large partition counts.
This PR has:
multi-gap, and single-candidate cases) in CostBasedAutoScalerTest.