[Data] Fix ActorPool scaling to avoid scaling down when the input queue is empty #53009

alexeykudinkin · 2025-05-14T21:07:44Z

Why are these changes needed?

This is a follow-up for a #52806 that inadvertently modified autoscaling condition allowing downscaling to happen before all inputs have completed.

Changes

Fixed ActorPool to avoid downscaling if the enqueued blocks are 0 until all inputs are done.
Unified and streamlined ActorPool scaling handler to be just 1 method
Updated tests
Tidying up

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Alexey Kudinkin <[email protected]>

Rebased `get_actor_info` to use it; Tidying up Signed-off-by: Alexey Kudinkin <[email protected]>

Removed unnecessary methods; Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]>

…are dispatched Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up; Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin · 2025-05-14T23:05:11Z

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

-        if op.completed() or (op_state.total_enqueued_input_bundles() == 0):
-            return False
+        if op.completed() or (
+            op._inputs_complete and op_state.total_enqueued_input_bundles() == 0


@bveeramani this

Signed-off-by: Alexey Kudinkin <[email protected]>

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

bveeramani · 2025-05-15T03:24:37Z

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

        util = self._calculate_actor_pool_util(actor_pool)
-        return util < self._actor_pool_scaling_down_threshold
+        if util >= self._actor_pool_scaling_up_threshold:
+            return _AutoscalingAction.SCALE_UP


In the current implementation, we don't scale up if the previous scale up hasn't finished. I think this diff might remove that behavior?
https://github.com/anyscale/rayturbo/blob/e156a9f9424ec1e499e93892d1a33d4bfb8599a1/python/ray/anyscale/data/autoscaler/anyscale_autoscaler.py#L357-L360

bveeramani · 2025-05-15T03:27:42Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

+@dataclass
+class _ActorInfo:
+    """Breakdown of the state of the actors used by the ``PhysicalOperator``"""
+
+    running: int
+    pending: int
+    restarting: int
+


Nit: I felt kinda confused by these name. _ActorInfo makes me think a single actor rather than all of the actors for an operator, and running etc. seemed like bools rather than counts.

Maybe something like ActorStateCounts or OperatorActorStats and num_running etc. might be clearer

I was calling it OpActorInfo initially, but it's now used inside the ActorPool. Will rename it to _ActorPoolInfo

bveeramani · 2025-05-15T03:29:32Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+        logger.info(
+            f"Scaling up actor pool by {num_actors} ("
+            f"running={self.num_running_actors()}, "
+            f"pending={self.num_pending_actors()}, "
+            f"restarting={self.num_restarting_actors()})"
+        )
+


We autoscale actors as much as twice a second, so I'm worried these info logs might get spammy. Would debug make sense instead?

There actually should be no bouncing back and forth in autoscaler (it just doesn't really make sense)

- AP has high utilization (actor-wise), but - Op is throttled, or - There are still free slots available Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]>

raulchen · 2025-05-16T20:10:40Z

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

+            )
+        elif util <= self._actor_pool_scaling_down_threshold:
+            if not actor_pool.can_scale_down():
+                return _AutoscalingAction.NO_OP, "debounced"


nit, the method name can_scale_down doesn't suggest it's because of debouncing

raulchen · 2025-05-16T20:14:38Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

@@ -760,20 +798,20 @@ def num_free_slots(self) -> int:
            for running_actor in self._running_actors.values()
        )

-    def _kill_inactive_actor(self) -> bool:
+    def _release_inactive_actor(self) -> bool:


nit, I prefer using the word "remove".
Because both kill and release are implementation details that can change in the future.
also, please update the comments as well

raulchen · 2025-05-16T20:16:26Z

python/ray/data/_internal/execution/streaming_executor_state.py

+        return (
+            f"{base} (running={info.running}, restarting={info.restarting}, "
+            f"pending={info.pending})"
+        )


implement this as _ActorInfo.__str__.
So the logs in scale_up/scale_down can use it as well

raulchen · 2025-05-16T20:19:46Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

+@dataclass
+class _ActorInfo:
+    """Breakdown of the state of the actors used by the ``PhysicalOperator``"""
+
+    running: int
+    pending: int
+    restarting: int
+


raulchen · 2025-05-16T20:20:31Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

@@ -574,18 +577,53 @@ def current_in_flight_tasks(self) -> int:
            for actor_state in self._running_actors.values()
        )

-    def scale_up(self, num_actors: int) -> int:
+    def can_scale_down(self):


why not putting the debouncing logic in the autoscaler?
It's also part of the autoscaling policy

Went back and forth on it decided to keep it in AP, since it owns the logic of actually scaling up and tracking most recent actions

raulchen · 2025-05-16T20:21:13Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+
+    def scale_up(self, num_actors: int, *, reason: Optional[str] = None) -> int:
+        logger.info(
+            f"Scaling up actor pool by {num_actors} ("


Suggested change

f"Scaling up actor pool by {num_actors} ("

f"Scaling up actor pool by {num_actors}, current actor counts: ("

It also includes reason so that needs to be more generic

raulchen · 2025-05-16T20:26:30Z

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

        self,
        actor_pool: AutoscalingActorPool,
        op: "PhysicalOperator",
        op_state: "OpState",
-    ):
+    ) -> Tuple[_AutoscalingAction, Optional[str]]:


nit, wrap action, number, and reason as one class, and move it to autoscaling_actor_pool.py?

Planning to take this up in a follow-up one

Signed-off-by: Alexey Kudinkin <[email protected]>

…ue is empty (ray-project#53009)

alexeykudinkin added 6 commits May 14, 2025 12:51

Restoring accidentally modified conditional

a661882

Signed-off-by: Alexey Kudinkin <[email protected]>

Combined scaling decisions in a single method

5e4ee77

Signed-off-by: Alexey Kudinkin <[email protected]>

Added standalone _ActorInfo;

dcc5d06

Rebased `get_actor_info` to use it; Tidying up Signed-off-by: Alexey Kudinkin <[email protected]>

Localized actor info summarization to StreamingExecutor;

b72fd6e

Removed unnecessary methods; Signed-off-by: Alexey Kudinkin <[email protected]>

Fixed tests;

e4a1358

Tidying up Signed-off-by: Alexey Kudinkin <[email protected]>

lint

31c8e94

Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin requested a review from a team as a code owner May 14, 2025 21:07

Updated test to assert that we're not scaling down before all inputs …

e1c5bd8

…are dispatched Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin added the go add ONLY when ready to merge, run all tests label May 14, 2025

alexeykudinkin requested a review from bveeramani May 14, 2025 21:10

alexeykudinkin added 2 commits May 14, 2025 14:10

lint

2a6304e

Signed-off-by: Alexey Kudinkin <[email protected]>

Added logs for scaling up/down actions;

c22e421

Tidying up; Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin commented May 14, 2025

View reviewed changes

Fixing usage

550c9f4

Signed-off-by: Alexey Kudinkin <[email protected]>

bveeramani reviewed May 15, 2025

View reviewed changes

alexeykudinkin added 14 commits May 14, 2025 22:48

Fixed autoscaling sequence to properly handle cases when

8247ddb

- AP has high utilization (actor-wise), but - Op is throttled, or - There are still free slots available Signed-off-by: Alexey Kudinkin <[email protected]>

Added comments

83276a4

Signed-off-by: Alexey Kudinkin <[email protected]>

lint

4580190

Signed-off-by: Alexey Kudinkin <[email protected]>

Updated tests

32583d8

Signed-off-by: Alexey Kudinkin <[email protected]>

Capture reason for autoscaling decision

2896dc5

Signed-off-by: Alexey Kudinkin <[email protected]>

Updated tests to assert autoscaling decision reasons;

a5bc1e5

Signed-off-by: Alexey Kudinkin <[email protected]>

Avoid scaling up/down past min/max envelope

a37eea4

Signed-off-by: Alexey Kudinkin <[email protected]>

lint

8cee20e

Signed-off-by: Alexey Kudinkin <[email protected]>

Added scale-down debouncing method

e6fe164

Signed-off-by: Alexey Kudinkin <[email protected]>

Avoid scaling up if previous scale-up hasn't finished yet

20724e8

Signed-off-by: Alexey Kudinkin <[email protected]>

Updated autoscaler tests

ac9a88d

Signed-off-by: Alexey Kudinkin <[email protected]>

Added test for can_scale_down

ffadb4a

Signed-off-by: Alexey Kudinkin <[email protected]>

Added a handle to avoid bouncing b/w scaling up and down

37f810d

Signed-off-by: Alexey Kudinkin <[email protected]>

lint

2f8be3f

Signed-off-by: Alexey Kudinkin <[email protected]>

raulchen approved these changes May 16, 2025

View reviewed changes

raulchen mentioned this pull request May 16, 2025

[data] add an actor-based map_batches autoscaling test #52983

Merged

alexeykudinkin added 6 commits May 16, 2025 16:27

Tidying up

05c15b7

Signed-off-by: Alexey Kudinkin <[email protected]>

Updating reason

4e73083

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up more

f37b0d6

Signed-off-by: Alexey Kudinkin <[email protected]>

Abstracted _ActorPoolInfo.__str__

7796e4b

Signed-off-by: Alexey Kudinkin <[email protected]>

lint

2507cd1

Signed-off-by: Alexey Kudinkin <[email protected]>

Tidying up some more

41ff078

Signed-off-by: Alexey Kudinkin <[email protected]>

richardliaw merged commit 05f3517 into ray-project:master May 19, 2025
5 checks passed

hainesmichaelc added the community-backlog label May 22, 2025

kenmcheng pushed a commit to kenmcheng/ray that referenced this pull request May 27, 2025

[Data] Fix ActorPool scaling to avoid scaling down when the input que…

fb02338

…ue is empty (ray-project#53009)

	f"Scaling up actor pool by {num_actors} ("
	f"Scaling up actor pool by {num_actors}, current actor counts: ("

[Data] Fix ActorPool scaling to avoid scaling down when the input queue is empty #53009

[Data] Fix ActorPool scaling to avoid scaling down when the input queue is empty #53009

Uh oh!

Conversation

alexeykudinkin commented May 14, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!