ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #383

fengyang95 · 2025-02-25T13:20:59Z

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 130, in
main()
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 25, in main
run_ppo(config)
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 33, in run_ppo
ray.get(main_task.remote(config, compute_score))
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=635457, ip=127.0.0.1)
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 126, in main_task
trainer.fit()
File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 862, in fit
val_metrics = self._validate()
^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 631, in _validate
test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: create_colocated_worker_cls..WorkerDict
actor_id: 51333b5b40d3feca28206af601000000
pid: 645217
name: 0o0QHzWorkerDict_0:6
namespace: 9b824d07-ee25-46ef-bdd4-4c993aab9272
ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

asirgogogo · 2025-02-27T11:09:43Z

same here

ChaosCodes · 2025-02-28T02:07:57Z

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

yenanjing · 2025-02-28T06:14:33Z

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

same problem.

yenanjing · 2025-02-28T08:05:14Z

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0，as mentioned in #12 (comment)

ChaosCodes · 2025-02-28T18:50:32Z

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0，as mentioned in #12 (comment)

I attempted to modify log_prob_micro_batch_size, but the training still gets stuck at actor_rollout_compute_log_prob.

I suspect the issue might be related to ulysses_sequence_parallel_size. When I set ulysses_sequence_parallel_size=8, the training gets stuck at actor_rollout_compute_log_prob. However, when I set ulysses_sequence_parallel_size=4, the training no longer gets stuck at actor_rollout_compute_log_prob, but it will sometimes result in an out-of-memory (OOM) error during the actor update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #383

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #383

fengyang95 commented Feb 25, 2025

asirgogogo commented Feb 27, 2025

ChaosCodes commented Feb 28, 2025

yenanjing commented Feb 28, 2025

yenanjing commented Feb 28, 2025

ChaosCodes commented Feb 28, 2025

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #383

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #383

Comments

fengyang95 commented Feb 25, 2025

asirgogogo commented Feb 27, 2025

ChaosCodes commented Feb 28, 2025

yenanjing commented Feb 28, 2025

yenanjing commented Feb 28, 2025

ChaosCodes commented Feb 28, 2025