Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #383

Open
fengyang95 opened this issue Feb 25, 2025 · 5 comments

Comments

@fengyang95
Copy link

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 130, in
main()
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 25, in main
run_ppo(config)
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 33, in run_ppo
ray.get(main_task.remote(config, compute_score))
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=635457, ip=127.0.0.1)
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 126, in main_task
trainer.fit()
File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 862, in fit
val_metrics = self._validate()
^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 631, in _validate
test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: create_colocated_worker_cls..WorkerDict
actor_id: 51333b5b40d3feca28206af601000000
pid: 645217
name: 0o0QHzWorkerDict_0:6
namespace: 9b824d07-ee25-46ef-bdd4-4c993aab9272
ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@asirgogogo
Copy link

same here

@ChaosCodes
Copy link

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

@yenanjing
Copy link

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

same problem.

@yenanjing
Copy link

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0,as mentioned in #12 (comment)

@ChaosCodes
Copy link

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0,as mentioned in #12 (comment)

I attempted to modify log_prob_micro_batch_size, but the training still gets stuck at actor_rollout_compute_log_prob.

I suspect the issue might be related to ulysses_sequence_parallel_size. When I set ulysses_sequence_parallel_size=8, the training gets stuck at actor_rollout_compute_log_prob. However, when I set ulysses_sequence_parallel_size=4, the training no longer gets stuck at actor_rollout_compute_log_prob, but it will sometimes result in an out-of-memory (OOM) error during the actor update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants