-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #383
Comments
same here |
I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.
|
same problem. |
I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0,as mentioned in #12 (comment) |
I attempted to modify I suspect the issue might be related to |
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 130, in
main()
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 25, in main
run_ppo(config)
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 33, in run_ppo
ray.get(main_task.remote(config, compute_score))
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=635457, ip=127.0.0.1)
File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 126, in main_task
trainer.fit()
File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 862, in fit
val_metrics = self._validate()
^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 631, in _validate
test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/verl/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: create_colocated_worker_cls..WorkerDict
actor_id: 51333b5b40d3feca28206af601000000
pid: 645217
name: 0o0QHzWorkerDict_0:6
namespace: 9b824d07-ee25-46ef-bdd4-4c993aab9272
ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The text was updated successfully, but these errors were encountered: