Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Cannot reuse --rdzv_id between different elastic launch ? #151

@PKUFlyingPig

Description

@PKUFlyingPig

Question

I followed the tutorial and used the following command to launch the torchelastic:

export NUM_TRAINERS=2
python -m torchelastic.distributed.launch \
    --nnodes=1:4 \
    --nproc_per_node=$NUM_TRAINERS \
    --rdzv_id=1 \
    --rdzv_backend=etcd \
    --rdzv_endpoint=162.105.19.156:2379 \
    mnmc_ddp_launch.py

I run the same command on two nodes, and they run successfully, But when I killed one node process with Ctrl-C, the other node also aborted. Here is the traceback if it helps:

Traceback (most recent call last):
  File "mnmc_ddp_launch.py", line 119, in <module>
    main()
  File "mnmc_ddp_launch.py", line 90, in main
    outputs = net(inputs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 696, in forward
    self._sync_params()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1222, in _sync_params
    self._distributed_broadcast_coalesced(
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted.
[ERROR] 2021-06-03 19:39:40,572 api: failed (exitcode: 1) local_rank: 0 (pid: 22262) of binary: /home/zhongyinmin/anaconda3/bin/python
[ERROR] 2021-06-03 19:39:40,572 local_elastic_agent: [default] Worker group failed
[INFO] 2021-06-03 19:39:40,572 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2021-06-03 19:39:40,573 api: [default] Stopping worker group
[INFO] 2021-06-03 19:39:40,573 api: [default] Rendezvous'ing worker group
INFO 2021-06-03 19:39:40,573 Attempting to join next rendezvous
INFO 2021-06-03 19:39:40,582 Observed existing rendezvous state: {'status': 'closed', 'version': '1', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_1/rdzv/v_1/rank_1', '/torchelastic/p2p/run_1/rdzv/v_1/rank_0'], 'num_workers_waiting': 0}
INFO 2021-06-03 19:39:40,582 Rendezvous for run_id=1 was observed to be closed
{"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "1", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "k8s-master", "state": "FAILED", "total_run_time": 80, "rdzv_backend": "etcd", "raw_error": "Traceback (most recent call last):\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py\", line 531, in main\n    run_result = elastic_agent.run(spec.role)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 680, in run\n    result = self._invoke_run(role)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 831, in _invoke_run\n    self._restart_workers(self._worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 674, in _restart_workers\n    self._initialize_workers(worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 654, in _initialize_workers\n    self._rendezvous(worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 518, in _rendezvous\n    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 154, in next_rendezvous\n    rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 287, in rendezvous_barrier\n    return self.init_phase()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 349, in init_phase\n    raise RendezvousClosedException()\ntorchelastic.rendezvous.api.RendezvousClosedException\n", "metadata": "{\"group_world_size\": 2, \"entry_point\": \"python\"}", "agent_restarts": 1}}
[ERROR] 2021-06-03 19:39:40,588 error_handler: {
  "message": {
    "message": "RendezvousClosedException: ",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/multiprocessing/errors/__init__.py\", line 320, in wrapper\n    return f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py\", line 531, in main\n    run_result = elastic_agent.run(spec.role)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 680, in run\n    result = self._invoke_run(role)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 831, in _invoke_run\n    self._restart_workers(self._worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 674, in _restart_workers\n    self._initialize_workers(worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 654, in _initialize_workers\n    self._rendezvous(worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 518, in _rendezvous\n    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 154, in next_rendezvous\n    rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 287, in rendezvous_barrier\n    return self.init_phase()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 349, in init_phase\n    raise RendezvousClosedException()\ntorchelastic.rendezvous.api.RendezvousClosedException\n",
      "timestamp": "1622720380"
    }
  }
}
Traceback (most recent call last):
  File "/home/zhongyinmin/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zhongyinmin/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py", line 561, in <module>
    main()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/multiprocessing/errors/__init__.py", line 320, in wrapper
    return f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py", line 531, in main
    run_result = elastic_agent.run(spec.role)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
    result = f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 831, in _invoke_run
    self._restart_workers(self._worker_group)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
    result = f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 674, in _restart_workers
    self._initialize_workers(worker_group)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
    result = f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 654, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
    result = f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 518, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 154, in next_rendezvous
    rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 287, in rendezvous_barrier
    return self.init_phase()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 349, in init_phase
    raise RendezvousClosedException()
torchelastic.rendezvous.api.RendezvousClosedException

I searched for this exception and it says "This Exception is raised when a rendezvous for the specified run_id is closed.
This is used to signal completion to nodes that arrive late." I don't understand what does it mean. And when I want to run the same command with --rdzv_id still set to 1, this error emerged again until I change the --rdzv_id to another number. Can not I reuse --rdzv_id between different job ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions