"The connection to the C10d store has failed" on distributed evaluation #84

chawins · 2022-03-25T08:29:34Z

I have been trying to get AutoAttack works reliably in distributed mode (DistributedDataParallel) in pytorch. Often, the program crashes after AutoAttack ran for several batches with the stack trace below. It happened more often when AutoAttack ran for longer, i.e., high robust accuracy and lots of samples got through to Square attack.

I realize that this might be a system-specific and more pytorch-related issue, but I am curious if anyone else here has gotten a similar error and perhaps has a fix. My current fix is just to keep repeating the evaluation until it gets lucky and runs successfully. Obviously, this is not ideal and wastes a lot of time and resources as some need 5-10 repeats to succeed.

Another fix is to just run the evaluation on a non-distributed mode. This error never happens outside of the distributed mode and might be specific to c10d backend.

Some system info:

Happen with pytorch 1.9-1.11
CUDA 11.0 and 11.3
Use 2 V100 GPUs at a time
Use torchrun

...
initial accuracy: 75.00%
apgd-ce - 1/1 - 6 out of 48 successfully perturbed
robust accuracy after APGD-CE: 65.62% (total time 26.3 s)
apgd-t - 1/1 - 1 out of 42 successfully perturbed
robust accuracy after APGD-T: 64.06% (total time 95.2 s)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64626 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64627 closing signal SIGTERM
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'v100-xgcp.internal_64618_0' has failed to shutdown the rendezvous 'dcd635f2-70f3-47cd-941b-fec1c751acd3' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
  "message": {
    "message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 113, in _call_store\n    return getattr(self._store, store_op)(*args, **kwargs)\nRuntimeError: Broken pipe\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n    return f(*args, **kwargs)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py\", line 724, in main\n    run(args)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py\", line 715, in run\n    elastic_launch(\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py\", line 131, in __call__\n    return launch_agent(self._config, self._entrypoint, list(args))\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py\", line 236, in launch_agent\n    result = agent.run()\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py\", line 709, in run\n    result = self._invoke_run(role)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py\", line 881, in _invoke_run\n    num_nodes_waiting = rdzv_handler.num_nodes_waiting()\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 1079, in num_nodes_waiting\n    self._state_holder.sync()\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 408, in sync\n    get_response = self._backend.get_state()\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 73, in get_state\n    base64_state: bytes = self._call_store(\"get\", self._key)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 115, in _call_store\n    raise RendezvousConnectionError(\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
      "timestamp": "1648190455"
    }
  }
}
Traceback (most recent call last):
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/chawins/miniconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1079, in num_nodes_waiting
    self._state_holder.sync()
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
    get_response = self._backend.get_state()
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
WARNING:torch.distributed.run:

The text was updated successfully, but these errors were encountered:

ScarlettChan · 2022-03-25T08:29:57Z

您好，您的邮件已收到!

fra31 · 2022-04-05T11:31:06Z

Hi,

I've never tried to use AA in distributed mode so far. Anyway, thanks for letting me know, I'll get back to you in case I face the same issue or find a fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"The connection to the C10d store has failed" on distributed evaluation #84

"The connection to the C10d store has failed" on distributed evaluation #84

chawins commented Mar 25, 2022

ScarlettChan commented Mar 25, 2022 via email

fra31 commented Apr 5, 2022

"The connection to the C10d store has failed" on distributed evaluation #84

"The connection to the C10d store has failed" on distributed evaluation #84

Comments

chawins commented Mar 25, 2022

ScarlettChan commented Mar 25, 2022 via email

fra31 commented Apr 5, 2022