You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to get AutoAttack works reliably in distributed mode (DistributedDataParallel) in pytorch. Often, the program crashes after AutoAttack ran for several batches with the stack trace below. It happened more often when AutoAttack ran for longer, i.e., high robust accuracy and lots of samples got through to Square attack.
I realize that this might be a system-specific and more pytorch-related issue, but I am curious if anyone else here has gotten a similar error and perhaps has a fix. My current fix is just to keep repeating the evaluation until it gets lucky and runs successfully. Obviously, this is not ideal and wastes a lot of time and resources as some need 5-10 repeats to succeed.
Another fix is to just run the evaluation on a non-distributed mode. This error never happens outside of the distributed mode and might be specific to c10d backend.
Some system info:
Happen with pytorch 1.9-1.11
CUDA 11.0 and 11.3
Use 2 V100 GPUs at a time
Use torchrun
...
initial accuracy: 75.00%
apgd-ce - 1/1 - 6 out of 48 successfully perturbed
robust accuracy after APGD-CE: 65.62% (total time 26.3 s)
apgd-t - 1/1 - 1 out of 42 successfully perturbed
robust accuracy after APGD-T: 64.06% (total time 95.2 s)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64626 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64627 closing signal SIGTERM
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'v100-xgcp.internal_64618_0' has failed to shutdown the rendezvous 'dcd635f2-70f3-47cd-941b-fec1c751acd3' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
"message": {
"message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
"extraInfo": {
"py_callstack": "Traceback (most recent call last):\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 113, in _call_store\n return getattr(self._store, store_op)(*args, **kwargs)\nRuntimeError: Broken pipe\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n return f(*args, **kwargs)\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py\", line 724, in main\n run(args)\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py\", line 715, in run\n elastic_launch(\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py\", line 131, in __call__\n return launch_agent(self._config, self._entrypoint, list(args))\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py\", line 236, in launch_agent\n result = agent.run()\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py\", line 709, in run\n result = self._invoke_run(role)\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py\", line 881, in _invoke_run\n num_nodes_waiting = rdzv_handler.num_nodes_waiting()\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 1079, in num_nodes_waiting\n self._state_holder.sync()\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 408, in sync\n get_response = self._backend.get_state()\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 73, in get_state\n base64_state: bytes = self._call_store(\"get\", self._key)\n File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 115, in _call_store\n raise RendezvousConnectionError(\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
"timestamp": "1648190455"
}
}
}
Traceback (most recent call last):
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/chawins/miniconda3/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
result = agent.run()
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1079, in num_nodes_waiting
self._state_holder.sync()
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
get_response = self._backend.get_state()
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
WARNING:torch.distributed.run:
The text was updated successfully, but these errors were encountered:
I've never tried to use AA in distributed mode so far. Anyway, thanks for letting me know, I'll get back to you in case I face the same issue or find a fix.
I have been trying to get AutoAttack works reliably in distributed mode (
DistributedDataParallel
) in pytorch. Often, the program crashes after AutoAttack ran for several batches with the stack trace below. It happened more often when AutoAttack ran for longer, i.e., high robust accuracy and lots of samples got through to Square attack.I realize that this might be a system-specific and more pytorch-related issue, but I am curious if anyone else here has gotten a similar error and perhaps has a fix. My current fix is just to keep repeating the evaluation until it gets lucky and runs successfully. Obviously, this is not ideal and wastes a lot of time and resources as some need 5-10 repeats to succeed.
Another fix is to just run the evaluation on a non-distributed mode. This error never happens outside of the distributed mode and might be specific to
c10d
backend.Some system info:
torchrun
The text was updated successfully, but these errors were encountered: