You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ConnectionError: Tried to launch distributed communication on port `29500`, but another process is utilizing it. Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.
Support for
--standalone
for single node multi-GPU jobs (https://pytorch.org/docs/stable/elastic/run.html#single-node-multi-worker) might be of great interest to avoid concurrent jobs on the same machine trying to bind to the same port, e.g. :accelerate launch --num_processes 2 --num_machines 1 train.py --lr 1e-5 & \ accelerate launch --num_processes 2 --num_machines 1 train.py --lr 1e-3
results in the following error :
When using
--main_process_port 0
, both jobs seem to loop forever.Workarounds such as
--main_process_port $((29500 + $RANDOM % 1000))
do resolve the conflict, but it is not straightforward.torchrun
provides--standalone
for this specific purpose (pytorch/pytorch#107734, https://github.com/pytorch/pytorch/blob/960c3bff98a1a8b0f3c68eec764a507b7aaa63c2/torch/distributed/run.py#L893-L907)So how about propagating this argument to
torch.distributed.run
?accelerate/src/accelerate/commands/launch.py
Line 793 in 74381f9
The text was updated successfully, but these errors were encountered: