Support `--standalone` for concurrent single node multi-GPU jobs #3175

Olive-Z · 2024-10-16T17:02:38Z

Support for --standalone for single node multi-GPU jobs (https://pytorch.org/docs/stable/elastic/run.html#single-node-multi-worker) might be of great interest to avoid concurrent jobs on the same machine trying to bind to the same port, e.g. :

accelerate launch --num_processes 2 --num_machines 1 train.py --lr 1e-5 & \
accelerate launch --num_processes 2 --num_machines 1 train.py --lr 1e-3

results in the following error :

ConnectionError: Tried to launch distributed communication on port `29500`, but another process is utilizing it. Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.

When using --main_process_port 0, both jobs seem to loop forever.
Workarounds such as --main_process_port $((29500 + $RANDOM % 1000)) do resolve the conflict, but it is not straightforward.
torchrun provides --standalone for this specific purpose (pytorch/pytorch#107734, https://github.com/pytorch/pytorch/blob/960c3bff98a1a8b0f3c68eec764a507b7aaa63c2/torch/distributed/run.py#L893-L907)

So how about propagating this argument to torch.distributed.run ?

accelerate/src/accelerate/commands/launch.py

Line 793 in 74381f9

distrib_run.run(args)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `--standalone` for concurrent single node multi-GPU jobs #3175

Support `--standalone` for concurrent single node multi-GPU jobs #3175

Olive-Z commented Oct 16, 2024

Support --standalone for concurrent single node multi-GPU jobs #3175

Support --standalone for concurrent single node multi-GPU jobs #3175

Comments

Olive-Z commented Oct 16, 2024

Support `--standalone` for concurrent single node multi-GPU jobs #3175

Support `--standalone` for concurrent single node multi-GPU jobs #3175