Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support --standalone for concurrent single node multi-GPU jobs #3175

Open
Olive-Z opened this issue Oct 16, 2024 · 0 comments
Open

Support --standalone for concurrent single node multi-GPU jobs #3175

Olive-Z opened this issue Oct 16, 2024 · 0 comments

Comments

@Olive-Z
Copy link

Olive-Z commented Oct 16, 2024

Support for --standalone for single node multi-GPU jobs (https://pytorch.org/docs/stable/elastic/run.html#single-node-multi-worker) might be of great interest to avoid concurrent jobs on the same machine trying to bind to the same port, e.g. :

accelerate launch --num_processes 2 --num_machines 1 train.py --lr 1e-5 & \
accelerate launch --num_processes 2 --num_machines 1 train.py --lr 1e-3

results in the following error :

ConnectionError: Tried to launch distributed communication on port `29500`, but another process is utilizing it. Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.

When using --main_process_port 0, both jobs seem to loop forever.
Workarounds such as --main_process_port $((29500 + $RANDOM % 1000)) do resolve the conflict, but it is not straightforward.
torchrun provides --standalone for this specific purpose (pytorch/pytorch#107734, https://github.com/pytorch/pytorch/blob/960c3bff98a1a8b0f3c68eec764a507b7aaa63c2/torch/distributed/run.py#L893-L907)

So how about propagating this argument to torch.distributed.run ?

distrib_run.run(args)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant