Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My servers used for multi-node training do not have ssh. How can I launch multi-node training using the torchrun command? #1203

Open
dingning97 opened this issue Apr 23, 2024 · 2 comments
Labels
feature request New feature or request

Comments

@dingning97
Copy link

dingning97 commented Apr 23, 2024

My machines used for multi-node training do not allow ssh service.
How can I launch multi-node training using the most basic torchrun command (torch.distributed.launch) ?

The servers which I use do not have slurm. And I found both openmpi and pdsh rely on ssh service.
So I run out of all the ways provided in this repo's README to start a multi-node training job.

@dingning97 dingning97 added the feature request New feature or request label Apr 23, 2024
@WKX933
Copy link

WKX933 commented Apr 30, 2024

I also encountered the same problem. Have you found a solution?

@Quentin-Anthony
Copy link
Member

This can be resolved by adding torchrun as a possible deepspeed multinode runner. We're targeting adding this within the next week or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants