Skip to content

wrong global rank when trying multi-nodes #6454

Discussion options

You must be logged in to vote

What have you set for MASTER_ADDR and MASTER_PORT? These have to reference to one of the two machines you are using. For example if I have two nodes like this:

IP: 512.124.134.4
IP: 512.124.136.8

And I want 512.124.134.4 to be my master node.

For both my machines I'd need to run something like MASTER_ADDR=512.124.134.4 MASTER_PORT=4500 python train.py.

Let me know if this helps! On top of this, we should update the doc if this does work :)

Replies: 6 comments 5 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@awaelchli
Comment options

Comment options

You must be logged in to vote
1 reply
@jungwhank
Comment options

Answer selected by jungwhank
Comment options

You must be logged in to vote
3 replies
@jungwhank
Comment options

@awaelchli
Comment options

@jungwhank
Comment options

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants