NODE_RANK causes DDP jobs to hang at initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
#5798
Unanswered
ajtao
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
In my compute cluster, all pytorch lightning code will hang when using more than 1 GPU.
It hangs right at "initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8"
Some relevant stats:
I have found that training does work if i unset NODE_RANK.
By instrumenting the pytorch-lightning code, i have observed that:
use_torchelastic_ddp
for all ranks.My questions:
What's your environment?
Beta Was this translation helpful? Give feedback.
All reactions