NODE_RANK causes DDP jobs to hang at `initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8` #5798

ajtao · 2021-01-25T21:46:03Z

ajtao
Jan 25, 2021

Hello,

In my compute cluster, all pytorch lightning code will hang when using more than 1 GPU.
It hangs right at "initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8"

Some relevant stats:

8-gpu ddp training
NODE_RANK=0
WORLD_SIZE=8

I have found that training does work if i unset NODE_RANK.

By instrumenting the pytorch-lightning code, i have observed that:

in the case when things hang, i.e. NODE_RANK is set, i've observed that select_accelerator() calls DDPHPCAccelerator every time, for all ranks. I guess this is because setting NODE_RANK seems to force use_torchelastic_ddp for all ranks.
in the case when NODE_RANK isn't set, then training works. In this case, it appears that the first call to select_accelerator() will call DDPAccelerator() for the first rank but then will call DDPHPCAccelerator for subsequent ranks.

My questions:

Why is NODE_RANK being set causing my code to hang?
Could the dependence on environment variables be explicitly stated / defined somewhere?
I also would love to know: where is the spawn for DDP happening? I was trying to read the code, but just can't find where it happens.

What's your environment?

Ubuntu 18.04
pip
1.0.8 pytorch-lightning
torch 1.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NODE_RANK causes DDP jobs to hang at `initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8` #5798

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

NODE_RANK causes DDP jobs to hang at initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8 #5798

ajtao Jan 25, 2021

What's your environment?

Replies: 0 comments

NODE_RANK causes DDP jobs to hang at `initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8` #5798

ajtao
Jan 25, 2021