extra process when running ddp across multiple GPUs #9864

ChanganVR · 2021-10-07T18:46:24Z

ChanganVR
Oct 7, 2021

Hi there,

When I run DPP across multiple GPUs, I always see additional processes for each additional GPU. For example, when I use 2 GPUs, on each GPU, there is one process for the training and one extra process, which takes about 1GB. In my understanding, this extra process is for processing gradient sync. Is this behavior expected? And is there a way to avoid the extra processes because it really limits the ability to use more than 8 GPUs.

Below is the utilization of GPU memory when I use two GPUs:

Below is the utilization of GPU memory when I use eight GPUs:

Any feedback or suggestion is appreciated. Thank you!

NikolasMorshuis · 2022-06-02T15:06:57Z

NikolasMorshuis
Jun 2, 2022

I did experience a similar issue and also encountered the problem of having one extra process that takes around 1GB of GPU memory for every additional GPU that I used during training.

The issue was solved by setting auto_select_gpus=False for the initialization of the trainer class.

1 reply

SagiPolaczek Sep 4, 2022

Can someone elaborate on that? I had the same issue and setting auto_select_gpus=False worked for me too. Meaning:

with auto_select_gpus=True:

You can see that there are two processes, each appears on both GPUs.

with auto_select_gpus=False:
Two process, but here each process runs on a different GPU.

NikolasMorshuis · 2022-12-20T18:41:14Z

NikolasMorshuis
Dec 20, 2022

I now also experience an unwanted phenomena, where the gpu:0 gets an extra task for each gpu used in multi-gpu training (see image):

Does anyone know why this might occur?

4 replies

akihironitta Dec 26, 2022

Any chance you could provide your full code and env detail?

paolomandica Jan 10, 2024

I have the same problem, did you manage to solve it in some way?

Basso42 Jan 25, 2024

Same problem here, did one of you find any explanation or solution?

Jesteinbe Feb 21, 2024

I'm seeing the same behavior, i.e. multiple processes on GPU 0 (one for each additional GPU). One grid I'm using runs Slurm and the gpus are not set to exclusive so this doesn't seem to be a big issue. I do run this on a second cluster though that runs UGE and the gpus are set to exclusive mode. My code errors out immediately and says the devices are busy or unavailable and my best guess is that exclusive mode prohibits those additional processes from running on the same gpu. Users can't change that setting on UGE without sudo access though. Can someone from the lightning team chime in here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extra process when running ddp across multiple GPUs #9864

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

extra process when running ddp across multiple GPUs #9864

Replies: 2 comments · 5 replies

Replies: 2 comments 5 replies