Emulating multiple devices with a single GPU #8630

tholop · 2021-07-29T21:08:27Z

tholop
Jul 29, 2021

Hello,

I have a single GPU, but I would like to spawn multiple replicas on that single GPU and train a model with DDP. Of course, each replica would have to use a smaller batch size in order to fit in memory. (For my use case, I am not interested in having a single replica with a large batch size).

I tried to pass --gpus "0,0" to the Lightning Trainer, and it managed to spawn two processes on the same GPU:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2

But in the end it crashed with RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage.

Please, is there any way to split a single GPU into multiple replicas with Lightning?
Thanks!

P.S.: Ray has a really nice support for fractional GPUs: https://docs.ray.io/en/master/using-ray-with-gpus.html#fractional-gpus. I've never used them with Lightning, but maybe it could be a workaround?

Answered by tholop

Aug 3, 2021

For reference: it seems to be possible when the backend is gloo instead of nccl. See discussion here: #8630 (reply in thread).

View full answer

yifuwang · 2021-07-29T23:52:57Z

yifuwang
Jul 29, 2021

Hmm interesting use case.

AFAIU it is not possible, at least with torch.distributed. When using GPU, both the gloo and nccl backend uses https://github.com/NVIDIA/nccl under the hood, which does not support the semantic you described:

From https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html:

Using the same CUDA device multiple times as different ranks of the same NCCL communicator is not supported and may lead to hangs.

It probably can be done if you write a custom gradient sync'ing logic, which moves gradient to RAM before sync'ing and sync them with a gloo process group.

8 replies

awaelchli Jul 30, 2021

@ananthsub yes potentially in our parsing or alternatively also in the plugin which already gets the list of devices.

yifuwang Aug 2, 2021

@justusschock sorry I'm not very familiar with how MPI works with GPUs in torch.distributed. However, if it relies on NCCL under the hood I'd guess it still wouldn't work.

tholop Aug 3, 2021
Author

@yifuwang I tried a bit more, and it actually worked with gloo. After adding PL_TORCH_DISTRIBUTED_BACKEND=gloo, I was able to run Lightning Training successfully with 2 replicas on a single GPU.

dmarx Sep 21, 2021

@tholop did you experience the same kind of speedup you'd expect if you were training on multiple separate physical devices?

PS: I haven't experimented yet, but I suspect you might be able to apply the fractional GPU capability in ray to achieve something like this.

tholop Sep 21, 2021
Author

@dmarx I did experience a speedup, but not as good as having separate physical devices. I didn't benchmark thoroughly though.

I totally agree regarding Ray's fractional GPUs! I mentioned it in the original issue as a possible workaround, but it might require a bit more work than just passing a string to Lightning.

tholop · 2021-08-03T20:01:02Z

tholop
Aug 3, 2021
Author

For reference: it seems to be possible when the backend is gloo instead of nccl. See discussion here: #8630 (reply in thread).

2 replies

ksasso1028 May 15, 2024

pytorch lightning complains about using the same device ID in current version, any workaround? @tholop certainly interested in this to get more steps vs batches

haleyso Aug 28, 2024

hi @ksasso1028, i'm running into the same thing with pytorch lightning right now. did you happen to find a workaround?

update: I just commented out the check unique ids call in device_parser.py ... so far it's working ok

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emulating multiple devices with a single GPU #8630

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Emulating multiple devices with a single GPU #8630

tholop Jul 29, 2021

Replies: 2 comments · 10 replies

yifuwang Jul 29, 2021

awaelchli Jul 30, 2021

yifuwang Aug 2, 2021

tholop Aug 3, 2021 Author

dmarx Sep 21, 2021

tholop Sep 21, 2021 Author

tholop Aug 3, 2021 Author

ksasso1028 May 15, 2024

haleyso Aug 28, 2024

tholop
Jul 29, 2021

Replies: 2 comments 10 replies

yifuwang
Jul 29, 2021

tholop Aug 3, 2021
Author

tholop Sep 21, 2021
Author

tholop
Aug 3, 2021
Author