OoM issue with multiple gpus using Distributed Data Parallel (DDP) training #1650

SebastienThibert · 2024-09-20T14:49:39Z

When I run this example runs on multiple gpus using Distributed Data Parallel (DDP) training on AWS SageMaker with 4 GPUS and a batch_size of 8192, I got a OoM issue despite the 96GiB capacity:

Tried to allocate 4.00 GiB. GPU 2 has a total capacity of 21.99 GiB of which 1.21 GiB is free. Including non-PyTorch memory, this process has 20.77 GiB memory in use.

The text was updated successfully, but these errors were encountered:

guarin · 2024-09-20T14:55:14Z

Hi, batch size 8192 is quite big even for 4 GPUs (the original paper used batch size 4096 on 128 TPUs). Are you using CIFAR (32x32) or normal ImageNet sized images (224x224)?

SebastienThibert · 2024-09-20T15:06:19Z

I use exactly the code from the example so 32x32 I think cordialement, Sébastien Thibert Le ven. 20 sept. 2024, 16:55, guarin ***@***.***> a écrit :

…

Hi, batch size 8192 is quite big even for 4 GPUs (the original paper used batch size 4096 on 128 TPUs). Are you using CIFAR (32x32) or normal ImageNet sized images (224x224)? — Reply to this email directly, view it on GitHub <#1650 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATLKMZ4EBK56DGUVEFFUNMDZXQZORAVCNFSM6AAAAABOSGOIIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRTHEZDKNJXGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

guarin · 2024-09-20T15:29:20Z

I just tested it on 4x24GB GPUs and it indeed fails with OOM. I had to reduce the batch size to 4096 for it to succeed. I think this is expected. Please note that the batch size is per GPU as it is set directly in the dataloader. You can train with larger batch sizes if you set precision="16-mixed" to enable half precision.

SebastienThibert · 2024-09-20T17:43:14Z

Ok, I thought we could split the batch on all the GPUs. So any other tips to increase the batch size ?

SebastienThibert changed the title ~~oom issue with multiple gpus using Distributed Data Parallel (DDP) training~~ OoM issue with multiple gpus using Distributed Data Parallel (DDP) training Sep 20, 2024

SauravMaheshkar added the question label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OoM issue with multiple gpus using Distributed Data Parallel (DDP) training #1650

OoM issue with multiple gpus using Distributed Data Parallel (DDP) training #1650

SebastienThibert commented Sep 20, 2024 •

edited

Loading

guarin commented Sep 20, 2024

SebastienThibert commented Sep 20, 2024 via email

guarin commented Sep 20, 2024 •

edited

Loading

SebastienThibert commented Sep 20, 2024

OoM issue with multiple gpus using Distributed Data Parallel (DDP) training #1650

OoM issue with multiple gpus using Distributed Data Parallel (DDP) training #1650

Comments

SebastienThibert commented Sep 20, 2024 • edited Loading

guarin commented Sep 20, 2024

SebastienThibert commented Sep 20, 2024 via email

guarin commented Sep 20, 2024 • edited Loading

SebastienThibert commented Sep 20, 2024

SebastienThibert commented Sep 20, 2024 •

edited

Loading

guarin commented Sep 20, 2024 •

edited

Loading