Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OoM issue with multiple gpus using Distributed Data Parallel (DDP) training #1650

Open
SebastienThibert opened this issue Sep 20, 2024 · 4 comments
Labels

Comments

@SebastienThibert
Copy link

SebastienThibert commented Sep 20, 2024

When I run this example runs on multiple gpus using Distributed Data Parallel (DDP) training on AWS SageMaker with 4 GPUS and a batch_size of 8192, I got a OoM issue despite the 96GiB capacity:

Tried to allocate 4.00 GiB. GPU 2 has a total capacity of 21.99 GiB of which 1.21 GiB is free. Including non-PyTorch memory, this process has 20.77 GiB memory in use.
@SebastienThibert SebastienThibert changed the title oom issue with multiple gpus using Distributed Data Parallel (DDP) training OoM issue with multiple gpus using Distributed Data Parallel (DDP) training Sep 20, 2024
@guarin
Copy link
Contributor

guarin commented Sep 20, 2024

Hi, batch size 8192 is quite big even for 4 GPUs (the original paper used batch size 4096 on 128 TPUs). Are you using CIFAR (32x32) or normal ImageNet sized images (224x224)?

@SebastienThibert
Copy link
Author

SebastienThibert commented Sep 20, 2024 via email

@guarin
Copy link
Contributor

guarin commented Sep 20, 2024

I just tested it on 4x24GB GPUs and it indeed fails with OOM. I had to reduce the batch size to 4096 for it to succeed. I think this is expected. Please note that the batch size is per GPU as it is set directly in the dataloader. You can train with larger batch sizes if you set precision="16-mixed" to enable half precision.

@SebastienThibert
Copy link
Author

Ok, I thought we could split the batch on all the GPUs. So any other tips to increase the batch size ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants