DDP number of training iterations independent of #GPUs? #6337
Unanswered
timothybrooks
asked this question in
DDP / multi-GPU / multi-node
Replies: 2 comments 3 replies
-
May it be related to my post, that each GPU is loading the whole dataset, thus the number of iterations for your case is constant? Edit: Saw that you had already reacted to it. :) |
Beta Was this translation helpful? Give feedback.
2 replies
-
How are you setting the number of GPUs? You have to pass it to the Trainer() otherwise it won't have effect. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello--
I am using DDP and anywhere from 1-8 GPUs. The PL documentation explains that it will wrap my training DataLoader in a DistributedSampler which gives each process a unique partition of the dataset. In this case, since it is also my understanding that the batch size is per-GPU, I would expect the total number of iterations to decrease proportionally to the number of GPUs. However the number of training batch iterations is constant (in my case ~100k) regardless of how many GPUs I use. It is equal to
dataset_size / batch_size
while I expected it to bedataset_size / (batch_size * num_gpus)
.Am I missing something here? Why isn't PL partitioning the dataset such that each process sees a unique subset?
Beta Was this translation helpful? Give feedback.
All reactions