DDP number of training iterations independent of #GPUs? #6337
Unanswered
timothybrooks
asked this question in
DDP / multi-GPU / multi-node
Replies: 2 comments 3 replies
-
May it be related to my post, that each GPU is loading the whole dataset, thus the number of iterations for your case is constant? Edit: Saw that you had already reacted to it. :) |
Beta Was this translation helpful? Give feedback.
2 replies
-
How are you setting the number of GPUs? You have to pass it to the Trainer() otherwise it won't have effect. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello--
I am using DDP and anywhere from 1-8 GPUs. The PL documentation explains that it will wrap my training DataLoader in a DistributedSampler which gives each process a unique partition of the dataset. In this case, since it is also my understanding that the batch size is per-GPU, I would expect the total number of iterations to decrease proportionally to the number of GPUs. However the number of training batch iterations is constant (in my case ~100k) regardless of how many GPUs I use. It is equal to
dataset_size / batch_size
while I expected it to bedataset_size / (batch_size * num_gpus)
.Am I missing something here? Why isn't PL partitioning the dataset such that each process sees a unique subset?
Beta Was this translation helpful? Give feedback.
All reactions