Memory issue when training in DDP mode #18525
Unanswered
happysyp000
asked this question in
DDP / multi-GPU / multi-node
Replies: 3 comments
-
same issue |
Beta Was this translation helpful? Give feedback.
0 replies
-
Beta Was this translation helpful? Give feedback.
0 replies
-
Here is a similar question and with an answer. I think it is helpful. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I have a training dataset about 80GB that is saved in a pandas dataframe. My machine has 4 GPUs, and the CPU memory is around 300Gb. When training the model with 1GPU, everything works fine, but when I train with 4 GPUs with the "ddp" mode, the memory gets dangerously close to overflown (for example, the memory usage would be around 320GB rather than 80GB)
My datamodule is like following:
My impression is the ddp mode creates 4 copies of the datamodule when training on GPUs, thus loading the dataframe 4 times, causing the memory issue. I also tried to put loading of the dataframe to the init function of the datamodule, but the memory issue still occurs.
Since the 4 GPUs are trained with the same dataframe, I'd like to find a way so that the dataframe needs to be loaded only one time, and keep the CPU from memory overflown. Is there any suggestion as to what I can do?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions