Memory issue when training in DDP mode #18525

happysyp000 · 2023-09-11T20:24:28Z

happysyp000
Sep 11, 2023

Hi, I have a training dataset about 80GB that is saved in a pandas dataframe. My machine has 4 GPUs, and the CPU memory is around 300Gb. When training the model with 1GPU, everything works fine, but when I train with 4 GPUs with the "ddp" mode, the memory gets dangerously close to overflown (for example, the memory usage would be around 320GB rather than 80GB)

My datamodule is like following:

class datamodule(pl.LightningDataModule):
    def __init__(dataframe_location):
      self.data_path=dataframe_location

    def setup(self, stage):
       if stage=='fit':
          self.df = pd.read_pickle(self.data_path)
  
    def train_dataloader(self):
          dataloader = loader_func(df,...)
          return dataloader

My impression is the ddp mode creates 4 copies of the datamodule when training on GPUs, thus loading the dataframe 4 times, causing the memory issue. I also tried to put loading of the dataframe to the init function of the datamodule, but the memory issue still occurs.

Since the 4 GPUs are trained with the same dataframe, I'd like to find a way so that the dataframe needs to be loaded only one time, and keep the CPU from memory overflown. Is there any suggestion as to what I can do?

Thanks.

My-captain · 2024-01-14T16:31:01Z

My-captain
Jan 14, 2024

same issue

0 replies

kiucho · 2024-01-15T06:57:16Z

kiucho
Jan 15, 2024

Same issue too.

When training facebook/opt-1.3b (BF16-mixed) with Lightning, there is a difference in vram usage depending on the strategy.
with strategy: "auto" it allocates 29GB which seems proper,

But with strategy: "ddp" it allocates 41GB per GPU.

Does anyone know what might be causing this?

0 replies

KasuganoLove · 2024-01-28T04:17:47Z

KasuganoLove
Jan 28, 2024

Here is a similar question and with an answer. I think it is helpful.

#11763

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issue when training in DDP mode #18525

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Memory issue when training in DDP mode #18525

happysyp000 Sep 11, 2023

Replies: 3 comments

My-captain Jan 14, 2024

kiucho Jan 15, 2024

KasuganoLove Jan 28, 2024

happysyp000
Sep 11, 2023

My-captain
Jan 14, 2024

kiucho
Jan 15, 2024

KasuganoLove
Jan 28, 2024