Batch size increase when using FSDP, rather than reducing the memory usage #16855

Luofan-KK · 2023-02-23T15:47:28Z

Luofan-KK
Feb 23, 2023

I was told that Fully Sharded Training shards the entire model across all available GPUs. So I guess given a fixed batch size, when available GPUs increase, the memory usage at each GPU will decrease.

But in my practice, when I using more GPU, the actual batch size increase: actual_batch_size= num_GPUs * given_batch_size, rather than sharding model into fine grained. I want to tune a huge model, and this must cause OOM.

I've searched similar questions. And there is an example. But how can I determine the batch size?

Following is my code:


class Model_pl(pl.LightningModule):
    def __init__(self,hidden_dim=256,nhead=8,num_layers=4,class_num=class_num):
        super().__init__()
        self.save_hyperparameters()
        self.m1=Module1(hidden_dim=hidden_dim,nhead=nhead,num_layers=num_layers)
        self.m2= Module2(hidden_dim=hidden_dim,nhead=nhead,class_num=class_num)
    def forward(self,x):
        x=self.m1(x)
        x=self.m2(x)
        return x
    def training_step(self, batch, batch_idx):
        raw_value,pad=batch
        oh_value=nnF.one_hot(raw_value.to(torch.int64),num_classes=class_num)
        processed_value,mask_matrix=generateMask(raw_value)
        pred_class=self.forward(processed_value)
        loss=focal_loss(pred_class,oh_value,mask_matrix)
        return loss
    def configure_optimizers(self,lr=1e-6,l2=1e-6):
        optimizer = torch.optim.Adam(self.trainer.model.parameters(), lr=lr,weight_decay=l2,amsgrad=True)
        return optimizer

if __name__=='__main__':
    model= Model_pl(hidden_dim=560,num_layers=12)
    dataset=MyDataset(results)
    loader=torch.utils.data.DataLoader(dataset,batch_size=2,shuffle=True,collate_fn=collate_fn1)
    strategy=pl.strategies.DDPFullyShardedNativeStrategy(cpu_offload=True)
    trainer=pl.Trainer(accelerator='gpu', devices=[1,2,3,4,5],
        amp_level="O2",amp_backend='apex',
        max_epochs=2,
        strategy=strategy,
    )
    trainer.fit(model= model, train_dataloaders=loader)

Thanks

olly-writes-code · 2024-07-31T18:17:54Z

olly-writes-code
Jul 31, 2024

Also noticing this. I would be curious why the batch size increases when I'd rather it stay at the prescribed value in the data loader

0 replies

olly-writes-code · 2024-07-31T20:00:39Z

olly-writes-code
Jul 31, 2024

Digging into the docs - I think this can be resolved by setting (in Trainer)

use_distributed_sampler = False

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size increase when using FSDP, rather than reducing the memory usage #16855

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Batch size increase when using FSDP, rather than reducing the memory usage #16855

Luofan-KK Feb 23, 2023

Replies: 2 comments

olly-writes-code Jul 31, 2024

olly-writes-code Jul 31, 2024

Luofan-KK
Feb 23, 2023

olly-writes-code
Jul 31, 2024

olly-writes-code
Jul 31, 2024