How to manage submodule device manually? #15740

Mindful · 2022-11-19T13:46:02Z

Mindful
Nov 19, 2022

I have a Pytorch lightning module that has two big transformer encoders, which make totally separate forward passes and then the outputs are combined to produce a final result - a standard Bi-Encoder. I'd like to move just one of those encoders onto a separate GPU during training so that I can train with bigger encoders. Given the below pseudocode, I'm wondering what the best way to move just encoder2 onto a separate device is.

class BiEncoder(pl.LightningModule):
    def __init__(self):
        self.encoder1 = BERT()
        self.encoder2 = BERT()

I looked for related discussions but the closest I could find was this question, which talks mostly about using a DeepSpeed integration to solve the problem. I'm really hoping for a simpler that just lets me place one encoder on a separate GPU, and I'm fine with having to write code in my forward() to make sure the input/output tensors get moved to proper devices.

Obviously calling encoder2.to(device) will move the encoder, but I know that Lightning moves the whole module automatically at some point during training, and I'd rather not end up with an implementation that has me working against the Lightning trainer implementation to try and force encoder2 onto a separate device or moving it back and forth.

In short, is there a Pytorch-Lightning approved way to have submodules on a separate device? I looked around for a specific callback that dealt with device management, but couldn't find one.

Mindful · 2023-02-06T07:46:01Z

Mindful
Feb 6, 2023
Author

Bumping this once because I'm still hoping there might be a good framework-friendly way to do this before I have to implement some sort of frankenstein solution.

0 replies

geomlyd · 2023-05-30T15:52:42Z

geomlyd
May 30, 2023

I'm also interested in this. I'm able to set the devices during the call to LightningModule's training_step, so this bypasses transports done by Lightning at the beginning of fit. However, I'm facing issues when I try to resume from a checkpoint because my optimizer apparently places its internal state in the CPU, while my (first) training step moves the model to my two GPUs.

I'm unsure what to do at the moment. Can anything elegant/"proper" be done here?

0 replies

Zigned · 2024-05-04T08:03:45Z

Zigned
May 4, 2024

I believe model parallel is all you need. To use Sharded Training, you need to first install FairScale using the command below.
pip install fairscale
Then, train using Sharded DDP
trainer = Trainer(strategy="ddp_sharded")
Reference: https://lightning.ai/docs/pytorch/1.6.5/advanced/model_parallel.html

0 replies

Mindful · 2024-08-31T00:22:04Z

Mindful
Aug 31, 2024
Author

Final bump before I go and implement this myself - would be really cool if Lightning supported this.
To be clear I am looking to just manually manage devices for parts of the model, not use something like sharded training.

Edit: For anyone else who wants to do something like this, here's how I did it for myself by overriding on_fit_start().

    def on_fit_start(self):
        device_count = torch.cuda.device_count()
        if self.device.type == 'cuda' and torch.cuda.device_count() >= 2:
            print("Found multiple GPUs, using separate device for definition encoder")
            def_encoder_device = (self.device.index + 1) % device_count
            self.bi_encoder.definition_encoder.cuda(def_encoder_device)

The only other thing to be careful of is that you will have to set the trainer to only use a single CUDA device (I.E. set devices=[0]) in order to prevent it from automatically trying to shard the model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to manage submodule device manually? #15740

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to manage submodule device manually? #15740

Mindful Nov 19, 2022

Replies: 4 comments

Mindful Feb 6, 2023 Author

geomlyd May 30, 2023

Zigned May 4, 2024

Mindful Aug 31, 2024 Author

Mindful
Nov 19, 2022

Mindful
Feb 6, 2023
Author

geomlyd
May 30, 2023

Zigned
May 4, 2024

Mindful
Aug 31, 2024
Author