Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Pytorch Lightning + CGX? #4

Open
bjaeger1 opened this issue May 31, 2023 · 3 comments
Open

Question: Pytorch Lightning + CGX? #4

bjaeger1 opened this issue May 31, 2023 · 3 comments

Comments

@bjaeger1
Copy link

How can I use cgx with pytorch lightning? Their website those not list cgx as an strategy argument to pass to the pytorch lightning trainer. However, there is a possibility to create a custom strategy (see: https://lightning.ai/docs/pytorch/stable/extensions/strategy.html)

@ilmarkov
Copy link
Member

I think DDPStrategy has process_group_backend. You can try to specify cgx there as it is passed to torch.distributed.init_process_group later. Note that you have to do import torch_cgx.

@bjaeger1
Copy link
Author

bjaeger1 commented Jun 2, 2023

for reference:
strategy=pl.strategies.DDPStrategy(process_group_backend="cgx") works!
Also, when building openmpi with ./configure ... make sure to set the flag --with-cuda=/usr/local/cuda to enable CUDA- aware of openmpi

@bjaeger1
Copy link
Author

bjaeger1 commented Jun 2, 2023

Update: the training works fine but there is a error (see message) below which prevents docker container from shutting down after training and logging of the model.

print(trainer.world_size, trainer.local_rank, trainer.global_rank, trainer.node_rank)
prints for first gpu: 2 0 0 0 and for second gpu: 2 1 1 0, which looks correct.

Error message:

File "/mlflow/projects/code/src/train.py", line 335, in train
trainer.fit(model, datamodule = ds_loader)  

File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt(

File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)

File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)

File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1049, in _run
self.__setup_profiler()

File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1509, in __setup_profiler
self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)

File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1826, in log_dir
dirpath = self.strategy.broadcast(dirpath)

File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 314, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)

File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1840, in broadcast_object_list
object_list[i] = _tensor_to_object(obj_view, obj_size)

File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1532, in _tensor_to_object
return _unpickler(io.BytesIO(buf)).load()
EOFError: Ran out of input

@ilmarkov do you have any idea what the problem could be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants