Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The correct way to enable multi-GPU training #8

Open
sleeplessai opened this issue Nov 3, 2021 · 2 comments
Open

The correct way to enable multi-GPU training #8

sleeplessai opened this issue Nov 3, 2021 · 2 comments

Comments

@sleeplessai
Copy link

Hi, @kwea123

I am conducting some experiments using this MVSNet implementation, since its clear and simple PyTorch Lightning warping.
For faster training process, the model is trained with 3 GPUs on my server, while an error comes out when the hyperparameter --gpu_num simply was set to 3. The PyTorch Lightning raised verbose information:

"You seem to have configured a sampler in your DataLoader. This will be replaced "
" by `DistributedSampler` since `replace_sampler_ddp` is True and you are using"
" distributed training. Either remove the sampler from your DataLoader or set"
" `replace_sampler_ddp=False` if you want to use your custom sampler."

To solve this problem, the train.py code has been modified by setting parameter in PL Trainer:

trainer = Trainer(#......
                  gpus=hparams.num_gpus,
                  replace_sampler_ddp=False,
                  distributed_backend='ddp' if hparams.num_gpus>1 else None,
                  # ......)

The model can be trained after this hyperparameter configured.

Is this the correct way to enable multi-GPU training manner?
For some reason, I cannot install nvidia-apex for current server.
Should and how do I use SyncBatchNorm for this model implementation?
Does it bear on performance without SyncBN?
Please tell me if I should, using nn.SyncBatchNorm.convert_sync_batchnorm() or PyTorch Lightning sync_bn in Trainer configuration?

Thanks a lot. 😊

@Geo-Tell
Copy link

Geo-Tell commented Dec 3, 2021

hello,Have you solved the problem @sleeplessai

@sleeplessai
Copy link
Author

sleeplessai commented Dec 4, 2021

Hi, @geovsion.
Yes, I had solved the multiple GPU training by specifying the num_gpus property for PL trainer and adding SyncBatchNorm support. For this, I updated the main packages PL to 0.9.0 and PyTorch to 1.6.0.
As the author didn't give quick reply, I folked the original repo manually to
sleeplessai/mvsnet2_pl for maintaining in the future. The code has been tested on a 3 GPU cluster node and works well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants