Multi processing doesnt work! #15880
Unanswered
vadinabronin
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
@vadinabronin did you manage to find the solution? I am facing the same problem |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Why didn't any idi*t ask the question: I have a trainer:
trainer = pl.Trainer(accelerator = 'gpu',devices = 2, max_epochs = cfg.epoch,
logger=pl.loggers.CSVLogger(save_dir="logs/"), precision = 16,strategy =
'dp')
And pytorch lightning module(i dont share this module because it is to big)
And function which compute distilation feature map loss(function is declared outside of my module)
and stupid dp strategy can not fix devices on in my function i get this exception when i learn my model
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 65, in forward
output = super().forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 79, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/tmp/ipykernel_17/687536527.py", line 43, in training_step
distill_loss = self.atloss.forward()
File "/tmp/ipykernel_17/125274374.py", line 57, in forward
teacher_feature_map)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Beta Was this translation helpful? Give feedback.
All reactions