Trainer super slow on single node, 2 GPU SLURM #773
Replies: 5 comments
-
Some more information: In the example above, I was using Edit: These are my SLURM parameters:
|
Beta Was this translation helpful? Give feedback.
-
@LucFrachon that's really weird. |
Beta Was this translation helpful? Give feedback.
-
DP works fine as long as I specify 1 node and 2 tasks per node. But then I lose the ability to train across several nodes, which was the primary reason why I adapted my code to use PL... |
Beta Was this translation helpful? Give feedback.
-
@LucFrachon sorry to hear that. Sounds like the bottleneck isn't lightning? if that ends up being an issue, happy to reopen. Try our new profiler? |
Beta Was this translation helpful? Give feedback.
-
Didn't know about the new profiler, thanks, I'll give it a try. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I'm running neural architecture search code on a SLURM cluster (1 node, 2 GPUs). I ran a version of this code that didn't use Pytorch-Lightning before, and it ran fine on the cluster, but since I adapted it to use PL (and benefit from multinode training), it only runs fine on my laptop (with toy examples). On the cluster, it runs incredibly slowly.
The code needs to train hundreds of neural nets, but each takes several orders of magnitude more time to train than on my laptop's 1050ti!
Running cProfile, I got the following results:
As you can see, the issue seems to be with
spawn.py
, which itself is called byconnection.py
, and calls{method 'poll' of 'select.poll' objects}
. I did some research and it seems that these are part of Python's built-in multiprocessing library, but I have no idea why they are taking so much time.Does anyone have any recommendations?
What's your environment?
Linux / SLURM task manager
Miniconda, Pytorch 1.3.1, PL 0.5.3.2
Beta Was this translation helpful? Give feedback.
All reactions