Trainer super slow on single node, 2 GPU SLURM #773

LucFrachon · 2020-01-30T20:22:08Z

LucFrachon
Jan 30, 2020

Hi all,

I'm running neural architecture search code on a SLURM cluster (1 node, 2 GPUs). I ran a version of this code that didn't use Pytorch-Lightning before, and it ran fine on the cluster, but since I adapted it to use PL (and benefit from multinode training), it only runs fine on my laptop (with toy examples). On the cluster, it runs incredibly slowly.

The code needs to train hundreds of neural nets, but each takes several orders of magnitude more time to train than on my laptop's 1050ti!

Running cProfile, I got the following results:


Function                                                              was called by...
                                                                          ncalls  tottime  cumtime
{built-in method builtins.exec}                                       <-     6/4    0.000    0.006  <frozen importlib._bootstrap>:211(_call_with_frames_removed)
<string>:1(<module>)                                                  <-       1    0.000 1108.929  {built-in method builtins.exec}
main.py:106(main)                                                     <-       1    0.030 1108.929  <string>:1(<module>)
evaluation.py:23(train_and_evaluate)                                  <-       2    0.005 1071.449  AIS.py:87(evaluate)
                                                                               1    0.001   29.416  main.py:56(evaluate_committee)
trainer.py:372(fit)                                                   <-       3    0.011 1100.816  evaluation.py:23(train_and_evaluate)
spawn.py:121(spawn)                                                   <-       3    0.001 1100.788  trainer.py:372(fit)
spawn.py:57(join)                                                     <-       5    0.011 1090.095  spawn.py:121(spawn)
connection.py:906(wait)                                               <-       1    0.001    0.004  connection.py:413(_poll)
                                                                               5    0.005 1089.824  spawn.py:57(join)
selectors.py:402(select)                                              <-       6    0.022 1089.809  connection.py:906(wait)
{method 'poll' of 'select.poll' objects}                              <-       6 1089.785 1089.785  selectors.py:402(select)

As you can see, the issue seems to be with spawn.py, which itself is called by connection.py, and calls {method 'poll' of 'select.poll' objects}. I did some research and it seems that these are part of Python's built-in multiprocessing library, but I have no idea why they are taking so much time.
Does anyone have any recommendations?

What's your environment?

Linux / SLURM task manager
Miniconda, Pytorch 1.3.1, PL 0.5.3.2

LucFrachon · 2020-01-31T10:48:03Z

LucFrachon
Jan 31, 2020
Author

Some more information: In the example above, I was using distributed_backend='ddp' in the Trainer.
I got the following error:
RuntimeError: trying to initialize the default process group twice!
I then tried dp, and training worked fine. This confirms that something is wrong with how my code interacts with PL's implementation of multiprocessing.
Any ideas?

Edit:

These are my SLURM parameters:

#!/bin/bash --login
#SBATCH --job-name=stocks_test
#SBATCH --partition=uoa-gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --mem=8000

0 replies

williamFalcon · 2020-02-01T21:00:35Z

williamFalcon
Feb 1, 2020
Maintainer

@LucFrachon that's really weird.
DDP for 2 GPUs should be fine. DDP is not a super stable API though, for instance, make sure your batch is divisible by 2 (according to PT docs). Otherwise try DP

0 replies

LucFrachon · 2020-02-03T11:03:05Z

LucFrachon
Feb 3, 2020
Author

DP works fine as long as I specify 1 node and 2 tasks per node. But then I lose the ability to train across several nodes, which was the primary reason why I adapted my code to use PL...
I think the issue is probably outside of PL though, and has more to do with the overall implementation of the NAS search process which was never written to take advantage of multiprocessing. I guess I will need to dig deeper into this topic.

0 replies

williamFalcon · 2020-02-11T17:34:44Z

williamFalcon
Feb 11, 2020
Maintainer

@LucFrachon sorry to hear that. Sounds like the bottleneck isn't lightning? if that ends up being an issue, happy to reopen.

Try our new profiler?
https://pytorch-lightning.readthedocs.io/en/latest/profiler.html

0 replies

LucFrachon · 2020-02-12T13:14:53Z

LucFrachon
Feb 12, 2020
Author

Didn't know about the new profiler, thanks, I'll give it a try.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer super slow on single node, 2 GPU SLURM #773

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Trainer super slow on single node, 2 GPU SLURM #773

LucFrachon Jan 30, 2020

What's your environment?

Replies: 5 comments

LucFrachon Jan 31, 2020 Author

williamFalcon Feb 1, 2020 Maintainer

LucFrachon Feb 3, 2020 Author

williamFalcon Feb 11, 2020 Maintainer

LucFrachon Feb 12, 2020 Author

LucFrachon
Jan 30, 2020

LucFrachon
Jan 31, 2020
Author

williamFalcon
Feb 1, 2020
Maintainer

LucFrachon
Feb 3, 2020
Author

williamFalcon
Feb 11, 2020
Maintainer

LucFrachon
Feb 12, 2020
Author