Problem in multi-node training #7275

nickKyr · 2021-04-29T13:51:27Z

nickKyr
Apr 29, 2021

Hello pytorch-lightning community,

my training hangs when training on multi-nodes; on single node with multiple GPUs runs fine :/
It baffles me that although the global rank ID seems right, the member output has 4 instead of 8 in the denominator.
Since I run in a slurm environment, do I have to add the SLURMEnvironment plugin in the Trainer? I tried to add it alongside the DDPPlugin but it was not accepted (Found invalid type for plugin <class 'pytorch_lightning.plugins.environments.slurm_environment.SLURMEnvironment'>. Expected a precision or training type plugin)

The job submission file has the corresponding lines:
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --exclusive

srun --ntasks=8 python3 coolModel.py 2>&1 | tee log.train

I attach the output and the code below...
pytorch version: 1.8.1
pytorch-lightning version: 1.2.4

Cheers,
Nikos

multNodeTraining.txt

###########python code
import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms

import pytorch_lightning as pl
from pytorch_lightning import Trainer
#from test_tube import Experiment

class database(pl.LightningDataModule):

def __init__(self):
    super().__init__()

#def setup(self, stage=None):
def train_dataloader(self):
    return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

def val_dataloader(self):
    return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

def test_dataloader(self):
    return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

class CoolModel(pl.LightningModule):

def __init__(self):
    super(CoolModel, self).__init__()
    # not the best model...
    self.l1 = torch.nn.Linear(28 * 28, 10)

def forward(self, x):
    return torch.relu(self.l1(x.view(x.size(0), -1)))

def my_loss(self, y_hat, y):
    return F.cross_entropy(y_hat, y)

def training_step(self, batch, batch_nb):
    x, y = batch
    y_hat = self.forward(x)
    return {'loss': self.my_loss(y_hat, y)}

def validation_step(self, batch, batch_nb):
    x, y = batch
    y_hat = self.forward(x)
    return {'val_loss': self.my_loss(y_hat, y)}

def validation_end(self, outputs):
    avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
    return {'avg_val_loss': avg_loss}

def configure_optimizers(self):
    return [torch.optim.Adam(self.parameters(), lr=0.02)]

if name=='main':

dm = database()
model = CoolModel()
#exp = Experiment(save_dir=os.getcwd())
#checkp1 = pl.callbacks.ModelCheckpoint(
#    monitor='loss', save_top_k=2, dirpath='./', mode='min', save_last=True
#    ) 
trainer = Trainer(
        max_epochs=10,
        gpus=4, num_nodes=2, accelerator='ddp',
        plugins= pl.plugins.DDPPlugin(find_unused_parameters=False), #pl.plugins.SLURMEnvironment],
        #progress_bar_refresh_rate=0
        )
# train on 32 gpus across 4 nodes (make sure to submit appropriate SLURM job)
# trainer = Trainer(experiment=exp, max_nb_epochs=1, gpus=[0, 1, 2, 3, 4, 5, 6, 7], nb_gpu_nodes=4)
trainer.fit(model, dm)
# view tensorflow logs
print(f'View tensorboard logs by running\ntensorboard --logdir {os.getcwd()}')
print('and going to http://localhost:6006 on your browser')

Answered by awaelchli

Apr 30, 2021

Hello Nikos

Do you have 8 gpus in the node? I think it must match gres.
Don't you also need to specify how many tasks per node in the SBATCH directive? [1]
Also, I notice some unsupported Trainer arguments in your script. It should be:
trainer = Trainer(max_epochs=1, gpus=[0, 1, 2, 3, 4, 5, 6, 7], num_nodes=4)
Make sure this script actually runs on CPU first before going to the cluster 😅

Totally no slurm expert here, just looking at your script with one eye closed.

[1] https://pytorch-lightning.readthedocs.io/en/latest/clouds/cluster.html#slurm-managed-cluster

View full answer

awaelchli · 2021-04-30T02:22:24Z

awaelchli
Apr 30, 2021

Hello Nikos

Do you have 8 gpus in the node? I think it must match gres.
Don't you also need to specify how many tasks per node in the SBATCH directive? [1]
Also, I notice some unsupported Trainer arguments in your script. It should be:
trainer = Trainer(max_epochs=1, gpus=[0, 1, 2, 3, 4, 5, 6, 7], num_nodes=4)
Make sure this script actually runs on CPU first before going to the cluster 😅

Totally no slurm expert here, just looking at your script with one eye closed.

[1] https://pytorch-lightning.readthedocs.io/en/latest/clouds/cluster.html#slurm-managed-cluster

14 replies

awaelchli May 6, 2021

but it is only for the first GPU node

did you mean to write second node? because slurm is telling us here this is node id 1 which means the second node.

interesting, you are only able to print this for the second node (node id 1) and then the error occurs.
any idea what happens on node 0?

nickKyr May 6, 2021
Author

Yes, I meant the second node, sorry... It's interesting that although the PROCIDs printed are greater than 3, the errors are about them:
ValueError: Invalid rank 4, rank should be in the interval [0, 3]
ValueError: Invalid rank 5, rank should be in the interval [0, 3]
ValueError: Invalid rank 6, rank should be in the interval [0, 3]
ValueError: Invalid rank 7, rank should be in the interval [0, 3]
No idea what's going on, that's why I posted it :) Even a single print statement on the top of the main function (the main function originally posted 1 week ago), is printed 4 times instead of 8. So it seems like only 1 GPU node is creating processes...

awaelchli May 6, 2021

aha! maybe the world size is incorrectly set to 4 instead of 8 and then the distributed sampler complains that the ranks should be in range [0, 4].

I have an idea. Maybe it's related to #7007
and because you pass in the plugin manually you also have to set num_nodes manually. Let's try this:

Trainer(
        max_epochs=10,
        gpus=4, num_nodes=2, accelerator='ddp',
        plugins= pl.plugins.DDPPlugin(
             num_nodes=2, # ADDED
             find_unused_parameters=False
        )
)

print(trainer.global_rank, trainer.world_size, os.environ["SLURM_NTASKS"], trainer.training_type_plugin.num_processes, trainer.training_type_plugin.num_nodes, trainer.training_type_plugin.num_processes)

should print
X 8 8 2 4

nickKyr May 6, 2021
Author

it works now! Specifying the number of nodes in the DDPPlugin was needed:

2 8 8 4 2 4
the PROCID is: 2
the LOCALID is: 2
the NODEID is: 0

5 8 8 4 2 4
the PROCID is: 5
the LOCALID is: 1
the NODEID is: 1

1 8 8 4 2 4
the PROCID is: 1
the LOCALID is: 1
the NODEID is: 0

6 8 8 4 2 4
the PROCID is: 6
the LOCALID is: 2
the NODEID is: 1

4 8 8 4 2 4
the PROCID is: 4
the LOCALID is: 0
the NODEID is: 1

7 8 8 4 2 4
the PROCID is: 7
the LOCALID is: 3
the NODEID is: 1

3 8 8 4 2 4
the PROCID is: 3
the LOCALID is: 3
the NODEID is: 0

0 8 8 4 2 4
the PROCID is: 0
the LOCALID is: 0
the NODEID is: 0

I even tested this on my actual application which is a much larger network and it works!
I had seen some similar posts, but not 7007
Thank you Adrian!

awaelchli May 6, 2021

AMAZING!
Ok, great. We know that adding the num_nodes in two places is not ideal. The linked issue will make sure we can infer it directly from the trainer, then there should be no inconsistency!

tsikup · 2022-09-29T17:34:38Z

tsikup
Sep 29, 2022

I want to revisit this discussion because I find my self in a similar situation that I can't get out of.
I submit a SLURM job with 2 nodes, 4 gpus per node and 4 tasks per node (1 task per gpu) as suggested by the lightning docs here.

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=A100:4

with devices=4 and num_nodes=2 in the Trainer.

I get the following error although the first 4 ranks (gpus) have been initialized as follows
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8

Can you help me?

Traceback (most recent call last):
  File "/local/tmp.581850/9.train.py", line 156, in <module>
    main(args, config)
  File "/local/tmp.581850/9.train.py", line 138, in main
    trainer.fit(model, datamodule=data, ckpt_path=None)
  File "/envs/pytorch-env/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/envs/pytorch-env/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/envs/pytorch-env/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/envs/pytorch-env/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1102, in _run
    self.strategy.setup_environment()
  File "/envs/pytorch-env/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 157, in setup_environment
    self.setup_distributed()
  File "/envs/pytorch-env/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 209, in setup_distributed
    self._process_group_backend = self._get_process_group_backend()
  File "/envs/pytorch-env/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 216, in _get_process_group_backend
    or get_default_process_group_backend_for_device(self.root_device)
  File "/envs/pytorch-env/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 123, in root_device
    return self.parallel_devices[self.local_rank]
IndexError: list index out of range

1 reply

spirosbax May 11, 2023

I'm running into the exact same problem. I have followed the docs and created a SLURM config that matches trainer.num_nodes and trainer.devices with slurm nodes and tasks_per_node.

Viditagarwal7479 · 2024-02-03T18:24:48Z

Viditagarwal7479
Feb 3, 2024

I am using torchrun and getting the same error when I am using

srun torchrun \
        --nnodes $SLURM_NNODES \
        --nproc_per_node $SLURM_NTASKS_PER_NODE \
        --rdzv_id $RANDOM \
        --rdzv_backend c10d \
        --rdzv_endpoint $head_node_ip:29500 \
        train_multi_ddp.py

Whereas this ones runs perfectly fine:

srun torchrun \
        --nnodes 2 \
        --nproc_per_node 1 \
        --rdzv_id $RANDOM \
        --rdzv_backend c10d \
        --rdzv_endpoint $head_node_ip:29500 \
        train_multi_ddp.py

I also verified the variables:

echo $SLURM_NNODES
> 2
echo $SLURM_NTASKS_PER_NODE
> 1

The SLURM script configuration is:

#SBATCH -N 2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=5
#SBATCH --gres=gpu:1
#SBATCH --job-name=multi_ddp
#SBATCH --output=train_multi_ddp.out
#SBATCH --error=train_multi_ddp.err
#SBATCH --partition=gpu

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem in multi-node training #7275

{{title}}

Replies: 3 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Problem in multi-node training #7275

Replies: 3 comments · 15 replies

nickKyr May 6, 2021 Author

nickKyr May 6, 2021 Author

Replies: 3 comments 15 replies

nickKyr May 6, 2021
Author

nickKyr May 6, 2021
Author