Problem in multi-node training #7275
-
Hello pytorch-lightning community, my training hangs when training on multi-nodes; on single node with multiple GPUs runs fine :/ The job submission file has the corresponding lines: srun --ntasks=8 python3 coolModel.py 2>&1 | tee log.train I attach the output and the code below... Cheers, ###########python code import pytorch_lightning as pl class database(pl.LightningDataModule):
class CoolModel(pl.LightningModule):
if name=='main':
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 15 replies
-
Hello Nikos Do you have 8 gpus in the node? I think it must match gres. Totally no slurm expert here, just looking at your script with one eye closed. [1] https://pytorch-lightning.readthedocs.io/en/latest/clouds/cluster.html#slurm-managed-cluster |
Beta Was this translation helpful? Give feedback.
-
I want to revisit this discussion because I find my self in a similar situation that I can't get out of.
with I get the following error although the first 4 ranks (gpus) have been initialized as follows Can you help me?
|
Beta Was this translation helpful? Give feedback.
-
I am using torchrun and getting the same error when I am using srun torchrun \
--nnodes $SLURM_NNODES \
--nproc_per_node $SLURM_NTASKS_PER_NODE \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
train_multi_ddp.py Whereas this ones runs perfectly fine: srun torchrun \
--nnodes 2 \
--nproc_per_node 1 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
train_multi_ddp.py I also verified the variables:
The SLURM script configuration is: #SBATCH -N 2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=5
#SBATCH --gres=gpu:1
#SBATCH --job-name=multi_ddp
#SBATCH --output=train_multi_ddp.out
#SBATCH --error=train_multi_ddp.err
#SBATCH --partition=gpu |
Beta Was this translation helpful? Give feedback.
Hello Nikos
Do you have 8 gpus in the node? I think it must match gres.
Don't you also need to specify how many tasks per node in the SBATCH directive? [1]
Also, I notice some unsupported Trainer arguments in your script. It should be:
trainer = Trainer(max_epochs=1, gpus=[0, 1, 2, 3, 4, 5, 6, 7], num_nodes=4)
Make sure this script actually runs on CPU first before going to the cluster 😅
Totally no slurm expert here, just looking at your script with one eye closed.
[1] https://pytorch-lightning.readthedocs.io/en/latest/clouds/cluster.html#slurm-managed-cluster