Multi-Node Training Fails with NCCL Communication Errors on NVIDIA DGX Cloud #3426

mahdip72 · 2025-03-07T01:11:50Z

Hi,

I’m encountering an issue when running multi-node training with the Hugging Face Accelerate library on NVIDIA DGX Cloud. The setup works perfectly for multi-GPU training on a single node, but fails when extended to multiple nodes due to NCCL communication errors between nodes. This is critical for us as 90% of our projects rely on Accelerate for distributed training.

Environment

Platform: NVIDIA DGX Cloud
Library: Hugging Face Accelerate (latest version as of March 2025)
Backend: PyTorch with NCCL
Slurm Cluster Setup: 2 nodes, 2 GPUs per node (A100 80GB)
Scheduler: Slurm

Problem

When running the attached script on two nodes, the job fails with NCCL-related errors, indicating a failure in communication between nodes. The main error in the log points to NCCL initialization issues. The full error log is attached below.

Interestingly, the same script runs successfully on a single node with multiple GPUs, suggesting the issue is specific to multi-node setups.

Steps to Reproduce

Use the attached python script (accelerate_test.py).
Submit the job using the below Slurm bash script.
Observe the failure in the output log.

I am testing using this simple training loop written via accelerate:

import torch
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
from accelerate import Accelerator

def main():
    # Initialize the accelerator. This will handle multi-node, multi-GPU setup.
    accelerator = Accelerator()
    
    # Display the total number of GPUs being used (only from the main process)
    if accelerator.is_main_process:
        print(f"Total number of GPUs being used: {accelerator.num_processes}")
    
    # Create a simple model (e.g., a two-layer MLP)
    model = nn.Sequential(
        nn.Linear(10, 50),
        nn.ReLU(),
        nn.Linear(50, 1)
    )
    
    # Set up an optimizer
    optimizer = optim.SGD(model.parameters(), lr=0.001)
    
    # Create a dummy dataset: 1000 samples, 10 features each
    x = torch.randn(1000, 10)
    y = torch.randn(1000, 1)
    dataset = TensorDataset(x, y)
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    # Prepare everything for distributed training (model, optimizer, dataloader)
    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
    
    # Training loop
    num_epochs = 5
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        model.train()
        for batch in dataloader:
            inputs, targets = batch
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = nn.functional.mse_loss(outputs, targets)
            accelerator.backward(loss)
            optimizer.step()
            epoch_loss += loss.item()
        
        avg_loss = epoch_loss / len(dataloader)
        # Only the main process prints to avoid duplicate outputs across nodes
        if accelerator.is_local_main_process:
            print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

if __name__ == '__main__':
    main()

Here is my slurm script inspired by this example:

#!/bin/bash

#SBATCH -D .
#SBATCH --exclusive
#SBATCH --gpus-per-node=2
#SBATCH --ntasks-per-node=1
#SBATCH --partition=gpu
#SBATCH --nodes 2
#SBATCH --time 00-20:00:00
#SBATCH --job-name test

module load nvhpc-pmix/24.3
module load cuda/11.8

# Activate the environment
source /home/mis-kvsmps/environments/joint_training/bin/activate

export GPUS_PER_NODE=2

# Get the rendezvous host (first node in the allocation)
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export RDZV_PORT=29530

echo "Rendezvous Node IP: $head_node_ip"

export NCCL_DEBUG=INFO

export LAUNCHER="accelerate launch \
    --num_processes $((SLURM_NNODES * GPUS_PER_NODE)) \
    --num_machines $SLURM_NNODES \
    --rdzv_backend c10d \
    --main_process_ip $head_node_ip \
    --main_process_port $RDZV_PORT \
    "

export SCRIPT="accelerate_test.py"
export CMD="$LAUNCHER $SCRIPT"
srun $CMD

Error log I got:

Rendezvous Node IP: dgx02
Total number of GPUs being used: 4
dgx02:219957:219957 [0] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.2<0>
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: Using internal network plugin.
dgx02:219957:219957 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda11.0
dgx02:219958:219958 [1] NCCL INFO cudaDriverVersion 12020
dgx02:219958:219958 [1] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.2<0>
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: Using internal network plugin.
dgx02:219957:220018 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/RoCE [6]mlx5_6:1/IB [7]mlx5_7:1/IB [8]mlx5_8:1/IB [9]mlx5_9:1/IB [10]mlx5_10:1/IB [11]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.2<0>
dgx02:219957:220018 [0] NCCL INFO Using non-device net plugin version 0
dgx02:219957:220018 [0] NCCL INFO Using network IB
dgx02:219958:220019 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/RoCE [6]mlx5_6:1/IB [7]mlx5_7:1/IB [8]mlx5_8:1/IB [9]mlx5_9:1/IB [10]mlx5_10:1/IB [11]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.2<0>
dgx02:219958:220019 [1] NCCL INFO Using non-device net plugin version 0
dgx02:219958:220019 [1] NCCL INFO Using network IB
dgx03:2619044:2619044 [0] NCCL INFO cudaDriverVersion 12020
dgx03:2619044:2619044 [0] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.3<0>
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: Using internal network plugin.
dgx03:2619045:2619045 [1] NCCL INFO cudaDriverVersion 12020
dgx03:2619045:2619045 [1] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.3<0>
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: Using internal network plugin.
dgx03:2619044:2619065 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.3<0>
dgx03:2619044:2619065 [0] NCCL INFO Using non-device net plugin version 0
dgx03:2619044:2619065 [0] NCCL INFO Using network IB
dgx03:2619045:2619066 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.3<0>
dgx03:2619045:2619066 [1] NCCL INFO Using non-device net plugin version 0
dgx03:2619045:2619066 [1] NCCL INFO Using network IB

dgx03:2619044:2619065 [0] misc/socket.cc:533 NCCL WARN socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)

dgx03:2619045:2619066 [1] misc/socket.cc:533 NCCL WARN socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
dgx03:2619044:2619065 [0] NCCL INFO misc/socket.cc:570 -> 2
dgx03:2619045:2619066 [1] NCCL INFO misc/socket.cc:570 -> 2
dgx03:2619044:2619065 [0] NCCL INFO misc/socket.cc:621 -> 2
dgx03:2619045:2619066 [1] NCCL INFO misc/socket.cc:621 -> 2
dgx03:2619044:2619065 [0] NCCL INFO bootstrap.cc:285 -> 2
dgx03:2619045:2619066 [1] NCCL INFO bootstrap.cc:285 -> 2
dgx03:2619044:2619065 [0] NCCL INFO init.cc:1534 -> 2
dgx03:2619045:2619066 [1] NCCL INFO init.cc:1534 -> 2
dgx03:2619045:2619066 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
dgx03:2619044:2619065 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
dgx03:2619045:2619045 [1] NCCL INFO group.cc:418 -> 2
dgx03:2619045:2619045 [1] NCCL INFO init.cc:1929 -> 2
dgx03:2619044:2619044 [0] NCCL INFO group.cc:418 -> 2
dgx03:2619044:2619044 [0] NCCL INFO init.cc:1929 -> 2
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 53, in <module>
[rank2]:     main()
[rank2]:   File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 31, in main
[rank2]:     model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1339, in prepare
[rank2]:     result = tuple(
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank2]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank2]:     return self.prepare_model(obj, device_placement=device_placement)
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank2]:     model = torch.nn.parallel.DistributedDataParallel(
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank2]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank2]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank2]: Last error:
[rank2]: socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 53, in <module>
[rank3]:     main()
[rank3]:   File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 31, in main
[rank3]:     model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1339, in prepare
[rank3]:     result = tuple(
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank3]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank3]:     return self.prepare_model(obj, device_placement=device_placement)
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank3]:     model = torch.nn.parallel.DistributedDataParallel(
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank3]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank3]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
W0306 16:29:16.786000 2618773 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2619044 closing signal SIGTERM
E0306 16:29:17.150000 2618773 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 2619045) of binary: /home/mis-kvsmps/environments/joint_training/bin/python
Traceback (most recent call last):
  File "/home/mis-kvsmps/environments/joint_training/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
    multi_gpu_launcher(args)
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
accelerate_test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-06_16:29:16
  host      : dgx03.cm.cluster
  rank      : 3 (local_rank: 1)
  exitcode  : 1 (pid: 2619045)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: dgx03: task 1: Exited with exit code 1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 15424 ON dgx02 CANCELLED AT 2025-03-06T16:35:38 ***
slurmstepd: error: *** STEP 15424.0 ON dgx02 CANCELLED AT 2025-03-06T16:35:38 ***
W0306 16:35:38.113000 219705 torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers
W0306 16:35:38.114000 219705 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 219957 closing signal SIGTERM
W0306 16:35:38.115000 219705 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 219958 closing signal SIGTERM

Could someone take a look at this and help identify what’s going wrong with the multi-node setup? Any guidance on resolving the NCCL communication issue would be greatly appreciated, as we’re heavily reliant on Accelerate for our projects on NVIDIA DGX Cloud.

Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Node Training Fails with NCCL Communication Errors on NVIDIA DGX Cloud #3426

Multi-Node Training Fails with NCCL Communication Errors on NVIDIA DGX Cloud #3426

mahdip72 commented Mar 7, 2025 •

edited

Loading

Multi-Node Training Fails with NCCL Communication Errors on NVIDIA DGX Cloud #3426

Multi-Node Training Fails with NCCL Communication Errors on NVIDIA DGX Cloud #3426

Comments

mahdip72 commented Mar 7, 2025 • edited Loading

Environment

Problem

Steps to Reproduce

mahdip72 commented Mar 7, 2025 •

edited

Loading