Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Node Training Fails with NCCL Communication Errors on NVIDIA DGX Cloud #3426

Open
mahdip72 opened this issue Mar 7, 2025 · 0 comments

Comments

@mahdip72
Copy link

mahdip72 commented Mar 7, 2025

Hi,

I’m encountering an issue when running multi-node training with the Hugging Face Accelerate library on NVIDIA DGX Cloud. The setup works perfectly for multi-GPU training on a single node, but fails when extended to multiple nodes due to NCCL communication errors between nodes. This is critical for us as 90% of our projects rely on Accelerate for distributed training.

Environment

  • Platform: NVIDIA DGX Cloud
  • Library: Hugging Face Accelerate (latest version as of March 2025)
  • Backend: PyTorch with NCCL
  • Slurm Cluster Setup: 2 nodes, 2 GPUs per node (A100 80GB)
  • Scheduler: Slurm

Problem

When running the attached script on two nodes, the job fails with NCCL-related errors, indicating a failure in communication between nodes. The main error in the log points to NCCL initialization issues. The full error log is attached below.

Interestingly, the same script runs successfully on a single node with multiple GPUs, suggesting the issue is specific to multi-node setups.

Steps to Reproduce

  1. Use the attached python script (accelerate_test.py).
  2. Submit the job using the below Slurm bash script.
  3. Observe the failure in the output log.

I am testing using this simple training loop written via accelerate:

import torch
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
from accelerate import Accelerator

def main():
    # Initialize the accelerator. This will handle multi-node, multi-GPU setup.
    accelerator = Accelerator()
    
    # Display the total number of GPUs being used (only from the main process)
    if accelerator.is_main_process:
        print(f"Total number of GPUs being used: {accelerator.num_processes}")
    
    # Create a simple model (e.g., a two-layer MLP)
    model = nn.Sequential(
        nn.Linear(10, 50),
        nn.ReLU(),
        nn.Linear(50, 1)
    )
    
    # Set up an optimizer
    optimizer = optim.SGD(model.parameters(), lr=0.001)
    
    # Create a dummy dataset: 1000 samples, 10 features each
    x = torch.randn(1000, 10)
    y = torch.randn(1000, 1)
    dataset = TensorDataset(x, y)
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    # Prepare everything for distributed training (model, optimizer, dataloader)
    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
    
    # Training loop
    num_epochs = 5
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        model.train()
        for batch in dataloader:
            inputs, targets = batch
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = nn.functional.mse_loss(outputs, targets)
            accelerator.backward(loss)
            optimizer.step()
            epoch_loss += loss.item()
        
        avg_loss = epoch_loss / len(dataloader)
        # Only the main process prints to avoid duplicate outputs across nodes
        if accelerator.is_local_main_process:
            print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

if __name__ == '__main__':
    main()

Here is my slurm script inspired by this example:

#!/bin/bash

#SBATCH -D .
#SBATCH --exclusive
#SBATCH --gpus-per-node=2
#SBATCH --ntasks-per-node=1
#SBATCH --partition=gpu
#SBATCH --nodes 2
#SBATCH --time 00-20:00:00
#SBATCH --job-name test

module load nvhpc-pmix/24.3
module load cuda/11.8

# Activate the environment
source /home/mis-kvsmps/environments/joint_training/bin/activate

export GPUS_PER_NODE=2

# Get the rendezvous host (first node in the allocation)
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export RDZV_PORT=29530

echo "Rendezvous Node IP: $head_node_ip"

export NCCL_DEBUG=INFO

export LAUNCHER="accelerate launch \
    --num_processes $((SLURM_NNODES * GPUS_PER_NODE)) \
    --num_machines $SLURM_NNODES \
    --rdzv_backend c10d \
    --main_process_ip $head_node_ip \
    --main_process_port $RDZV_PORT \
    "

export SCRIPT="accelerate_test.py"
export CMD="$LAUNCHER $SCRIPT"
srun $CMD

Error log I got:

Rendezvous Node IP: dgx02
Total number of GPUs being used: 4
dgx02:219957:219957 [0] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.2<0>
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: Using internal network plugin.
dgx02:219957:219957 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda11.0
dgx02:219958:219958 [1] NCCL INFO cudaDriverVersion 12020
dgx02:219958:219958 [1] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.2<0>
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: Using internal network plugin.
dgx02:219957:220018 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/RoCE [6]mlx5_6:1/IB [7]mlx5_7:1/IB [8]mlx5_8:1/IB [9]mlx5_9:1/IB [10]mlx5_10:1/IB [11]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.2<0>
dgx02:219957:220018 [0] NCCL INFO Using non-device net plugin version 0
dgx02:219957:220018 [0] NCCL INFO Using network IB
dgx02:219958:220019 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/RoCE [6]mlx5_6:1/IB [7]mlx5_7:1/IB [8]mlx5_8:1/IB [9]mlx5_9:1/IB [10]mlx5_10:1/IB [11]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.2<0>
dgx02:219958:220019 [1] NCCL INFO Using non-device net plugin version 0
dgx02:219958:220019 [1] NCCL INFO Using network IB
dgx03:2619044:2619044 [0] NCCL INFO cudaDriverVersion 12020
dgx03:2619044:2619044 [0] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.3<0>
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: Using internal network plugin.
dgx03:2619045:2619045 [1] NCCL INFO cudaDriverVersion 12020
dgx03:2619045:2619045 [1] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.3<0>
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: Using internal network plugin.
dgx03:2619044:2619065 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.3<0>
dgx03:2619044:2619065 [0] NCCL INFO Using non-device net plugin version 0
dgx03:2619044:2619065 [0] NCCL INFO Using network IB
dgx03:2619045:2619066 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.3<0>
dgx03:2619045:2619066 [1] NCCL INFO Using non-device net plugin version 0
dgx03:2619045:2619066 [1] NCCL INFO Using network IB

dgx03:2619044:2619065 [0] misc/socket.cc:533 NCCL WARN socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)

dgx03:2619045:2619066 [1] misc/socket.cc:533 NCCL WARN socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
dgx03:2619044:2619065 [0] NCCL INFO misc/socket.cc:570 -> 2
dgx03:2619045:2619066 [1] NCCL INFO misc/socket.cc:570 -> 2
dgx03:2619044:2619065 [0] NCCL INFO misc/socket.cc:621 -> 2
dgx03:2619045:2619066 [1] NCCL INFO misc/socket.cc:621 -> 2
dgx03:2619044:2619065 [0] NCCL INFO bootstrap.cc:285 -> 2
dgx03:2619045:2619066 [1] NCCL INFO bootstrap.cc:285 -> 2
dgx03:2619044:2619065 [0] NCCL INFO init.cc:1534 -> 2
dgx03:2619045:2619066 [1] NCCL INFO init.cc:1534 -> 2
dgx03:2619045:2619066 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
dgx03:2619044:2619065 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
dgx03:2619045:2619045 [1] NCCL INFO group.cc:418 -> 2
dgx03:2619045:2619045 [1] NCCL INFO init.cc:1929 -> 2
dgx03:2619044:2619044 [0] NCCL INFO group.cc:418 -> 2
dgx03:2619044:2619044 [0] NCCL INFO init.cc:1929 -> 2
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 53, in <module>
[rank2]:     main()
[rank2]:   File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 31, in main
[rank2]:     model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1339, in prepare
[rank2]:     result = tuple(
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank2]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank2]:     return self.prepare_model(obj, device_placement=device_placement)
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank2]:     model = torch.nn.parallel.DistributedDataParallel(
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank2]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank2]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank2]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank2]: Last error:
[rank2]: socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 53, in <module>
[rank3]:     main()
[rank3]:   File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 31, in main
[rank3]:     model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1339, in prepare
[rank3]:     result = tuple(
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank3]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank3]:     return self.prepare_model(obj, device_placement=device_placement)
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank3]:     model = torch.nn.parallel.DistributedDataParallel(
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank3]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank3]:   File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank3]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
W0306 16:29:16.786000 2618773 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2619044 closing signal SIGTERM
E0306 16:29:17.150000 2618773 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 2619045) of binary: /home/mis-kvsmps/environments/joint_training/bin/python
Traceback (most recent call last):
  File "/home/mis-kvsmps/environments/joint_training/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
    multi_gpu_launcher(args)
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
accelerate_test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-06_16:29:16
  host      : dgx03.cm.cluster
  rank      : 3 (local_rank: 1)
  exitcode  : 1 (pid: 2619045)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: dgx03: task 1: Exited with exit code 1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 15424 ON dgx02 CANCELLED AT 2025-03-06T16:35:38 ***
slurmstepd: error: *** STEP 15424.0 ON dgx02 CANCELLED AT 2025-03-06T16:35:38 ***
W0306 16:35:38.113000 219705 torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers
W0306 16:35:38.114000 219705 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 219957 closing signal SIGTERM
W0306 16:35:38.115000 219705 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 219958 closing signal SIGTERM

Could someone take a look at this and help identify what’s going wrong with the multi-node setup? Any guidance on resolving the NCCL communication issue would be greatly appreciated, as we’re heavily reliant on Accelerate for our projects on NVIDIA DGX Cloud.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant