You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m encountering an issue when running multi-node training with the Hugging Face Accelerate library on NVIDIA DGX Cloud. The setup works perfectly for multi-GPU training on a single node, but fails when extended to multiple nodes due to NCCL communication errors between nodes. This is critical for us as 90% of our projects rely on Accelerate for distributed training.
Environment
Platform: NVIDIA DGX Cloud
Library: Hugging Face Accelerate (latest version as of March 2025)
When running the attached script on two nodes, the job fails with NCCL-related errors, indicating a failure in communication between nodes. The main error in the log points to NCCL initialization issues. The full error log is attached below.
Interestingly, the same script runs successfully on a single node with multiple GPUs, suggesting the issue is specific to multi-node setups.
Steps to Reproduce
Use the attached python script (accelerate_test.py).
Submit the job using the below Slurm bash script.
Observe the failure in the output log.
I am testing using this simple training loop written via accelerate:
importtorchfromtorchimportnn, optimfromtorch.utils.dataimportDataLoader, TensorDatasetfromaccelerateimportAcceleratordefmain():
# Initialize the accelerator. This will handle multi-node, multi-GPU setup.accelerator=Accelerator()
# Display the total number of GPUs being used (only from the main process)ifaccelerator.is_main_process:
print(f"Total number of GPUs being used: {accelerator.num_processes}")
# Create a simple model (e.g., a two-layer MLP)model=nn.Sequential(
nn.Linear(10, 50),
nn.ReLU(),
nn.Linear(50, 1)
)
# Set up an optimizeroptimizer=optim.SGD(model.parameters(), lr=0.001)
# Create a dummy dataset: 1000 samples, 10 features eachx=torch.randn(1000, 10)
y=torch.randn(1000, 1)
dataset=TensorDataset(x, y)
dataloader=DataLoader(dataset, batch_size=32, shuffle=True)
# Prepare everything for distributed training (model, optimizer, dataloader)model, optimizer, dataloader=accelerator.prepare(model, optimizer, dataloader)
# Training loopnum_epochs=5forepochinrange(num_epochs):
epoch_loss=0.0model.train()
forbatchindataloader:
inputs, targets=batchoptimizer.zero_grad()
outputs=model(inputs)
loss=nn.functional.mse_loss(outputs, targets)
accelerator.backward(loss)
optimizer.step()
epoch_loss+=loss.item()
avg_loss=epoch_loss/len(dataloader)
# Only the main process prints to avoid duplicate outputs across nodesifaccelerator.is_local_main_process:
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
if__name__=='__main__':
main()
#!/bin/bash#SBATCH -D .#SBATCH --exclusive#SBATCH --gpus-per-node=2#SBATCH --ntasks-per-node=1#SBATCH --partition=gpu#SBATCH --nodes 2#SBATCH --time 00-20:00:00#SBATCH --job-name test
module load nvhpc-pmix/24.3
module load cuda/11.8
# Activate the environmentsource /home/mis-kvsmps/environments/joint_training/bin/activate
export GPUS_PER_NODE=2
# Get the rendezvous host (first node in the allocation)
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST| head -n 1)export RDZV_PORT=29530
echo"Rendezvous Node IP: $head_node_ip"export NCCL_DEBUG=INFO
export LAUNCHER="accelerate launch \ --num_processes $((SLURM_NNODES * GPUS_PER_NODE))\ --num_machines $SLURM_NNODES\ --rdzv_backend c10d \ --main_process_ip $head_node_ip\ --main_process_port $RDZV_PORT\"export SCRIPT="accelerate_test.py"export CMD="$LAUNCHER$SCRIPT"
srun $CMD
Error log I got:
Rendezvous Node IP: dgx02
Total number of GPUs being used: 4
dgx02:219957:219957 [0] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.2<0>
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx02:219957:219957 [0] NCCL INFO NET/Plugin: Using internal network plugin.
dgx02:219957:219957 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda11.0
dgx02:219958:219958 [1] NCCL INFO cudaDriverVersion 12020
dgx02:219958:219958 [1] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.2<0>
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx02:219958:219958 [1] NCCL INFO NET/Plugin: Using internal network plugin.
dgx02:219957:220018 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/RoCE [6]mlx5_6:1/IB [7]mlx5_7:1/IB [8]mlx5_8:1/IB [9]mlx5_9:1/IB [10]mlx5_10:1/IB [11]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.2<0>
dgx02:219957:220018 [0] NCCL INFO Using non-device net plugin version 0
dgx02:219957:220018 [0] NCCL INFO Using network IB
dgx02:219958:220019 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/RoCE [6]mlx5_6:1/IB [7]mlx5_7:1/IB [8]mlx5_8:1/IB [9]mlx5_9:1/IB [10]mlx5_10:1/IB [11]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.2<0>
dgx02:219958:220019 [1] NCCL INFO Using non-device net plugin version 0
dgx02:219958:220019 [1] NCCL INFO Using network IB
dgx03:2619044:2619044 [0] NCCL INFO cudaDriverVersion 12020
dgx03:2619044:2619044 [0] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.3<0>
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx03:2619044:2619044 [0] NCCL INFO NET/Plugin: Using internal network plugin.
dgx03:2619045:2619045 [1] NCCL INFO cudaDriverVersion 12020
dgx03:2619045:2619045 [1] NCCL INFO Bootstrap : Using ibp12s0:100.126.5.3<0>
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
dgx03:2619045:2619045 [1] NCCL INFO NET/Plugin: Using internal network plugin.
dgx03:2619044:2619065 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.3<0>
dgx03:2619044:2619065 [0] NCCL INFO Using non-device net plugin version 0
dgx03:2619044:2619065 [0] NCCL INFO Using network IB
dgx03:2619045:2619066 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp12s0:100.126.5.3<0>
dgx03:2619045:2619066 [1] NCCL INFO Using non-device net plugin version 0
dgx03:2619045:2619066 [1] NCCL INFO Using network IB
dgx03:2619044:2619065 [0] misc/socket.cc:533 NCCL WARN socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
dgx03:2619045:2619066 [1] misc/socket.cc:533 NCCL WARN socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
dgx03:2619044:2619065 [0] NCCL INFO misc/socket.cc:570 -> 2
dgx03:2619045:2619066 [1] NCCL INFO misc/socket.cc:570 -> 2
dgx03:2619044:2619065 [0] NCCL INFO misc/socket.cc:621 -> 2
dgx03:2619045:2619066 [1] NCCL INFO misc/socket.cc:621 -> 2
dgx03:2619044:2619065 [0] NCCL INFO bootstrap.cc:285 -> 2
dgx03:2619045:2619066 [1] NCCL INFO bootstrap.cc:285 -> 2
dgx03:2619044:2619065 [0] NCCL INFO init.cc:1534 -> 2
dgx03:2619045:2619066 [1] NCCL INFO init.cc:1534 -> 2
dgx03:2619045:2619066 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
dgx03:2619044:2619065 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
dgx03:2619045:2619045 [1] NCCL INFO group.cc:418 -> 2
dgx03:2619045:2619045 [1] NCCL INFO init.cc:1929 -> 2
dgx03:2619044:2619044 [0] NCCL INFO group.cc:418 -> 2
dgx03:2619044:2619044 [0] NCCL INFO init.cc:1929 -> 2
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 53, in <module>
[rank2]: main()
[rank2]: File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 31, in main
[rank2]: model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
[rank2]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1339, in prepare
[rank2]: result = tuple(
[rank2]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank2]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank2]: return self.prepare_model(obj, device_placement=device_placement)
[rank2]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank2]: model = torch.nn.parallel.DistributedDataParallel(
[rank2]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank2]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank2]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank2]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank2]: Last error:
[rank2]: socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 53, in <module>
[rank3]: main()
[rank3]: File "/home/mis-kvsmps/projects/JointTraining/accelerate_test.py", line 31, in main
[rank3]: model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
[rank3]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1339, in prepare
[rank3]: result = tuple(
[rank3]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank3]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank3]: return self.prepare_model(obj, device_placement=device_placement)
[rank3]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank3]: model = torch.nn.parallel.DistributedDataParallel(
[rank3]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank3]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank3]: File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank3]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank3]: Last error:
[rank3]: socketPollConnect: Connect to 100.126.5.2<40395> returned 113(No route to host) errno 115(Operation now in progress)
W0306 16:29:16.786000 2618773 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2619044 closing signal SIGTERM
E0306 16:29:17.150000 2618773 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 2619045) of binary: /home/mis-kvsmps/environments/joint_training/bin/python
Traceback (most recent call last):
File "/home/mis-kvsmps/environments/joint_training/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
multi_gpu_launcher(args)
File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
distrib_run.run(args)
File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/mis-kvsmps/environments/joint_training/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
accelerate_test.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-03-06_16:29:16
host : dgx03.cm.cluster
rank : 3 (local_rank: 1)
exitcode : 1 (pid: 2619045)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: dgx03: task 1: Exited with exit code 1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 15424 ON dgx02 CANCELLED AT 2025-03-06T16:35:38 ***
slurmstepd: error: *** STEP 15424.0 ON dgx02 CANCELLED AT 2025-03-06T16:35:38 ***
W0306 16:35:38.113000 219705 torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers
W0306 16:35:38.114000 219705 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 219957 closing signal SIGTERM
W0306 16:35:38.115000 219705 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 219958 closing signal SIGTERM
Could someone take a look at this and help identify what’s going wrong with the multi-node setup? Any guidance on resolving the NCCL communication issue would be greatly appreciated, as we’re heavily reliant on Accelerate for our projects on NVIDIA DGX Cloud.
Thanks!
The text was updated successfully, but these errors were encountered:
Hi,
I’m encountering an issue when running multi-node training with the Hugging Face Accelerate library on NVIDIA DGX Cloud. The setup works perfectly for multi-GPU training on a single node, but fails when extended to multiple nodes due to NCCL communication errors between nodes. This is critical for us as 90% of our projects rely on Accelerate for distributed training.
Environment
Problem
When running the attached script on two nodes, the job fails with NCCL-related errors, indicating a failure in communication between nodes. The main error in the log points to NCCL initialization issues. The full error log is attached below.
Interestingly, the same script runs successfully on a single node with multiple GPUs, suggesting the issue is specific to multi-node setups.
Steps to Reproduce
accelerate_test.py
).I am testing using this simple training loop written via accelerate:
Here is my slurm script inspired by this example:
Error log I got:
Could someone take a look at this and help identify what’s going wrong with the multi-node setup? Any guidance on resolving the NCCL communication issue would be greatly appreciated, as we’re heavily reliant on Accelerate for our projects on NVIDIA DGX Cloud.
Thanks!
The text was updated successfully, but these errors were encountered: