Error with nccl_mpi_all_reduce on multinode system #97

vilmara · 2018-05-15T20:37:00Z

Hi all,

What is the command line to run nccl_mpi_all_reduce on a multi-node system (2 nodes with 4 GPUs each one)?, and I am getting the below error when typing this command:

DeepBench/code$ mpirun -np 8 bin/nccl_mpi_all_reduce

WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job.

Local host: C4-1

terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to set cuda device

When running only with 4 ranks, I get this output:

DeepBench/code$ mpirun -np 4 bin/nccl_mpi_all_reduce

WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.

Local host: C4-1

NCCL MPI AllReduce
Num Ranks: 4

# of floats    bytes transferred    Avg Time (msec)    Max Time (msec)

[C4130-1:04094] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found
[C4130-1:04094] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
100000 400000 0.148489 0.148565
3097600 12390400 2.63694 2.63695
4194304 16777216 3.57147 3.57148
6553600 26214400 5.59742 5.59744
16777217 67108868 81.9391 81.9396
38360000 153440000 32.6457 32.6462

Thanks

The text was updated successfully, but these errors were encountered:

sharannarang · 2018-05-22T23:59:23Z

@mpatwary , can you help with this?

mpatwary · 2018-05-23T19:37:11Z

It looks like you are using the right command and I think the problem is unrelated to nccl_mpi_all_reduce. Do the other MPI implementations run well like ring_all_reduce and osu_allreduce? I suspect the problem could be the setup. Does any other code with MPI run well in your system?

vilmara · 2018-05-23T21:11:08Z

hi @mpatwary, my system has 2 nodes, each with 4 P100 GPUs (total 8 gpus) connected using infiniband, I was wonder how mpirun communicates between the nodes to implement the distributed benchmark?

ring_all_reduce and osu_allreduce are throwing errors when I compile the DeepBench benchmarks:

Compilation:
make CUDA_PATH=/usr/local/cuda-9.1 CUDNN_PATH=/usr/local/cuda/include/ MPI_PATH=/home/dell/.openmpi/ NCCL_PATH=/home/$USER/.openmpi/ ARCH=sm_60

Normal outputs and errors:
mkdir -p bin
make -C nvidia
make[1]: Entering directory '/home/dell/DeepBench/code/nvidia'
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc gemm_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/gemm_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -L /usr/local/cuda-9.1/lib64 -lcublas -L /usr/local/cuda-9.1/lib64 -lcurand --generate-code arch=compute_60,code=sm_60 -std=c++11
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc conv_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/conv_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -I /usr/local/cuda/include//include/ -L /usr/local/cuda/include//lib64/ -L /usr/local/cuda-9.1/lib64 -lcurand -lcudnn --generate-code arch=compute_60,code=sm_60 -std=c++11
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc rnn_bench.cu -DUSE_TENSOR_CORES=0 -o bin/rnn_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -I /usr/local/cuda/include//include/ -L /usr/local/cuda/include//lib64/ -L /usr/local/cuda-9.1/lib64 -lcurand -lcudnn --generate-code arch=compute_60,code=sm_60 -std=c++11
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc nccl_single_all_reduce.cu -o bin/nccl_single_all_reduce -I ../kernels/ -I /home/root/.openmpi//include/ -I /usr/local/cuda/include//include/ -L /home/root/.openmpi//lib/ -L /usr/local/cuda/include//lib64 -lnccl -lcudart -lcurand --generate-code arch=compute_60,code=sm_60 -std=c++11
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc nccl_mpi_all_reduce.cu -o bin/nccl_mpi_all_reduce -I ../kernels/ -I /home/root/.openmpi//include/ -I /usr/local/cuda/include//include/ -I /home/dell/.openmpi//include -L /home/root/.openmpi//lib/ -L /usr/local/cuda/include//lib64 -L /home/dell/.openmpi//lib -lnccl -lcurand -lcudart -lmpi --generate-code arch=compute_60,code=sm_60 -std=c++11
make[1]: Leaving directory '/home/dell/DeepBench/code/nvidia'
cp nvidia/bin/* bin
rm -rf nvidia/bin
mkdir -p bin
make -C osu_allreduce
make[1]: Entering directory '/home/dell/DeepBench/code/osu_allreduce'
mkdir -p bin
gcc -o bin/osu_coll.o -c -O2 -pthread -Wall -march=native -I/usr/local/cuda-9.1/include -I/home/dell/.openmpi//include osu_coll.c
gcc -o bin/osu_allreduce.o -c -O2 -pthread -Wall -march=native -I ../kernels/ -I/usr/local/cuda-9.1/include -I/home/dell/.openmpi//include osu_allreduce.c
gcc -o bin/osu_allreduce -pthread -Wl,--enable-new-dtags -Wl,-rpath=/usr/local/cuda-9.1/lib64 -Wl,-rpath=/home/dell/.openmpi//lib bin/osu_allreduce.o bin/osu_coll.o -L/usr/local/cuda-9.1/lib64 -L/home/dell/.openmpi//lib -lstdc++ -lmpi_cxx -lmpi -lcuda
/usr/bin/ld: cannot find -lmpi_cxx
collect2: error: ld returned 1 exit status
Makefile:17: recipe for target 'build' failed
make[1]: *** [build] Error 1
make[1]: Leaving directory '/home/dell/DeepBench/code/osu_allreduce'
Makefile:6: recipe for target 'osu_allreduce' failed
make: *** [osu_allreduce] Error 2

I have recompiled, and ran it again with 4 and 8 gpus but now I got another the below error:

mpirun --allow-run-as-root -np 8 bin/nccl_mpi_all_reduce
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[41026,1],0]
Exit code: 127

mpatwary · 2018-05-23T21:26:16Z

Looks like the code is not getting the path to the mpi lib directory. You can try exporting that.

vilmara · 2018-05-29T21:49:34Z

@mpatwary, thanks for your prompt reply. I exported that and have gotten other errors. My system has 2 nodes, each with 4 P100 GPUs (total 8 GPUs) connected using InfiniBand, I was wonder how mpirun communicates between the nodes to implement the distributed benchmark?. It looks like the command mpirun --allow-run-as-root -np 8 bin/nccl_mpi_all_reduce is just considering the host node only; my understanding is that mpirun should receive the flag -H with the ib address of both servers (I tried this option but got errors too). Can you share the command line you have used to implement DeepBench nccl_mpi_all_reduce with multinode and multi GPUs systems?

Here is the error I am getting considering just the 4 GPUs of the host server:
mpirun --allow-run-as-root -np 4 bin/nccl_mpi_all_reduce
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[host-P100-2:10830] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[36721,1],0]
Exit code: 1

laserrapt0r · 2021-10-05T17:37:13Z

I have a problem here as well. Normal single version works fine. All other MPI applications are working. But i get this one here:

NCCL MPI AllReduce
Num Ranks: 2

# of floats    bytes transferred    Avg Time (msec)    Max Time (msec)

terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL failure: unhandled cuda error in nccl_mpi_all_reduce.cu at line: 86 rank: 0

[jetson-3:29969] *** Process received signal ***
[jetson-3:29969] Signal: Aborted (6)
[jetson-3:29969] Signal code: (-6)
what(): NCCL failure: unhandled cuda error in nccl_mpi_all_reduce.cu at line: 86 rank: 1

[jetson-2:08669] * Process received signal *
[jetson-2:08669] Signal: Aborted (6)
[jetson-2:08669] Signal code: (-6)
[jetson-2:08669] * End of error message *
[jetson-3:29969] * End of error message *

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpiexec noticed that process rank 0 with PID 0 on node jetson-3 exited on signal 6 (Aborted).

sharannarang assigned mpatwary May 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with nccl_mpi_all_reduce on multinode system #97

Error with nccl_mpi_all_reduce on multinode system #97

vilmara commented May 15, 2018 •

edited

Loading

sharannarang commented May 22, 2018

mpatwary commented May 23, 2018

vilmara commented May 23, 2018 •

edited

Loading

mpatwary commented May 23, 2018

vilmara commented May 29, 2018 •

edited

Loading

laserrapt0r commented Oct 5, 2021

Error with nccl_mpi_all_reduce on multinode system #97

Error with nccl_mpi_all_reduce on multinode system #97

Comments

vilmara commented May 15, 2018 • edited Loading

Local host: C4-1

Local host: C4-1

NCCL MPI AllReduce Num Ranks: 4

sharannarang commented May 22, 2018

mpatwary commented May 23, 2018

vilmara commented May 23, 2018 • edited Loading

mpatwary commented May 23, 2018

vilmara commented May 29, 2018 • edited Loading

laserrapt0r commented Oct 5, 2021

NCCL MPI AllReduce Num Ranks: 2

[jetson-2:08669] *** Process received signal *** [jetson-2:08669] Signal: Aborted (6) [jetson-2:08669] Signal code: (-6) [jetson-2:08669] *** End of error message *** [jetson-3:29969] *** End of error message ***

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

vilmara commented May 15, 2018 •

edited

Loading

NCCL MPI AllReduce
Num Ranks: 4

vilmara commented May 23, 2018 •

edited

Loading

vilmara commented May 29, 2018 •

edited

Loading

NCCL MPI AllReduce
Num Ranks: 2

[jetson-2:08669] * Process received signal *
[jetson-2:08669] Signal: Aborted (6)
[jetson-2:08669] Signal code: (-6)
[jetson-2:08669] * End of error message *
[jetson-3:29969] * End of error message *

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.