Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
mayooot committed Jan 10, 2024
1 parent c2bb939 commit c2cd9cc
Show file tree
Hide file tree
Showing 42 changed files with 4,650 additions and 2 deletions.
14 changes: 14 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM nvcr.io/nvidia/pytorch:23.11-py3

WORKDIR /

# change the download source of apt, comment it out if you are abroad
COPY sources.list /etc/apt/sources.list
RUN apt-get update && \
apt-get install -y openssh-server vim curl inetutils-ping net-tools telnet lsof

COPY start.sh /start.sh
COPY sshd_config /etc/ssh/sshd_config
COPY nccl-tests /nccl-tests

CMD ["/bin/bash", "start.sh"]
75 changes: 73 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,73 @@
# build-nccl-tests-with-pytorch
This is a dockerfile to build PyTorch executing NCCL-Tests.
# Build-NCCL-Tests-With-PyTorch

![license](https://img.shields.io/hexpm/l/plug.svg)
[![docker](https://img.shields.io/docker/pulls/mayooot/nccl-tests-with-pytorch.svg)](https://hub.docker.com/r/mayooot/nccl-tests-with-pytorch)

# Overview

Build [NCCL-Tests](https://github.com/NVIDIA/nccl-tests) and configure SSHD in PyTorch container to help you test NCCL
faster!

PyTorch Version: 23.11

# Quick Start

~~~shell
docker pull mayooot/nccl-tests-with-pytorch:v0.0.1
~~~

# Build From Source

~~~shell
git clone https://github.com/mayooot/build-nccl-tests-with-pytorch
cd build-nccl-tests-with-pytorch

docker build -t nccl-tests-with-pytorch:latest .
~~~

# Usage

The default values for `PORT` and `PASS` are 12345, you can replace them with `-e`.

In addition, you need to mount the host's `id_rsa` and `id_rsa.pub` to the container.

~~~shell
docker run --name foo \
-d -it \
--network=host \
-e PORT=1998 -e PASS=P@88w0rd \
-v /tmp/id_rsa:/root/.ssh/id_rsa \
-v /tmp/id_rsa.pub:/root/.ssh/id_rsa.pub \
--gpus all --shm-size=1g \
--cap-add=IPC_LOCK --device=/dev/infiniband \
mayooot/nccl-tests-with-pytorch:v0.0.1
~~~

The code and executable for NCCL-Tests is located in `/nccl-tests`, so let me show you how to use it,
using `all_reduce_perf` as an example.

Before using `all_reduce_perf`, you need to configure SSH intercommunication.

~~~shell
ssh-copy-id -p 1998 root@all_cluster_ip
~~~

Please replace `--host cluster_ip1,cluster_ip2,...` to the real cluster's IP address.

~~~shell
docker exec -it foo bash

cd /nccl-tests

mpirun --allow-run-as-root \
-mca plm_rsh_args "-p 1998" \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA=mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17 \
--host cluster_ip1,cluster_ip2,... \
./build/all_reduce_perf \
-b 1G -e 4G -f 2 -g 8
~~~

# Contribute

Feel free to open issues and pull requests. Any feedback is highly appreciated!
27 changes: 27 additions & 0 deletions nccl-tests/LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@

Copyright (c) 2016-2017, NVIDIA CORPORATION. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION, nor the names of their
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

23 changes: 23 additions & 0 deletions nccl-tests/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#
# Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
#
# See LICENCE.txt for license information
#

BUILDDIR ?= build
override BUILDDIR := $(abspath $(BUILDDIR))

.PHONY: all clean

default: src.build

TARGETS=src

all: ${TARGETS:%=%.build}
clean: ${TARGETS:%=%.clean}

%.build:
${MAKE} -C $* build BUILDDIR=${BUILDDIR}

%.clean:
${MAKE} -C $* clean BUILDDIR=${BUILDDIR}
72 changes: 72 additions & 0 deletions nccl-tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# NCCL Tests

These tests check both the performance and the correctness of [NCCL](http://github.com/nvidia/nccl) operations.

## Build

To build the tests, just type `make`.

If CUDA is not installed in /usr/local/cuda, you may specify CUDA\_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL\_HOME.

```shell
$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
```

NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed.

```shell
$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
```

## Usage

NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread).

### Quick examples

Run on 8 GPUs (`-g 8`), scanning from 8 Bytes to 128MBytes :
```shell
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
```

Run with MPI on 10 processes (potentially on multiple nodes) with 4 GPUs each, for a total of 40 GPUs:
```shell
$ mpirun -np 10 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
```

### Performance

See the [Performance](doc/PERFORMANCE.md) page for explanation about numbers, and in particular the "busbw" column.

### Arguments

All tests support the same set of arguments :

* Number of GPUs
* `-t,--nthreads <num threads>` number of threads per process. Default : 1.
* `-g,--ngpus <GPUs per thread>` number of gpus per thread. Default : 1.
* Sizes to scan
* `-b,--minbytes <min size in bytes>` minimum size to start with. Default : 32M.
* `-e,--maxbytes <max size in bytes>` maximum size to end at. Default : 32M.
* Increments can be either fixed or a multiplication factor. Only one of those should be used
* `-i,--stepbytes <increment size>` fixed increment between sizes. Default : 1M.
* `-f,--stepfactor <increment factor>` multiplication factor between sizes. Default : disabled.
* NCCL operations arguments
* `-o,--op <sum/prod/min/max/avg/all>` Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum.
* `-d,--datatype <nccltype/all>` Specify which datatype to use. Default : Float.
* `-r,--root <root/all>` Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0.
* Performance
* `-n,--iters <iteration count>` number of iterations. Default : 20.
* `-w,--warmup_iters <warmup iteration count>` number of warmup iterations (not timed). Default : 5.
* `-m,--agg_iters <aggregation count>` number of operations to aggregate together in each iteration. Default : 1.
* `-a,--average <0/1/2/3>` Report performance as an average across all ranks (MPI=1 only). <0=Rank0,1=Avg,2=Min,3=Max>. Default : 1.
* Test operation
* `-p,--parallel_init <0/1>` use threads to initialize NCCL in parallel. Default : 0.
* `-c,--check <check iteration count>` perform count iterations, checking correctness of results on each iteration. This can be quite slow on large numbers of GPUs. Default : 1.
* `-z,--blocking <0/1>` Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0.
* `-G,--cudagraph <num graph launches>` Capture iterations as a CUDA graph and then replay specified number of times. Default : 0.

## Copyright

NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2021, NVIDIA CORPORATION. All rights reserved.

Binary file added nccl-tests/build/all_gather_perf
Binary file not shown.
Binary file added nccl-tests/build/all_reduce_perf
Binary file not shown.
Binary file added nccl-tests/build/alltoall_perf
Binary file not shown.
Binary file added nccl-tests/build/broadcast_perf
Binary file not shown.
Binary file added nccl-tests/build/gather_perf
Binary file not shown.
Binary file added nccl-tests/build/hypercube_perf
Binary file not shown.
Binary file added nccl-tests/build/reduce_perf
Binary file not shown.
Binary file added nccl-tests/build/reduce_scatter_perf
Binary file not shown.
Binary file added nccl-tests/build/scatter_perf
Binary file not shown.
Binary file added nccl-tests/build/sendrecv_perf
Binary file not shown.
Binary file added nccl-tests/build/timer.o
Binary file not shown.
Binary file added nccl-tests/build/verifiable/verifiable.o
Binary file not shown.
144 changes: 144 additions & 0 deletions nccl-tests/doc/PERFORMANCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Performance reported by NCCL tests

NCCL tests report the average operation time in ms, and two bandwidths in GB/s : algorithm bandwidth and bus bandwidth. This page explains what those numbers mean and what you should expect depending on the hardware used.

# Time

Time is useful with small sizes, to measure the constant overhead (or latency) associated with operations.

On large sizes, the time becomes linear with the size (since it is roughly equal to overhead + size / bw) and is no longer measuring the latency but
also the bandwidth multiplied by the size.

Therefore, on large sizes, it makes more sense to look at the bandwidth.

# Bandwidth

## Algorithm bandwidth

Algorithm bandwidth is using the most commonly used formula for bandwidth : size (_S_) / time (_t_). It is useful to compute how much time any large operation would take by simply dividing the size of the operation by the algorithm bandwidth.

`algbw = S/t`

## Bus bandwidth

While the algorithm bandwidth makes sense for point-to-point operations like Send/Receive, it is not always helpful to measure collective operations speed, since the theoretical peak algorithm bandwidth is not equal to the hardware peak bandwidth, usually depending on the number of ranks.
Most benchmarks only provide time measurements, which is hard to interpret for large sizes. Some others also provide algorithms bandwidth, but see that depending on the number of ranks, that bandwidth varies (and decreases as the number of ranks increase).

To provide a number which reflects how optimally the hardware is used, NCCL tests introduce the notion of "Bus Bandwidth" ("busbw" column in the tests output).
This number is obtained applying a formula to the algorithm bandwidth to reflect the speed of the inter-GPU communication.
Using this bus bandwidth, we can compare it with the hardware peak bandwidth, independently of the number of ranks used.

The formula depends on the collective operation.

### AllReduce

An allreduce operation, for each element of the N arrays (input i_X and output o_X, each situated on rank X), is performing the following operation :

`o_0 = o_1 = o_2 = ... = o_{n-1} = i_0 + i_1 + i_2 + ... + i_{n-1}`

**Note : this is independent of the algorithm used (ring, tree, or other) as long as they use point-to-point operations (send/receive).**

A ring would do that operation in an order which follows the ring :

`i_0 + i_1 + ... + i_{n-1} -> o_{n-1} -> o_0 -> o_1 -> .. -> o_{n-2}`

A tree would do it hierarchically :

`(((((i_{n-1} + i_{n-2}) + (i_{n-3} + i_{n-4})) + ... + (i_1 + i_0))))) -> o_0 -> (o_{n/2} -> (o_{3n/4} ...))`

In all cases, we need n-1 additions and n assignments for each element. Since every step is on a different rank except potentially one (the last input and the first output),
we need 2(n-1) data transfers (x number of elements) to perform an allReduce operation.

Considering that each rank has a bandwidth to the outside world of _B_, the time to perform an allReduce operation of _S_ elements is at best :

`t = (S*2*(n-1)) / (n*B)`

Indeed, we have _S_ elements, 2*(n-1) operations per element, and _n_ links of bandwidth _B_ to perform them.
Reordering the equation, we find that

`t = (S/B) * (2*(n-1)/n)`

Therefore, to get an AllReduce bandwidth measurement which we can compare to the hardware peak bandwidth, we compute :

`B = S/t * (2*(n-1)/n) = algbw * (2*(n-1)/n)`

### ReduceScatter

The ReduceScatter operation requires only to perform the addition part of the allReduce operation :

`o_K = i_0 + i_1 + i_2 + ... + i_{n-1}`

With K being the rank which is getting the final result(K=offset/recvsize).

The perfect reduceScatter time with a rank bandwidth of B would therefore be :

`t = S*(n-1) / (B*n)`

And the Bus Bandwidth is therefore computed as :

`B = S/t * (n-1)/n = algbw * (n-1)/n`

Note that here, S is the size in bytes of the total array, which for NCCL is equal to `recvcount*sizeof(datatype)*n` as the `recvcount` argument is the count per rank.

### AllGather

The AllGather operation requires only to perform the assignment part of the allReduce operation :

`o_0 = o_1 = o_2 = ... = o_{n-1} = i_K`

With K being the rank where the data originates from (K=offset*sendsize).

The perfect allGather time with a rank bandwidth of B would therefore be :

`t = S*(n-1) / (B*n)`

And the Bus Bandwidth is therefore computed as :

`B = S/t * (n-1)/n = algbw * (n-1)/n`

Note that here, S is the size in bytes of the total array, which for NCCL is equal to `sendcount*sizeof(datatype)*n` as the `sendcount` argument is the count per rank.

### Broadcast

The broadcast operation representation is similar to allGather :

`o_0 = o_1 = o_2 = ... = o_{n-1} = i_R`

R being the root of the operation.

However, in this case, since the i_R input is not evenly distributed on the ranks, we cannot use all N links to perform the transfer operations.
Indeed, *all* data has to get out of the root rank, hence the bottleneck is on the root rank which only has B as capacity to get data out :

`t = S/B`

And :

`B = S/t`

### Reduce

The reduce operation performs :

`o_R = i_0 + i_1 + i_2 + ... + i_{n-1}`

R being the root of the operation.

Similarly to broadcast, all data need to be sent to the root, hence :

`t = S/B`

And :

`B = S/t`

### Summary

To obtain a bus bandwidth which should be independent of the number of ranks _n_, we apply a correction factor to the algorithm bandwidth :

* AllReduce : 2*(_n_-1)/_n_
* ReduceScatter : (_n_-1)/_n_
* AllGather : (_n_-1)/_n_
* Broadcast : 1
* Reduce : 1

The bus bandwidth should reflect the speed of the hardware bottleneck : NVLink, PCI, QPI, or network.
Loading

0 comments on commit c2cd9cc

Please sign in to comment.