ComScribe is a tool that identifies communication among all GPU-GPU and CPU-GPU pairs in a single-node multi-GPU system.
You will need the following programs:
- Python: ComScribe is a Python script. It uses several packages listed in
requirements.txt
, which you can install via the command:
pip3 install -r requirements.txt
- nvprof: ComScribe parses the outputs of NVIDIA's profiler nvprof, which is a light-weight command-line profiler available since CUDA 5.
No further installation is required.
To obtain the communication matrices of your application (app
):
-
Put
comscribe.py
in the same directory withapp
-
python3 comscribe.py -g <num_gpus> -s log|linear -i <cmd_to_run>
-g
lets our tool know how many GPUs will be used, however note that if the application to be run requires such a parameter too, it must be explicitly specified (see-i
below).-s
can belog
for log scale orlinear
for linear scale for the output figures.-i
takes the input command as a string such as:-i './app --foo 20 --bar 5'
-
The communication matrix for a communication type is only generated if it is detected, e.g. if there are no Unified Memory transfers then there will not be any output regarding Unified Memory transfers. For the types of communication detected, the generated figures are saved as PDF files in the directory of the script.
We have used our tool in an NVIDIA V100 DGX2 system with up to 16 GPUs using CUDA v10.0.130 for the following benchmarks:
-
NVIDIA Monte Carlo Simluation of 2D Ising-GPU | GitHub
-
NVIDIA Multi-GPU Jacobi Solver | GitHub
-
- Full-Duplex | GitHub
- Full-Duplex with Unified Memory | GitHub
- Half-Duplex with peer access | GitHub
- Half-Duplex without peer access | GitHub
- Zero-copy Memory (both Read and Write benchmarks) | GitHub
Note: In order to run a Comm|Scope benchmark with fixed iterations e.g. 100, in the source code of benchmark, replace it's registration with:
benchmark::RegisterBenchmark(...)->SMALL_ARGS()->Iterations(100);
-
MGBench | Github
python3 comscribe.py -g 4 -i './scope --benchmark_filter="Comm_ZeroCopy_GPUToGPU_Read.*18.*" -n 0' -s log
Gives the bar-chart for Zero-copy memory transfers:
python3 comscribe.py -g 4 -i './scope --benchmark_filter="Comm_Demand_Duplex_GPUGPU.*18.*"' -s linear
Gives two matrices, bytes transferred (left) and number of transfers made (right):
python3 comscribe.py -g 4 -i './fullduplex' -s linear
Gives two matrices, bytes transferred (left) and number of transfers made (right):
python3 comscribe.py -g 4 -i './cuIsing -y 32768 -x 65536 -n 128 -p 16 -d 4 -t 1.5' -s log
Gives two matrices, bytes transferred (left) and number of transfers made (right):
To be published as: Akthar, P., Tezcan, E., Qararyah, F.M., and Unat, D. "ComScribe: Identifying Intra-node GPU Communication"