- DeepBench
- Types of Operations
- Training Benchmark
- Inference Benchmark
- Supported Ops & Precision
- Results
- Get Involved
- Getting the Code
The primary purpose of DeepBench is to benchmark operations that are important to deep learning on different hardware platforms. Although the fundamental computations behind deep learning are well understood, the way they are used in practice can be surprisingly diverse. For example, a matrix multiplication may be compute-bound, bandwidth-bound, or occupancy-bound, based on the size of the matrices being multiplied and the kernel implementation. Because every deep learning model uses these operations with different parameters, the optimization space for hardware and software targeting deep learning is large and underspecified.
DeepBench attempts to answer the question, "Which hardware provides the best performance on the basic operations used for deep neural networks?". We specify these operations at a low level, suitable for use in hardware simulators for groups building new processors targeted at deep learning. DeepBench includes operations and workloads that are important to both training and inference.
The Deep Learning eco system consists of several different pieces. We wanted to highlight where DeepBench fits into this eco system. The diagram below describes the software and hardware components involved with deep learning. At the very top, deep learning frameworks like Baidu's PaddlePaddle, Theano, TensorFlow, Torch etc. All these frameworks allow deep learning researchers to build models. They include basic building blocks like layers which can be connected in different ways to create a model. In order to train the deep learning models, the frameworks work with underlying neural network libraries such as NVIDIA's cuDNN and Intel's MKL. These libraries implement operations such as matrix multiply that are important to deep learning models. Finally, the models are trained on hardware like NVIDIA GPUs or Intel's Xeon Phi processor.
DeepBench uses the neural network libraries to benchmark the performance of basic operations on different hardware. It does not work with deep learning frameworks or deep learning models built for applications. We cannot measure the time required to train an entire model using DeepBench. The performance characteristics of models built for different applications are very different from each other. Therefore, we are benchmarking the underlying operations involved in a deep learning model. Benchmarking these operations will help raise awareness amongst hardware vendors and software developers about the bottlenecks in deep learning training and inference.
DeepBench consists of a set of basic operations (dense matrix
multiplies, convolutions and communication) as well as some recurrent
layer types. There are Excel spreadsheets (DeepBenchKernels_train.xlsx
&
DeepBenchKernels_inference.xlsx
) in this repository that describes all
of the sizes for training and inference respectively.
For training, both forward and backward operations are tested. The precision requirements for training and inference are discussed in the sections below.
We will use vendor supplied libraries even if faster independent libraries exist or faster results have been published. Most users will default to the vendor supplied libraries and as such the vendor supplied libraries are most representative of users' experience.
DeepBench includes training results for seven hardware platforms, NVIDIA's
TitanX, M40, P40, TitanX Pascal, TitanXp, 1080 Ti, P100 and Intel's Knights
Landing. Inference results are included for three server platforms, NVIDIA's
TitanX Pascal, P40, P100, TitanXp and 1080 Ti. Inference results are also included
for three mobile devices iPhone 6 &7, RaspBerry Pi 3. We provide an overview of the
results and all results are available in the results
folder. We will
gladly accept pull requests for new hardware platforms.
Dense matrix multiplies exist in almost all deep neural networks today. They are used to implement fully connected layers and vanilla RNNs and are building blocks for other types of recurrent layers. Sometimes they are also used as a quick way to implement novel layer types for which custom code doesn't exist.
When performing the GEMM operation A * B = C
, either or both of A
and B
can be optionally transposed. Common terminology to describe a matrix problem
is the triple (M, N, K), which describes the sizes of the matrices involved,
and the “op” which tells us which matrices (if any) are transposed. The figure below
describes how the triple (M, N, K) correspond to the sizes of the matrices being multiplied.
The variant where both matrices
are transposed is not used in neural networks. The other three
variants are used, but they need not be implemented as a call to
SGEMM
with those transpose descriptors. Sometimes it can be faster
to perform an in-place transpose followed by the appropriate
multiplication and a transpose back. Such optimizations should be
detailed in the spreadsheet.
The constant coefficients alpha and beta should both be 1.0 so that no work is elided.
Convolutions make up the vast majority of flops in networks that operate on images and videos and form important parts of networks such as speech and natural language modeling, thus making them perhaps the single most important layer from a performance perspective.
Convolutions have 4 or 5 dimensional inputs and outputs giving rise to a large number of possible orderings for these dimensions. For the first version of the benchmark we are only concerned with performance in NCHW format i.e. data is presented in image, feature maps, rows and columns.
There are many techniques for computing convolutions that are optimal for different sizes of the filter and image, including: direct, matrix multiply based, FFT based, and Winograd based approaches. In the first version of this benchmark, we are not concerned about the accuracy of the different approaches since the general consensus is that 32-bit floating point is accurate enough for each of them. We have noted the approach used for each size in the spreadsheet.
Recurrent layers are usually made up of some combination of the above operations and also simpler operations such as unary or binary operations which aren't very compute intensive and generally constitute a small percentage of overall runtime. However, the GEMM and convolution operations are relatively small in recurrent layers, so the cost of these smaller operations can become significant. This is especially true if there is a high fixed overhead associated with starting a computation. It is also possible to use alternate storage formats for the recurrent matrices because the cost of converting to a new storage format can be amortized over the many steps of the recurrent computation. If this is done, the time to convert to and from the custom format should be included in the overall time.
These factors lead to many optimization possibilities both within a time step and across a sequence of time steps such that measuring the raw performance of the operations is not necessarily representative of the performance of an entire recurrent layer. In this benchmark we focus on only one recurrent layer, even though there are even more optimization opportunities if one considers stacks of them.
The calculation of the inputs should not be included in the time for the recurrent layer calculation since it can be calculated as one large multiply and then consumed by the actual recurrent calculation. So in: h_t = g(Wx_t + Uh_t-1) the time for the calculation of Wx_t for all t should not be included in the time for the recurrent layer.
The backward calculation should calculate the updates with respect to the weights but not the inputs. All the recurrent work is done to calculate the weight updates, so calculating the updates with respect to the inputs as well just obscures what we are trying to measure.
DeepBench includes support for three types of recurrent cells; vanilla RNNs, LSTMs and GRUs. The non-linearity for vanilla RNNs should be a ReLU. The internal non-linearities of the LSTM should be the standard operations - sigmoid for the gates and tanh for the activations. The LSTM should not have peephole connections. The internal of the GRU should be a sigmoid for reset and update gates. The output gate non linearity should be a ReLU.
Neural networks today are often trained across multiple GPUs or even multiple systems, each with multiple GPUs. There are two main categories of techniques for doing this: synchronous and asynchronous. Synchronous techniques rely on keeping the parameters on all instances of the model synchronized, usually by making sure all instances of the model have the same copy of the gradients before taking an optimization step. The Message Passing Interface (MPI) primitive usually used to perform this operation is called All-Reduce. There are many ways to implement All-Reduce based on the number of ranks, the size of the data, and the topology of the network. This benchmark places no constraints on the implementation other than that it should be deterministic. Asynchronous methods are quite varied and in this version of the benchmark we will not be attempting to test these methods.
In order to evaluate All-Reduce, we use the following libraries and benchmarks:
The NCCL library can be build without MPI (for single node) and with MPI (for multinode) as shown in https://github.com/NVIDIA/nccl-tests. We therefore have two versions of NCCL for the single node in the experiments. For multinode experiments, we use only NCCL with MPI, the benchmark from OSU, and Baidu's Allreduce implementation. We report the shortest latency achieved from all implementations for each configuration.
Intel(R) Machine Learning Scaling Library (Intel(R) MLSL) is a library providing an efficient implementation of communication patterns used in deep learning. In order to evaluate All-Reduce performance, we use All-Reduce benchmark from OSU.
Each node has two CPU sockets (dual root topology), and each socket has a PCIe root complex. For each socket there are two PLX switches that are each connected to the CPU socket via 16 lanes of PCIe v3. There are two GPUs on each PLX switch. All pairs of GPUs communicate simultaneously over 16 lanes of PCIe v3. The two CPU sockets are connected via Intel QPI. The interconnect across nodes is InfiniBand FDR. The figure below shows a schematic diagram of one our nodes, where all devices connected by the same PCI root complex are encapsulated in a dotted box. In our experiments, P100, TitanX Maxwell and M40 were such systems.
Each node has one CPU socket (single root topology) with two PLX switches, each switch are connected to 5 GPUs. The communication among the GPUs in the same PLX switch traverses through the PLX switch only, whereas the communication to any GPU connected to the other PLX switch requires traversal both PLX switches along with the connecting PCIe bridge. In our experiments, TitanX Pascal, and 1080Ti were such systems.
The blocking All-Reduce latency is measured on Intel Xeon Phi processor 7250 on Intel’s internal Endeavor cluster with Intel® Omni-Path Architecture (Intel® OPA) series 100 fabric with fat-tree topology, using Intel MPI 2017 Update 3 and Intel MLSL 2017 Update 2 Preview.
The training benchmark includes support for all the operations discussed
above. The DeepBenchKernels_train.xlsx
file contains the entire list of
kernels for the training benchmark.
While training deep learning models, most researchers typically use single precision floating point numbers for all compute kernels. Academic research has demonstrated that reduced precision training works for several different models trained on limited datasets. In our experience, we’ve found that 16 bit half precision floating point numbers are sufficient to train large deep learning models on large datasets reliably. Training with half precision numbers allows hardware vendors to better utilize the available computing power. In addition, the parameters require half the total storage for the entire model.
DeepBench specifies the minimum precision requirements for training. We are specifying precision for multiply and add for all the operations. The minimum precision for multiplication and addition is set to 16 and 32 bits respectively. None of the currently available hardware supports 16 bit multiply and 32 bit accumulate. We will accept results on any hardware platform that satisfies this minimum precision requirement. All results will include the precision that is used for the benchmark.
Benchmarking inference is a very challenging problem. There are many applications that have been enabled by deep learning and each of them have their unique performance characteristics and requirements. We selected applications for benchmarking that receive high user traffic. We are also including kernels from deep learning models that are used across several different applications.
For the inference kernels, we cover the same set of operations as the training set i.e.
matrix multiply, convolution and recurrent operations. The kernels have some differences
from the training counterparts. In the next few sections, we discuss the changes needed
to benchmark inference workloads. The DeepBenchKernels_inference.xlsx
file contains
the complete list of kernels for the training benchmark.
Large scale real world applications such as image search, language translation and speech recognition are typically deployed on servers located in data centers. The client sends the request over the internet which is processed on the remote server hosting the deep learning model. The remote server is typically a powerful machine consisting of many processors. The memory and compute capabilities are large enough to host very large deep learning models. The downside of deploying the model on the server is the latency depends on the network bandwidth between the client and the server. It also requires the user to be connected to the internet. In order to address these issues, several models are being deployed on end devices. On-device deployment enables deep learning models to have lower latency and are always available regardless of internet connectivity. However, these models need to be smaller in order to fit within the power and memory constraints of mobile and embedded devices.
In DeepBench, we measure the performance of inference kernels on both server and mobile
platforms. Hardware vendors or users can
run the appropriate benchmarks and add their results to the repository. We provide an overview
of the results below and detailed results are available in the results/inference
folder.
We will gladly accept pull requests for new hardware platforms.
In order to meet latency requirements of user requests, most internet applications process requests individually as they arrive at the data center. This makes for a straightforward application where each request is handled by a single thread. However, this is inefficient for two reasons. Processing requests individually makes the operation bandwidth bound as the processor needs to load weights of the network. This makes it harder for processor to effectively utilize the on chip caches. Secondly, the amount of parallelism that can be exploited to classify one request is limited, making it difficult to exploit SIMD or multicore parallelism. RNNs are especially challenging to deploy because evaluating RNNs sample by sample relies on matrix vector multiplication, which are bandwidth bound and difficult to parallelize.
To overcome these issues, we built a batching scheduler called Batch Dispatch which assembles streams of data from user requests into batches before performing forward propagation on these batches. In this case, there is a tradeoff between increased batch size, and consequently improved efficiency, and increased latency. The more we buffer user requests to assemble a large batch, the longer users must wait for their results. This places constraints on the amount of batching we can perform.
In practice, we’ve seen that batching requests up to 4 or 5 seems to work well for efficiency and latency for data center deployment. In the case of deployment on devices, the batch size is limited to 1.
Deep Neural networks are trained using single precision or half precision floating point numbers. The precision requirements for inference are significantly lower than training. Several different models can deployed with 8 bit representations for inference with little or no loss in accuracy compared to their floating point models. Therefore, for inference kernels, we’re specifying the minimum precision for multiplication and accumulation of 8 and 32 bits respectively. Since all hardware platforms may not support this precision requirement, we will accept results for any platform that satisfies this minimum requirement. All results will include the precision used for the benchmark.
To benchmark matrix multiplication with 8 bit inputs for ARM processors, we use the Gemmlowp library. Convolution kernels from the ARM Compute Library are used for convolution benchmark. The ARM Compute library only supports single precision convolutions. Low precision convolution support should be available shortly. The ARM Compute library doesn’t have any support for RNNs. Therefore, DeepBench does not include RNN results for ARM devices. We welcome contributions from other libraries that support RNN operations for ARM devices.
For server deployment, we use the cudNN library and cuBLAS library for Nvidia GPUs. For Nvidia GPUs, RNN kerenels only support single precision and results are reported with the same. More details regarding which ops are supported on different processors can be found in later sections.
A sparse neural network is one where most of the weights of the neural network are zero. These zero weights don’t contribute in determining the prediction of the neural network. Sparse neural networks reduce memory and computation footprint which enables deep learning models to be deployed on mobile devices. Inference performance of RNNs is dominated by the memory bandwidth of the hardware, since most of the work is simply reading in the parameters at every time step. Moving from a dense calculation to a sparse one comes with a penalty, but if the sparsity factor is large enough, then the smaller amount of data required by the sparse routines becomes a win.
The more powerful server class processors used in data centers can generally perform inference quickly enough to serve one user, but in the data center performance per dollar is very important. Techniques such as sparsity that allow models to be evaluated faster enable more users to be served per GPU increasing the effective performance per dollar.
There has been a lot of progress in developing sparse neural networks in the past couple of years. DeepBench includes sparse matrix vector and sparse matrix multiply kernels. Based on our research, we’ve learnt that neural networks with 90 to 95% sparsity can achieve relatively good performance compared to their dense baselines. However, current implementations of sparse matrix multiply are optimized for much higher sparsity (around 99% or higher). By including sparse kernels, we’re hoping to incentivize hardware vendors and software developers to build libraries that provide good performance for sparsity in the range of 90~95%.
We use the Eigen library to benchmark sparse operations on ARM devices. For GPU benchmarks, we use the cuSparse library from Nvidia.
Many inference applications have real time latency requirements. For example, speech interfaces require speech recognition models to return a result without a delay that is noticeable to a user. DeepBench kernels can be used as a starting point to measure the best case latency of individual operations. However, measuring full system latency is outside the scope of this release of DeepBench, given the focus on basic operations rather than complete applications. For example, a complete application running on a mobile device might need to modify the power state of the system when starting up. In another example, a complete server application might have a significant latency component that is determined by a user’s network connection to the server. We may consider addressing operation latency in a future version of DeepBench.
In this section, we document the support for the various operations across precisions for different processors. As far as possible, we pick the precision that closely matches the minimum required precision. The precision requirements are stated below again. However, there are cases where we need to benchmark higher precision operations. The tables below highlight which operations are benchmarked for each processor.
Minimum Precision for training: 16 bit multiply, 32 bit accumulate
Minimum Precision for inference: 8 bit multiply, 32 bit accumulate
Single precision results are available for 5 Nvidia GPUs and Intel's Xeon Phi processor. None of the available processors support 16 bit multiplication and 32 bit addition. Instead, we benchmark Nvidia's Psuedo FP16 mode where inputs/outputs are 16 bit but the compute is still in single precision. Support for mixed precision training is available in upcoming hardware processors.
Processor | Single precision | FP16 inputs/FP32 math | FP16 inputs / Mixed Precision Math |
---|---|---|---|
Nvidia TitanX Maxwell | GEMM, Conv, RNN | ||
Nvidia Tesla M40 | GEMM, Conv, RNN | ||
Nvidia 1080Ti | GEMM, Conv, RNN | ||
Nvidia TitanX Pascal | GEMM, Conv, RNN | Â Â Â Â Â Â Â Â Â Â Â | |
Nvidia TitanXp | GEMM, Conv, RNN | ||
Nvidia Tesla P100 | GEMM, Conv, RNN | GEMM, Conv, RNN | |
Nvidia Tesla V100 | GEMM, Conv, RNN | GEMM, Conv, RNN | |
Intel Xeon Phi 7250 | GEMM, Conv |
The GEMM and convolution benchmark are run with 8 bit multiplication and 32 bit accumulate on NVIDIA processors. However, NVIDIA GPUs don't support all input sizes for this precision mode. Input sizes have to be a multiple of 4 to run in this precision mode. We have padded inputs dimensions to be multiples of 4 for all kernels. The cost of padding and discarding extra outputs is small compared to the cost of the operation. The results spreadsheet indicates which of the kernels required padding. Sparse operations and Recurrent kernel results are reported in single precision since the relevant libraries don't support low precision.
Processor | Single Precision | Int8 multiply/32 bit accumulate |
---|---|---|
Nvidia 1080Ti | RNN, Sparse GEMM | GEMM, Conv |
Nvidia TitanX Pascal | RNN, Sparse GEMM | GEMM, Conv |
Nvidia TitanXp | RNN, Sparse GEMM | GEMM, Conv |
The table below describes the inference device kernel results available on different processors, ops and precision. We don't have any results for RNNs since no ARM libraries support RNNs. ARM Compute library is not yet supported on the iPhone.
Processor         | Single Precision       | Int8 inputs/32 bit math |
---|---|---|
Raspberry Pi 3 | Conv | GEMM, Sparse GEMM |
iPhone 6 | GEMM, Sparse GEMM | |
iPhone 7 | GEMM, Sparse GEMM |
In this section, we are documenting the performance for a few operations.
These are picked at random and are only meant to demonstrate the performance for a few applications.
The results below only include the time and TeraFLOPS for the fastest processor for the particular operation and parameters. The full results can be found in the results
folder.
The precision used for benchmarking the training and inference processors is listed at the top of the results file.
Training results can be found in the results/training
folder which contains the following files:
DeepBench_IA_KNL7250.xlsx
: Training results on Intel's Xeon Phi ProcessorDeepBench_NV_TitanX.xlsx
: Training results on NVIDIA's TitanX GPUsDeepBench_NV_M40.xlsx
: Training results on NVIDIA's M40 GPUsDeepBench_NV_TitanX_Pascal.xlsx
: Training results on NVIDIA's TitanX Pascal GPUDeepBench_NV_TitanXp.xlsx
: Training results on NVIDIA's TitanXp Pascal GPUDeepBench_NV_1080Ti.xlxs
: Training results on NVIDIA's 1080 Ti GPUDeepBench_NV_P100.xlsx
: Training results on NVIDIA's P100 GPUDeepBench_NV_V100.xlsx
: Training results on NVIDIA's V100 GPU
Detailed inference results can be found in the results/inference
folder which contains the following files:
server/DeepBench_NV_TitanXp.xlsx
: Inference results on NVIDIA's TitanXp GPUsserver/DeepBench_NV_TitanXp.xlsx
: Inference results on NVIDIA's TitanXp Pascal GPUserver/DeepBench_NV_1080Ti.xlxs
: Inference results on NVIDIA's 1080 Ti GPUdevice/DeepBench_iPhone_7.xlsx
: Inference results on iPhone 7device/DeepBench_iPhone_6.xlsx
: Inference results on iPhone 6device/DeepBench_Raspberry_Pi_3.xlsx
: Inference results on Raspberry Pi 3
The software libraries (e.g. cuDNN, OpenMPI) used to benchmark performance are mentioned in each of Excel workbooks in Specs
sheet.
Please feel free to ask us any clarifying questions.
Results on more hardware platforms will be added once they are available. We welcome contributions from all hardware vendors.
Kernel | A Transpose | B Transpose | Application | Time (ms) | TeraFLOPS | Processor |
---|---|---|---|---|---|---|
M=1760, N=128, K=1760 | N | N | Speech Recognition | 0.07 | 10.72 | Tesla V100 Mixed Precision |
M=7860, N=64, K=2560 | N | N | Speech Recognition | 0.10 | 25.94 | Tesla V100 Mixed Precision |
M=2560, N=64, K=2560 | N | N | Speech Recognition | 0.08 | 10.11 | Tesla V100 Mixed Precision |
M=5124, N=9124, K=2560 | T | N | Speech Recognition | 8.73 | 27.43 | Tesla V100 Mixed Precision |
M=3072, N=128, K=1024 | T | N | Speech Recognition | 0.04 | 18.73 | Tesla V100 Mixed Precision |
Input Size | Filter Size | # of Filters | Padding (h, w) | Stride (h, w) | Application | Total Time (ms) | Fwd TeraFLOPS | Processor |
---|---|---|---|---|---|---|---|---|
W = 700, H = 161, C = 1, N = 32 | R = 5, S = 20 | 32 | 0, 0 | 2, 2 | Speech Recognition | 1.53 | 7.75 | Tesla V100 FP32 |
W = 54, H = 54, C = 64, N = 8 | R = 3, S = 3 | 64 | 1, 1 | 1, 1 | Face Recognition | 0.55 | 10.12 | Tesla V100 FP32 |
W = 224, H = 224, C = 3, N = 16 | R = 3, S = 3 | 64 | 1, 1 | 1, 1 | Computer Vision | 2.40 | 1.40 | Tesla V100 FP32 |
W = 7, H = 7, C = 512, N = 16 | R = 3, S = 3 | 512 | 1, 1 | 1, 1 | Computer Vision | 0.70 | 14.56 | Tesla V100 Mixed Precision |
W = 28, H = 28, C = 192, N = 16 | R = 5, S = 5 | 32 | 2, 2 | 1, 1 | Computer Vision | 0.93 | 16.90 | Tesla V100 FP32 |
The recurrent op kernels are only run on NVIDIA hardware.
Hidden Units | Batch Size | TimeSteps | Recurrent Type | Application | Total Time (ms) | Fwd TeraFLOPS | Processor |
---|---|---|---|---|---|---|---|
1760 | 16 | 50 | Vanilla | Speech Recognition | 6.75 | 1.46 | Tesla V100 FP32 |
2560 | 32 | 50 | Vanilla | Speech Recognition | 11.48 | 3.43 | Tesla V100 Mixed Precision |
1024 | 128 | 25 | LSTM | Machine Translation | 6.46 | 12.41 | Tesla V100 Mixed Precision |
2816 | 32 | 1500 | GRU | Speech Recognition | 591.02 | 10.45 | Tesla V100 Mixed Precision |
Size (# of floats) | Number of Processors | Application | Time (ms) | Bandwidth (GB/s) | Processor |
---|---|---|---|---|---|
16777216 | 8 | Speech Recognition | 8.66 | 61.99 | Xeon Phi 7250 with Intel® Omni-Path |
16777216 | 16 | Speech Recognition | 14.72 | 72.94 | Xeon Phi 7250 with Intel® Omni-Path |
16777216 | 32 | Speech Recognition | 19 | 113.03 | Xeon Phi 7250 with Intel® Omni-Path |
64500000 | 32 | Speech Recognition | 76.68 | 107.67 | Xeon Phi 7250 with Intel® Omni-Path |
The next few sections provide a few results for GEMM, Convolution and Recurrent operations for inference kernels on server platforms. Results on Intel platforms should be available shortly.
Kernel | Application | Results (ms) | TeraFLOPS | Processor |
---|---|---|---|---|
M=5124, N=700, K=2048 | Speech Recognition | 0.46 | 31.94 | 1080 Ti |
M=35, N=700, K=2048 | Speech Recognition | 0.05 | 2.09 | 1080 Ti |
M=3072, N=3000, K=1024 | Speech Recognition | 0.49 | 38.36 | Titan Xp |
M=512, N=6000, K=2816 | Speech Recognition | 0.43 | 40.71 | Titan Xp |
Kernel | Sparsity | Application | Results (ms) | Speedup wrt dense | TeraFLOPS | Processor |
---|---|---|---|---|---|---|
M=7680, N=1, K=2560 | 0.95 | Speech Recognition | 0.03 | 6.56 | 1.10 | 1080 Ti |
M=7680, N=2, K=2560 | 0.95 | Speech Recognition | 0.04 | 5.93 | 1.74 | 1080 Ti |
M=7680, N=1500, K=2560 | 0.95 | Speech Recognition | 29.81 | 0.16 | 1.88 | TitanXp |
M=10752, N=1, K=3584 | 0.9 | Speech Recognition | 0.1 | 4 | 0.72 | TitanXp |
Input Size | Filter Size | # of Filters | Padding (h, w) | Stride (h, w) | Application | Time (ms) | TeraFLOPS | Processor |
---|---|---|---|---|---|---|---|---|
W = 341, H = 79, C = 32, N = 4 | R = 5, S = 10 | 32 | 0,0 | 2,2 | Speech Recognition | 0.29 | 9.03 | TitanXp |
W = 224, H = 224, C = 3, N = 1 | R = 7, S = 7 | 64 | 3, 3 | 2, 2 | Computer Vision | 0.14 | 1.64 | TitanXp |
W = 56, H = 56, C = 256, N = 1 | R = 1, S = 1 | 128 | 0, 0 | 2, 2 | Computer Vision | 0.015 | 3.43 | TitanX Pascal |
W = 7, H = 7, C = 512, N = 2 | R = 1, S = 1 | 2048 | 0, 0 | 1, 1 | Computer Vision | 0.018 | 11.42 | 1080 Ti |
Hidden Units | Batch Size | TimeSteps | Recurrent Type | Application | Results (ms) | Fwd TeraFLOPS | Processor |
---|---|---|---|---|---|---|---|
1536 | 4 | 50 | LSTM | Language Modelling | 6.93 | 0.55 | TitanXp |
256 | 4 | 150 | LSTM | Character Language Modelling | 1.63 | 0.19 | 1080 Ti |
2816 | 1 | 1500 | GRU | Speech Recognition | 350.62 | 0.20 | TitanXp |
2560 | 2 | 375 | GRU | Speech Recognition | 75.02 | 0.39 | TitanXp |
Kernel | Application | Results (ms) | GigaFLOPS | Processor |
---|---|---|---|---|
M=5124, N=700, K=2048 | Speech Recognition | 212.84 | 69.03 | iPhone 7 |
M=35, N=700, K=2048 | Speech Recognition | 1.94 | 51.69 | iPhone 7 |
M=3072, N=1500, K=1024 | Speech Recognition | 136.63 | 69.07 | iPhone 7 |
Kernel | Sparsity | Application | Results (ms) | Speedup wrt dense | GigaFLOPS | Processor |
---|---|---|---|---|---|---|
M=7680, N=1, K=2560 | 0.95 | Speech Recognition | 1.01 | 15.55 | 18.55 | iPhone 7 |
M=7680, N=1500, K=2560 | 0.95 | Speech Recognition | 1677.36 | 5.46 | 16.70 | iPhone 7 |
M=7680, N=1, K=2560 | 0.9 | Speech Recognition | 2.1 | 8.02 | 8.41 | iPhone 7 |
Input Size | Filter Size | # of Filters | Padding (h, w) | Stride (h, w) | Application | Time (ms) | GigaFLOPS | Processor |
---|---|---|---|---|---|---|---|---|
W = 112, H = 112, C = 64, N = 1 | R = 1, S = 1 | 64 | 0, 0 | 1, 1 | Computer Vision | 670.75 | 0.15 | Raspberry Pi 3 |
W = 56, H = 56, C = 256, N = 1 | R = 1, S = 1 | 128 | 0, 0 | 2, 2 | Computer Vision | 185.87 | 0.28 | Raspberry Pi 3 |
W = 7, H = 7, C = 512, N = 1 | R = 1, S = 1 | 2048 | 0, 0 | 1, 1 | Computer Vision | 735.28 | 0.14 | Raspberry Pi 3 |
We welcome contributions from the community to DeepBench. You can contribute in two ways:
- Deep Learning Researchers/Engineers: If you are deep learning researcher or engineer working on a new deep learning application, you may have different operations and/or workloads involved in training your model. We are interested in learning more about the underlying operations that are adversely impacting the performance (speed) of your model. Please contribute these operations and workloads!
- Hardware Vendors: We would gladly accept contributions from other hardware vendors. We're open to accepting benchmark results from large companies or smaller startups building hardware for training deep learning models. Please contribute benchmark results for your hardware!
To get the code, simply clone the github repo
git clone https://github.com/baidu-research/DeepBench
In order to build the benchmarks, you will need to specify the following paths:
MPI_PATH: Path to MPI library. The benchmarks have been tested with OpenMPI version 1.10.2.
CUDA_PATH: Path to CUDA library. The benchmarks have been tested with version 7.5.18.
CUDNN_PATH: Path to CUDNN library. The benchmarks have been tested with version 5.0.
NCCL_PATH: Path to NCCL library. NCCL library is available at https://github.com/NVIDIA/nccl. The benchmarks have been tested with commit b3a9e1333d9e2e1b8553b5843ba1ba4f7c79739d
To build all the benchmarks, please use the following command:
cd code/
make CUDA_PATH=<cuda_path> CUDNN_PATH=<cudnn_path> MPI_PATH=<mpi_path> NCCL_PATH=<nccl_path>
For distributions that split their MPI headers and libraries (e.g. RHEL, Fedora, CentOS) into separate directories you should also specify the path to the include files:
MPI_INCLUDE_PATH=<mpi_include_path>
You need to build the code for the appropriate architecture. By default, the architecture version is set to 5.2. This works for the TitanX and Tesla M40 GPU. In order build the benchmark for another architecture (such as Pascal with version 6.1), please append the following variable to the make
command:
ARCH=sm_61 ## Just an example for Pascal architecture
In some cases, it may be useful to generate benchmarking executables for multiple architectures. For example, some systems may have multiple graphics processors with different architectures installed. The NVIDIA compiler (nvcc) supports the generation of "fat binaries" that contain intermediate and compiled code for multiple target architectures. To compile for multiple architectures, add a comma separated list of architectures to the make
command line.
ARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62,sm_70 # Everything since Kepler!
Note that compilation for multiple architectures will take longer than compilation for a single architecture. Also, not all CUDA versions support all architectures. For example, support for sm_60 (and later) require CUDA 8 or later.
For inference problems with int8
precision, the convolution and gemm kernels need to be padded to be multiples of 4. By default, the kernels are padded and results are reported with padding. To disable padding, please use the following build option. When padding is disabled, the benchmark numbers aren't reported for the kernels that aren't supported.
make gemm PAD_KERNELS=0
make conv PAD_KERNELS=0
In order to use Tensor Cores on NVIDIA's V100 processor, you need to use CUDA 9.0 and cudNN 7.0 or higher. Using the correct libraries, add the following option to the make command:
make USE_TENSOR_CORES=1 ARCH=sm_70
Convolution operations running Tensor Cores need input and output channels to be a multiple of 8. The benchmarks currently pad the input channels to be a multiple of 8 and report padded numbers.
Once compilation completes successfully, the executables will be
generated in the bin
directory. Before executing the benchmarks, it
is important to set your LD_LIBRARY_PATH
correctly. For bash shells,
please use:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<cuda_path>:<cudnn_path>:<mpi_path>:<nccl_path>
The GEMM, convolution, recurrent op and sparse GEMM benchmarks can be run by calling the respective executables. Here is some of the output from the GEMM benchmark:
~/DeepBench/code$ bin/gemm_bench
Running training benchmark
Times
----------------------------------------------------------------------------------------
m n k a_t b_t precision time (usec)
1760 16 1760 0 0 float 180
1760 32 1760 0 0 float 182
1760 64 1760 0 0 float 247
1760 128 1760 0 0 float 318
By default, the benchmarks are run with training problems. The default precision for benchmarking is determined based on the CUDA and cudnn library versions. The mode (inference or training) and precision can be specified on the command line using:
bin/gemm_bench <inference|train> <int8|float|half>
Each of the benchmark files includes a note indicating which precision is supported for different GPUs.
To execute the NCCL single All-Reduce benchmark, you need to specify the number of GPUs as an argument. Please note that the number of GPUs must not be greater than the number of GPUs visible in your system.
bin/nccl_single_all_reduce <num_gpus>
The NCCL MPI All-Reduce benchmark can be run using mpirun
as shown below:
mpirun -np <num_ranks> bin/nccl_mpi_all_reduce
num_ranks
cannot be greater than the number of GPUs in the system.
The osu_allreduce
benchmark can be executed using mpirun as follows:
mpirun -np <num_processes> bin/osu_allreduce
The osu_allreduce
benchmark can be run with more processes than
GPUs. However, all our experiments were conducted with each process
running on a single GPU.
In order to build the benchmarks, you will need to specify the following paths:
MPI_PATH: Path to MPI library. The benchmarks have been tested with OpenMPI version 2.0.1.
CUDA_PATH: Path to CUDA library. The benchmarks have been tested with version 8.0.61.
BAIDU_ALLREDUCE_PATH: Path to Baidu's allreduce implementation, which is avaiable at https://github.com/baidu-research/baidu-allreduce/.
To build all the benchmarks, please use the following command:
cd code/
make CUDA_PATH=<cuda_path> MPI_PATH=<mpi_path> BAIDU_ALLREDUCE_PATH=<baidu_allreduce_path>
For distributions that split their MPI headers and libraries (e.g. RHEL, Fedora, CentOS) into separate directories you should also specify the path to the include files:
MPI_INCLUDE_PATH=<mpi_include_path>
Please set the ARCH paramter for appropriate architecture as discussed above in the NVIDIA Benchmarks section.
Once compilation completes successfully, the executables will be
generated in the bin
directory. Before executing the benchmarks, it
is important to set your LD_LIBRARY_PATH
correctly. For bash shells,
please use:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<cuda_path>:<mpi_path>:<baidu_allreduce_path>
The Baidu All-Reduce benchmark can be run using mpirun
as shown below:
mpirun -np <num_ranks> bin/ring_all_reduce
num_ranks
is used as the total number of GPUs in the system.
Source all the Intel tools (icc, mkl, mpi) into the path
source <icc_installdir>/bin/compilervars.sh intel64
source <mkl_installdir>/bin/mklvars.sh intel64
source <impi_installdir>/bin/mpivars.sh intel64
source <mlsl_installdir>/intel64/bin/mlslvars.sh
Running the Intel GEMM benchmark (MKL 2017)
code/intel/sgemm/run_mkl_sgemm_ia.sh
Running the Intel convolution benchmark (MKL 2017 and libxsmm (open source KNL optimized convolution implementation))
code/intel/convolution/run_conv_ia.sh
The Intel All-Reduce benchmarks use the standard OSU benchmark compiled/running with Intel MPI or with Intel MLSL.
In order to build the Intel All-Reduce benchmarks, you will need to specify the following paths:
MPI_PATH: Path to Intel MPI library ($I_MPI_ROOT by default). The benchmarks have been tested with Intel MPI 2017 Update 3.
MLSL_PATH: Path to Intel MLSL library ($MLSL_ROOT by default). The benchmarks have been tested with Intel MLSL 2017 Update 2 Preview.
and use "Makefile_ia" makefile.
For example (building with default paths):
make -f Makefile_ia all
Running the Intel All-Reduce benchmarks:
code/osu_allreduce/run_allreduce_ia.sh <hostfile> <allreduce_binary>
There are 2 possible values for <allreduce_binary>:
- osu_allreduce - benchmark for blocking All-Reduce over MPI
- mlsl_osu_allreduce - benchmark for blocking All-Reduce over MLSL
The performance of blocking All-Reduce over MLSL is reported in DeepBench result files.
For example, to run All-Reduce benchmark over MLSL create hostfile with one hostname per line and run script as following:
code/osu_allreduce/run_allreduce_ia.sh <hostfile> mlsl_osu_allreduce
Script will run benchmark on different scales (2, 4, 8, 16, 32 nodes) and on DeepBench specific message sizes. Benchmark will report average latency metric.
For example, benchmark output on 32 KNL/OPA nodes:
# Size Avg Latency(ms)
100000 0.31
3097600 3.59
4194304 4.67
6553600 7.17
16777217 16.80
38360000 56.65
64500000 75.77
The ARM benchmarks in DeepBench are compiled and run on 64 bit ARM v8 processors.
The Makefile
in the code/arm
folder only supports this processor. In order to benchmark
other processors, you will have to modify the Makefile
to support them.
The ARM GEMM benchmark uses the Gemmlowp library
for int8
kernels. This library is included as a submodule in the DeepBench repository.
To build and run the benchmark, simply run:
./run_gemm_bench.sh
The ARM Convolution benchmark uses the ARM Compute Library. To build the benchmark, you need to specify the include and lib paths for ARM compute library:
ARM_COMPUTE_INCLUDE_PATH: Path to ARM Compute Library
ARM_COMPUTE_LIB_PATH: Path to ARM Compute library binary
To build and run the benchmark, please use:
make conv ARM_COMPUTE_INCLUDE_PATH=<path> ARM_COMPUTE_LIB_PATH=<path>
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path to arm compute library binary>
bin/conv_bench
The Sparse GEMM Benchmark uses the Eigen library. To build the benchmark, you need to download the eigen library and specify the path:
EIGEN_PATH: path to Eigen library
To compile and run the benchmark, please use the following command:
make sparse EIGEN_PATH=<path>
bin/sparse_bench