title | author | origin | date | |
---|---|---|---|---|
C++ on GPUs done right? |
Peter Steinbach |
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden |
Meeting C++, December 05, 2015 |
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
[/column]
[column,class="col-xs-6"]
[/column]
[/columns]
© [Khronos Group](https://www.khronos.org/news/logos/) Feel free to reply, discuss, inform, correct, ...github.com/psteinb/meetingcpp2015
. . .
If not stated otherwise, the slides and all it's code are licensed under
Creative Commons Attribution 4.0 International License
[columns,class="row vertical-align"]
[column,class="col-xs-8"]
![](img/800px-MPI-CBG_building_outside_4pl.jpg) *Scientific Software Engineer* at Max Planck Institute of Molecular Cell Biology and Genetics[/column]
[column,class="col-xs-4"]
- Dresden, Saxony
- 450 staff
- founded 2001
- cell biology
- genomics
- systems biology
[/column]
[/columns]
Accelerating scientific algorithms on GPUs (multi-GB dataset, a lot of FFTs)
-
Architecture
-
What can you use today
-
What can you use tomorrow
Massively Parallel Programming { data-background="img/Titan_render.png" data-background-size="800px" style="margin-top: 20%"}
Data obtained from Top500.org
[columns,class="row"]
[column,class="col-xs-4"]
Nvidia Tesla ![](img/Nvidia-Tesla-K80_x400.jpg)GPU without Graphics
[/column]
[column,class="col-xs-4"]
AMD FirePro ![](img/amd-firepro-s9150-server-graphics_x400.png)GPU without Graphics
[/column]
[column,class="col-xs-4"]
Intel MIC ![](img/xeon_phi_x400.jpg)Not Covered Today!
[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-4"]
_Nvidia Tesla K80_ ![](img/Nvidia-Tesla-K80_x400.jpg)[/column]
[column,class="col-xs-4"]
_AMD FirePro S9170_ ![](img/amd-firepro-s9150-server-graphics_x400.png)[/column]
[column,class="col-xs-4"]
Intel Xeon Phi 5110P ![](img/xeon_phi_x400.jpg)[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-4"]
* 2x GK210 chipsets * 2x 12 GB GDDR5 * 2x 288 GB/s to RAM * 8.7 TFlops SP * 2.9 TFlops DP[/column]
[column,class="col-xs-4"]
* 1x Grenada XT * 32 GB GDDR5 RAM * 320 GB/s to RAM * 5.2 TFlops SP * 2.6 TFlops DP[/column]
[column,class="col-xs-4"]
* 62x x86 CPUs * 8 GB GDDR5 RAM * 320 GB/s to RAM * 2.1 TFlops SP * 1.1 TFlops DP[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
Tuna ![](img/1024px-Thunnus_orientalis_Osaka_Kaiyukan_Aquarium_cropped_x400.jpg) (fast, single, versatile)[/column]
[column,class="col-xs-6"]
Forage Fish ![](img/1024px-School_of_Pterocaesio_chrysozona_in_Papua_New_Guinea_1_x400.jpg) (small, many, use wakefield of neighbor)[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
CPU ![](img/Central-Processing-Unit_x400.jpeg)[/column]
[column,class="col-xs-6"]
GPU ![](img/Nvidia-Tesla-K80_x400.jpg)[/column]
[/columns]
Note: Will use Nvidia Kepler as GPGPU example. Nvidia K40: 15 Streaming Multiprocessors (SMX), 12 GB of GDDR5 DRAM[columns,class="row vertical-align"]
[column,class="col-xs-2"]
Warp
[/column]
[column,class="col-xs-4"]
[/column]
[column,class="col-xs-8"]
- smallest unit of concurrency: 32 threads
- thread = single CUDA core
- all threads execute same program
[/column]
[/columns]
. . .
[columns,class="row vertical-align"]
[column,class="col-xs-2"]
Block
[/column]
[column,class="col-xs-4"]
[/column]
[column,class="col-xs-8"]
- can synchronize (barriers)
- can exchange data (common "shared" memory, etc.)
[/column]
[/columns]
. . .
[columns,class="row vertical-align"]
[column,class="col-xs-2"]
Grid
[/column]
[column,class="col-xs-4"]
[/column]
[column,class="col-xs-8"]
- grids/blocks serve as work distribution/sharing mechanism on device (occupancy)
- blocks dispatched to SMX (in turn run warps)
[/column]
[/columns]
-
-
global memory access: 200-400 ticks per warp
-
fp32 add/mul/fma: 32 per tick per warp
-
. . .
[columns,class="row vertical-align"]
[column,class="col-xs-12"]
[/column]
[/columns]
- hide (memory) latency by pipelining active warps
-
device kernels
-
arithmetic complexity needs to be high
-
number of arithmetic operations > number of load/store operations
-
[column,class="col-xs-12"]
[/column]
[/columns]
Keep data put as long as possible!
Bad: Non-Coalesced Memory Access
[columns,class="row vertical-align"]
[column,class="col-xs-12"]
- every thread accesses different cache line at random
- warp has to be replayed 31 times to complete 1 instruction
[/column]
[/columns]
. . .
**Good: Coalesced Memory Access**[columns,class="row vertical-align"]
[column,class="col-xs-12"]
[/column]
[/columns]
* **GPUs are complicated beasts**-
massive parallel compute power (per Watt)
-
massive ways to kill performance
What can you use today? { data-background="img/1024px-San_Francisco_Haight_Str_Guitar_Shop.jpg" style="color: black; margin: 0;margin-top: -100px;" }
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
![](img/800x_warning-42657_1280.png)[/column]
[column,class="col-xs-6"]
- good tools are rare and almost never portable
. . .
**Use a Library!**[/column]
[/columns]
[columns,class="row"]
[column,class="col-xs-6"]
**Vendor sponsored**[/column]
[column,class="col-xs-6"]
**Open Source**[/column]
[/columns]
[columns,class="row"]
[column,class="col-xs-3 text-right"]
[CUDA](https://developer.nvidia.com/gpu-accelerated-libraries) based- cuBLAS
- cuFFT
- cuDNN
- cuSparse
- cuSolver
- cuRAND
...
[/column]
[column,class="col-xs-3"]
[OpenCL](http://developer.amd.com/tools-and-sdks/opencl-zone/acl-amd-compute-libraries/) based- clBLAS
- clFFT
- clSparse
- clRNG
...
[/column]
[column,class="col-xs-6"]
* Multi-Purpose: [Alpaka](https://github.com/ComputationalRadiationPhysics/alpaka), [ArrayFire](https://github.com/arrayfire/arrayfire), [VexCL](https://github.com/ddemidov/vexcl), [ViennaCL](http://viennacl.sourceforge.net/), ...[/column]
[/columns]
``` void vector_sum(std::vector& a, float scale, const std::vector& b) { for (int i=0; i[columns,class="row vertical-align"]
[column,class="col-xs-8"]
**C**ompute **U**nified **D**evice **A**rchitecture ([Nvidia CUDA Zone](https://developer.nvidia.com/cuda-zone))[/column]
[column,class="col-xs-4"]
![](img/NVIDIA-CUDA.jpg)[/column]
[/columns]
* freeware tool suite, gpu library package and low/high level API(s)-
CUDA platform supports C and C++ with proprietary compiler
-
binaries run on Nvidia hardware only
-
source code split into host and device part
-
host : C++11 and STL supported
-
device: tiny subset of C++11
(no exceptions, no iostream, no virtual inheritance, no STL)
-
int main(/*..*/){//..
std::vector<float> host_a(vsize,1.f);
std::vector<float> host_b(vsize,2.f);
float * device_a=nullptr, *device_b=nullptr;
cudaMalloc(&device_a, vsize_byte);
cudaMalloc(&device_b, vsize_byte);
cudaMemcpy(device_a, &host_a[0], vsize_byte,
cudaMemcpyHostToDevice);
cudaMemcpy(device_b, &host_b[0], vsize_byte,
cudaMemcpyHostToDevice);
//above main
__global__ void vector_sum(std::size_t _size,
float _scale, float* _a, float* _b){
std::size_t index = blockIdx.x*blockDim.x + threadIdx.x;
if (index < _size)
_a[index] = _scale*_a[index] + _b[index];
}
//in main: dispatch to device
vector_sum<<<(vsize+255)/256, 256>>>(vsize,
host_d,
device_a,
device_b);
//transfer memory back
cudaMemcpy(&host_a[0], device_a, vsize_byte,
cudaMemcpyDeviceToHost);
//clean-up
cudaFree(device_a);
cudaFree(device_b);
return 0;
}
[columns,class="row"]
[column,class="col-xs-6 text-success"]
-
free and working
-
CUDA comes with a ton of tools
(debugger, profiler, libraries, ...) -
CUDA comes with a ton of examples
-
very flexible (device instrinsics, locked memory handling, ...)
-
nVidia very active in porting scientific applications
-
nVidia very active C++ standardisation (Parallelism TS)
[/column]
. . .
[column,class="col-xs-6 text-warning"]
-
plain C API
(memory allocation, error handling, asynchronous calls, ...) -
grid dispatch is error prone
(code repetition in index calculation) -
compiler is sometimes hard to come by (using boost, OpenMP interoperability)
-
__keyword__
disrupt design (redundancy, maintainability)
[/column]
[/columns]
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
**Open C**ompute **L**anguage ([khronos.org/opencl](https://www.khronos.org/opencl/))[/column]
[column,class="col-xs-4 bg-primary"]
_No Logo due to Apple's Copyright_[/column]
[/columns]
* open, royalty-free standard for cross-platform, parallel programming
-
designed to run on CPUs, GPUs, FPGAs, DSPs, ...
-
maintained by non-profit technology consortium Khronos Group
-
source code split into host and device part
-
host : C/C++ based API (lower level than CUDA)
-
device: C11 derived language (OpenCL 2.0)
-
const char *kernelSource = "\n" \
"__kernel void vecAdd( __global float *a, \n" \
" __global float *b, \n" \
" __global float *c, \n" \
" const unsigned int n) \n" \
"{ \n" \
" int id = get_global_id(0); \n" \
" \n" \
" //Make sure we do not go out of bounds \n" \
" if (id < n) \n" \
" c[id] = a[id] + b[id]; \n" \
"} \n" \
"\n" ;
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
_parallel algorithms library which resembles the C++ Standard Template Library (STL)_[/column]
[column,class="col-xs-4"]
![](img/thrust_logo.png) [thrust.github.io](http://thrust.github.io/)[/column]
[/columns]
* open source (Apache v2 license)
-
interoperability with CUDA, TBB and OpenMP (possible backends)
-
high level interface compared to CUDA/OpenCL
struct saxpy_functor :
public thrust::binary_function<float,float,float>
{
const float a;
saxpy_functor(float _a) : a(_a) {}
__host__ __device__
float operator()(const float& x,
const float& y
) const {
return a * x + y;
}
};
int main(//...){//..
thrust::host_vector<float> host_a(N,1.f);
thrust::host_vector<float> host_b(N,2.f);
thrust::device_vector<float> dev_a = host_a;
thrust::device_vector<float> dev_b = host_b;
thrust::transform(dev_a.begin(),dev_a.end(),
dev_b.begin(),
dev_a.begin(),
saxpy_functor(scale));
}
[columns,class="row"]
[column,class="col-xs-6 text-success"]
* C++ and STL for GPUs (and CPUs)!-
container and algorithm API well thought through
-
code becomes readable/maintainable (at least for a C++Dev)
-
algorithms can be dispatched from device kernels as well
-
many examples, active community
[/column]
. . .
[column,class="col-xs-6 text-warning"]
* loss of flexibility:* host-device i/o (pinned memory allocator considered experimental)
* index information only available through kernel built-ins
* grid distpatch of kernel by thrust library (occupancy)
-
kernel optimisations = CUDA
(CUB library?) -
C++11, C++17 ?
[/column]
[/columns]
![by [axeplace.com](http://axepalace.com/media/catalog/product/cache/1/image/9df78eab33525d08d6e5fb8d27136e95/d/v/dv020_jpg_jumbo_h70433.001_black_flat3.jpg)](img/baseguitar.jpg)[columns,class="row vertical-align"]
[column,class="col-xs-6"]
**H**eterogenous **C**ompute **C**ompiler ([bitbucket.org/multicoreware/hcc](https://bitbucket.org/multicoreware/hcc/wiki/Home))[/column]
[column,class="col-xs-4"]
meant for APU ![](img/apu_comic.gif) **A**ll-purpose G**PU**s[/column]
[/columns]
* single source C++ compiler (for CPU, GPU and APU targets)-
supports C++AMP 1.2, HC, OpenMP 4, C++1x
-
currently being ported to discrete GPUs
-
very young project presented in Kona
using namespace concurrency;
void amp_sum(vector<float>& _va,
const vector<float>& _vb,
float _scale){
extent<1> ext_a(_va.size()),ext_b(_vb.size());
array_view<float, 1> view_a(ext_a,_va);
array_view<const float, 1> view_b(ext_b,_vb);
parallel_for_each(view_a.get_extent(),
[=](index<1> idx) restrict(amp)
{
view_a[idx] = view_a[idx]*_scale + view_b[idx];
}
);
view_a.synchronize();
}
[columns,class="row"]
[column,class="col-xs-6 text-success"]
-
API focusses on problem-solving and flexibility
-
API appears to be lightweight (array views)
-
multiple binary backends (SPIR-V, OpenCL, ...)
-
multiple hardware backends (CPU, GPU, APU)
-
homogenous C++ source code
-
function continuations supported
future1.then(future2)//..
[/column]
. . .
[column,class="col-xs-6 text-warning"]
-
young project, API still fluid (
concurrency::
=>hc::
) -
no tooling yet (debugger, profiler, ...)
-
performance yield unclear
-
combined API for integrated and discrete GPUs
-
HSA/AMD road map unclear
[/column]
[/columns]
![](img/Elektrofryingpan_landscape.jpg) 1932, by [Museum of Making Music at English Wikipedia](https://commons.wikimedia.org/wiki/File:Elektrofryingpan.jpg)[columns,class="row vertical-align"]
[column,class="col-xs-6"]
**Open M**ulti-**P**rocessing ([openmp.org](http://openmp.org/))[/column]
[column,class="col-xs-4"]
![](img/OpenMP_logo.png)[/column]
[/columns]
void vector_sum(int size, float scale,
float * restrict a, float * restrict b) {
#pragma omp target map(to:b[0:n], size, scale) map(a[0:n])
{
#pragma omp parallel for
for (int i=0; i<size; i++) {
a[i] = a[i] * scale + b[i];
}
}
}
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
**Open Acc**elerator ([openacc.org](http://openacc.org/))[/column]
[column,class="col-xs-4"]
![](img/openacc_logo.jpg)[/column]
[/columns]
void vector_sum(int size, float scale, float *a, float *b) {
#pragma acc parallel copy(a[0:size]) copyin(b[0:size])
#pragma acc loop
for (int i = 0; i < size; ++i)
a[i] = scale*a[i] + b[i];
}
[columns,class="row"]
[column,class="col-xs-6 text-success"]
-
OpenMP is (already) a success story
(why not OpenACC as well) -
dream: one-line injection and code is fast
-
strong industrial support (tooling)
-
GPU: perfect fit for upgrading legacy code or prototyping
[/column]
. . .
[column,class="col-xs-6 text-warning"]
-
OpenMP works well on shared memory CPUs
-
(discrete) GPUs have different architecture than CPUs
-
language in a language ??
-
OpenACC, OpenMP dichotomy (will users/applications loose?)
[/column]
[/columns]
![](img/1024px-Ukulele-electro-acoustic.jpg) by [Alno](https://commons.wikimedia.org/wiki/File:Ukulele-electro-acoustic.JPG)What can you use tomorrow { data-background="img/touchscreen-guitar.jpg" }
-
OpenCL wrapper enabling vendor independent parallel algorithms
-
conceptually very similar to thrust/bolt
compute::device gpu = compute::system::default_device();
compute::context ctx(gpu);
compute::command_queue queue(ctx, gpu);
compute::vector<float> device_a(a.size(), ctx);//etc..
compute::copy(host_a.begin(), host_a.end(),
device_a.begin(), queue);//etc..
compute::transform(device_a.begin(),device_a.end(),
device_a.begin(),compute::add<float>(),queue);
Take away: SPIR-V promising, SYCL very similar to boost.compute
vector_sum<<<(vsize+255)/256, 256>>>(/*..*/);
launch(vector_sum, /*..*/);
. . .
auto f1 = bulk_async(par(n), [=](parallel_agent &self)
{
int i = self.index();
z[i] = a * x[i] + y[i];
});
auto f2 = bulk_then(f1, par(n), other_work);
auto f3 = bulk_then(f1, par(n), more_work);
when_all(f2, f3).wait();
[columns,class="row vertical-align"]
[column,class="col-xs-6"]
Published already:Parallelism TS, Concurrency TS
[/column]
[column,class="col-xs-4"]
[![](img/logo-cpp.jpg)](https://isocpp.org)[/column]
[/columns]
transform( std::experimental::parallel::par,
std::begin(a), std::end(a),
std::begin(b),
std::begin(a)
[&](float& a, const float& b) {
a = scale*a + b;
});
. . .
future<int> f1 = copy_to_device();
future<int> f2 = f1.then([](future<int> f) {
future<int> f3 = start_compute();
return f3;
});
future<int> f3 = f3.then([](future<int> f){
return copy_to_host();
}
);
-
better API to coordinate asynchronous transfers and computations
-
future: use
(a)wait/then
and friends to express data dependencies -
support by compiler vendors needed
-
in production: almost dominated by C99
-
on the horizon: performant, flexible and maintainable C++ APIs emerging
GPUs today convert workstations to compute clusters, and clusters to supercomputers!
* GPUs architecture is complex: obtaining max. performance challenging
- accelerators are a must on the road to exascale/performance
[columns,class="row"]
[column,class="col-xs-4"]
MPI CBG / Scionics Computer Innovations GmbH
[/column]
[column,class="col-xs-8 text-left"]
Robert Haase, Ian Henry, Benoit Lombardot, Jeff Oegema
[/column]
[/columns]
[columns,class="row"]
[column,class="col-xs-4"]
[/column]
[column,class="col-xs-8 text-left"]
Guido Juckeland, Thomas Karnagel, René Widera, Erik Zenker
[/column]
[/columns]
[columns,class="row"]
[column,class="col-xs-4"]
[/column]
[column,class="col-xs-8 text-left"]
Greg Stoner, Ben Sander, Chan SiuChi; Jack Chung
[/column]
[/columns]
[columns,class="row"]
[column,class="col-xs-4"]
[/column]
[column,class="col-xs-8 text-left"]
Axel Köhler
[/column]
[/columns]
![](img/Sleeping_students.jpg) **For Questions, Comments, Complaints, Compliments, ... **