Skip to content

Latest commit

 

History

History
1550 lines (942 loc) · 26.5 KB

slides.md

File metadata and controls

1550 lines (942 loc) · 26.5 KB
title author origin email date
C++ on GPUs done right?
Peter Steinbach
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden
Meeting C++, December 05, 2015

Disclaimer(s)

No OpenGL/Vulkan here!

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

[/column]

[column,class="col-xs-6"]

[/column]

[/columns]

© [Khronos Group](https://www.khronos.org/news/logos/)

This is Open-Source!

Feel free to reply, discuss, inform, correct, ...

github.com/psteinb/meetingcpp2015

. . .

 

If not stated otherwise, the slides and all it's code are licensed under

Creative Commons Attribution 4.0 International License

Who am I?

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

![](img/800px-MPI-CBG_building_outside_4pl.jpg) *Scientific Software Engineer* at Max Planck Institute of Molecular Cell Biology and Genetics

[/column]

[column,class="col-xs-4"]

  • Dresden, Saxony
  • 450 staff
  • founded 2001
  • cell biology
  • genomics
  • systems biology

[/column]

[/columns]

What do I do?

Accelerating scientific algorithms on GPUs (multi-GB dataset, a lot of FFTs)

Outline

1. Massively Parallel Programming
  1. Architecture

  2. What can you use today

  3. What can you use tomorrow

Massively Parallel Programming { data-background="img/Titan_render.png" data-background-size="800px" style="margin-top: 20%"}

Yet Another Hype?

Data obtained from Top500.org

Vendor Options

[columns,class="row"]

[column,class="col-xs-4"]

Nvidia Tesla ![](img/Nvidia-Tesla-K80_x400.jpg)

GPU without Graphics

[/column]

[column,class="col-xs-4"]

AMD FirePro ![](img/amd-firepro-s9150-server-graphics_x400.png)

GPU without Graphics

[/column]

[column,class="col-xs-4"]

Intel MIC ![](img/xeon_phi_x400.jpg)

Not Covered Today!

[/column]

[/columns]

Vendor flag ships

[columns,class="row vertical-align"]

[column,class="col-xs-4"]

_Nvidia Tesla K80_ ![](img/Nvidia-Tesla-K80_x400.jpg)

[/column]

[column,class="col-xs-4"]

_AMD FirePro S9170_ ![](img/amd-firepro-s9150-server-graphics_x400.png)

[/column]

[column,class="col-xs-4"]

Intel Xeon Phi 5110P ![](img/xeon_phi_x400.jpg)

[/column]

[/columns]

 

[columns,class="row vertical-align"]

[column,class="col-xs-4"]

* 2x GK210 chipsets * 2x 12 GB GDDR5 * 2x 288 GB/s to RAM * 8.7 TFlops SP * 2.9 TFlops DP

[/column]

[column,class="col-xs-4"]

* 1x Grenada XT * 32 GB GDDR5 RAM * 320 GB/s to RAM * 5.2 TFlops SP * 2.6 TFlops DP

[/column]

[column,class="col-xs-4"]

* 62x x86 CPUs * 8 GB GDDR5 RAM * 320 GB/s to RAM * 2.1 TFlops SP * 1.1 TFlops DP

[/column]

[/columns]

Architecture { data-background="img/nvidia_kepler_die_shot.jpg" }

{ data-background="img/islay_1024px.png" data-background-size="800px" }

{ data-background="img/islay_annotated_1024px.png" data-background-size="800px" }

Food Hunt

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

Tuna ![](img/1024px-Thunnus_orientalis_Osaka_Kaiyukan_Aquarium_cropped_x400.jpg) (fast, single, versatile)

[/column]

[column,class="col-xs-6"]

Forage Fish ![](img/1024px-School_of_Pterocaesio_chrysozona_in_Papua_New_Guinea_1_x400.jpg) (small, many, use wakefield of neighbor)

[/column]

[/columns]

The same principle on die

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

CPU ![](img/Central-Processing-Unit_x400.jpeg)

[/column]

[column,class="col-xs-6"]

GPU ![](img/Nvidia-Tesla-K80_x400.jpg)

[/column]

[/columns]

Note: Will use Nvidia Kepler as GPGPU example.

A more in-depth look

Nvidia K40: 15 Streaming Multiprocessors (SMX), 12 GB of GDDR5 DRAM

Kepler SMX Close-up

CUDA core: 192 fp32 ops / clock

SIMT Execution

[columns,class="row vertical-align"]

[column,class="col-xs-2"]

Warp

[/column]

[column,class="col-xs-4"]

[/column]

[column,class="col-xs-8"]

  • smallest unit of concurrency: 32 threads
  • thread = single CUDA core
  • all threads execute same program

[/column]

[/columns]

. . .

[columns,class="row vertical-align"]

[column,class="col-xs-2"]

Block

[/column]

[column,class="col-xs-4"]

[/column]

[column,class="col-xs-8"]

  • can synchronize (barriers)
  • can exchange data (common "shared" memory, etc.)

[/column]

[/columns]

. . .

[columns,class="row vertical-align"]

[column,class="col-xs-2"]

Grid

[/column]

[column,class="col-xs-4"]

[/column]

[column,class="col-xs-8"]

  • grids/blocks serve as work distribution/sharing mechanism on device (occupancy)
  • blocks dispatched to SMX (in turn run warps)

[/column]

[/columns]

Hiding Memory Latency

  • Kepler:

    • global memory access: 200-400 ticks per warp

    • fp32 add/mul/fma: 32 per tick per warp

. . .

 

[columns,class="row vertical-align"]

[column,class="col-xs-12"]

[/column]

[/columns]

  • hide (memory) latency by pipelining active warps

{ data-background="img/1024px-unmarked_holes.jpg" }

Compute > Memory Access

 

  • device kernels

    • arithmetic complexity needs to be high

    • number of arithmetic operations > number of load/store operations

Data Locality

[columns,class="row vertical-align"]

[column,class="col-xs-12"]

[/column]

[/columns]

Keep data put as long as possible!

Memory Access

Bad: Non-Coalesced Memory Access

[columns,class="row vertical-align"]

[column,class="col-xs-12"]

  • every thread accesses different cache line at random
  • warp has to be replayed 31 times to complete 1 instruction

[/column]

[/columns]

. . .

**Good: Coalesced Memory Access**

[columns,class="row vertical-align"]

[column,class="col-xs-12"]

[/column]

[/columns]

Summary Architecture

* **GPUs are complicated beasts**
  • massive parallel compute power (per Watt)

  • massive ways to kill performance

What can you use today? { data-background="img/1024px-San_Francisco_Haight_Str_Guitar_Shop.jpg" style="color: black; margin: 0;margin-top: -100px;" }

A Word of Warning!

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

![](img/800x_warning-42657_1280.png)

[/column]

[column,class="col-xs-6"]

* 32 threads is the minimum
  • good tools are rare and almost never portable

. . .

**Use a Library!**

[/column]

[/columns]

Use Libraries!

[columns,class="row"]

[column,class="col-xs-6"]

**Vendor sponsored**

[/column]

[column,class="col-xs-6"]

**Open Source**

[/column]

[/columns]

[columns,class="row"]

[column,class="col-xs-3 text-right"]

[CUDA](https://developer.nvidia.com/gpu-accelerated-libraries) based
  • cuBLAS
  • cuFFT
  • cuDNN
  • cuSparse
  • cuSolver
  • cuRAND
    ...

[/column]

[column,class="col-xs-3"]

[OpenCL](http://developer.amd.com/tools-and-sdks/opencl-zone/acl-amd-compute-libraries/) based
  • clBLAS
  • clFFT
  • clSparse
  • clRNG
    ...

[/column]

[column,class="col-xs-6"]

* Multi-Purpose: [Alpaka](https://github.com/ComputationalRadiationPhysics/alpaka), [ArrayFire](https://github.com/arrayfire/arrayfire), [VexCL](https://github.com/ddemidov/vexcl), [ViennaCL](http://viennacl.sourceforge.net/), ...

[/column]

[/columns]

Baseline Example

``` void vector_sum(std::vector& a, float scale, const std::vector& b) { for (int i=0; i

CUDA Overview

[columns,class="row vertical-align"]

[column,class="col-xs-8"]

**C**ompute **U**nified **D**evice **A**rchitecture ([Nvidia CUDA Zone](https://developer.nvidia.com/cuda-zone))

[/column]

[column,class="col-xs-4"]

![](img/NVIDIA-CUDA.jpg)

[/column]

[/columns]

* freeware tool suite, gpu library package and low/high level API(s)
  • CUDA platform supports C and C++ with proprietary compiler

  • binaries run on Nvidia hardware only

  • source code split into host and device part

    • host : C++11 and STL supported

    • device: tiny subset of C++11
      (no exceptions, no iostream, no virtual inheritance, no STL)

 

1. Declare and allocate host and device memory. 1. Initialize host data. 1. Transfer data from the host to the device. 1. Execute one or more kernels (vector sum). 1. Transfer results from the device to the host.

CUDA Code: Mem Init

int main(/*..*/){//..
  std::vector<float> host_a(vsize,1.f);
  std::vector<float> host_b(vsize,2.f);

  float * device_a=nullptr, *device_b=nullptr;
  cudaMalloc(&device_a, vsize_byte); 
  cudaMalloc(&device_b, vsize_byte);

  cudaMemcpy(device_a, &host_a[0], vsize_byte,
             cudaMemcpyHostToDevice);
  cudaMemcpy(device_b, &host_b[0], vsize_byte,
			 cudaMemcpyHostToDevice);

CUDA Code: Compute

//above main
__global__ void vector_sum(std::size_t _size,
			   float _scale, float* _a, float* _b){
  std::size_t index = blockIdx.x*blockDim.x + threadIdx.x;
  if (index < _size)
    _a[index] = _scale*_a[index] + _b[index];
}

//in main: dispatch to device
vector_sum<<<(vsize+255)/256, 256>>>(vsize,
									 host_d,
									 device_a,
									 device_b);

CUDA Code: Mem TX + Clean-up

  //transfer memory back
  cudaMemcpy(&host_a[0], device_a, vsize_byte,
             cudaMemcpyDeviceToHost);

  //clean-up
  cudaFree(device_a);
  cudaFree(device_b);
  return 0;
}

CUDA Wrap-up

[columns,class="row"]

[column,class="col-xs-6 text-success"]

  • free and working

  • CUDA comes with a ton of tools
    (debugger, profiler, libraries, ...)

  • CUDA comes with a ton of examples

  • very flexible (device instrinsics, locked memory handling, ...)

  • nVidia very active in porting scientific applications

  • nVidia very active C++ standardisation (Parallelism TS)

[/column]

. . .

[column,class="col-xs-6 text-warning"]

  • plain C API
    (memory allocation, error handling, asynchronous calls, ...)

  • grid dispatch is error prone
    (code repetition in index calculation)

  • compiler is sometimes hard to come by (using boost, OpenMP interoperability)

  • __keyword__ disrupt design (redundancy, maintainability)

[/column]

[/columns]

CUDA is like ... { data-background="img/1024px-Taylor415_acoustic.jpg" }

OpenCL

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

**Open C**ompute **L**anguage ([khronos.org/opencl](https://www.khronos.org/opencl/))

[/column]

[column,class="col-xs-4 bg-primary"]

_No Logo due to Apple's Copyright_

[/column]

[/columns]

 

* open, royalty-free standard for cross-platform, parallel programming
  • designed to run on CPUs, GPUs, FPGAs, DSPs, ...

  • maintained by non-profit technology consortium Khronos Group

  • source code split into host and device part

    • host : C/C++ based API (lower level than CUDA)

    • device: C11 derived language (OpenCL 2.0)

OpenCL Kernel

const char *kernelSource =                     "\n" \
"__kernel void vecAdd(  __global float *a,      \n" \
"                       __global float *b,      \n" \
"                       __global float *c,      \n" \
"                       const unsigned int n)   \n" \
"{                                              \n" \
"    int id = get_global_id(0);                 \n" \
"                                               \n" \
"    //Make sure we do not go out of bounds     \n" \
"    if (id < n)                                \n" \
"        c[id] = a[id] + b[id];                 \n" \
"}                                              \n" \
                                               "\n" ;
from [www.olcf.ornl.gov](https://www.olcf.ornl.gov/tutorials/opencl-vector-addition/)

OpenCL is like ...

![by [Kasra Ganjavi](https://en.wikipedia.org/wiki/File:Andy_McKee,_January_2008.jpg)](img/Andy_McKee_January_2008.jpg)

thrust

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

_parallel algorithms library which resembles the C++ Standard Template Library (STL)_

[/column]

[column,class="col-xs-4"]

![](img/thrust_logo.png) [thrust.github.io](http://thrust.github.io/)

[/column]

[/columns]

 

* open source (Apache v2 license)
  • interoperability with CUDA, TBB and OpenMP (possible backends)

  • high level interface compared to CUDA/OpenCL

thrust Code: Functor

struct saxpy_functor :
public thrust::binary_function<float,float,float>
{
    const float a;
    saxpy_functor(float _a) : a(_a) {}

	__host__ __device__
    float operator()(const float& x,
					 const float& y
					) const { 
            return a * x + y;
        }
};

thrust Code: GPU dispatch

int main(//...){//..

  thrust::host_vector<float> host_a(N,1.f);
  thrust::host_vector<float> host_b(N,2.f);

  thrust::device_vector<float> dev_a = host_a;
  thrust::device_vector<float> dev_b = host_b;

  thrust::transform(dev_a.begin(),dev_a.end(), 
					dev_b.begin(),
				    dev_a.begin(),
				    saxpy_functor(scale));	
}

thrust Wrap-up

[columns,class="row"]

[column,class="col-xs-6 text-success"]

* C++ and STL for GPUs (and CPUs)!
  • container and algorithm API well thought through

  • code becomes readable/maintainable (at least for a C++Dev)

  • algorithms can be dispatched from device kernels as well

  • many examples, active community

[/column]

. . .

[column,class="col-xs-6 text-warning"]

* loss of flexibility:
* host-device i/o (pinned memory allocator considered experimental)

* index information only available through kernel built-ins

* grid distpatch of kernel by thrust library (occupancy)
  • kernel optimisations = CUDA
    (CUB library?)

  • C++11, C++17 ?

[/column]

[/columns]

thrust is like ...

![by [axeplace.com](http://axepalace.com/media/catalog/product/cache/1/image/9df78eab33525d08d6e5fb8d27136e95/d/v/dv020_jpg_jumbo_h70433.001_black_flat3.jpg)](img/baseguitar.jpg)

HCC

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

**H**eterogenous **C**ompute **C**ompiler ([bitbucket.org/multicoreware/hcc](https://bitbucket.org/multicoreware/hcc/wiki/Home))

[/column]

[column,class="col-xs-4"]

meant for APU ![](img/apu_comic.gif) **A**ll-purpose G**PU**s

[/column]

[/columns]

* single source C++ compiler (for CPU, GPU and APU targets)
  • supports C++AMP 1.2, HC, OpenMP 4, C++1x

  • currently being ported to discrete GPUs

  • very young project presented in Kona

HCC Vector Sum (C++AMP)

using namespace concurrency;

void amp_sum(vector<float>& _va,
			 const vector<float>& _vb,
			 float _scale){
		 
  extent<1> ext_a(_va.size()),ext_b(_vb.size());

  array_view<float, 1> view_a(ext_a,_va); 
  array_view<const float, 1> view_b(ext_b,_vb); 

HCC continued

  parallel_for_each(view_a.get_extent(),
		    [=](index<1> idx) restrict(amp)
		    {
		      view_a[idx] = view_a[idx]*_scale + view_b[idx];
		    }
		    );

  view_a.synchronize();
}

HCC Wrap-up

[columns,class="row"]

[column,class="col-xs-6 text-success"]

  • API focusses on problem-solving and flexibility

  • API appears to be lightweight (array views)

  • multiple binary backends (SPIR-V, OpenCL, ...)

  • multiple hardware backends (CPU, GPU, APU)

  • homogenous C++ source code

  • function continuations supported

future1.then(future2)//..

[/column]

. . .

[column,class="col-xs-6 text-warning"]

  • young project, API still fluid (concurrency:: => hc::)

  • no tooling yet (debugger, profiler, ...)

  • performance yield unclear

  • combined API for integrated and discrete GPUs

  • HSA/AMD road map unclear

[/column]

[/columns]

HC is like ...

![](img/Elektrofryingpan_landscape.jpg) 1932, by [Museum of Making Music at English Wikipedia](https://commons.wikimedia.org/wiki/File:Elektrofryingpan.jpg)

Pragma based approaches

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

**Open M**ulti-**P**rocessing ([openmp.org](http://openmp.org/))

[/column]

[column,class="col-xs-4"]

![](img/OpenMP_logo.png)

[/column]

[/columns]

void vector_sum(int size, float scale,
				float * restrict a, float * restrict b) {
	#pragma omp target map(to:b[0:n], size, scale) map(a[0:n])
	{
		#pragma omp parallel for
		for (int i=0; i<size; i++) {
			a[i] = a[i] * scale + b[i];
		}
	}
}
accelerator target since version 4.0 ([gcc 5.0+](https://gcc.gnu.org/wiki/Offloading), [icc 16+](https://software.intel.com/en-us/intel-parallel-studio-xe), [ENZO2016](http://www.pathscale.com/enzo))

Pragmas continued

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

**Open Acc**elerator ([openacc.org](http://openacc.org/))

[/column]

[column,class="col-xs-4"]

![](img/openacc_logo.jpg)

[/column]

[/columns]

void vector_sum(int size, float scale, float *a, float *b) {

	#pragma acc parallel copy(a[0:size]) copyin(b[0:size])
	#pragma acc loop
	for (int i = 0; i < size; ++i)
		a[i] = scale*a[i] + b[i];
	
}
(partially available in [gcc 5.0+](https://gcc.gnu.org/wiki/Offloading), fully in [pgi](https://www.pgroup.com/resources/accel.htm) & [ENZO2016](http://www.pathscale.com/enzo) compiler)

Pragma Wrap-up

[columns,class="row"]

[column,class="col-xs-6 text-success"]

  • OpenMP is (already) a success story
    (why not OpenACC as well)

  • dream: one-line injection and code is fast

  • strong industrial support (tooling)

  • GPU: perfect fit for upgrading legacy code or prototyping

[/column]

. . .

[column,class="col-xs-6 text-warning"]

  • OpenMP works well on shared memory CPUs

  • (discrete) GPUs have different architecture than CPUs

  • language in a language ??

  • OpenACC, OpenMP dichotomy (will users/applications loose?)

[/column]

[/columns]

Pragmas are like ...

![](img/1024px-Ukulele-electro-acoustic.jpg) by [Alno](https://commons.wikimedia.org/wiki/File:Ukulele-electro-acoustic.JPG)

What can you use tomorrow { data-background="img/touchscreen-guitar.jpg" }

* not yet part of boost library
  • OpenCL wrapper enabling vendor independent parallel algorithms

  • conceptually very similar to thrust/bolt

	compute::device gpu = compute::system::default_device();
	compute::context ctx(gpu);
    compute::command_queue queue(ctx, gpu);

	compute::vector<float> device_a(a.size(), ctx);//etc..
	compute::copy(host_a.begin(), host_a.end(),
		device_a.begin(), queue);//etc..

    compute::transform(device_a.begin(),device_a.end(),
        device_a.begin(),compute::add<float>(),queue);

OpenCL tomorrow

![](img/khronos_road_map.png) from [SIGGRAPH Asia 11/2015](https://www.khronos.org/assets/uploads/developers/library/2015-sigasia/SIGGRAPH-Asia_Nov15.pdf)

Take away: SPIR-V promising, SYCL very similar to boost.compute

CUDA tomorrow

vector_sum<<<(vsize+255)/256, 256>>>(/*..*/);

launch(vector_sum, /*..*/);
from GTC2015 (03/2015)

. . .

auto f1 = bulk_async(par(n), [=](parallel_agent &self)
						    {
							  int i = self.index();
							  z[i] = a * x[i] + y[i];
							});
 
auto f2 = bulk_then(f1, par(n), other_work);
auto f3 = bulk_then(f1, par(n), more_work);
when_all(f2, f3).wait();
from SC15 (11/2015, [agency](https://github.com/jaredhoberock/agency))

C++17

[columns,class="row vertical-align"]

[column,class="col-xs-6"]

Published already:

Parallelism TS, Concurrency TS

[/column]

[column,class="col-xs-4"]

[![](img/logo-cpp.jpg)](https://isocpp.org)

[/column]

[/columns]

transform(	std::experimental::parallel::par,
			std::begin(a), std::end(a),
			std::begin(b),
			std::begin(a)
			[&](float& a, const float& b) {
				a = scale*a + b;
			});
**vendors plan to support this with**
``` std::parallel::cuda, std::parallel::opencl ```

My C++17 GPU excitement

. . .

future<int> f1 = copy_to_device();
future<int> f2 = f1.then([](future<int> f) {
                    future<int> f3 = start_compute();
                    return f3;
					});
future<int> f3 = f3.then([](future<int> f){
					return copy_to_host();
					}
					);
  taken from concurrency TS
  • better API to coordinate asynchronous transfers and computations

  • future: use (a)wait/then and friends to express data dependencies

  • support by compiler vendors needed

Summary

C++ on GPUs done right?

  • in production: almost dominated by C99

  • on the horizon: performant, flexible and maintainable C++ APIs emerging

GPUs are there to stay

 

GPUs today convert workstations to compute clusters, and clusters to supercomputers!

 

* GPUs architecture is complex: obtaining max. performance challenging
  • accelerators are a must on the road to exascale/performance

Acknowledgements

[columns,class="row"]

[column,class="col-xs-4"]

MPI CBG / Scionics Computer Innovations GmbH

[/column]

[column,class="col-xs-8 text-left"]

Robert Haase, Ian Henry, Benoit Lombardot, Jeff Oegema

[/column]

[/columns]

[columns,class="row"]

[column,class="col-xs-4"]

GPU Center of Excellence

[/column]

[column,class="col-xs-8 text-left"]

Guido Juckeland, Thomas Karnagel, René Widera, Erik Zenker

[/column]

[/columns]

[columns,class="row"]

[column,class="col-xs-4"]

AMD; Multicoreware

[/column]

[column,class="col-xs-8 text-left"]

Greg Stoner, Ben Sander, Chan SiuChi; Jack Chung

[/column]

[/columns]

[columns,class="row"]

[column,class="col-xs-4"]

nVidia

[/column]

[column,class="col-xs-8 text-left"]

Axel Köhler

[/column]

[/columns]

Thank you!

![](img/Sleeping_students.jpg)

{style="font-size: 1.5em"}

**For Questions, Comments, Complaints, Compliments, ... **

github.com/psteinb/meetingcpp2015