Octo Tiger GPU TODO Items

Target

SC Papers

March 1, 2019 – Submissions open
April 2, 2019 – Abstracts submissions deadline
April 10, 2019 – Full paper deadline (No extensions)
May 11, 2019 – Reviews sent
May 25, 2019 – Resubmissions deadline
June 15, 2019 – Notifications sent
July 12, 2019 – Major revision deadline
August 9, 2019 – Major revision notifications sent
August 28, 2019 – Final paper deadline

GPU-related workitems

Conversion of the FMM interaction kernels to Struct-of-Array (SoA) data-structure and stencil based interactions (Gregor, David)

The calculation of the FMM interaction methods have proven to be most compute intensive methods. These include four kernels: Multipole-Multipole, Monopole-Monopole, Monopole-Multipole and Multipole-Monopole.

The switch from a Array-of-Struct data-structure to a Struct-of-Array structure and the usage of a stencil instead of an interaction list should improve cache efficiency, avoid gather/scatter operations and enable us to eventually move the kernels to the GPU.

Create stencil based Multipole-Multipole kernel using SoA data (David)

Create of SoA datastructure class
Create stencil that includes all interactions from the interaction list for one element
Create conversion method from Array-of-Structs (AoS) to Struct-of-Arrays to convert the data of a node and its neighbors
Create CPU Multipole Multipole (M2M) interaction kernel that uses the new data-structure and the stencil
Integrate new M2M kernel and the conversion methods into compute fmm
Test and optimize (L1 blocking?) the new M2M interaction kernel
Split M2M kernel in RHO and non-RHO version to avoid branching
Performance analysis of the new kernel compared to the old AoS M2M interaction kernel

Create stencil based Monopole-Monopole kernel using SoA data (Gregor)

Create data-structure conversion method for Monopole-Monopole (P2P) interactions that only loads relevant data
Create additional stencil for monopole interactions
Create CPU SoA Monopole Monopole (P2P) interaction kernel that uses the stencil
Integrate SoA P2P conversion method and kernel
Test and optimize the new P2P SoA kernel (again, L1 blocking?)

Create stencil based Monopole-Multipole kernel using SoA data (Gregor)

Create data-structure conversion method for Monopole-Multipole (P2M) interactions that only loads relevant data
Create CPU Monopole-Multipole (P2M) interaction kernel that uses the stencil
Integrate SoA P2M conversion method and kernel
Test and optimize the new P2M SoA kernel
Add early exit conditions to P2M kernel

Create stencil based Multipole-Monopole kernel using SoA data (Gregor)

Create data-structure conversion method for Multipole-Monopole (M2P) interactions that only loads relevant data
Create CPU Monopole Monopole (M2P) interaction kernel that uses the stencil
Integrate SoA M2P conversion method and kernel
Test and optimize the new M2P SoA kernel

Analysis and general conversion ToDo Items (Gregor)

Porting the new FMM SoA stencil kernels from CPU to GPU (Gregor)

The new SoA kernels can now be ported to the GPU using cuda.

One kernel does not provide enough work to keep the complete GPU busy, does we want to launch multiple kernels simultaneously using Cuda streams.
Synchronization of the results should be done with HPX futures, which enables us to do other work with the current HPX workerthread instead of just waiting of the results of a Cuda kernel.
We want to fully use both GPUs, as well as the CPU at the same time. Kernels should be launched on the GPU, unless all Cuda streams of the device are already busy, in this case we launch it on the CPU. For this we need a CPU/GPU scheduler for each HPX workerthread, that manages its Cuda streams and decides where to launch the kernel.
All calls to the Cuda interface should be asynchronous (or avoided during the FMM part) in order to keep the CPU free to work on other kernels

Initial preparation tasks (Gregor)

Fix all bugs preventing Octotiger from compiling with clang (for cuda clang)
Adapt buildscripts (that build all dependencies: Boost, HPX, Vc, ...) to use cuda clang
Get Johns minimal example that uses HPX futures with cuda to compile

Create first basic Cuda Multipole kernel (M2M and M2P interactions) (Gregor)

Create template version of Multiplole SoA kernel that works with either doubles (for GPU) or Vc types (for CPU)
Create Multipole cuda kernel (calling the template SoA kernel with double)
Add Multipole interface skeleton that will call either the M2M CPU kernel or the future Cuda kernel
Add CPU/GPU communication that moves all SoA data required by the M2M kernel to the GPU
Create Cuda M2M GPU kernel that calls the double version of the templated M2M SoA kernel.
Integrate new M2M cuda kernel
Test small scenario with single thread and the M2M cuda kernel to verify correctness

Use Cuda streams and HPX futures to run multiple GPU kernels at the same time / Add CPU/GPU scheduler (Gregor)

Create Cuda P2P kernel (Gregor)

Create template version of the P2P SoA kernel, again for both double and Vc types
Create P2P cuda kernel itself (calling the template SoA kernel)
Add interface skeleton for launching the kernel on either the CPU or the GPU (using the scheduler)
Add required buffers and pinned staging areas for monopole data to the scheduler
Integrate Cuda P2P kernel call (with HPX futures for synchronization) into the interface
Integrate interface into compute fmm
Test correctness and runtime with the new P2P kernel activated

Add support for multiple GPUs on one compute node (Gregor)

Add CLI argument for the number of desired Cuda streams per GPU
Allocate Cuda streams on different GPUs depending on the cuda streams per locality and the cuda streams per GPU
Adapt CPU/GPU scheduler to utilize multiple GPUs (by using streams based on different GPUs)
Test on shared memory node with 8 1080 Ti on one node

Port Monopole-Multipole SoA kernel to GPU (Gregor)

Not yet done since this is the least important of the FMM kernels (~2-3% of the runtime). However, should be simple to do after the other kernels are completed.

Create templated version of the P2M SoA kernel, again for both double and Vc types
Create CPU/GPU Interface
Integrate and test correctness

Optimization of the FMM cuda kernels (Gregor)

Analysis and general ToDo items (Gregor)

Analyse runtime behaviour for a varying number of streams
Do a distributed run on Piz Daint und test implementation on runtime behaviour with multiple compute/nodes GPUs

Porting grid reconstruct to the GPU (Gregor)

According to the profiling, this is the second most expensive part of octotiger and should be moves to the GPU.

Initial tasks (Gregor)

Create unit test for this method (with fixed input output comparison?)
Move iii loop from inner loop to other loop (we will parallelize over this loop)
Move if statements from inner loop to other loop wherever possible
Copy the required arrays to the gpu asynchronously (use existing scheduler/helper class)
Successively move parts of the method to gpu and verify correctness each time

Porting compute fluxes to the GPU

Less expensive than the reconstruct method, still needs about 2% of the runtime ->

Initial tasks

Create unit test for this method (with fixed input output comparison?)
parallelize over k,j,i loops -> replace by one loop and calculate k,i,j in each iteration
create unvectorized (template) version of roe_fluxes (that can be called in cuda)
Copy the required arrays to the gpu asynchronously (use existing scheduler/helper class)
Create basic version of the GPU kernel

Analysis - Identify more methods that should (or need) to be ported to the GPU

Compute radiation method (Parsa)

Identify share of the total runtime
Identify submethod that takes the most time
Parallelization should probably happen over the R_N3 loop (if method has a significant enough share of the total runtime)

Optimization for Summit nodes

This is a critical part as we don't have access to Summit so far. We could need support of @brycelelbach especially here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly