-
Notifications
You must be signed in to change notification settings - Fork 19
Octo Tiger GPU TODO Items
- March 1, 2019 – Submissions open
- April 2, 2019 – Abstracts submissions deadline
- April 10, 2019 – Full paper deadline (No extensions)
- May 11, 2019 – Reviews sent
- May 25, 2019 – Resubmissions deadline
- June 15, 2019 – Notifications sent
- July 12, 2019 – Major revision deadline
- August 9, 2019 – Major revision notifications sent
- August 28, 2019 – Final paper deadline
Conversion of the FMM interaction kernels to Struct-of-Array (SoA) data-structure and stencil based interactions (Gregor, David)
The calculation of the FMM interaction methods have proven to be most compute intensive methods. These include four kernels: Multipole-Multipole, Monopole-Monopole, Monopole-Multipole and Multipole-Monopole.
The switch from a Array-of-Struct data-structure to a Struct-of-Array structure and the usage of a stencil instead of an interaction list should improve cache efficiency, avoid gather/scatter operations and enable us to eventually move the kernels to the GPU.
-
Create of SoA datastructure class
-
Create stencil that includes all interactions from the interaction list for one element
-
Create conversion method from Array-of-Structs (AoS) to Struct-of-Arrays to convert the data of a node and its neighbors
-
Create CPU Multipole Multipole (M2M) interaction kernel that uses the new data-structure and the stencil
-
Integrate new M2M kernel and the conversion methods into compute fmm
-
Test and optimize (L1 blocking?) the new M2M interaction kernel
-
Split M2M kernel in RHO and non-RHO version to avoid branching
-
Performance analysis of the new kernel compared to the old AoS M2M interaction kernel
-
Create data-structure conversion method for Monopole-Monopole (P2P) interactions that only loads relevant data
-
Create additional stencil for monopole interactions
-
Create CPU SoA Monopole Monopole (P2P) interaction kernel that uses the stencil
-
Integrate SoA P2P conversion method and kernel
-
Test and optimize the new P2P SoA kernel (again, L1 blocking?)
-
Create data-structure conversion method for Monopole-Multipole (P2M) interactions that only loads relevant data
-
Create CPU Monopole-Multipole (P2M) interaction kernel that uses the stencil
-
Integrate SoA P2M conversion method and kernel
-
Test and optimize the new P2M SoA kernel
-
Add early exit conditions to P2M kernel
-
Create data-structure conversion method for Multipole-Monopole (M2P) interactions that only loads relevant data
-
Create CPU Monopole Monopole (M2P) interaction kernel that uses the stencil
-
Integrate SoA M2P conversion method and kernel
-
Test and optimize the new M2P SoA kernel
-
Combine M2M and M2P kernel into one singular kernel
-
Create CLI parameters to switch between new SoA stencil kernels and the old AoS interaction list kernels
-
Verify single node results for multiple scenarios
-
Verify results for distributed communication
-
Fix distributed communication for the SoA kernels
-
Count FLOPs done by each SoA kernel
-
Estimate memory requirement of each SoA kernel (too big for L2?)
-
Gather runtime results on KNL and Xeon silver for Oxford paper
-
Gather data for runtime comparison between new and old kernels on multiple platforms
-
Analyse new hotspots
-
Move SoA conversion buffers into thread local memory to avoid constant reallocation (and to lower memory requirements)
The new SoA kernels can now be ported to the GPU using cuda.
- One kernel does not provide enough work to keep the complete GPU busy, does we want to launch multiple kernels simultaneously using Cuda streams.
- Synchronization of the results should be done with HPX futures, which enables us to do other work with the current HPX workerthread instead of just waiting of the results of a Cuda kernel.
- We want to fully use both GPUs, as well as the CPU at the same time. Kernels should be launched on the GPU, unless all Cuda streams of the device are already busy, in this case we launch it on the CPU. For this we need a CPU/GPU scheduler for each HPX workerthread, that manages its Cuda streams and decides where to launch the kernel.
- All calls to the Cuda interface should be asynchronous (or avoided during the FMM part) in order to keep the CPU free to work on other kernels
-
Fix all bugs preventing Octotiger from compiling with clang (for cuda clang)
-
Adapt buildscripts (that build all dependencies: Boost, HPX, Vc, ...) to use cuda clang
-
Get Johns minimal example that uses HPX futures with cuda to compile
-
Create template version of Multiplole SoA kernel that works with either doubles (for GPU) or Vc types (for CPU)
-
Create Multipole cuda kernel (calling the template SoA kernel with double)
-
Add Multipole interface skeleton that will call either the M2M CPU kernel or the future Cuda kernel
-
Add CPU/GPU communication that moves all SoA data required by the M2M kernel to the GPU
-
Create Cuda M2M GPU kernel that calls the double version of the templated M2M SoA kernel.
-
Integrate new M2M cuda kernel
-
Test small scenario with single thread and the M2M cuda kernel to verify correctness
Use Cuda streams and HPX futures to run multiple GPU kernels at the same time / Add CPU/GPU scheduler (Gregor)
-
Create basic scheduler class - each one should manage multiple cuda streams and their associated buffers
-
Adapt the cuda M2M kernel to use the scheduler managed buffers
-
Adapt the cuda M2M kernel to be synchronized with HPX futures (associated with the used cuda stream)
-
Test running multiple GPU kernels at once using one scheduling managing multiple streams
-
Add CPU/GPU scheduling capabilites - run on CPU if all Cuda streams are busy
-
Augment SoA datastructure with allocator for pinned memory
-
Add "staging area" for SoA data in pinned memory for each Cuda stream managed
-
Make all CPU <-> GPU data transfers asynchronous
-
Move all cuda mallocs to either the beginning of the program (scheduler buffers) or the regridding phases (result buffers)
-
Add CLI argument for the number of cuda streams managed per HPX locality
-
Run test using all CPU cores and on P100 (with 48 streams) on one compute node to check concurrent kernel launches with nvvp
-
Create template version of the P2P SoA kernel, again for both double and Vc types
-
Create P2P cuda kernel itself (calling the template SoA kernel)
-
Add interface skeleton for launching the kernel on either the CPU or the GPU (using the scheduler)
-
Add required buffers and pinned staging areas for monopole data to the scheduler
-
Integrate Cuda P2P kernel call (with HPX futures for synchronization) into the interface
-
Integrate interface into compute fmm
-
Test correctness and runtime with the new P2P kernel activated
-
Add CLI argument for the number of desired Cuda streams per GPU
-
Allocate Cuda streams on different GPUs depending on the cuda streams per locality and the cuda streams per GPU
-
Adapt CPU/GPU scheduler to utilize multiple GPUs (by using streams based on different GPUs)
-
Test on shared memory node with 8 1080 Ti on one node
Not yet done since this is the least important of the FMM kernels (~2-3% of the runtime). However, should be simple to do after the other kernels are completed.
-
Create templated version of the P2M SoA kernel, again for both double and Vc types
-
Create CPU/GPU Interface
-
Integrate and test correctness
-
Divide Cuda P2P Kernel into multiple blocks in order to keep more SMs busy
-
Divide Cuda Multipole Kernel into multiple blocks in order to keep more SMs busy
-
Reorder statements in the Multipole kernel to divide it into multiple kernels (and reduce required registers) -> Currently to be found in a sidebranch and not utilized
-
Move stencil and indicator constants into the cuda constant memory for the Multipole kernel
-
Move stencil and four-array constants into the cuda constant memory for the P2P kernel
-
Add blocking to P2P kernel to utilize shared memory and reduce global memory accesses
-
Add blocking to Multipole kernel to utilize shared memory and reduce global memory accesses
-
Reduce workitems of the P2P kernel and increase the number of blocks instead
-
Restructure memory access pattern in P2P kernel to use multiple loops with masks instead of the stencil
-
Restructure memory access pattern in Multipole kernel to use multiple loops with masks instead of the stencil
-
Reduce workitems of the Multipole kernel and increase the number of blocks instead
-
Analyse runtime behaviour for a varying number of streams
-
Do a distributed run on Piz Daint und test implementation on runtime behaviour with multiple compute/nodes GPUs
According to the profiling, this is the second most expensive part of octotiger and should be moves to the GPU.
-
Create unit test for this method (with fixed input output comparison?)
-
Move iii loop from inner loop to other loop (we will parallelize over this loop)
-
Move if statements from inner loop to other loop wherever possible
-
Copy the required arrays to the gpu asynchronously (use existing scheduler/helper class)
-
Successively move parts of the method to gpu and verify correctness each time
Less expensive than the reconstruct method, still needs about 2% of the runtime ->
-
Create unit test for this method (with fixed input output comparison?)
-
parallelize over k,j,i loops -> replace by one loop and calculate k,i,j in each iteration
-
create unvectorized (template) version of roe_fluxes (that can be called in cuda)
-
Copy the required arrays to the gpu asynchronously (use existing scheduler/helper class)
-
Create basic version of the GPU kernel
-
Identify share of the total runtime
-
Identify submethod that takes the most time
-
Parallelization should probably happen over the R_N3 loop (if method has a significant enough share of the total runtime)
- This is a critical part as we don't have access to Summit so far. We could need support of @brycelelbach especially here.