WIP: Add GPU aware MPI support in cannon algorithm #647

gsitaram · 2023-01-06T22:39:46Z

If CUDA or HIP backend is enabled, --WITH-DBCSR-G2G CMake option enables the following:

Use device arrays in MPI transfers in the cannon algorithm implementation.
Move norms calculation to GPU
Transfer the matrix images once to GPU and transpose the B matrix once. All MPI Isend and Irecv calls will transfer the data in device buffers instead of host buffers.
MPI transfers and multiplications remain overlapped

Requirements:

An MPI implementation that supports GPU aware communication.

Need help with the following:

The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?
The ROCm build fails with undefined references in OpenMPI. I am guessing we have to fix the CI build to include ROCm aware support when building UCX that OpenMPI depends on.
I need to include AMD copyright lines in each file modified, but this does not seem acceptable for the pre-commit script. Please let me know how to get around this.

jenkins-cscs · 2023-01-06T22:43:22Z

Can one of the admins verify this patch?

codecov · 2023-01-06T22:49:36Z

Codecov Report

Attention: 64 lines in your changes are missing coverage. Please review.

Comparison is base (5d92807) 67.0% compared to head (b76c8bd) 66.8%.
Report is 129 commits behind head on develop.

❗ Current head b76c8bd differs from pull request most recent head 0c37025. Consider uploading reports for the commit 0c37025 to get more accurate results

Files	Patch %	Lines
src/mm/dbcsr_mm_common.F	0.0%	55 Missing ⚠️
src/mm/dbcsr_mm.F	25.0%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##           develop    #647     +/-   ##
=========================================
- Coverage     67.0%   66.8%   -0.2%     
=========================================
  Files          105     105             
  Lines        29121   29188     +67     
=========================================
+ Hits         19521   19523      +2     
- Misses        9600    9665     +65

Flag	Coverage Δ
unittests	`66.8% <4.4%> (-0.2%)`	⬇️
with-blas	`66.8% <4.4%> (-0.2%)`	⬇️
with-libxsmm	`66.2% <4.7%> (-0.2%)`	⬇️
with-mpi	`66.9% <4.4%> (-0.2%)`	⬇️
with-openmp	`65.7% <4.4%> (-0.2%)`	⬇️
without-mpi	`66.0% <4.4%> (-0.2%)`	⬇️
without-openmp	`65.8% <4.4%> (-0.2%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hfp · 2023-01-08T13:19:09Z

The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?

Yes, I will help with this. I think it should not be a blocker for merge.

Btw, is there a way to perform atomic FP-add on (certain) AMD GPUs using OpenCL? It seems the CUDA based code path does that but everything I tried in OpenCL did not work for me like C11 atomics, legacy builtins, or even cheating by guessing a prototype function and calling it. I found some doc for inline assembly in OpenCL but I am a bit hesitant to adopt it (like reading all the arch specific doc for AMD GPUs). For NVidia btw I am using PTX inline assembly in OpenCL.

hfp · 2023-01-08T14:22:20Z

Regarding the failing check of file headers, I propose to put copyright lines in front of the LICENSE file like this, and to stick with generic

! Copyright (C) by the DBCSR developers group - All rights reserved                                !

Otherwise we are back to the business of duplicating author notes as part of the source code. I think authorship is recorded as part of repository's metadata and no one is "left behind". As an extension, I also propose to drop AUTHORS file.

A general overhaul is to have a file extension .md for LICENSE (and perhaps similar files).

( sorry the proposal is not exactly related and can be perhaps a separate PR )

gsitaram · 2023-01-09T15:33:26Z

The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?

Yes, I will help with this. I think it should not be a blocker for merge.

Thank you so much.

Btw, is there a way to perform atomic FP-add on (certain) AMD GPUs using OpenCL? It seems the CUDA based code path does that but everything I tried in OpenCL did not work for me like C11 atomics, legacy builtins, or even cheating by guessing a prototype function and calling it. I found some doc for inline assembly in OpenCL but I am a bit hesitant to adopt it (like reading all the arch specific doc for AMD GPUs). For NVidia btw I am using PTX inline assembly in OpenCL

I will find out more details and inform you.

gsitaram · 2023-01-09T15:37:01Z

Regarding the failing check of file headers, I propose to put copyright lines in front of the LICENSE file like this, and to stick with generic
! Copyright (C) by the DBCSR developers group - All rights reserved                                !
Otherwise we are back to the business of duplicating author notes as part of the source code. I think authorship is recorded as part of repository's metadata and no one is "left behind". As an extension, I also propose to drop AUTHORS file.

A general overhaul is to have a file extension .md for LICENSE (and perhaps similar files).

( sorry the proposal is not exactly related and can be perhaps a separate PR )

I like this proposal of having a separate LICENSE file. If all DBCSR maintainers agree, I can make the required changes (keep only the DBCSR Developers Group copyright line in each file, and add a new license file with the entire text of the license and copyright lines above that.)

gsitaram · 2023-01-09T21:00:29Z

@hfp, it does not seem like there are atomic add operations in OpenCL for FP type data. You may have to use compare and swap instead to accomplish the same. I found this blog showing an example, but this is not recent.
I see that in the newer OpenCL (3.0) spec, the atomic_* operations are deprecated, and we have to use a combination of memory order and fence operations.. but I am not very familiar with this set of functions. If I find any recent examples somewhere, I'll be sure to point those to you.

* Added c_calculate_norms prototype to ACC/LIBSMM interface/header. * Stub implementation for OpenCL.

* Added c_calculate_norms prototype to ACC/LIBSMM interface/header. * Stub implementation for OpenCL. * Adjusted rules to compile calculate_norms.cpp as CUDA translation unit. * Separated CFLAGS and DFLAGS. Allow unsupported host-compiler (nvcc). * Makefile/OpenCL: improved warning level. * Fixed potential warnings about dereferencing type-punned pointer will break strict-aliasing rules.

* Fixed including header file in calculate_norms.cpp.

hfp · 2023-01-20T13:19:50Z

Please allow me to share my suggestions for this PR:

Let's decide on the license banner etc solely based on input from @alazzaro. I would avoid surveying all/past authors. Also, the suggested change is not about dropping anyone's contribution; it's fully recorded in the GitHub repository and we only avoid duplicating this information along with exposing potentially outdated contact data.
I adjusted upstream master to provide a stub implementation of c_calculate_norms for the OpenCL backend. I hope I got the return code correct that allows to bail-out at runtime and to hopefully fall-back to host as long as c_calculate_norms is not implemented with an own OpenCL kernel. So, please consider to rebase or merge the upstream master.
I suggest rethinking the file extension of calculate_norms.cpp. I believe .cu might require less changes in the build system(s). However, this is not too important.

In my integration test (#649), I changed including some header files in calculate_norms.cpp from:

#if defined(__CUDA)
#  include "acc/cuda/acc_cuda.h"
#elif defined(__HIP)
#  include "acc/hip/acc_hip.h"
#endif
#include "libsmm_acc_init.h"

... to the following:

#if defined(__CUDA)
#  include "../cuda/acc_cuda.h"
#elif defined(__HIP)
#  include "../hip/acc_hip.h"
#endif
#include "libsmm_acc_init.h"

gsitaram · 2023-01-20T21:21:15Z

@hfp, I merged the master into this branch and changed the path to those header files as you suggested.
As for the name of the file, hipcc takes .cpp extension, and nvcc can compile the .cpp file with the option -x cuda. So, I didn't change it. If the changes introduced to the build system are not acceptable, please feel free to change. After I got it to build, I didn't touch it.
I'll discuss with Alfio and see how to proceed with the license blob.
Thanks for your help in getting the opencl backend to accept these changes.

alazzaro · 2023-01-25T11:39:35Z

@gsitaram Thanks for this PR and sorry for the late reply (I would like to thank @hfp for his contribution too).

I'm putting here some of the topics/comments:

Definitely I agree with @hfp proposal, so we can change our LICENSE file. @dev-zero comments?
Understand how we can test it. Currently we cannot on github, so we can use the CI on Daint
Move file calculate_norms.cpp to cuda_hip directory

I will add some specific comments in the files.

src/acc/libsmm_acc/calculate_norms.cpp

src/acc/libsmm_acc/libsmm_acc.cpp

src/mm/dbcsr_mm_common.F

gsitaram · 2023-01-31T20:05:34Z

@alazzaro , please check if the previous commit reflects our discussion about avoiding the g2g path for all data types other than real_8.

hfp · 2023-02-01T08:44:53Z

I think Gina was right with the previous location of c_calculate_norms (under src/acc/libsmm_acc). That's the "library" also hosting transpose and multiply (in fact it's also CUDA/HIP-only). Another sign of calculate_norms.cpp belonging to LIBSMM_ACC is the latter only contains infrastructure-code like mem, stream, error, etc., and never contained calculation routines. Further, calculate_norms.cpp depends on libsmm_acc_init.h (acc_get_gpu_warp_size), which is now awkward after moving it to cuda_hip.

…in GPU

for more information, see https://pre-commit.ci

alazzaro · 2024-01-24T13:40:21Z

thanks @gsitaram !

Add GPU aware MPI support in cannon algorithm with norms calculation in GPU

hfp added a commit to hfp/dbcsr that referenced this pull request Jan 20, 2023

ocl: adjustments in anticipation of cp2k#647

072e897

* Added c_calculate_norms prototype to ACC/LIBSMM interface/header. * Stub implementation for OpenCL.

hfp added a commit to hfp/dbcsr that referenced this pull request Jan 20, 2023

ocl: test integration with cp2k#647

24e77ee

* Fixed including header file in calculate_norms.cpp.

alazzaro mentioned this pull request Jan 20, 2023

ocl: test integration with #647 #649

Closed

alazzaro self-assigned this Jan 25, 2023

hfp added a commit to hfp/dbcsr that referenced this pull request Jan 26, 2023

Anticipate suggested changes of PR cp2k#647 (cp2k#647 (comment)).

c7b82ef

alazzaro reviewed Jan 31, 2023

View reviewed changes

src/acc/libsmm_acc/calculate_norms.cpp Outdated Show resolved Hide resolved

src/acc/libsmm_acc/calculate_norms.cpp Outdated Show resolved Hide resolved

src/acc/libsmm_acc/libsmm_acc.cpp Show resolved Hide resolved

src/mm/dbcsr_mm_common.F Show resolved Hide resolved

hfp added a commit to hfp/dbcsr that referenced this pull request Feb 1, 2023

Accommodate changes (cp2k#647).

785e71c

alazzaro force-pushed the dbcsr_g2g branch 2 times, most recently from a04f96b to d368160 Compare July 10, 2023 21:17

alazzaro changed the title ~~Add GPU aware MPI support in cannon algorithm~~ WIP: Add GPU aware MPI support in cannon algorithm Jul 11, 2023

alazzaro force-pushed the dbcsr_g2g branch 7 times, most recently from 63a4d0d to 8610601 Compare July 11, 2023 09:37

alazzaro force-pushed the develop branch 6 times, most recently from 1ba0f0b to 6261a60 Compare July 12, 2023 20:53

gsitaram and others added 17 commits January 24, 2024 13:36

Add GPU aware MPI support in cannon algorithm with norms calculation …

7f9fa98

…in GPU

Remove unused variable

8b1bdb8

Change path to header files

0f0e627

Move calculate_norms.cpp to cuda_hip directory

2a29d76

Changes to reflect review comments

404386e

Use G2G cannon algorithm only for real_8 data type

5e9a904

[pre-commit.ci] auto fixes from pre-commit.com hooks

c48285d

for more information, see https://pre-commit.ci

Allow multiple headers

ae102b7

Move calculate norms interface

f89ed5e

Update to MPI F08

694a69b

Move ROCM build test to Release

7579b99

Update ROCM container

5146f15

Update docker containers to ubuntu 22.04

c839c68

Force to remove HIP_ARCHITECTURE for linking

b2510ca

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b00d8f

for more information, see https://pre-commit.ci

Avoid LTO warning on type mismatch

4f028dd

Revert changes on linux test

d88a3b7

alazzaro force-pushed the dbcsr_g2g branch from 4e1d916 to d88a3b7 Compare January 24, 2024 12:37

alazzaro added 3 commits January 24, 2024 13:53

Rocm test to use Release build

e331ec6

Update github actions

a569fc7

Enable test on G2G and rocm compilation

0c37025

alazzaro merged commit 3bc658f into cp2k:develop Jan 24, 2024
21 checks passed

alazzaro pushed a commit that referenced this pull request Jan 24, 2024

Add GPU aware MPI support in cannon algorithm (#647)

2af40bb

Add GPU aware MPI support in cannon algorithm with norms calculation in GPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add GPU aware MPI support in cannon algorithm #647

WIP: Add GPU aware MPI support in cannon algorithm #647

gsitaram commented Jan 6, 2023 •

edited

Loading

jenkins-cscs commented Jan 6, 2023

codecov bot commented Jan 6, 2023 •

edited

Loading

hfp commented Jan 8, 2023

hfp commented Jan 8, 2023

gsitaram commented Jan 9, 2023 •

edited

Loading

gsitaram commented Jan 9, 2023

gsitaram commented Jan 9, 2023

hfp commented Jan 20, 2023

gsitaram commented Jan 20, 2023

alazzaro commented Jan 25, 2023 •

edited

Loading

gsitaram commented Jan 31, 2023

hfp commented Feb 1, 2023

alazzaro commented Jan 24, 2024

WIP: Add GPU aware MPI support in cannon algorithm #647

WIP: Add GPU aware MPI support in cannon algorithm #647

Conversation

gsitaram commented Jan 6, 2023 • edited Loading

jenkins-cscs commented Jan 6, 2023

codecov bot commented Jan 6, 2023 • edited Loading

Codecov Report

hfp commented Jan 8, 2023

hfp commented Jan 8, 2023

gsitaram commented Jan 9, 2023 • edited Loading

gsitaram commented Jan 9, 2023

gsitaram commented Jan 9, 2023

hfp commented Jan 20, 2023

gsitaram commented Jan 20, 2023

alazzaro commented Jan 25, 2023 • edited Loading

gsitaram commented Jan 31, 2023

hfp commented Feb 1, 2023

alazzaro commented Jan 24, 2024

gsitaram commented Jan 6, 2023 •

edited

Loading

codecov bot commented Jan 6, 2023 •

edited

Loading

gsitaram commented Jan 9, 2023 •

edited

Loading

alazzaro commented Jan 25, 2023 •

edited

Loading