Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add GPU aware MPI support in cannon algorithm #647

Merged
merged 20 commits into from
Jan 24, 2024

Conversation

gsitaram
Copy link
Contributor

@gsitaram gsitaram commented Jan 6, 2023

If CUDA or HIP backend is enabled, --WITH-DBCSR-G2G CMake option enables the following:

  • Use device arrays in MPI transfers in the cannon algorithm implementation.
  • Move norms calculation to GPU
  • Transfer the matrix images once to GPU and transpose the B matrix once. All MPI Isend and Irecv calls will transfer the data in device buffers instead of host buffers.
  • MPI transfers and multiplications remain overlapped

Requirements:

  • An MPI implementation that supports GPU aware communication.

Need help with the following:

  • The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?
  • The ROCm build fails with undefined references in OpenMPI. I am guessing we have to fix the CI build to include ROCm aware support when building UCX that OpenMPI depends on.
  • I need to include AMD copyright lines in each file modified, but this does not seem acceptable for the pre-commit script. Please let me know how to get around this.

@jenkins-cscs
Copy link
Collaborator

Can one of the admins verify this patch?

@codecov
Copy link

codecov bot commented Jan 6, 2023

Codecov Report

Attention: 64 lines in your changes are missing coverage. Please review.

Comparison is base (5d92807) 67.0% compared to head (b76c8bd) 66.8%.
Report is 129 commits behind head on develop.

❗ Current head b76c8bd differs from pull request most recent head 0c37025. Consider uploading reports for the commit 0c37025 to get more accurate results

Files Patch % Lines
src/mm/dbcsr_mm_common.F 0.0% 55 Missing ⚠️
src/mm/dbcsr_mm.F 25.0% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           develop    #647     +/-   ##
=========================================
- Coverage     67.0%   66.8%   -0.2%     
=========================================
  Files          105     105             
  Lines        29121   29188     +67     
=========================================
+ Hits         19521   19523      +2     
- Misses        9600    9665     +65     
Flag Coverage Δ
unittests 66.8% <4.4%> (-0.2%) ⬇️
with-blas 66.8% <4.4%> (-0.2%) ⬇️
with-libxsmm 66.2% <4.7%> (-0.2%) ⬇️
with-mpi 66.9% <4.4%> (-0.2%) ⬇️
with-openmp 65.7% <4.4%> (-0.2%) ⬇️
without-mpi 66.0% <4.4%> (-0.2%) ⬇️
without-openmp 65.8% <4.4%> (-0.2%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hfp
Copy link
Member

hfp commented Jan 8, 2023

The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?

Yes, I will help with this. I think it should not be a blocker for merge.


Btw, is there a way to perform atomic FP-add on (certain) AMD GPUs using OpenCL? It seems the CUDA based code path does that but everything I tried in OpenCL did not work for me like C11 atomics, legacy builtins, or even cheating by guessing a prototype function and calling it. I found some doc for inline assembly in OpenCL but I am a bit hesitant to adopt it (like reading all the arch specific doc for AMD GPUs). For NVidia btw I am using PTX inline assembly in OpenCL.

@hfp
Copy link
Member

hfp commented Jan 8, 2023

Regarding the failing check of file headers, I propose to put copyright lines in front of the LICENSE file like this, and to stick with generic

! Copyright (C) by the DBCSR developers group - All rights reserved                                !

Otherwise we are back to the business of duplicating author notes as part of the source code. I think authorship is recorded as part of repository's metadata and no one is "left behind". As an extension, I also propose to drop AUTHORS file.

A general overhaul is to have a file extension .md for LICENSE (and perhaps similar files).

( sorry the proposal is not exactly related and can be perhaps a separate PR )

@gsitaram
Copy link
Contributor Author

gsitaram commented Jan 9, 2023

The OpenCL build fails with "undefined reference to c_calculate_norms" error, @hfp, could you help me fix it for the OpenCL backend?

Yes, I will help with this. I think it should not be a blocker for merge.

Thank you so much.

Btw, is there a way to perform atomic FP-add on (certain) AMD GPUs using OpenCL? It seems the CUDA based code path does that but everything I tried in OpenCL did not work for me like C11 atomics, legacy builtins, or even cheating by guessing a prototype function and calling it. I found some doc for inline assembly in OpenCL but I am a bit hesitant to adopt it (like reading all the arch specific doc for AMD GPUs). For NVidia btw I am using PTX inline assembly in OpenCL

I will find out more details and inform you.

@gsitaram
Copy link
Contributor Author

gsitaram commented Jan 9, 2023

Regarding the failing check of file headers, I propose to put copyright lines in front of the LICENSE file like this, and to stick with generic

! Copyright (C) by the DBCSR developers group - All rights reserved                                !

Otherwise we are back to the business of duplicating author notes as part of the source code. I think authorship is recorded as part of repository's metadata and no one is "left behind". As an extension, I also propose to drop AUTHORS file.

A general overhaul is to have a file extension .md for LICENSE (and perhaps similar files).

( sorry the proposal is not exactly related and can be perhaps a separate PR )

I like this proposal of having a separate LICENSE file. If all DBCSR maintainers agree, I can make the required changes (keep only the DBCSR Developers Group copyright line in each file, and add a new license file with the entire text of the license and copyright lines above that.)

@gsitaram
Copy link
Contributor Author

gsitaram commented Jan 9, 2023

@hfp, it does not seem like there are atomic add operations in OpenCL for FP type data. You may have to use compare and swap instead to accomplish the same. I found this blog showing an example, but this is not recent.
I see that in the newer OpenCL (3.0) spec, the atomic_* operations are deprecated, and we have to use a combination of memory order and fence operations.. but I am not very familiar with this set of functions. If I find any recent examples somewhere, I'll be sure to point those to you.

hfp added a commit to hfp/dbcsr that referenced this pull request Jan 20, 2023
* Added c_calculate_norms prototype to ACC/LIBSMM interface/header.
* Stub implementation for OpenCL.
hfp added a commit that referenced this pull request Jan 20, 2023
* Added c_calculate_norms prototype to ACC/LIBSMM interface/header.
* Stub implementation for OpenCL.

* Adjusted rules to compile calculate_norms.cpp as CUDA translation unit.
* Separated CFLAGS and DFLAGS. Allow unsupported host-compiler (nvcc).
* Makefile/OpenCL: improved warning level.

* Fixed potential warnings about dereferencing type-punned pointer will break strict-aliasing rules.
hfp added a commit to hfp/dbcsr that referenced this pull request Jan 20, 2023
* Fixed including header file in calculate_norms.cpp.
@hfp
Copy link
Member

hfp commented Jan 20, 2023

Please allow me to share my suggestions for this PR:

  • Let's decide on the license banner etc solely based on input from @alazzaro. I would avoid surveying all/past authors. Also, the suggested change is not about dropping anyone's contribution; it's fully recorded in the GitHub repository and we only avoid duplicating this information along with exposing potentially outdated contact data.
  • I adjusted upstream master to provide a stub implementation of c_calculate_norms for the OpenCL backend. I hope I got the return code correct that allows to bail-out at runtime and to hopefully fall-back to host as long as c_calculate_norms is not implemented with an own OpenCL kernel. So, please consider to rebase or merge the upstream master.
  • I suggest rethinking the file extension of calculate_norms.cpp. I believe .cu might require less changes in the build system(s). However, this is not too important.

In my integration test (#649), I changed including some header files in calculate_norms.cpp from:

#if defined(__CUDA)
#  include "acc/cuda/acc_cuda.h"
#elif defined(__HIP)
#  include "acc/hip/acc_hip.h"
#endif
#include "libsmm_acc_init.h"

... to the following:

#if defined(__CUDA)
#  include "../cuda/acc_cuda.h"
#elif defined(__HIP)
#  include "../hip/acc_hip.h"
#endif
#include "libsmm_acc_init.h"

@gsitaram
Copy link
Contributor Author

@hfp, I merged the master into this branch and changed the path to those header files as you suggested.
As for the name of the file, hipcc takes .cpp extension, and nvcc can compile the .cpp file with the option -x cuda. So, I didn't change it. If the changes introduced to the build system are not acceptable, please feel free to change. After I got it to build, I didn't touch it.
I'll discuss with Alfio and see how to proceed with the license blob.
Thanks for your help in getting the opencl backend to accept these changes.

@alazzaro alazzaro self-assigned this Jan 25, 2023
@alazzaro
Copy link
Member

alazzaro commented Jan 25, 2023

@gsitaram Thanks for this PR and sorry for the late reply (I would like to thank @hfp for his contribution too).

I'm putting here some of the topics/comments:

  • Definitely I agree with @hfp proposal, so we can change our LICENSE file. @dev-zero comments?
  • Understand how we can test it. Currently we cannot on github, so we can use the CI on Daint
  • Move file calculate_norms.cpp to cuda_hip directory

I will add some specific comments in the files.

hfp added a commit to hfp/dbcsr that referenced this pull request Jan 26, 2023
src/acc/libsmm_acc/calculate_norms.cpp Outdated Show resolved Hide resolved
src/acc/libsmm_acc/calculate_norms.cpp Outdated Show resolved Hide resolved
src/acc/libsmm_acc/libsmm_acc.cpp Show resolved Hide resolved
src/mm/dbcsr_mm_common.F Show resolved Hide resolved
@gsitaram
Copy link
Contributor Author

@alazzaro , please check if the previous commit reflects our discussion about avoiding the g2g path for all data types other than real_8.

hfp added a commit to hfp/dbcsr that referenced this pull request Feb 1, 2023
@hfp
Copy link
Member

hfp commented Feb 1, 2023

I think Gina was right with the previous location of c_calculate_norms (under src/acc/libsmm_acc). That's the "library" also hosting transpose and multiply (in fact it's also CUDA/HIP-only). Another sign of calculate_norms.cpp belonging to LIBSMM_ACC is the latter only contains infrastructure-code like mem, stream, error, etc., and never contained calculation routines. Further, calculate_norms.cpp depends on libsmm_acc_init.h (acc_get_gpu_warp_size), which is now awkward after moving it to cuda_hip.

@alazzaro alazzaro force-pushed the dbcsr_g2g branch 2 times, most recently from a04f96b to d368160 Compare July 10, 2023 21:17
@alazzaro alazzaro changed the title Add GPU aware MPI support in cannon algorithm WIP: Add GPU aware MPI support in cannon algorithm Jul 11, 2023
@alazzaro alazzaro force-pushed the dbcsr_g2g branch 7 times, most recently from 63a4d0d to 8610601 Compare July 11, 2023 09:37
@alazzaro alazzaro force-pushed the develop branch 6 times, most recently from 1ba0f0b to 6261a60 Compare July 12, 2023 20:53
@alazzaro alazzaro merged commit 3bc658f into cp2k:develop Jan 24, 2024
21 checks passed
@alazzaro
Copy link
Member

thanks @gsitaram !

alazzaro pushed a commit that referenced this pull request Jan 24, 2024
Add GPU aware MPI support in cannon algorithm with norms calculation in GPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants