Generic kernel #812

alazzaro · 2024-06-27T08:23:26Z

This PR introduces a generic (untuned) kernel for the ACC, if the tuned kernel is not present. That pushes the computation to the ACC (previously it was falling-back to the CPU with a big performance penalty).

The output changes accordingly, e.g.:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops    13 x    13 x    13                43940       0.0%      0.0%    100.0%     
 flops     5 x    13 x    13              7351500       0.0%      0.0%    100.0% (*) 
 flops    13 x     5 x    13              7351500       0.0%      0.0%    100.0% (*) 
 flops    13 x    13 x     5              7351500       0.0%      0.0%    100.0% (*) 
 flops    18 x    13 x    13             26404560       0.0%      0.0%    100.0% (*) 
 flops    13 x    18 x    13             26404560       0.0%      0.0%    100.0% (*) 
 flops    13 x    13 x    18             26404560       0.0%      0.0%    100.0% (*) 
 flops     5 x     5 x    13           1229962500       0.0%      0.0%    100.0% (*) 
 flops     5 x    13 x     5           1229962500       0.0%      0.0%    100.0% (*) 
 flops    13 x     5 x     5           1229962500       0.0%      0.0%    100.0% (*) 
 flops    18 x     5 x    13           4417686000       0.0%      0.0%    100.0% (*) 
 flops     5 x    18 x    13           4417686000       0.0%      0.0%    100.0% (*) 
 flops     5 x    13 x    18           4417686000       0.0%      0.0%    100.0% (*) 
 flops    18 x    13 x     5           4417686000       0.0%      0.0%    100.0% (*) 
 flops    13 x     5 x    18           4417686000       0.0%      0.0%    100.0% (*) 
 flops    13 x    18 x     5           4417686000       0.0%      0.0%    100.0% (*) 
 flops    18 x    18 x    13          15867109440       0.0%      0.0%    100.0% (*) 
 flops    18 x    13 x    18          15867109440       0.0%      0.0%    100.0% (*) 
 flops    13 x    18 x    18          15867109440       0.0%      0.0%    100.0% (*) 
 flops     5 x     5 x     5         205782187500       0.0%      0.0%    100.0%     
 flops    18 x     5 x     5         739112850000       0.0%      0.0%    100.0% (*) 
 flops     5 x     5 x    18         739112850000       0.0%      0.0%    100.0% (*) 
 flops     5 x    18 x     5         739112850000       0.0%      0.0%    100.0% (*) 
 flops    18 x     5 x    18        2654689464000       0.0%      0.0%    100.0% (*) 
 flops    18 x    18 x     5        2654689464000       0.0%      0.0%    100.0% (*) 
 flops     5 x    18 x    18        2654689464000       0.0%      0.0%    100.0% (*) 
 flops    18 x    18 x    18        9534912226560       0.0%      0.0%    100.0%     

 *** WARNING in dbcsr_mm_sched.F:606 :: (*) ACC Untuned kernels, consider ***
 *** to run the tuning procedure                                          ***

*** WARNING in dbcsr_mm_sched.F:618  :: Some kernels are running on the   ***
*** CPU, consider to run the ACC tuning procedure for them                ***

…ipcc/clang

alazzaro · 2024-06-27T13:12:06Z

For the record, I've disabled Daint CI for the moment (Daint will disappear soon).
Changed the jenkins-cscs in https://github.com/cp2k/dbcsr/settings/access to be "Triage" role (requires "Write" to trigger the CI)

…ecution

alazzaro · 2024-06-27T15:33:08Z

It turns out a generic kernel is hard to make it without the proper tuning procedure (see https://storage.googleapis.com/cp2k-ci/run-cp2k-cuda-pascal-47248377_report.txt for some of the failing kernels). Since we do test all generated kernels, the new proposed workflow is:

check if a tuned kernel exists and use it
if it doesn't exist, use the generic, if it works
if the generic one doesn't work, fall-back to the CPU (previous case)

As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .55x55x14, 55x55x26, 38x38x2).

In the future, I would consider to switch to cublas for the case 3. For the moment, I think this is a step forward.

hfp · 2024-06-28T07:04:07Z

As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .55x55x14, 55x55x26, 38x38x2).

I think this is due to inappropriate implementation for larger kernels like too many registers used. I believe this goes along with long JIT-compilation of such kernel, i.e., the compiler goes crazy to avoid/implement spilling the excess registers. A way to circumvent this, is to implement a max-size inside of the kernel and to branch into a different flavor for larger kernels.

In the future, I would consider to switch to cublas for the case 3. For the moment, I think this is a step forward.

Yes, aggree. Though, I have not implemented calling MKL for GPUs in case of the OpenCL backend. However, the OpenCL backend validates for all kernel sizes up to the static maximum we have set for all GPUs.

hfp · 2024-06-28T07:07:07Z

That pushes the computation to the ACC (previously it was falling-back to the CPU with a big performance penalty).

As a note for other readers, a big penalty is due to the data already uploaded to the GPU rather then per-se slow CPU performance.

alazzaro · 2024-06-28T07:12:36Z

As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g .55x55x14, 55x55x26, 38x38x2).

I think this is due to inappropriate implementation for larger kernels like too many registers used. I believe this goes along with long JIT-compilation of such kernel, i.e., the compiler goes crazy to avoid/implement spilling the excess registers. A way to circumvent this, is to implement a max-size inside of the kernel and to branch into a different flavor for larger kernels.

Yeap, your analysis is correct. I've decided to add a limit of kernel size (any dimension > 50). Otherwise, the compiler cannot compile (we get the error that cannot load the PTX):

CUDA DRIVER API ERROR: ModuleLoadDataEx failed with error CUDA_ERROR_INVALID_PTX (/opt/cp2k/exts/dbcsr/src/acc/libsmm_acc/libsmm_acc.cpp::181)

My speculation is that the JIT will mostly fail for large kernels. The entire procedure needs more checking.

A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.

hfp · 2024-06-28T07:34:32Z

CUDA DRIVER API ERROR: ModuleLoadDataEx failed with error CUDA_ERROR_INVALID_PTX (/opt/cp2k/exts/dbcsr/src/acc/libsmm_acc/libsmm_acc.cpp::181)

... and you can be very happy about this error. Worst case is, it compiles a long time (much longer than normal) and produces broken code. Hat down for Nv's toolchain to know when it failed!

My speculation is that the JIT will mostly fail for large kernels. The entire procedure needs more checking.

Yes, there is remaining risk but not necessarily attributed to JIT compilation. Though, it's generally the same toolchain as the offline compiler. However, larger kernels really needs an implementation that intrinsically limits the register usage; the branch to decide about its use is cheap.

A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.

The CUDA/HIP used several different kernel implementations. Did you choose only one of them for the generic kernel? My guess is if all of them got into the generic kernel (along with appropriate conditions), then you will not see failing kernels. The "appropriate" conditions however can be tricky. It can help "to learn" from the tuned cases when to select either flavor, e.g., it might not be "the size" (like M*N*K) but rather just M or N, or combination of, or something else. You will end up hard-coding some basic rules and then we can actually throw away the whole offline-prediction ;-)

alazzaro · 2024-06-28T07:44:39Z

A side note, it is risky to use kernels from previous architectures, unless we test them on the new one.

The CUDA/HIP used several different kernel implementations. Did you choose only one of them for the generic kernel? My guess is if all of them got into the generic kernel (along with appropriate conditions), then you will not see failing kernels. The "appropriate" conditions however can be tricky. It can help "to learn" from the tuned cases when to select either flavor, e.g., it might not be "the size" (like MNK) but rather just M or N, or combination of, or something else. You will end up hard-coding some basic rules and then we can actually throw away the whole offline-prediction ;-)

So, I went for the most used kernel type ("medium") and tried to figure the dependencies on the other parameters (tile_m, tile_n, threads, groupping, w, v). I tried several branches depending on the size, but it is hard (especially for rectangular blocks). In particular, for large kernels (55x55x55 for example), any combination is failing (I ended up doing the autotuning, almost all generated kernels combinations are failing. We cannot make it in production).

Now I'm testing an exteme case: always use the generic kernel. Let's see what the CP2K-CI tell us...

alazzaro · 2024-06-28T11:49:51Z

OK, the generic kernel passes all tests in the CP2K-CI (see https://storage.googleapis.com/cp2k-ci/run-cp2k-cuda-pascal-48ac3b6d_report.txt ). The wrong result is pre-existing to the PR.

alazzaro added 2 commits June 24, 2024 10:23

Use C++ compiler and add OpenMP flag for C++ application instead of h…

113fd78

…ipcc/clang

Use a generic kernel if tuned kernnels are not present

e8462c0

alazzaro force-pushed the generic_kernel branch 2 times, most recently from ae11958 to 3c7d137 Compare June 27, 2024 12:30

Switch off Daint CI

59e6b0b

alazzaro force-pushed the generic_kernel branch from 3c7d137 to 59e6b0b Compare June 27, 2024 13:05

Check if the generic kernel is working, otherwise fall-back to CPU ex…

874e68e

…ecution

alazzaro force-pushed the generic_kernel branch from c75d6eb to d951561 Compare June 28, 2024 07:36

Limit the max size of the untuned kernel

c155268

alazzaro force-pushed the generic_kernel branch from d951561 to c155268 Compare June 28, 2024 11:51

alazzaro merged commit c4f8eea into develop Jun 28, 2024
22 checks passed

alazzaro deleted the generic_kernel branch June 28, 2024 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic kernel #812

Generic kernel #812

alazzaro commented Jun 27, 2024 •

edited

Loading

alazzaro commented Jun 27, 2024

alazzaro commented Jun 27, 2024

hfp commented Jun 28, 2024

hfp commented Jun 28, 2024

alazzaro commented Jun 28, 2024 •

edited

Loading

hfp commented Jun 28, 2024 •

edited

Loading

alazzaro commented Jun 28, 2024

alazzaro commented Jun 28, 2024

Generic kernel #812

Generic kernel #812

Conversation

alazzaro commented Jun 27, 2024 • edited Loading

alazzaro commented Jun 27, 2024

alazzaro commented Jun 27, 2024

hfp commented Jun 28, 2024

hfp commented Jun 28, 2024

alazzaro commented Jun 28, 2024 • edited Loading

hfp commented Jun 28, 2024 • edited Loading

alazzaro commented Jun 28, 2024

alazzaro commented Jun 28, 2024

alazzaro commented Jun 27, 2024 •

edited

Loading

alazzaro commented Jun 28, 2024 •

edited

Loading

hfp commented Jun 28, 2024 •

edited

Loading