-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic kernel #812
Generic kernel #812
Conversation
ae11958
to
3c7d137
Compare
For the record, I've disabled Daint CI for the moment (Daint will disappear soon). |
It turns out a generic kernel is hard to make it without the proper tuning procedure (see https://storage.googleapis.com/cp2k-ci/run-cp2k-cuda-pascal-47248377_report.txt for some of the failing kernels). Since we do test all generated kernels, the new proposed workflow is:
As far as I can see, only 56 tests are failing in the CP2K-CI on P100 (apparently with large kernels, e.g . In the future, I would consider to switch to cublas for the case 3. For the moment, I think this is a step forward. |
I think this is due to inappropriate implementation for larger kernels like too many registers used. I believe this goes along with long JIT-compilation of such kernel, i.e., the compiler goes crazy to avoid/implement spilling the excess registers. A way to circumvent this, is to implement a max-size inside of the kernel and to branch into a different flavor for larger kernels.
Yes, aggree. Though, I have not implemented calling MKL for GPUs in case of the OpenCL backend. However, the OpenCL backend validates for all kernel sizes up to the static maximum we have set for all GPUs. |
As a note for other readers, a big penalty is due to the data already uploaded to the GPU rather then per-se slow CPU performance. |
Yeap, your analysis is correct. I've decided to add a limit of kernel size (any dimension > 50). Otherwise, the compiler cannot compile (we get the error that cannot load the PTX):
My speculation is that the JIT will mostly fail for large kernels. The entire procedure needs more checking. A side note, it is risky to use kernels from previous architectures, unless we test them on the new one. |
... and you can be very happy about this error. Worst case is, it compiles a long time (much longer than normal) and produces broken code. Hat down for Nv's toolchain to know when it failed!
Yes, there is remaining risk but not necessarily attributed to JIT compilation. Though, it's generally the same toolchain as the offline compiler. However, larger kernels really needs an implementation that intrinsically limits the register usage; the branch to decide about its use is cheap.
The CUDA/HIP used several different kernel implementations. Did you choose only one of them for the generic kernel? My guess is if all of them got into the generic kernel (along with appropriate conditions), then you will not see failing kernels. The "appropriate" conditions however can be tricky. It can help "to learn" from the tuned cases when to select either flavor, e.g., it might not be "the size" (like M*N*K) but rather just M or N, or combination of, or something else. You will end up hard-coding some basic rules and then we can actually throw away the whole offline-prediction ;-) |
So, I went for the most used kernel type ("medium") and tried to figure the dependencies on the other parameters (tile_m, tile_n, threads, groupping, w, v). I tried several branches depending on the size, but it is hard (especially for rectangular blocks). In particular, for large kernels (55x55x55 for example), any combination is failing (I ended up doing the autotuning, almost all generated kernels combinations are failing. We cannot make it in production). Now I'm testing an exteme case: always use the generic kernel. Let's see what the CP2K-CI tell us... |
OK, the generic kernel passes all tests in the CP2K-CI (see https://storage.googleapis.com/cp2k-ci/run-cp2k-cuda-pascal-48ac3b6d_report.txt ). The wrong result is pre-existing to the PR. |
This PR introduces a generic (untuned) kernel for the ACC, if the tuned kernel is not present. That pushes the computation to the ACC (previously it was falling-back to the CPU with a big performance penalty).
The output changes accordingly, e.g.: