|
| 1 | +# Introduction |
| 2 | +This file is intended to provide instructions for: |
| 3 | +- Running smoke, unoptimized performance tests for the ALP/Dense sequential reference backend (aka alp_reference); |
| 4 | +- Running optimized performance tests of the ALP/Dense sequential reference backend with dispatch to BLAS (aka alp_dispatch); |
| 5 | +- Running optimized performance tests of the ALP/Dense shared memory backend with dispatch to BLAS (aka alp_omp). |
| 6 | + |
| 7 | +# Performance Tests |
| 8 | + |
| 9 | +This tests have been executed: |
| 10 | +- On a Kunpeng 920 node using 1 core for the sequential reference and alp_dispatch tests and 64 cores for the alp_omp tests; |
| 11 | +- Compiling with gcc 9.4.0; |
| 12 | +- Linking against KunpengBLAS from the Kunpeng BoostKit 22.0.RC1 and the netlib LAPACK linking to the same BLAS library. |
| 13 | +- All tests report runtime in milliseconds after the _time (ms, ...)_ text lines printed on screen. |
| 14 | + |
| 15 | +In our evaluation we extracted the _Kunpeng BoostKit 22.0.RC1_ in a `BLAS_ROOT` folder (the `usr/local/kml` directory extracted from the `boostkit-kml-1.6.0-1.aarch64.rpm` package). `BLAS_ROOT` should contain the `include/kblas.h` header file and the `lib/kblas/{locking, nolocking, omp, pthread}/libkblas.so` library. |
| 16 | + |
| 17 | +If no system LAPACK library can be found by the compiler, `LAPACK_LIB` (containing the `liblapack.{a,so}` library) and `LAPACK_INCLUDE` (containing the `lapacke.h` header file) have to be appropriately set and provided to cmake, for example exporting them as follows: |
| 18 | + |
| 19 | +``` |
| 20 | +# The root folder where this branch is cloned. |
| 21 | +export ALP_SOURCE="$(realpath ../)" |
| 22 | +# The build folder from which running these steps. |
| 23 | +export ALP_BUILD="$(pwd)" |
| 24 | +# The KML installation folder. |
| 25 | +# For example, the "usr/local/kml" directory extracted from the "boostkit-kml-1.6.0-1.aarch64.rpm" |
| 26 | +#export BLAS_ROOT="/path/to/kunpengblas/boostkit-kml-1.6.0.aarch64/usr/local/kml" |
| 27 | +# The lib folder of the LAPACK library. |
| 28 | +#export LAPACK_LIB="/path/to/lapack/netlib/build/lib" |
| 29 | +# The include folder of the LAPACK library. |
| 30 | +# Must include the C/C++ LAPACKE interface. |
| 31 | +#export LAPACK_INCLUDE="/path/to/lapack/netlib/lapack-3.9.1/LAPACKE/include/" |
| 32 | +
|
| 33 | +if [ -z ${BLAS_ROOT+x} ] || [ -z ${LAPACK_LIB+x} ] || [ -z ${LAPACK_INCLUDE+x} ]; then |
| 34 | + echo "Please define BLAS_ROOT, LAPACK_LIB, and LAPACK_INCLUDE variables." |
| 35 | +fi |
| 36 | +``` |
| 37 | + |
| 38 | +In particular, we assume the availability of the C/C++ LAPACKE interface and, for all tests below, we assume no system libraries are available. |
| 39 | + |
| 40 | +Assuming this branch is cloned in the `ALP_SOURCE` folder, all instructions provided below should be run from a `$ALP_SOURCE/build` folder. |
| 41 | + |
| 42 | +An analogous [script-like](alpdense.sh) version of this page is available in the ALP root directory of this branch. You may decide to run it directly (**note:** always making sure to customize the export commands above to your environment first) as follows: |
| 43 | + |
| 44 | +``` |
| 45 | +bash ../alpdense.sh |
| 46 | +``` |
| 47 | + |
| 48 | +or follow the instructions in this page step by step. |
| 49 | + |
| 50 | +# Source Code Location |
| 51 | + |
| 52 | +Assuming this branch is cloned in the `ALP_SOURCE` folder, all ALP/Dense include files are located in the `$ALP_SOURCE/include/alp` folder: |
| 53 | +- In particular, all the pre-implemented algorithms are located in `$ALP_SOURCE/include/alp/algorithms` |
| 54 | +- The reference, dispatch, and omp backends are located in `$ALP_SOURCE/include/alp/reference`, `$ALP_SOURCE/include/$ALP_SOURCE/alp/dispatch`, and `$ALP_SOURCE/include/alp/omp`, respectively. |
| 55 | + |
| 56 | +All tests discussed below are collected in the `$ALP_SOURCE/tests/smoke` and `$ALP_SOURCE/tests/performance` folders. The folder `$ALP_SOURCE/tests/unit` contains additional unit tests not discuss in this page. |
| 57 | + |
| 58 | +# Dependencies |
| 59 | + |
| 60 | +For all tests below, the standard ALP dependencies are required: |
| 61 | +- LibNUMA: -lnuma |
| 62 | +- Standard math library: -lm |
| 63 | +- POSIX threads: -lpthread |
| 64 | +- OpenMP: -fopenmp in the case of GCC |
| 65 | + |
| 66 | +# Sequential Smoke Tests (Functional, Unoptimized) |
| 67 | + |
| 68 | +We collect the following smoke tests associated with the ALP/Dense reference backend: |
| 69 | +- Basic targets: |
| 70 | + - General matrix-matrix multiplication ([source](tests/smoke/alp_gemm.cpp)) |
| 71 | + - Householder tridiagonalization of a real symmetric/complex Hermitian matrix ([source](tests/smoke/alp_zhetrd.cpp)) |
| 72 | + - Divide and conquer tridiagonal eigensolver for tridiagonal, real symmetric matrices ([source](tests/smoke/alp_stedc.cpp)) |
| 73 | + - Eigensolver for real symmetric matrices ([source](tests/smoke/alp_zheevd.cpp)) |
| 74 | + - Householder QR decomposition of a real/complex general matrix ([source](tests/smoke/alp_zgeqrf.cpp)) |
| 75 | +- Challenge targets: |
| 76 | + - Triangular linear system solve using backsubstitution of upper tridiagonal, real/complex matrix ([source](tests/smoke/alp_backsubstitution.cpp)) |
| 77 | + - Triangular linear system solve using forwardsubstitution of lower tridiagonal, real/complex matrix ([source](tests/smoke/alp_forwardsubstitution.cpp)) |
| 78 | + - Cholesky decomposition of a symmetric/Hermitian positive definite matrix ([source](tests/smoke/alp_cholesky.cpp)) |
| 79 | + - Householder LU decomposition of a real/complex general matrices ([source](tests/smoke/alp_zgetrf.cpp)) |
| 80 | + - Inverse of a symmetric/Hermitian positive definite matrix ([source](tests/smoke/alp_potri.cpp)) |
| 81 | + - Singular value decomposition of a real/complex general matrix ([source](tests/smoke/alp_zgesvd.cpp)) |
| 82 | + |
| 83 | +This tests are collected and run as ALP smoketests. |
| 84 | +From `$ALP_SOURCE/build` run: |
| 85 | + |
| 86 | +``` |
| 87 | +cmake -DWITH_ALP_REFERENCE_BACKEND=ON -DCMAKE_INSTALL_PREFIX=./install $ALP_SOURCE || ( echo "test failed" && exit 1 ) |
| 88 | +make smoketests_alp -j$(nproc) |
| 89 | +``` |
| 90 | + |
| 91 | +# Sequential Cholesky Decomposition Tests (optimized) |
| 92 | + |
| 93 | +Here we compare our ALP Cholesky implementation, based on the alp_dispatch backend, against the `potrf` LAPACK functionality. |
| 94 | + |
| 95 | +From the `$ALP_SOURCE/build` folder run the following commands: |
| 96 | + |
| 97 | +``` |
| 98 | +cmake -DKBLAS_ROOT="$BLAS_ROOT" -DWITH_ALP_DISPATCH_BACKEND=ON -DCMAKE_INSTALL_PREFIX=./install $ALP_SOURCE || ( echo "test failed" && exit 1 ) |
| 99 | +make install -j$(nproc) || ( echo "test failed" && exit 1 ) |
| 100 | +``` |
| 101 | + |
| 102 | +## LAPACK-Based Test |
| 103 | + |
| 104 | +To compile and run the LAPACK-based Cholesky test (not ALP code) run the following commands: |
| 105 | +``` |
| 106 | +install/bin/grbcxx -b alp_dispatch -o cholesky_lapack_reference.exe $ALP_SOURCE/tests/performance/lapack_cholesky.cpp $LAPACK_LIB/liblapack.a -I$LAPACK_INCLUDE -lgfortran || ( echo "test failed" && exit 1 ) |
| 107 | +./cholesky_lapack_reference.exe -n 1024 -repeat 10 || ( echo "test failed" && exit 1 ) |
| 108 | +``` |
| 109 | +In our tests, we executed `./cholesky_lapack_reference.exe` with matrix sizes (`-n` flag) in the range [400, 3000] in steps of 100. |
| 110 | + |
| 111 | +## ALP-Based Test (Dispatch Sequential Building Blocks to Optimized BLAS) |
| 112 | + |
| 113 | +Some facts about this test: |
| 114 | +- The algorithm is a blocked variant of Cholesky with block size BS = 64 (as done in LAPACK). |
| 115 | +- It recursively requires an unblocked version of the same algorithm (of size BSxBS) which does not dispatch to LAPACK. |
| 116 | +- All BLAS functions needed by the algorithm are dispatched to the external BLAS library. In particular, as POC of what ALP could offer in terms of performance if its primitives could be efficiently generated/optimized (e.g., via our envisioned MLIR-based backend for delayed compilation), it dispatches the triangular solve and the fused `foldl`+`mxm` operations. |
| 117 | + |
| 118 | +``` |
| 119 | +make test_alp_cholesky_perf_alp_dispatch -j$(nproc) || ( echo "test failed" && exit 1 ) |
| 120 | +tests/performance/alp_cholesky_perf_alp_dispatch -n 1024 -repeat 10 || ( echo "test failed" && exit 1 ) |
| 121 | +``` |
| 122 | +As for the LAPACK-based test, we executed `tests/performance/alp_cholesky_perf_alp_dispatch` with matrix sizes (`-n` flag) in the range [400, 3000] in steps of 100. |
| 123 | + |
| 124 | +**Note:** A consistent test should use the same BLAS in LAPACK-based as well as in the ALP-based tests. |
| 125 | + |
| 126 | +# Shared-Memory Parallel `mxm` Tests (Optimized) |
| 127 | + |
| 128 | +Here we compare our ALP shared memory backend (alp_omp) `mxm` implementation against the BLAS's `gemm` functionality. |
| 129 | +`mxm` is an inplace, ALP primitive that computes C = C + A*B, with matrices of conforming sizes. |
| 130 | + |
| 131 | +Our current shared memory backend implementation is currently only supporting square thread grids (although the methodology is not limited to that in general). For this reason, in the tests below we run both LAPACK and ALP using 64 threads. To ensure a fair comparison, we link with the `omp` version of KunpengBLAS. |
| 132 | + |
| 133 | +You can compile with the `omp` version of KunpengBLAS by additionally providing the `-DKBLAS_IMPL=omp` flag when calling cmake. However, this should be compiled in a different directory from the other BLAS-based builds, as follows: |
| 134 | +``` |
| 135 | +CWD=$(pwd) |
| 136 | +ompbuild="build_with_omp_blas" |
| 137 | +rm -rf $ompbuild && mkdir $ompbuild && cd $ompbuild |
| 138 | +cmake -DKBLAS_ROOT="$BLAS_ROOT" -DKBLAS_IMPL=omp -DWITH_ALP_OMP_BACKEND=ON -DCMAKE_INSTALL_PREFIX=./install $ALP_SOURCE || ( echo "test failed" && exit 1 ) |
| 139 | +make install -j$(nproc) || ( echo "test failed" && exit 1 ) |
| 140 | +``` |
| 141 | + |
| 142 | +## `gemm`-Based BLAS Test. |
| 143 | + |
| 144 | +from `$ompbuild` run: |
| 145 | +``` |
| 146 | +install/bin/grbcxx -b alp_dispatch -o blas_mxm.exe $ALP_SOURCE/tests/performance/blas_mxm.cpp -lgfortran || ( echo "test failed" && exit 1 ) |
| 147 | +OMP_NUM_THREADS=64 ./blas_mxm.exe -n 1024 -repeat 10 || ( echo "test failed" && exit 1 ) |
| 148 | +cd $CWD |
| 149 | +``` |
| 150 | +In our tests, we executed `./blas_mxm.exe` with matrix sizes (`-n` flag) in the range [1024:1024:10240]. |
| 151 | + |
| 152 | +## ALP-Based Test (Dispatch Sequential Building Blocks to Optimized BLAS). |
| 153 | + |
| 154 | +Some facts about this test: |
| 155 | +- The ALP `mxm` shared memory implementation is based on a [2.5D matrix multiplication algorithm](https://netlib.org/lapack/lawnspdf/lawn248.pdf); |
| 156 | +- In this test we execute with a 3D thread grid of size 4x4x4; |
| 157 | +- We set `OMP_NUM_THREADS=64` threads and fix `GOMP_CPU_AFFINITY="0-15 24-39 48-63 72-87"` to reflect the cores and NUMA topology of the node; |
| 158 | +- The algorithm is allocating memory using a 2D block-cyclic layout with blocks of size 128x128. |
| 159 | +- Each sequential block-level `mxm` (128x128x128) is dispatched to the selected BLAS library. |
| 160 | + |
| 161 | +From `$ALP_SOURCE/build` run: |
| 162 | + |
| 163 | +``` |
| 164 | +make test_alp_mxm_perf_alp_omp -j$(nproc) || ( echo "test failed" && exit 1 ) |
| 165 | +GOMP_CPU_AFFINITY="0-15 24-39 48-63 72-87" OMP_NUM_THREADS=64 tests/performance/alp_mxm_perf_alp_omp -n 1024 -repeat 10 || ( echo "test failed" && exit 1 ) |
| 166 | +``` |
| 167 | +As for the gemm-based test, we executed `tests/performance/alp_mxm_perf_alp_omp` with matrix sizes (`-n` flag) in the range [1024:1024:10240]. |
0 commit comments