Skip to content

Commit c4c2a8a

Browse files
author
Kent Knox
committed
Updating README.md for v0.6.0.x release
1 parent c686d7f commit c4c2a8a

File tree

1 file changed

+24
-142
lines changed

1 file changed

+24
-142
lines changed

README.md

Lines changed: 24 additions & 142 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,32 @@
11
# rocBLAS
2-
A BLAS implementation on top of AMD's Radeon Open Compute [ROCm][] runtime and toolchains. rocBLAS is implemented in
3-
the [HIP][] programming language and optimized for AMD's latest discrete GPUs.
2+
A BLAS implementation on top of AMD's Radeon Open Compute [ROCm][] runtime and toolchains. rocBLAS is implemented in the [HIP][] programming language and optimized for AMD's latest discrete GPUs.
43

5-
## rocBLAS Wiki
6-
The [wiki][] has helpful information about building the rocBLAS library, samples and tests.
4+
## Installing pre-built packages
5+
Download pre-built packages either from [ROCm's package servers](https://rocm.github.io/install.html#installing-from-amd-rocm-repositories) or by clicking the github releases tab and manually downloading, which could be newer. Release notes are available for each release on the releases tab.
6+
* `sudo apt update && sudo apt install rocblas`
7+
8+
## Quickstart rocBLAS build
79

8-
## Building rocBLAS
910
#### Bash helper build script (Ubuntu only)
1011
The root of this repository has a helper bash script `install.sh` to build and install rocBLAS on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it's a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.
1112
* `./install -h` -- shows help
12-
* `./install -id` -- common install flags
13-
14-
### Manual build (all supported platforms)
15-
The build infrastructure for rocBLAS is based on [Cmake](https://cmake.org/) v3.5. This is the version of cmake available on ROCm supported platforms. Examples of installing cmake:
16-
* Ubuntu: `sudo apt install cmake-qt-gui`
17-
* Fedora: `sudo dnf install cmake-gui`
18-
19-
### Library
20-
The rocBLAS library has one dependency named [Tensile](https://github.com/ROCmSoftwarePlatform/Tensile), which supplies the high-performance implementation of xGEMM. Tensile is downloaded by cmake during library configuration and automatically configured as part of the build, so no further action is required by the user to set it up. Tensile is predominately written in python2.7 (not python3), so it does bring python dependencies which can easily be installed with distro package managers. The rocBLAS library contains both host and device code, so the HCC compiler must be specified during cmake configuration to properly initialize build tools. Example steps to build rocBLAS:
21-
22-
#### (One time only)
23-
* Ubuntu: `sudo apt install python2.7 python-yaml`
24-
* Fedora: `sudo dnf install python PyYAML`
25-
26-
#### Configure and build steps
27-
```bash
28-
mkdir -p [ROCBLAS_BUILD_DIR]/release
29-
cd [ROCBLAS_BUILD_DIR]/release
30-
# Default install location is in /opt/rocm, define -DCMAKE_INSTALL_PREFIX=<path> to specify other
31-
# Default build config is 'Release', define -DCMAKE_BUILD_TYPE=<config> to specify other
32-
CXX=/opt/rocm/bin/hcc ccmake [ROCBLAS_SOURCE]
33-
make -j$(nproc)
34-
sudo make install # sudo required if installing into system directory such as /opt/rocm
35-
```
13+
* `./install -id` -- build library, build dependencies and install (-d flag only needs to be passed once on a system)
3614

37-
### rocBLAS clients
38-
The repository contains source for clients that serve as samples, tests and benchmarks. Clients source can be found in the clients subdir.
15+
## Manual build (all supported platforms)
16+
If you use a distro other than Ubuntu, or would like more control over the build process, the [rocblas build wiki](https://github.com/RadeonOpenCompute/rocBLAS/wiki/Build) has helpful information on how to configure cmake and manually build.
3917

40-
### Dependencies (only necessary for rocBLAS clients)
41-
The rocBLAS samples have no external dependencies, but our unit test and benchmarking applications do. These clients introduce the following dependencies:
42-
1. [boost](http://www.boost.org/)
43-
2. [lapack](https://github.com/Reference-LAPACK/lapack-release)
44-
* lapack itself brings a dependency on a fortran compiler
45-
3. [googletest](https://github.com/google/googletest)
18+
### Functions supported
19+
A list of [exported functions](https://github.com/RadeonOpenCompute/rocBLAS/wiki/exported-functions) from rocblas can be found on the wiki
4620

47-
Linux distros typically have an easy installation mechanism for boost through the native package manager.
48-
49-
* Ubuntu: `sudo apt install libboost-program-options-dev`
50-
* Fedora: `sudo dnf install boost-program-options`
51-
52-
Unfortunately, googletest and lapack are not as easy to install. Many distros do not provide a googletest package with pre-compiled libraries, and the lapack packages do not have the necessary cmake config files for cmake to configure linking the cblas library. rocBLAS provide a cmake script that builds the above dependencies from source. This is an optional step; users can provide their own builds of these dependencies and help cmake find them by setting the CMAKE_PREFIX_PATH definition. The following is a sequence of steps to build dependencies and install them to the cmake default /usr/local.
53-
54-
#### (optional, one time only)
55-
```bash
56-
mkdir -p [ROCBLAS_BUILD_DIR]/release/deps
57-
cd [ROCBLAS_BUILD_DIR]/release/deps
58-
ccmake -DBUILD_BOOST=OFF [ROCBLAS_SOURCE]/deps # assuming boost is installed through package manager as above
59-
make -j$(nproc) install
60-
```
61-
62-
Once dependencies are available on the system, it is possible to configure the clients to build. This requires a few extra cmake flags to the library cmake configure script. If the dependencies are not installed into system defaults (like /usr/local ), you should pass the CMAKE_PREFIX_PATH to cmake to help find them.
63-
* `-DCMAKE_PREFIX_PATH="<semicolon separated paths>"`
64-
```bash
65-
# Default install location is in /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to specify other
66-
CXX=/opt/rocm/bin/hcc ccmake -DBUILD_CLIENTS_TESTS=ON -DBUILD_CLIENTS_BENCHMARKS=ON [ROCBLAS_SOURCE]
67-
make -j$(nproc)
68-
sudo make install # sudo required if installing into system directory such as /opt/rocm
69-
```
70-
71-
#### CUDA build errata
72-
rocBLAS is written with HiP kernels, so it should build and run on CUDA platforms. However, currently the cmake infrastructure is broken
73-
with a CUDA backend. However, a BLAS marshalling library that presents a common interface for both ROCm and CUDA backends can be found with [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS).
74-
75-
## Migrating libraries to ROCm from OpenCL
76-
[clBLAS][] demonstrated significant performance benefits of data parallel (GPU) computation when applied to solving dense
77-
linear algebra problems, but OpenCL primarily remains in the domain of expert programmers. The ROCm model introduces a
78-
single source paradigm for integrating device and host code together in a single source file, thereby simplifying the
79-
entire development process for heterogeneous computing. Compilers will get smarter, catching errors at compile/build time
80-
and native profilers/debuggers will better integrate into the development process. As AMD simplifies the
81-
programming model with ROCm (using HCC and HIP), it is the intent of this library to expose that simplified programming
82-
model to end users.
83-
84-
## rocBLAS interface
85-
In general, the rocBLAS interface is compatible with Legacy [Netlib BLAS][] and the cuBLAS-v2 API, with
86-
the explicit exception that Legacy BLAS does not have handle. The cuBLAS' cublasHandle_t is replaced
21+
## rocBLAS interface examples
22+
In general, the rocBLAS interface is compatible with CPU oriented [Netlib BLAS][] and the cuBLAS-v2 API, with
23+
the explicit exception that traditional BLAS interfaces do not accept handles. The cuBLAS' cublasHandle_t is replaced
8724
with rocblas_handle everywhere. Thus, porting a CUDA application which originally calls the cuBLAS API
8825
to a HIP application calling rocBLAS API should be relatively straightforward. For example, the rocBLAS
8926
SGEMV interface is
9027

28+
### GEMV API
29+
9130
```c
9231
rocblas_status
9332
rocblas_sgemv(rocblas_handle handle,
@@ -100,67 +39,7 @@ rocblas_sgemv(rocblas_handle handle,
10039
float* y, rocblas_int incy);
10140
```
10241
103-
rocBLAS assumes matrices A and vectors x, y are allocated in GPU memory space filled with data. Users are
104-
responsible for copying data from/to the host and device memory. HIP provides memcpy style API's to facilitate data
105-
management.
106-
107-
## Rules for obtaining the rocBLAS API from Legacy BLAS
108-
1. The Legacy BLAS routine name is changed to lower case, and prefixed by rocblas_.
109-
110-
2. A first argument rocblas_handle handle is added to all rocBlas functions.
111-
112-
3. Input arguments are declared with the const modifier.
113-
114-
4. Character arguments are replaced with enumerated types defined in rocblas_types.h.
115-
They are passed by value on the host.
116-
117-
5. Array arguments are passed by reference on the device.
118-
119-
6. Scalar arguments are passed by value on the host with the following two exceptions:
120-
121-
* Scalar values alpha and beta are passed by reference on either the host or
122-
the device. The rocBLAS functions will check to see it the value is on
123-
the device. If this is true, it is used, else the value on the host is
124-
used.
125-
126-
* Where Legacy BLAS functions have return values, the return value is instead
127-
added as the last function argument. It is returned by reference on either
128-
the host or the device. The rocBLAS functions will check to see it the value
129-
is on the device. If this is true, it is used, else the value is returned
130-
on the host. This applies to the following functions: xDOT, xDOTU, xNRM2,
131-
xASUM, IxAMAX, IxAMIN.
132-
133-
7. The return value of all functions is rocblas_status, defined in rocblas_types.h. It is
134-
used to check for errors.
135-
136-
137-
#### Additional notes
138-
139-
* The rocBLAS library is LP64, so rocblas_int arguments are 32 bit and rocblas_long
140-
arguments are 64 bit.
141-
142-
* rocBLAS uses column-major storage for 2D arrays, and 1 based indexing for
143-
the functions xMAX and xMIN. This is the same as Legacy BLAS and cuBLAS.
144-
If you need row-major and 0 based indexing (used in C language arrays)
145-
download the [CBLAS](http://www.netlib.org/blas/#_cblas) file cblas.tgz.
146-
Look at the CBLAS functions that provide a thin interface to Legacy BLAS. They
147-
convert from row-major, 0 based, to column-major, 1 based. This is done by
148-
swapping the order of function arguments. It is not necessary to transpose
149-
matrices.
150-
151-
* The auxiliary functions rocblas_set_pointer and rocblas_get_pointer are used
152-
to set and get the value of the state variable rocblas_pointer_mode. This
153-
variable is not used, it is added for compatibility with cuBLAS. rocBLAS
154-
will check if your scalar argument passed by reference is on the device.
155-
If this is true it will pass by reference on the device, else it passes
156-
by reference on the host.
157-
158-
## Asynchronous API
159-
Except a few routines (like TRSM) having memory allocation inside preventing asynchronicity, most of the library routines
160-
(like BLAS-1 SCAL, BLAS-2 GEMV, BLAS-3 GEMM) are configured to operate in asynchronous fashion with respect to CPU,
161-
meaning that these library function calls return immediately.
162-
163-
## Batched and strided GEMM API
42+
### Batched and strided GEMM API
16443
rocBLAS GEMM can process matrices in batches with regular strides. There are several permutations of these API's, the
16544
following is an example that takes everything
16645
@@ -178,14 +57,17 @@ rocblas_sgemm_strided_batched(
17857
rocblas_int batch_count )
17958
```
18059

60+
rocBLAS assumes matrices A and vectors x, y are allocated in GPU memory space filled with data. Users are
61+
responsible for copying data from/to the host and device memory. HIP provides memcpy style API's to facilitate data
62+
management.
18163

182-
183-
[wiki]: https://github.com/RadeonOpenCompute/rocBLAS/wiki
64+
## Asynchronous API
65+
Except a few routines (like TRSM) having memory allocation inside preventing asynchronicity, most of the library routines
66+
(like BLAS-1 SCAL, BLAS-2 GEMV, BLAS-3 GEMM) are configured to operate in asynchronous fashion with respect to CPU,
67+
meaning these library functions return immediately.
18468

18569
[ROCm]: https://github.com/RadeonOpenCompute/ROCm
18670

18771
[HIP]: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/
18872

18973
[Netlib BLAS]: http://www.netlib.org/blas/
190-
191-
[clBLAS]: https://github.com/clMathLibraries/clBLAS

0 commit comments

Comments
 (0)