-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit afe6ffa
Showing
227 changed files
with
85,211 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
--- | ||
Language: Cpp | ||
BasedOnStyle: google | ||
ColumnLimit: 120 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
*.ipynb linguist-documentation | ||
*.html linguist-documentation | ||
*.tpp linguist-language=C++ | ||
*.cpp linguist-language=C++ | ||
*.hpp linguist-language=C++ | ||
*.h linguist-language=C | ||
*.c linguist-language=C | ||
*.cu linguist-language=Cuda | ||
*.cuh linguist-language=Cuda |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Defaults for CUDA | ||
*.i | ||
*.ii | ||
*.gpu | ||
*.ptx | ||
*.cubin | ||
*.fatbin | ||
|
||
# Ignore the build and lib dirs | ||
build/* | ||
lib/* | ||
!build/README.MD | ||
|
||
# Ignore any executables | ||
bin/* | ||
!bin/README.MD | ||
|
||
# Ignore Mac specific files | ||
.DS_Store | ||
|
||
# Ignore notebook at root | ||
.ipynb_checkpoints | ||
.notebook | ||
__pycache__ | ||
|
||
# Ignore matrix files | ||
*.mtx | ||
*.mtx.gz | ||
!res/test/* | ||
|
||
# Ignore evaluation outputs | ||
*.out | ||
evaluations/download/* | ||
evaluations/out/ | ||
!evaluations/out/*.MD | ||
!evaluations/download/*.MD | ||
evaluations/all/* | ||
!evaluations/slurms/* | ||
|
||
|
||
.vscode/c_cpp_properties.json | ||
.ght | ||
misc/cudasample_simpleOccupancy/simpleOccupancy | ||
misc/cudasample_simpleOccupancy/simpleOccupancy.o | ||
misc/cudasample_simpleOccupancy/NsightEclipse.xml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
[style] | ||
based_on_style = google | ||
indent_width: 2 | ||
column_limit = 120 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
--tool=memcheck | ||
--track-origins=yes | ||
--suppressions=./diagnostic/valgrind.supp | ||
--leak-check=full | ||
--show-leak-kinds=all | ||
--log-file=./logs/valgrind.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2021 Erhan Tezcan | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
## Permutations | ||
|
||
Suppose you have the following linear system. | ||
|
||
``` | ||
a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 = y_1 | ||
a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 = y_2 | ||
a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 = y_3 | ||
``` | ||
|
||
This is shown as a matrix-vector multiplication of the form: | ||
|
||
``` | ||
| a_1,1 a_1,2 a_1,3 | | x_1 | | a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 | | y_1 | | ||
| a_2,1 a_2,2 a_2,3 | * | x_2 | = | a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 | = | y_2 | | ||
| a_3,1 a_3,2 a_3,3 | | x_3 | | a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 | | y_3 | | ||
``` | ||
|
||
Consider the following **row permutation** `[1 -> 2 , 2 -> 3 , 3 -> 1]`. You can't permute `x` here in a solver, because column vectors are static. But, you can't update `x` from `y` directly too, so you have the order back `y` during update phase. | ||
|
||
``` | ||
| a_3,1 a_3,2 a_3,3 | | x_1 | | a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 | | y_3 | | ||
| a_1,1 a_1,2 a_1,3 | * | x_2 | = | a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 | = | y_1 | | ||
| a_2,1 a_2,2 a_2,3 | | x_3 | | a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 | | y_2 | | ||
``` | ||
|
||
Now consider the same permutation but as a **symmetric permutation**. If you now permute `x`, you can update the solutions directly in Jacobi, although for SpMV it does not change the result. | ||
|
||
``` | ||
| a_3,3 a_3,1 a_3,2 | | x_3 | | a_3,3 * x_3 + a_3,1 * x_1 + a_3,2 * x_2 | | y_3 | | ||
| a_1,3 a_1,1 a_1,2 | * | x_1 | = | a_1,3 * x_3 + a_1,1 * x_1 + a_1,2 * x_2 | = | y_1 | | ||
| a_2,3 a_2,1 a_2,2 | | x_2 | | a_2,3 * x_3 + a_2,1 * x_1 + a_2,2 * x_2 | | y_2 | | ||
``` | ||
|
||
To summarize: | ||
|
||
- Normally: | ||
- `y = Ax` | ||
- `x = (y - Ex) / d` | ||
- Row Permuted: | ||
- `y' = A'x` | ||
- `x = ((y' - E'x) / d')'` | ||
- Row + Column Permuted (Symmetric): | ||
- `y' = A'x'` | ||
- `x' = (y' - E'x') / d'` | ||
|
||
### Swapping Vectors for Cardiac | ||
|
||
In presence of `x32, x64 and y64`, we have 3 options for swapping: | ||
|
||
1. **Naive**: Call a `copying` kernel which reads `y64` and writes it to both `x32` and `x64`. | ||
2. **X64 Cast**: Remove `x32` from the kernel all together, and cast the value of `x64` to `float` at runtime. | ||
3. **X32 Copy**: At the end of `SpMV` kernel, as you write the result to `y64` write it to `x32` too. If you swap the pointers now as you normally do, `x32` will also have the swapped values. | ||
4. A hybrid of **1** and **3**, we swap `x64` and `y64` by pointers, but call a copy kernel on `x32` just before that. This turned out to be faster than **3**. This is what we use. | ||
|
||
### V100 Specs | ||
|
||
- Compute Capability: 7.0 | ||
- Warp Size: 32 Threads | ||
- Max Warps / SM: 64 | ||
- Max Thread Blocks / SM: 32 | ||
- Max Thread Block Size: 1024 | ||
|
||
- SMs: 80 | ||
- TPCs: 40 | ||
- FP32 Cores / SM: 64 | ||
- FP64 Cores / SM: 32 | ||
- Tensor Cores / SM: 8 | ||
|
||
- Peak FP32 TFLOPS: 15.7 | ||
- Peak FP64 TFLOPs: 7.8 | ||
- Peak Tensor TFLOPS: 125 | ||
|
||
- L1 Cache Line Size: 128 B | ||
- L2 Cache Line Size: 32 B | ||
- L2 Cache Size: 6144 KB | ||
- Shared Memory Size / SM: Configurable up to 96 KB | ||
|
||
GPU cache lines are 128 bytes and are aligned. Try to make all memory accesses by warps touch the minimum number of cache lines. | ||
See [here](https://forums.developer.nvidia.com/t/cache-line-size-of-l1-and-l2/24907) for more. Also check Ch. 5.2 of CUDA Handbook. | ||
|
||
Also see NVIDIA [docs](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticscaches.htm) for caches: For memory cached in both L1 and L2, if every thread in a warp loads a 4-byte value from sparse locations which miss in L1 cache, each thread will incur one 128-byte L1 transaction and four 32-byte L2 transactions. This will cause the load instruction to reissue 32 times more than if the values would be adjacent and cache-aligned. If bandwidth between caches becomes a bottleneck, rearranging data or algorithms to access the data more uniformly can alleviate the problem. | ||
|
||
Another [link](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticsglobal.htm): A L1 cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 (cached loads using the generic data path) are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only (uncached loads using the generic data path) are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses. | ||
|
||
Note that `128 bytes = 32 floats = 16 doubles`. If we are accessing less elements than that with a warp (i.e. for a row in CSR Vector SpMV), we might have worse performance. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
#!/usr/bin/make -f | ||
|
||
# Target and directories | ||
SRCDIR := src | ||
BUILDDIR := build | ||
TGTDIR := bin | ||
TARGET := spmv | ||
|
||
# Compilers and flags | ||
CC := gcc | ||
CPPC := g++ | ||
NVCC := nvcc | ||
CUDA_KERNEL_CHECK_FLAG ?= -DCUDA_CHECK_KERNELS=1 | ||
MYGPU_ARCH ?= sm_70 | ||
|
||
# Prepare files | ||
SOURCES_C := $(shell find $(SRCDIR) -type f -name *.c) | ||
OBJECTS_C := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_C:.c=.o)) | ||
SOURCES_CPP := $(shell find $(SRCDIR) -type f -name *.cpp) | ||
OBJECTS_CPP := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_CPP:.cpp=.o)) | ||
SOURCES_CU := $(shell find $(SRCDIR) -type f -name *.cu) | ||
OBJECTS_CU := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_CU:.cu=.o)) | ||
SOURCES := $(SOURCES_C) $(SOURCES_CPP) $(SOURCES_CU) | ||
OBJECTS := $(OBJECTS_C) $(OBJECTS_CPP) $(OBJECTS_CU) | ||
|
||
# Flags | ||
CCFLAGS := -O3 -fopenmp -std=c99 -Wno-unused-result $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -g | ||
CPPCFLAGS := -O3 -fopenmp --std=c++11 -Wno-unused-result -fno-exceptions -Wall -Wextra $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -g | ||
NVCCFLAGS := -ccbin g++ -O3 -Xcompiler -fopenmp -Xcompiler -Wno-unused-result -Xcompiler -fno-exceptions -Xcompiler -Wall $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -Xcompiler -g -g -G | ||
ARCHFLAGS := -arch=$(MYGPU_ARCH) -Wno-deprecated-gpu-targets | ||
LDFLAGS := -lrt -lm -lcudart -fopenmp -lhsl_mc64 -lgfortran | ||
# NOTE: Be careful with the order of libraries above | ||
# NOTE: -g option for valgrind to track lines | ||
|
||
# Directories | ||
LIB := -Llib -L$(CUDA_PATH)/lib64 -L/usr/local/lib -L$(HOME)/lib | ||
INC := -Iinclude -I$(CUDA_PATH)/include -Itemplates | ||
|
||
# First rule | ||
all: $(TGTDIR)/$(TARGET) | $(TGTDIR) | ||
|
||
# Linking | ||
$(TGTDIR)/$(TARGET): $(OBJECTS) | $(BUILDDIR) | ||
$(CC) $^ -o $(TGTDIR)/$(TARGET) $(INC) $(LDFLAGS) $(LIB) | ||
|
||
# C compilations | ||
$(BUILDDIR)/%.o: $(SRCDIR)/%.c include/*.h | ||
$(CC) $(CCFLAGS) $(INC) -c -o $@ $< | ||
|
||
# CPP compilations | ||
$(BUILDDIR)/%.o: $(SRCDIR)/%.cpp include/*.hpp templates/*.tpp | ||
$(CPPC) $(CPPCFLAGS) $(INC) -c -o $@ $< | ||
|
||
# CUDA compilations | ||
$(BUILDDIR)/%.o: $(SRCDIR)/%.cu include/*.cuh templates/*.tpp | ||
$(NVCC) $(NVCCFLAGS) $(ARCHFLAGS) $(INC) -c -o $@ $< | ||
|
||
# Objects directory | ||
$(BUILDDIR): | ||
@mkdir -p $(BUILDDIR) | ||
|
||
# Target directory | ||
$(TGTDIR): | ||
@mkdir -p $(TGTDIR) | ||
|
||
# Cleaning | ||
clean: | ||
$(RM) -r $(BUILDDIR)/*.o | ||
|
||
# Diagnostic | ||
show: | ||
@echo "Sources: $(SOURCES)" | ||
@echo "Objects: $(OBJECTS)" | ||
@echo "CUDA HOME: $(CUDA_PATH)" | ||
@echo "Target arch: $(MYGPU_ARCH)" | ||
|
||
# Code distribution overall | ||
cloc: | ||
cloc . | ||
|
||
# Clean and make again | ||
again: | ||
@make clean && make | ||
|
||
.PHONY: all clean show cloc again |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
# Exploring CSR-based Mixed and Multi-Precision SpMV for GPUs | ||
|
||
Submitted to EuroPar'22 as "Erhan Tezcan, Tuğba Torun, Fahrican Koşar, Kamer Kaya and Didem Unat, _Mixed and Multi-Precision SpMV for GPUs with Row-wise Precision Selection_" | ||
|
||
## Building | ||
|
||
Use `make` to build the CUDA binary at `bin/spmv`. The compiler uses `--arch=sm_70` for NVIDIA V100, but you can change that to suit your own GPU with an `MYGPU_ARCH` environment variable, e.g. `export MYGPU_ARCH=sm_50`. We have used `cuda/11.2`, `python/3.7.4` and `gcc/9.3.0` to compile our program and run Python scripts. You also need to install and compile [`HSL_MC64`](https://www.hsl.rl.ac.uk/catalogue/mc64.html) static library with `gfortran`. | ||
|
||
## File Structure | ||
|
||
The file structure of this project is as follows: | ||
|
||
- `batch` has shell scripts for cluster commands, such as queueing a job. | ||
- `bin` for binary executables. | ||
- `build` for build files. | ||
- `diagnostic` has several scripts to check the program via Valgrind, cudamemcheck etc. | ||
- `evaluations` is where we store the execution output. This is later read by Python scripts to make plots. | ||
- `img` stores the output from Python files, such as plot images. | ||
- `include` has header files. | ||
- `logs` have log outputs, generally from the diagnostic tools. | ||
- `res` has resources, such as MatrixMarket files. | ||
- `scripts` has a variety of Python scripts, mostly for plotting and automated running of the code. | ||
- `src` has the source files. | ||
- `templates` has the source files for template functions. | ||
|
||
## Running | ||
|
||
The `Makefile` will create a binary called `spmv` under `bin` folder within the same directory, with object files under `build`. Run the executable with `-h` or `--help` option to see usage. | ||
|
||
## Batches | ||
|
||
For both `kuacc` and `simula` under `batches` we have the following: | ||
|
||
- `final_experiment.sh` runs the final experiments, as used for the paper. | ||
- `spmv_all.sh` runs SpMV test on all matrices (from `allpruned` index). | ||
- `_srun_gpu.sh` asks for an interactive shell with one Tesla V100. | ||
- `_check_queue.sh` checks the queue for my jobs. | ||
- `_load_modules.sh` loads necessary modules. _does not work sometimes_ | ||
|
||
## Matrix Resources | ||
|
||
Matrices are stored under `res` folder, with the following scripts: | ||
|
||
- `download.sh <MatrixMarketURL>` downloads the matrix from the given URL. See [SuiteSparse](https://sparse.tamu.edu/). | ||
- `download-from-md.sh <path>` downloads the matrices that appear in the provided Markdown file. | ||
- `generate.sh` under `architect` generates a specific set of matrices using the `architect.py` script. | ||
- `parsehtml.sh <path-to-html> <output-name>` parses an HTML from <http://yifanhu.net/GALLERY/GRAPHS/search.html> to create an index file. | ||
|
||
## Diagnostics | ||
|
||
The scripts below are under `diagnostics` folder: | ||
|
||
- `eval_architect.sh` uses `evaluator.py` on matrices under `res/architect`. | ||
- `eval_res.sh` uses `evaluator.py` on matrices under `res`. | ||
- `cudamemcheck.sh` runs `cudamemcheck` with a matrix under `res/architect`. | ||
- `valgrind.sh <matrix>` runs `valgrind` for the provided matrix. | ||
- `nvprof.sh <matrix>` profiles SpMV kernels for the provided matrix. | ||
- `run_random.sh` selects a random matrix under `res` and runs it. | ||
|
||
## Scripts | ||
|
||
Stored under `scripts` folder: | ||
|
||
- `architect.py` creates random MatrixMarket matrices. | ||
- `evaluator.py` runs the binary and parses it's outputs to create plots. Saves the resulting dictionary on file. | ||
- `exporter.py` reads a a dictionary output by `evaluator.py` and exports `csv` files. | ||
- `interpreter.py` reads a dictionary output by `evaluator.py` and plots stuff. | ||
- `interpret.ipynb` a notebook to plot the results from another evaluation output. | ||
- `analyser.py` analyse a specific matrix with Python. | ||
- `plots.py` helper functions for plotting. | ||
- `utility.py` utility functions. | ||
- `prints.py` helper functions for printing. | ||
|
||
`plottype` folder has generic plotting functions such as bar, heatmap, density etc. and `plotspecial` folder has specific plots. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
#!/bin/bash | ||
squeue -u etezcan19 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
#!/bin/bash | ||
module load cuda/11.2 gcc/9.3.0 python/3.7.4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/bin/bash | ||
srun -A users --partition=ai --qos=ai --account=ai -n1 --gres=gpu:tesla_v100:1 --pty $SHELL | ||
|
||
# -w ai12 (for a specific node) | ||
# srun -N 1 -n1 -p short --qos=users --gres=gpu:1 -w ai12 --pty $SHELL |
Oops, something went wrong.