Skip to content

Commit

Permalink
migrate
Browse files Browse the repository at this point in the history
  • Loading branch information
erhant committed Apr 13, 2022
0 parents commit afe6ffa
Show file tree
Hide file tree
Showing 227 changed files with 85,211 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
Language: Cpp
BasedOnStyle: google
ColumnLimit: 120
9 changes: 9 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*.ipynb linguist-documentation
*.html linguist-documentation
*.tpp linguist-language=C++
*.cpp linguist-language=C++
*.hpp linguist-language=C++
*.h linguist-language=C
*.c linguist-language=C
*.cu linguist-language=Cuda
*.cuh linguist-language=Cuda
45 changes: 45 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Defaults for CUDA
*.i
*.ii
*.gpu
*.ptx
*.cubin
*.fatbin

# Ignore the build and lib dirs
build/*
lib/*
!build/README.MD

# Ignore any executables
bin/*
!bin/README.MD

# Ignore Mac specific files
.DS_Store

# Ignore notebook at root
.ipynb_checkpoints
.notebook
__pycache__

# Ignore matrix files
*.mtx
*.mtx.gz
!res/test/*

# Ignore evaluation outputs
*.out
evaluations/download/*
evaluations/out/
!evaluations/out/*.MD
!evaluations/download/*.MD
evaluations/all/*
!evaluations/slurms/*


.vscode/c_cpp_properties.json
.ght
misc/cudasample_simpleOccupancy/simpleOccupancy
misc/cudasample_simpleOccupancy/simpleOccupancy.o
misc/cudasample_simpleOccupancy/NsightEclipse.xml
4 changes: 4 additions & 0 deletions .style.yapf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[style]
based_on_style = google
indent_width: 2
column_limit = 120
6 changes: 6 additions & 0 deletions .valgrindrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
--tool=memcheck
--track-origins=yes
--suppressions=./diagnostic/valgrind.supp
--leak-check=full
--show-leak-kinds=all
--log-file=./logs/valgrind.log
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2021 Erhan Tezcan

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
86 changes: 86 additions & 0 deletions MISC.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
## Permutations

Suppose you have the following linear system.

```
a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 = y_1
a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 = y_2
a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 = y_3
```

This is shown as a matrix-vector multiplication of the form:

```
| a_1,1 a_1,2 a_1,3 | | x_1 | | a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 | | y_1 |
| a_2,1 a_2,2 a_2,3 | * | x_2 | = | a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 | = | y_2 |
| a_3,1 a_3,2 a_3,3 | | x_3 | | a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 | | y_3 |
```

Consider the following **row permutation** `[1 -> 2 , 2 -> 3 , 3 -> 1]`. You can't permute `x` here in a solver, because column vectors are static. But, you can't update `x` from `y` directly too, so you have the order back `y` during update phase.

```
| a_3,1 a_3,2 a_3,3 | | x_1 | | a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 | | y_3 |
| a_1,1 a_1,2 a_1,3 | * | x_2 | = | a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 | = | y_1 |
| a_2,1 a_2,2 a_2,3 | | x_3 | | a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 | | y_2 |
```

Now consider the same permutation but as a **symmetric permutation**. If you now permute `x`, you can update the solutions directly in Jacobi, although for SpMV it does not change the result.

```
| a_3,3 a_3,1 a_3,2 | | x_3 | | a_3,3 * x_3 + a_3,1 * x_1 + a_3,2 * x_2 | | y_3 |
| a_1,3 a_1,1 a_1,2 | * | x_1 | = | a_1,3 * x_3 + a_1,1 * x_1 + a_1,2 * x_2 | = | y_1 |
| a_2,3 a_2,1 a_2,2 | | x_2 | | a_2,3 * x_3 + a_2,1 * x_1 + a_2,2 * x_2 | | y_2 |
```

To summarize:

- Normally:
- `y = Ax`
- `x = (y - Ex) / d`
- Row Permuted:
- `y' = A'x`
- `x = ((y' - E'x) / d')'`
- Row + Column Permuted (Symmetric):
- `y' = A'x'`
- `x' = (y' - E'x') / d'`

### Swapping Vectors for Cardiac

In presence of `x32, x64 and y64`, we have 3 options for swapping:

1. **Naive**: Call a `copying` kernel which reads `y64` and writes it to both `x32` and `x64`.
2. **X64 Cast**: Remove `x32` from the kernel all together, and cast the value of `x64` to `float` at runtime.
3. **X32 Copy**: At the end of `SpMV` kernel, as you write the result to `y64` write it to `x32` too. If you swap the pointers now as you normally do, `x32` will also have the swapped values.
4. A hybrid of **1** and **3**, we swap `x64` and `y64` by pointers, but call a copy kernel on `x32` just before that. This turned out to be faster than **3**. This is what we use.

### V100 Specs

- Compute Capability: 7.0
- Warp Size: 32 Threads
- Max Warps / SM: 64
- Max Thread Blocks / SM: 32
- Max Thread Block Size: 1024

- SMs: 80
- TPCs: 40
- FP32 Cores / SM: 64
- FP64 Cores / SM: 32
- Tensor Cores / SM: 8

- Peak FP32 TFLOPS: 15.7
- Peak FP64 TFLOPs: 7.8
- Peak Tensor TFLOPS: 125

- L1 Cache Line Size: 128 B
- L2 Cache Line Size: 32 B
- L2 Cache Size: 6144 KB
- Shared Memory Size / SM: Configurable up to 96 KB

GPU cache lines are 128 bytes and are aligned. Try to make all memory accesses by warps touch the minimum number of cache lines.
See [here](https://forums.developer.nvidia.com/t/cache-line-size-of-l1-and-l2/24907) for more. Also check Ch. 5.2 of CUDA Handbook.

Also see NVIDIA [docs](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticscaches.htm) for caches: For memory cached in both L1 and L2, if every thread in a warp loads a 4-byte value from sparse locations which miss in L1 cache, each thread will incur one 128-byte L1 transaction and four 32-byte L2 transactions. This will cause the load instruction to reissue 32 times more than if the values would be adjacent and cache-aligned. If bandwidth between caches becomes a bottleneck, rearranging data or algorithms to access the data more uniformly can alleviate the problem.

Another [link](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticsglobal.htm): A L1 cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 (cached loads using the generic data path) are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only (uncached loads using the generic data path) are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Note that `128 bytes = 32 floats = 16 doubles`. If we are accessing less elements than that with a warp (i.e. for a row in CSR Vector SpMV), we might have worse performance.
85 changes: 85 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/usr/bin/make -f

# Target and directories
SRCDIR := src
BUILDDIR := build
TGTDIR := bin
TARGET := spmv

# Compilers and flags
CC := gcc
CPPC := g++
NVCC := nvcc
CUDA_KERNEL_CHECK_FLAG ?= -DCUDA_CHECK_KERNELS=1
MYGPU_ARCH ?= sm_70

# Prepare files
SOURCES_C := $(shell find $(SRCDIR) -type f -name *.c)
OBJECTS_C := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_C:.c=.o))
SOURCES_CPP := $(shell find $(SRCDIR) -type f -name *.cpp)
OBJECTS_CPP := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_CPP:.cpp=.o))
SOURCES_CU := $(shell find $(SRCDIR) -type f -name *.cu)
OBJECTS_CU := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_CU:.cu=.o))
SOURCES := $(SOURCES_C) $(SOURCES_CPP) $(SOURCES_CU)
OBJECTS := $(OBJECTS_C) $(OBJECTS_CPP) $(OBJECTS_CU)

# Flags
CCFLAGS := -O3 -fopenmp -std=c99 -Wno-unused-result $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -g
CPPCFLAGS := -O3 -fopenmp --std=c++11 -Wno-unused-result -fno-exceptions -Wall -Wextra $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -g
NVCCFLAGS := -ccbin g++ -O3 -Xcompiler -fopenmp -Xcompiler -Wno-unused-result -Xcompiler -fno-exceptions -Xcompiler -Wall $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -Xcompiler -g -g -G
ARCHFLAGS := -arch=$(MYGPU_ARCH) -Wno-deprecated-gpu-targets
LDFLAGS := -lrt -lm -lcudart -fopenmp -lhsl_mc64 -lgfortran
# NOTE: Be careful with the order of libraries above
# NOTE: -g option for valgrind to track lines

# Directories
LIB := -Llib -L$(CUDA_PATH)/lib64 -L/usr/local/lib -L$(HOME)/lib
INC := -Iinclude -I$(CUDA_PATH)/include -Itemplates

# First rule
all: $(TGTDIR)/$(TARGET) | $(TGTDIR)

# Linking
$(TGTDIR)/$(TARGET): $(OBJECTS) | $(BUILDDIR)
$(CC) $^ -o $(TGTDIR)/$(TARGET) $(INC) $(LDFLAGS) $(LIB)

# C compilations
$(BUILDDIR)/%.o: $(SRCDIR)/%.c include/*.h
$(CC) $(CCFLAGS) $(INC) -c -o $@ $<

# CPP compilations
$(BUILDDIR)/%.o: $(SRCDIR)/%.cpp include/*.hpp templates/*.tpp
$(CPPC) $(CPPCFLAGS) $(INC) -c -o $@ $<

# CUDA compilations
$(BUILDDIR)/%.o: $(SRCDIR)/%.cu include/*.cuh templates/*.tpp
$(NVCC) $(NVCCFLAGS) $(ARCHFLAGS) $(INC) -c -o $@ $<

# Objects directory
$(BUILDDIR):
@mkdir -p $(BUILDDIR)

# Target directory
$(TGTDIR):
@mkdir -p $(TGTDIR)

# Cleaning
clean:
$(RM) -r $(BUILDDIR)/*.o

# Diagnostic
show:
@echo "Sources: $(SOURCES)"
@echo "Objects: $(OBJECTS)"
@echo "CUDA HOME: $(CUDA_PATH)"
@echo "Target arch: $(MYGPU_ARCH)"

# Code distribution overall
cloc:
cloc .

# Clean and make again
again:
@make clean && make

.PHONY: all clean show cloc again
74 changes: 74 additions & 0 deletions README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Exploring CSR-based Mixed and Multi-Precision SpMV for GPUs

Submitted to EuroPar'22 as "Erhan Tezcan, Tuğba Torun, Fahrican Koşar, Kamer Kaya and Didem Unat, _Mixed and Multi-Precision SpMV for GPUs with Row-wise Precision Selection_"

## Building

Use `make` to build the CUDA binary at `bin/spmv`. The compiler uses `--arch=sm_70` for NVIDIA V100, but you can change that to suit your own GPU with an `MYGPU_ARCH` environment variable, e.g. `export MYGPU_ARCH=sm_50`. We have used `cuda/11.2`, `python/3.7.4` and `gcc/9.3.0` to compile our program and run Python scripts. You also need to install and compile [`HSL_MC64`](https://www.hsl.rl.ac.uk/catalogue/mc64.html) static library with `gfortran`.

## File Structure

The file structure of this project is as follows:

- `batch` has shell scripts for cluster commands, such as queueing a job.
- `bin` for binary executables.
- `build` for build files.
- `diagnostic` has several scripts to check the program via Valgrind, cudamemcheck etc.
- `evaluations` is where we store the execution output. This is later read by Python scripts to make plots.
- `img` stores the output from Python files, such as plot images.
- `include` has header files.
- `logs` have log outputs, generally from the diagnostic tools.
- `res` has resources, such as MatrixMarket files.
- `scripts` has a variety of Python scripts, mostly for plotting and automated running of the code.
- `src` has the source files.
- `templates` has the source files for template functions.

## Running

The `Makefile` will create a binary called `spmv` under `bin` folder within the same directory, with object files under `build`. Run the executable with `-h` or `--help` option to see usage.

## Batches

For both `kuacc` and `simula` under `batches` we have the following:

- `final_experiment.sh` runs the final experiments, as used for the paper.
- `spmv_all.sh` runs SpMV test on all matrices (from `allpruned` index).
- `_srun_gpu.sh` asks for an interactive shell with one Tesla V100.
- `_check_queue.sh` checks the queue for my jobs.
- `_load_modules.sh` loads necessary modules. _does not work sometimes_

## Matrix Resources

Matrices are stored under `res` folder, with the following scripts:

- `download.sh <MatrixMarketURL>` downloads the matrix from the given URL. See [SuiteSparse](https://sparse.tamu.edu/).
- `download-from-md.sh <path>` downloads the matrices that appear in the provided Markdown file.
- `generate.sh` under `architect` generates a specific set of matrices using the `architect.py` script.
- `parsehtml.sh <path-to-html> <output-name>` parses an HTML from <http://yifanhu.net/GALLERY/GRAPHS/search.html> to create an index file.

## Diagnostics

The scripts below are under `diagnostics` folder:

- `eval_architect.sh` uses `evaluator.py` on matrices under `res/architect`.
- `eval_res.sh` uses `evaluator.py` on matrices under `res`.
- `cudamemcheck.sh` runs `cudamemcheck` with a matrix under `res/architect`.
- `valgrind.sh <matrix>` runs `valgrind` for the provided matrix.
- `nvprof.sh <matrix>` profiles SpMV kernels for the provided matrix.
- `run_random.sh` selects a random matrix under `res` and runs it.

## Scripts

Stored under `scripts` folder:

- `architect.py` creates random MatrixMarket matrices.
- `evaluator.py` runs the binary and parses it's outputs to create plots. Saves the resulting dictionary on file.
- `exporter.py` reads a a dictionary output by `evaluator.py` and exports `csv` files.
- `interpreter.py` reads a dictionary output by `evaluator.py` and plots stuff.
- `interpret.ipynb` a notebook to plot the results from another evaluation output.
- `analyser.py` analyse a specific matrix with Python.
- `plots.py` helper functions for plotting.
- `utility.py` utility functions.
- `prints.py` helper functions for printing.

`plottype` folder has generic plotting functions such as bar, heatmap, density etc. and `plotspecial` folder has specific plots.
2 changes: 2 additions & 0 deletions batch/kuacc/_check_queue.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
squeue -u etezcan19
2 changes: 2 additions & 0 deletions batch/kuacc/_load_modules.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
module load cuda/11.2 gcc/9.3.0 python/3.7.4
5 changes: 5 additions & 0 deletions batch/kuacc/_srun_gpu.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash
srun -A users --partition=ai --qos=ai --account=ai -n1 --gres=gpu:tesla_v100:1 --pty $SHELL

# -w ai12 (for a specific node)
# srun -N 1 -n1 -p short --qos=users --gres=gpu:1 -w ai12 --pty $SHELL
Loading

0 comments on commit afe6ffa

Please sign in to comment.