migrate

ParCoreLab · Apr 13, 2022 · afe6ffa · afe6ffa
commit afe6ffa
Show file tree

Hide file tree

Showing 227 changed files with 85,211 additions and 0 deletions.
diff --git a/.clang-format b/.clang-format
@@ -0,0 +1,4 @@
+---
+Language:        Cpp
+BasedOnStyle:  google 
+ColumnLimit:     120  
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,9 @@
+*.ipynb linguist-documentation
+*.html linguist-documentation
+*.tpp linguist-language=C++
+*.cpp linguist-language=C++
+*.hpp linguist-language=C++
+*.h linguist-language=C
+*.c linguist-language=C
+*.cu linguist-language=Cuda
+*.cuh linguist-language=Cuda
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,45 @@
+# Defaults for CUDA
+*.i
+*.ii
+*.gpu
+*.ptx
+*.cubin
+*.fatbin
+
+# Ignore the build and lib dirs
+build/*
+lib/*
+!build/README.MD
+
+# Ignore any executables
+bin/*
+!bin/README.MD
+
+# Ignore Mac specific files
+.DS_Store
+
+# Ignore notebook at root
+.ipynb_checkpoints
+.notebook
+__pycache__
+
+# Ignore matrix files
+*.mtx
+*.mtx.gz
+!res/test/*
+
+# Ignore evaluation outputs
+*.out
+evaluations/download/*
+evaluations/out/
+!evaluations/out/*.MD
+!evaluations/download/*.MD
+evaluations/all/*
+!evaluations/slurms/*
+
+
+.vscode/c_cpp_properties.json
+.ght
+misc/cudasample_simpleOccupancy/simpleOccupancy
+misc/cudasample_simpleOccupancy/simpleOccupancy.o
+misc/cudasample_simpleOccupancy/NsightEclipse.xml
diff --git a/.style.yapf b/.style.yapf
@@ -0,0 +1,4 @@
+[style]
+based_on_style = google
+indent_width: 2 
+column_limit = 120
diff --git a/.valgrindrc b/.valgrindrc
@@ -0,0 +1,6 @@
+--tool=memcheck
+--track-origins=yes
+--suppressions=./diagnostic/valgrind.supp
+--leak-check=full
+--show-leak-kinds=all
+--log-file=./logs/valgrind.log 
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 Erhan Tezcan
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/MISC.MD b/MISC.MD
@@ -0,0 +1,86 @@
+## Permutations
+
+Suppose you have the following linear system.
+
+```
+a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 = y_1
+a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 = y_2
+a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 = y_3
+```
+
+This is shown as a matrix-vector multiplication of the form:
+
+```
+| a_1,1  a_1,2  a_1,3 |     | x_1 |     | a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 |     | y_1 |
+| a_2,1  a_2,2  a_2,3 |  *  | x_2 |  =  | a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 |  =  | y_2 |
+| a_3,1  a_3,2  a_3,3 |     | x_3 |     | a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 |     | y_3 |
+```
+
+Consider the following **row permutation** `[1 -> 2 , 2 -> 3 , 3 -> 1]`. You can't permute `x` here in a solver, because column vectors are static. But, you can't update `x` from `y` directly too, so you have the order back `y` during update phase.
+
+```
+| a_3,1  a_3,2  a_3,3 |     | x_1 |     | a_3,1 * x_1 + a_3,2 * x_2 + a_3,3 * x_3 |     | y_3 |
+| a_1,1  a_1,2  a_1,3 |  *  | x_2 |  =  | a_1,1 * x_1 + a_1,2 * x_2 + a_1,3 * x_3 |  =  | y_1 |
+| a_2,1  a_2,2  a_2,3 |     | x_3 |     | a_2,1 * x_1 + a_2,2 * x_2 + a_2,3 * x_3 |     | y_2 |
+```
+
+Now consider the same permutation but as a **symmetric permutation**. If you now permute `x`, you can update the solutions directly in Jacobi, although for SpMV it does not change the result.
+
+```
+| a_3,3  a_3,1  a_3,2 |     | x_3 |     | a_3,3 * x_3 + a_3,1 * x_1 + a_3,2 * x_2 |     | y_3 |
+| a_1,3  a_1,1  a_1,2 |  *  | x_1 |  =  | a_1,3 * x_3 + a_1,1 * x_1 + a_1,2 * x_2 |  =  | y_1 |
+| a_2,3  a_2,1  a_2,2 |     | x_2 |     | a_2,3 * x_3 + a_2,1 * x_1 + a_2,2 * x_2 |     | y_2 |
+```
+
+To summarize:
+
+- Normally:
+  - `y = Ax`
+  - `x = (y - Ex) / d`
+- Row Permuted:
+  - `y' = A'x`
+  - `x = ((y' - E'x) / d')'`
+- Row + Column Permuted (Symmetric):
+  - `y' = A'x'`
+  - `x' = (y' - E'x') / d'`
+
+### Swapping Vectors for Cardiac
+
+In presence of `x32, x64 and y64`, we have 3 options for swapping:
+
+1.  **Naive**: Call a `copying` kernel which reads `y64` and writes it to both `x32` and `x64`.
+2.  **X64 Cast**: Remove `x32` from the kernel all together, and cast the value of `x64` to `float` at runtime.
+3.  **X32 Copy**: At the end of `SpMV` kernel, as you write the result to `y64` write it to `x32` too. If you swap the pointers now as you normally do, `x32` will also have the swapped values.
+4.  A hybrid of **1** and **3**, we swap `x64` and `y64` by pointers, but call a copy kernel on `x32` just before that. This turned out to be faster than **3**. This is what we use.
+
+### V100 Specs
+
+- Compute Capability: 7.0
+- Warp Size: 32 Threads
+- Max Warps / SM: 64
+- Max Thread Blocks / SM: 32
+- Max Thread Block Size: 1024
+
+- SMs: 80
+- TPCs: 40
+- FP32 Cores / SM: 64
+- FP64 Cores / SM: 32
+- Tensor Cores / SM: 8
+
+- Peak FP32 TFLOPS: 15.7
+- Peak FP64 TFLOPs: 7.8
+- Peak Tensor TFLOPS: 125
+
+- L1 Cache Line Size: 128 B
+- L2 Cache Line Size: 32 B
+- L2 Cache Size: 6144 KB
+- Shared Memory Size / SM: Configurable up to 96 KB
+
+GPU cache lines are 128 bytes and are aligned. Try to make all memory accesses by warps touch the minimum number of cache lines.
+See [here](https://forums.developer.nvidia.com/t/cache-line-size-of-l1-and-l2/24907) for more. Also check Ch. 5.2 of CUDA Handbook.
+
+Also see NVIDIA [docs](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticscaches.htm) for caches: For memory cached in both L1 and L2, if every thread in a warp loads a 4-byte value from sparse locations which miss in L1 cache, each thread will incur one 128-byte L1 transaction and four 32-byte L2 transactions. This will cause the load instruction to reissue 32 times more than if the values would be adjacent and cache-aligned. If bandwidth between caches becomes a bottleneck, rearranging data or algorithms to access the data more uniformly can alleviate the problem.
+
+Another [link](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticsglobal.htm): A L1 cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 (cached loads using the generic data path) are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only (uncached loads using the generic data path) are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.
+
+Note that `128 bytes = 32 floats = 16 doubles`. If we are accessing less elements than that with a warp (i.e. for a row in CSR Vector SpMV), we might have worse performance.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,85 @@
+#!/usr/bin/make -f
+
+# Target and directories
+SRCDIR := src
+BUILDDIR := build
+TGTDIR := bin
+TARGET := spmv
+
+# Compilers and flags
+CC := gcc
+CPPC := g++
+NVCC := nvcc
+CUDA_KERNEL_CHECK_FLAG ?= -DCUDA_CHECK_KERNELS=1
+MYGPU_ARCH ?= sm_70
+
+# Prepare files
+SOURCES_C := $(shell find $(SRCDIR) -type f -name *.c)
+OBJECTS_C := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_C:.c=.o))
+SOURCES_CPP := $(shell find $(SRCDIR) -type f -name *.cpp)
+OBJECTS_CPP := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_CPP:.cpp=.o))
+SOURCES_CU := $(shell find $(SRCDIR) -type f -name *.cu)
+OBJECTS_CU := $(patsubst $(SRCDIR)/%,$(BUILDDIR)/%,$(SOURCES_CU:.cu=.o))
+SOURCES := $(SOURCES_C) $(SOURCES_CPP) $(SOURCES_CU)
+OBJECTS := $(OBJECTS_C) $(OBJECTS_CPP) $(OBJECTS_CU)
+
+# Flags
+CCFLAGS := -O3 -fopenmp -std=c99 -Wno-unused-result $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -g 
+CPPCFLAGS := -O3 -fopenmp --std=c++11 -Wno-unused-result -fno-exceptions -Wall -Wextra $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -g
+NVCCFLAGS := -ccbin g++ -O3 -Xcompiler -fopenmp -Xcompiler -Wno-unused-result -Xcompiler -fno-exceptions -Xcompiler -Wall $(JACOBI_ITERS_FLAG) $(CUDA_KERNEL_CHECK_FLAG) # -Xcompiler -g -g -G
+ARCHFLAGS := -arch=$(MYGPU_ARCH) -Wno-deprecated-gpu-targets
+LDFLAGS := -lrt -lm -lcudart -fopenmp -lhsl_mc64 -lgfortran
+# NOTE: Be careful with the order of libraries above
+# NOTE: -g option for valgrind to track lines 
+
+# Directories
+LIB := -Llib -L$(CUDA_PATH)/lib64 -L/usr/local/lib -L$(HOME)/lib
+INC := -Iinclude -I$(CUDA_PATH)/include -Itemplates
+
+# First rule
+all: $(TGTDIR)/$(TARGET) | $(TGTDIR)
+
+# Linking
+$(TGTDIR)/$(TARGET): $(OBJECTS) | $(BUILDDIR)
+	$(CC) $^ -o $(TGTDIR)/$(TARGET) $(INC) $(LDFLAGS) $(LIB)
+
+# C compilations
+$(BUILDDIR)/%.o: $(SRCDIR)/%.c include/*.h 
+	$(CC) $(CCFLAGS) $(INC) -c -o $@ $<
+
+# CPP compilations
+$(BUILDDIR)/%.o: $(SRCDIR)/%.cpp include/*.hpp templates/*.tpp
+	$(CPPC) $(CPPCFLAGS) $(INC) -c -o $@ $<
+
+# CUDA compilations
+$(BUILDDIR)/%.o: $(SRCDIR)/%.cu include/*.cuh templates/*.tpp
+	$(NVCC) $(NVCCFLAGS) $(ARCHFLAGS) $(INC) -c -o $@ $<
+
+# Objects directory
+$(BUILDDIR):
+	@mkdir -p $(BUILDDIR)  
+
+# Target directory
+$(TGTDIR):
+	@mkdir -p $(TGTDIR)  
+
+# Cleaning
+clean: 
+	$(RM) -r $(BUILDDIR)/*.o
+
+# Diagnostic
+show:
+	@echo "Sources: $(SOURCES)"
+	@echo "Objects: $(OBJECTS)"
+	@echo "CUDA HOME: $(CUDA_PATH)"
+	@echo "Target arch: $(MYGPU_ARCH)"
+
+# Code distribution overall
+cloc:
+	cloc .
+
+# Clean and make again
+again:
+	@make clean && make
+
+.PHONY: all clean show cloc again
diff --git a/README.MD b/README.MD
@@ -0,0 +1,74 @@
+# Exploring CSR-based Mixed and Multi-Precision SpMV for GPUs
+
+Submitted to EuroPar'22 as "Erhan Tezcan, Tuğba Torun, Fahrican Koşar, Kamer Kaya and Didem Unat, _Mixed and Multi-Precision SpMV for GPUs with Row-wise Precision Selection_"
+
+## Building
+
+Use `make` to build the CUDA binary at `bin/spmv`. The compiler uses `--arch=sm_70` for NVIDIA V100, but you can change that to suit your own GPU with an `MYGPU_ARCH` environment variable, e.g. `export MYGPU_ARCH=sm_50`. We have used `cuda/11.2`, `python/3.7.4` and `gcc/9.3.0` to compile our program and run Python scripts. You also need to install and compile [`HSL_MC64`](https://www.hsl.rl.ac.uk/catalogue/mc64.html) static library with `gfortran`.
+
+## File Structure
+
+The file structure of this project is as follows:
+
+- `batch` has shell scripts for cluster commands, such as queueing a job.
+- `bin` for binary executables.
+- `build` for build files.
+- `diagnostic` has several scripts to check the program via Valgrind, cudamemcheck etc.
+- `evaluations` is where we store the execution output. This is later read by Python scripts to make plots.
+- `img` stores the output from Python files, such as plot images.
+- `include` has header files.
+- `logs` have log outputs, generally from the diagnostic tools.
+- `res` has resources, such as MatrixMarket files.
+- `scripts` has a variety of Python scripts, mostly for plotting and automated running of the code.
+- `src` has the source files.
+- `templates` has the source files for template functions.
+
+## Running
+
+The `Makefile` will create a binary called `spmv` under `bin` folder within the same directory, with object files under `build`. Run the executable with `-h` or `--help` option to see usage.
+
+## Batches
+
+For both `kuacc` and `simula` under `batches` we have the following:
+
+- `final_experiment.sh` runs the final experiments, as used for the paper.
+- `spmv_all.sh` runs SpMV test on all matrices (from `allpruned` index).
+- `_srun_gpu.sh` asks for an interactive shell with one Tesla V100.
+- `_check_queue.sh` checks the queue for my jobs.
+- `_load_modules.sh` loads necessary modules. _does not work sometimes_
+
+## Matrix Resources
+
+Matrices are stored under `res` folder, with the following scripts:
+
+- `download.sh <MatrixMarketURL>` downloads the matrix from the given URL. See [SuiteSparse](https://sparse.tamu.edu/).
+- `download-from-md.sh <path>` downloads the matrices that appear in the provided Markdown file.
+- `generate.sh` under `architect` generates a specific set of matrices using the `architect.py` script.
+- `parsehtml.sh <path-to-html> <output-name>` parses an HTML from <http://yifanhu.net/GALLERY/GRAPHS/search.html> to create an index file.
+
+## Diagnostics
+
+The scripts below are under `diagnostics` folder:
+
+- `eval_architect.sh` uses `evaluator.py` on matrices under `res/architect`.
+- `eval_res.sh` uses `evaluator.py` on matrices under `res`.
+- `cudamemcheck.sh` runs `cudamemcheck` with a matrix under `res/architect`.
+- `valgrind.sh <matrix>` runs `valgrind` for the provided matrix.
+- `nvprof.sh <matrix>` profiles SpMV kernels for the provided matrix.
+- `run_random.sh` selects a random matrix under `res` and runs it.
+
+## Scripts
+
+Stored under `scripts` folder:
+
+- `architect.py` creates random MatrixMarket matrices.
+- `evaluator.py` runs the binary and parses it's outputs to create plots. Saves the resulting dictionary on file.
+- `exporter.py` reads a a dictionary output by `evaluator.py` and exports `csv` files.
+- `interpreter.py` reads a dictionary output by `evaluator.py` and plots stuff.
+- `interpret.ipynb` a notebook to plot the results from another evaluation output.
+- `analyser.py` analyse a specific matrix with Python.
+- `plots.py` helper functions for plotting.
+- `utility.py` utility functions.
+- `prints.py` helper functions for printing.
+
+`plottype` folder has generic plotting functions such as bar, heatmap, density etc. and `plotspecial` folder has specific plots.
diff --git a/batch/kuacc/_check_queue.sh b/batch/kuacc/_check_queue.sh
@@ -0,0 +1,2 @@
+#!/bin/bash
+squeue -u etezcan19
diff --git a/batch/kuacc/_load_modules.sh b/batch/kuacc/_load_modules.sh
@@ -0,0 +1,2 @@
+#!/bin/bash
+module load cuda/11.2 gcc/9.3.0 python/3.7.4
diff --git a/batch/kuacc/_srun_gpu.sh b/batch/kuacc/_srun_gpu.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+srun -A users --partition=ai --qos=ai --account=ai -n1 --gres=gpu:tesla_v100:1 --pty $SHELL
+
+# -w ai12 (for a specific node)
+# srun -N 1 -n1 -p short --qos=users --gres=gpu:1 -w ai12  --pty $SHELL