Stefan Jokić, Lucas Cosier, Theresa Wakonig, Olivier Becker
This project was done as part of the Advanced Systems Lab course at ETH in 2021. To see contributions, take a look at the final report 36_report.pdf
.
This repository contains fast C implementations of algorithms for the computation of Shapley values (SV) in the context of k-NN classifiers. In particular, the baseline implementations as well as optimized versions (both scalar and SIMD using AVX2 intrinsics) of the following two algorithms presented in the PVLDB paper "Efficient task-specific data valuation for nearest neighbor algorithms" by R. Jia et al. are included:
- Exact computation of Shapley values (alg1)
- MC approximation of Shapley values (alg2)
The reference Python implementations are contained in the python-baseline
directory and originate from the Github repositories published by the original authors of the paper mentioned above:
- github.com/AI-secure/KNN-PVLDB/ (Algorithm 1: Exact SV computation)
- github.com/sunblaze-ucb/data-valuation/ (Algorithm 2: MC approximation)
The root directory that this README is contained in includes a Makefile. By running make
in this current root directory, make
will be called recursively in all subdirectories of src
and benchmarking
.
We refer to ~
as the root directory of this repository that this README is contained in. In the following list, we describe the structure of the root directory.
-
The
src
directory contains two further directoriesalg1
andalg2
, which in turn contain the reference implementations asalg1.h
andalg2.h
, respectively. Each of these directories also includes a testing infrastructure (testing_infra.c
) that allows to run the algorithms with a given input size and prints the output to the console. This is used to check the correctness of the result by comparing it with the output of the Python reference implementation using the testing script~/testing/testing.sh
. It also contains an infrastructure to measure the runtimes of the subroutines that make up the algorithms (bottlenecks.c
), so as to identify bottlenecks. The included makefile will compile both of these C files. Each of the directoriesalg1
andalg2
also contain a directory namedopt
, which contains all optimized versions of the algorithm and which are further divded intoscalar
andvectorized
and contain their own READMEs describing which optimizations are applied (e.g. the README inalg1/vectorized/
will specify that opt1 refers to the version of algorithm 1 that includes the optimized l2 norm with AVX intrinsics). The structure of the directory for a given optimized version (e.g.alg1/vectorized/opt1
) is similar to that of the baselines: they contain a makefile, the optimized implementation (optX.h
), the testing infrastructure, an infrastructure to identify bottlenecks, and a runtime benchmarking infrastructure. They also contain shell scripts for running correctness tests (testing.sh
) by comparing the output with the C baseline implementation, benchmarking runtimes (benchmarking.sh
) and running a roofline analysis (roofline_analysis.sh
) which requires Intel Advisor to be installed. Some scripts may require you to runmake
first. For easier automation, the~/src
directory contains shell scripts that run the aforementioned scripts for benchmarking and roofline analysis in the nested directories of the optimized implementations:benchmark_all_opts.sh
androofline_analysis_all_opts.sh
, respectively. The scriptroofline_analysis.sh
runs the roofline analysis for the baseline implementations. The~/src/alg1
,~/src/alg2
as well as all the directories containing optimized versions (i.e.algX/scalar/optY
oralgX/vectorized/optY
) contain aresults
androofline_analysis
folder with the .csv files containing the results of the runtime benchmarks and roofline analysis, respectively, for different input sizes. Note that the results of the roofline analysis may only be opened with Intel Advisor. The results of the bottlenecks analyses are also stored as .csv files in~/src/algX/opt/scalar/bottlenecks
and~/src/algX/opt/vectorized/bottlenecks
. -
The
optimizations
directory contains optimizations performed on the different algorithms, but in isolation. This means that, contrary to e.g.alg1/scalar/opt1
which contains the entire algorithm 1 along with the optimized l2-norm, the directory~/optimizations/l2norm
contains an optimized implementation of the l2norm, as well as an infrastructure to measure the runtime, speedup (w.r.t. a baseline) and correctness of the l2norm in isolation, i.e. excluding the rest of the algorithm. -
For checking the correctness of our C baselines, the
testing
directory contains a scripttesting.sh
which will run the C baseline implementations as well as the Python reference implementations of both algorithms and store their output in .csv files. The contents of the these .csv files is then compared using a Python script (test_compare.py
) and the implementation is only deemed correct if for all valuesnp.allclose(python_impl, c_impl, 1e-07, 1e-06)
returns true. Note thatpython_impl
andc_impl
refer to the arrays containing the outputs of the Python implementation and C implementation, respectively. -
For benchmarking our C baselines, the directory
benchmarking
contains a scriptbenchmarking.sh
that will measure the runtimes of both algorithms for different input sizes using RDTSC and subsequently store the results in theresults
directory. We provide you with our own measurements (using the Intel i7 11700k) in theresults_final
directory. The scriptbenchmarking_diff_K.sh
can be used to benchmark runtimes with a varying k (for the k-NN) and store the results inresults_diff_K
, however, we did not end up using such benchmarks for our final report. -
The
include
directory contains various C header files that are included in the implementations for algorithms 1 and 2. For instance, themat.h
header implements a struct that represents a matrix, i.e. contains fields that describe the size of the matrix as well as an array that contains its data, orheap.h
which implements a max-heap that is used in algorithm 2. -
As mentioned previously, the
python-baseline
directory contains the reference Python implementations of algorithms 1 and 2. The source GitHub repositories of these implementations is mentioned at the beginning of this README. -
The
visualizations
directory contains various Python scripts for plotting the runtimes, speedup (w.r.t. baseline) and roofline models. It also includes Jupyter notebooks for plotting the runtimes of different subroutines of the two algorithms in order to identify bottlenecks. The plots used in the report can be found invisualizations/plots
. -
Lastly, the
data
directory, which is currently not on this repository as it has been added to our .gitignore, contains the data set used for running the correctness tests of our implementations. More precisely, this data set contains a subset of features extracted from the CIFAR-10 image data set using a ResNet50 CNN and the associated labels (from 0 to 9). Note that the data set is quite sparse. When running the test for different input sizes, we simply take different subsets of this data set. You can download the data set that we used here (ETH polybox) and extract it in thedata
directory (within the root directory). The values are stored in single-precision floating point.