Skip to content

Fast C implementations for the computation of Shapley values (SV) in the context of k-NN classifiers

License

Notifications You must be signed in to change notification settings

sjokic/FastShapleyKNN

 
 

Repository files navigation

FAST COMPUTATION OF SHAPLEY VALUES FOR NEAREST NEIGHBOR ALGORITHMS

Stefan Jokić, Lucas Cosier, Theresa Wakonig, Olivier Becker

This project was done as part of the Advanced Systems Lab course at ETH in 2021. To see contributions, take a look at the final report 36_report.pdf.

This repository contains fast C implementations of algorithms for the computation of Shapley values (SV) in the context of k-NN classifiers. In particular, the baseline implementations as well as optimized versions (both scalar and SIMD using AVX2 intrinsics) of the following two algorithms presented in the PVLDB paper "Efficient task-specific data valuation for nearest neighbor algorithms" by R. Jia et al. are included:

  • Exact computation of Shapley values (alg1)
  • MC approximation of Shapley values (alg2)

The reference Python implementations are contained in the python-baseline directory and originate from the Github repositories published by the original authors of the paper mentioned above:

Makefile

The root directory that this README is contained in includes a Makefile. By running make in this current root directory, make will be called recursively in all subdirectories of src and benchmarking.

Repository structure

We refer to ~ as the root directory of this repository that this README is contained in. In the following list, we describe the structure of the root directory.

  • The src directory contains two further directories alg1 and alg2, which in turn contain the reference implementations as alg1.h and alg2.h, respectively. Each of these directories also includes a testing infrastructure (testing_infra.c) that allows to run the algorithms with a given input size and prints the output to the console. This is used to check the correctness of the result by comparing it with the output of the Python reference implementation using the testing script ~/testing/testing.sh. It also contains an infrastructure to measure the runtimes of the subroutines that make up the algorithms (bottlenecks.c), so as to identify bottlenecks. The included makefile will compile both of these C files. Each of the directories alg1 and alg2 also contain a directory named opt, which contains all optimized versions of the algorithm and which are further divded into scalar and vectorized and contain their own READMEs describing which optimizations are applied (e.g. the README in alg1/vectorized/ will specify that opt1 refers to the version of algorithm 1 that includes the optimized l2 norm with AVX intrinsics). The structure of the directory for a given optimized version (e.g. alg1/vectorized/opt1) is similar to that of the baselines: they contain a makefile, the optimized implementation (optX.h), the testing infrastructure, an infrastructure to identify bottlenecks, and a runtime benchmarking infrastructure. They also contain shell scripts for running correctness tests (testing.sh) by comparing the output with the C baseline implementation, benchmarking runtimes (benchmarking.sh) and running a roofline analysis (roofline_analysis.sh) which requires Intel Advisor to be installed. Some scripts may require you to run make first. For easier automation, the ~/src directory contains shell scripts that run the aforementioned scripts for benchmarking and roofline analysis in the nested directories of the optimized implementations: benchmark_all_opts.sh and roofline_analysis_all_opts.sh, respectively. The script roofline_analysis.sh runs the roofline analysis for the baseline implementations. The ~/src/alg1, ~/src/alg2 as well as all the directories containing optimized versions (i.e. algX/scalar/optY or algX/vectorized/optY) contain a results and roofline_analysis folder with the .csv files containing the results of the runtime benchmarks and roofline analysis, respectively, for different input sizes. Note that the results of the roofline analysis may only be opened with Intel Advisor. The results of the bottlenecks analyses are also stored as .csv files in ~/src/algX/opt/scalar/bottlenecks and ~/src/algX/opt/vectorized/bottlenecks.

  • The optimizations directory contains optimizations performed on the different algorithms, but in isolation. This means that, contrary to e.g. alg1/scalar/opt1 which contains the entire algorithm 1 along with the optimized l2-norm, the directory ~/optimizations/l2norm contains an optimized implementation of the l2norm, as well as an infrastructure to measure the runtime, speedup (w.r.t. a baseline) and correctness of the l2norm in isolation, i.e. excluding the rest of the algorithm.

  • For checking the correctness of our C baselines, the testing directory contains a script testing.sh which will run the C baseline implementations as well as the Python reference implementations of both algorithms and store their output in .csv files. The contents of the these .csv files is then compared using a Python script (test_compare.py) and the implementation is only deemed correct if for all values np.allclose(python_impl, c_impl, 1e-07, 1e-06) returns true. Note that python_impl and c_impl refer to the arrays containing the outputs of the Python implementation and C implementation, respectively.

  • For benchmarking our C baselines, the directory benchmarking contains a script benchmarking.sh that will measure the runtimes of both algorithms for different input sizes using RDTSC and subsequently store the results in the results directory. We provide you with our own measurements (using the Intel i7 11700k) in the results_final directory. The script benchmarking_diff_K.sh can be used to benchmark runtimes with a varying k (for the k-NN) and store the results in results_diff_K, however, we did not end up using such benchmarks for our final report.

  • The include directory contains various C header files that are included in the implementations for algorithms 1 and 2. For instance, the mat.h header implements a struct that represents a matrix, i.e. contains fields that describe the size of the matrix as well as an array that contains its data, or heap.h which implements a max-heap that is used in algorithm 2.

  • As mentioned previously, the python-baseline directory contains the reference Python implementations of algorithms 1 and 2. The source GitHub repositories of these implementations is mentioned at the beginning of this README.

  • The visualizations directory contains various Python scripts for plotting the runtimes, speedup (w.r.t. baseline) and roofline models. It also includes Jupyter notebooks for plotting the runtimes of different subroutines of the two algorithms in order to identify bottlenecks. The plots used in the report can be found in visualizations/plots.

  • Lastly, the data directory, which is currently not on this repository as it has been added to our .gitignore, contains the data set used for running the correctness tests of our implementations. More precisely, this data set contains a subset of features extracted from the CIFAR-10 image data set using a ResNet50 CNN and the associated labels (from 0 to 9). Note that the data set is quite sparse. When running the test for different input sizes, we simply take different subsets of this data set. You can download the data set that we used here (ETH polybox) and extract it in the data directory (within the root directory). The values are stored in single-precision floating point.

About

Fast C implementations for the computation of Shapley values (SV) in the context of k-NN classifiers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 52.3%
  • C++ 26.9%
  • Jupyter Notebook 8.6%
  • Python 6.5%
  • Shell 4.2%
  • Makefile 1.5%