This repository contains the source code to reproduce the results and analysis of the paper
Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning Ravin Kohli, Matthias Feurer, Bernd Bischl, Katharina Eggensperger, Frank Hutter Data-centric Machine Learning Research (DMLR) Workshop at ICLR 2024
The code is provided as-is and we will neither maintain it nor provide bug fixes.
git clone https://github.com/automl/dmlr-iclr24-datasets-for-benchmarking
cd tabular_data_experiments
conda create -n tabular_data_experiments python=3.10
conda activate tabular_data_experiments
conda install swig
# Install for usage
pip install .
# Install for development
make install-dev
Our code is heavily inspired by the great source code published alongside the paper Why do tree-based models still outperform deep learning on tabular data? by Leo Grinsztajn, Edouard Oyallon and Gael Varoquaux.
The raw data can be found here.
We provide the following notebooks for visualization:
Contains code that creates the table used throughout the paper.
Contains code that creates the figures used throughout the paper.