How well do LLMs reason over tabular data, really?

This repository contains the benchmark suite and replication package for our paper "How well do LLMs reason over tabular data, really?". It allows you to reproduce our benchmark results and evaluate language models on their table reasoning capabilities.

Read our paper for a detailed analysis of the benchmark results and findings.

Testing models

To replicate the results from our paper or test new available models, follow these steps:

Prerequisites

Before installing our TabReasBench package, you need to have Ollama installed and the required model pulled:

Install Ollama from ollama.com
Pull the required model:

ollama pull qwen2.5:32b

Hardware Requirements

For running the benchmarks: Any GPU that can run Ollama models
For evaluating results: GPU with at least 20GB VRAM (required for qwen2.5:32b used as LLM-as-a-judge)

Installation

You can install TabReasBench using pip:

git clone https://github.com/trl-lab/tabular-robustness/
cd tabular-robustness
pip install .

Running the Benchmarks

To replicate our benchmark results, run:

tabreasbench --model qwen2.5:32b --output_dir benchmark_results

This will:

Run all benchmarks (base, missing, and shuffle) across different scales
Evaluate the results using qwen2.5:32b as the judge model
Generate aggregated results and LaTeX tables

The results will be organized in the specified output directory:

output_dir/
├── raw_results/
│   └── results_qwen2.5_32b.csv           # Raw model outputs and ground truth
├── evaluated_results/
│   └── results_qwen2.5_32b_evaluated.csv # Results with correctness evaluation
└── aggregated_results/
    ├── overall_summary.csv                # Overall performance metrics
    ├── detailed_results.csv               # Per-dataset performance breakdown
    ├── overall_summary.tex                # LaTeX table of overall results
    └── detailed_results.tex               # LaTeX table of detailed results

Citation

If you use our test code in your research, please cite our paper:

@article{wolff2025well,
  title={How well do LLMs reason over tabular data, really?},
  author={Wolff, Cornelius and Hulsebos, Madelon},
  journal={arXiv preprint arXiv:2505.07453},
  year={2025}
}

Plain text citation:

Wolff, C., & Hulsebos, M. (2025). How well do LLMs reason over tabular data, really?. arXiv preprint arXiv:2505.07453.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
tabreasbench		tabreasbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How well do LLMs reason over tabular data, really?

Testing models

Prerequisites

Hardware Requirements

Installation

Running the Benchmarks

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

trl-lab/tabular-robustness

Folders and files

Latest commit

History

Repository files navigation

How well do LLMs reason over tabular data, really?

Testing models

Prerequisites

Hardware Requirements

Installation

Running the Benchmarks

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages