DafnyBench: A Benchmark for Formal Software Verification

Dataset & code for our paper DafnyBench: A Benchmark for Formal Software Verification

Dataset is also available for download on 🤗 Hugging Face.

Overview 📊

DafnyBench is the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification, with over 750 Dafny programs.

Usage 💻

Dataset: The dataset for DafnyBench (with 782 programs) could be found in the DafnyBench directory, which contains the ground_truth set & the hints_removedset (with compiler hints, i.e. annoataions, removed).
Evaluation: Evaluate LLMs on DafnyBench by asking models to fill in missing hints in a test file from the hints_removed set and checking if the reconstructed program could be verified by Dafny. Please refer to the eval directory.

Set Up for Evaluation 🔧

Install Dafny on your machine by following this tutorial
Clone & cd into this repository
Set up environment by running the following lines:

python -m venv stats
source stats/bin/activate
pip install -r requirements.txt
cd eval

Set up environment variable for the root directory:

export DAFNYBENCH_ROOT=

Set up environment variable for path to Dafny executable on your machine (for example, /opt/homebrew/bin/Dafny):

export DAFNY_PATH=

If you're evaluating an LLM through API access, set up API key. For example:

export OPENAI_API_KEY=

You can choose to evaluate an LLM on a single test program, such as:

python fill_hints.py --model "gpt-4o" --test_file "Clover_abs_no_hints.dfy" --feedback_turn 3 --dafny_path "$DAFNY_PATH"

or evaluate on the entire dataset:

export model_to_eval='gpt-4o'
./run_eval.sh

Contents 📁

DafnyBench
- A collection of 782 Dafny programs. Each program has a ground_truth version that is fully verified with Dafny & a hints_removed version that has hints (i.e. annotations) removed
eval
- Contains scripts to evaluate LLMs on DafnyBench
results
- results_summary - Dataframes that summarize LLMs' success on every test program
- reconstructed_files - LLM outputs with hints filled back in
- analysis - Contains a notebook for analyzing the results

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
DafnyBench		DafnyBench
assets		assets
eval		eval
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DafnyBench: A Benchmark for Formal Software Verification

Overview 📊

Usage 💻

Set Up for Evaluation 🔧

Contents 📁

About

Uh oh!

Releases

Packages

Languages

License

sr-lab/DafnyBench

Folders and files

Latest commit

History

Repository files navigation

DafnyBench: A Benchmark for Formal Software Verification

Overview 📊

Usage 💻

Set Up for Evaluation 🔧

Contents 📁

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages