Dataset & code for our paper DafnyBench: A Benchmark for Formal Software Verification
Dataset is also available for download on 🤗 Hugging Face.
DafnyBench is the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification, with over 750 Dafny programs.
- Dataset: The dataset for DafnyBench (with 782 programs) could be found in the
DafnyBenchdirectory, which contains theground_truthset & thehints_removedset (with compiler hints, i.e. annoataions, removed). - Evaluation: Evaluate LLMs on DafnyBench by asking models to fill in missing hints in a test file from the
hints_removedset and checking if the reconstructed program could be verified by Dafny. Please refer to theevaldirectory.
- Install Dafny on your machine by following this tutorial
- Clone &
cdinto this repository - Set up environment by running the following lines:
python -m venv stats
source stats/bin/activate
pip install -r requirements.txt
cd eval
- Set up environment variable for the root directory:
export DAFNYBENCH_ROOT=
- Set up environment variable for path to Dafny executable on your machine (for example,
/opt/homebrew/bin/Dafny):
export DAFNY_PATH=
- If you're evaluating an LLM through API access, set up API key. For example:
export OPENAI_API_KEY=
- You can choose to evaluate an LLM on a single test program, such as:
python fill_hints.py --model "gpt-4o" --test_file "Clover_abs_no_hints.dfy" --feedback_turn 3 --dafny_path "$DAFNY_PATH"
or evaluate on the entire dataset:
export model_to_eval='gpt-4o'
./run_eval.sh
DafnyBench- A collection of 782 Dafny programs. Each program has a
ground_truthversion that is fully verified with Dafny & ahints_removedversion that has hints (i.e. annotations) removed
- A collection of 782 Dafny programs. Each program has a
eval- Contains scripts to evaluate LLMs on DafnyBench
resultsresults_summary- Dataframes that summarize LLMs' success on every test programreconstructed_files- LLM outputs with hints filled back inanalysis- Contains a notebook for analyzing the results
