PropertyEval

A repository of property-based tests for thorough benchmarking of LLM code generation.

Usage

The /tests directory contains directories labeled from 0 to 163, each of which contains a strategy.py file. This file contains the hypothesis strategy for the corresponding problem from the HumanEval dataset. __init__.py files have been placed in each directory to allow for importing of the tests as modules. The strategies are available as the strategy attribute of these strategy modules. Usage of the strategies is as follows.

from hypothesis import given, strategies

@given(strategies.tuples(*st))
def test_property(args):
    # call functions as f(*args)
    # for example, assert f(*args) == ground_truth(*args)
    # ...

Here, st is the imported strategy. One way to do this is using the importlib module.

import importlib

st_module = importlib.import_module(f"test.{humaneval_id}.strategy")
st = st_module.strategy

Contribution

Thoroughness

We show that it is possible to improve the thoroughness of programming benchmarks using Property-Based Testing (PBT), leveraging the canonical solutions within these benchmarks. For the HumanEval dataset, since adequate property-based tests cannot be automatically generated using rule-based tools, we carefully construct these tests manually. We show that our approach using PBT allows us to synthesize as thorough test cases as those generated using type-aware mutations in Liu et al.'s EvalPlus¹. However, our approach can be easily adapted to other contexts.

Dataset

We share our full set of property-based tests as a complementary resource to existing manual and synthesized test suites.

Examples

A non-trivial strategy.

# HumanEval 129: minPath
@composite
def create_grid(draw, n_st=integers(min_value=2, max_value=MAX_SEQUENCE_LEN)):
    n = draw(n_st)
    grid = draw(lists(lists(integers(), min_size=n, max_size=n), min_size=n, max_size=n))    
    perm = draw(permutations(range(1, n**2 + 1)))
    # fill grid with perm
    for i in range(n):
        for j in range(n):
            grid[i][j] = perm[i*n + j]    
    return grid

grid = create_grid()
k = integers(min_value=1, max_value=MAX_INT)
strategy = grid, k

Examples of additional constraints on the input space. Here, we have restricted the alphabet and introduced bounds on the lengths of strings and lists.

# HumanEval 134: check_if_last_char_is_a_letter
txt = text(alphabet='abcde0123 ')
strategy = txt

# HumanEval 143: words_in_sentence
sentence = text(alphabet="a ", min_size=1, max_size=100)
    .map(lambda s: re.sub(r"\s+", " ", s))
    .filter(lambda s: not (s.startswith(" ") or s.endswith(" ")))
strategy = sentence

# HumanEval 158: find_max
words = lists(text(alphabet='abc', max_size=MAX_SEQUENCE_LEN), min_size=1, max_size=MAX_SEQUENCE_LEN)
strategy = words

Automation

For the MBPP dataset, we demonstrate that these tests can be generated largely automatically using GPT-3.5 by providing few-shot prompts based on some of our manually constructed tests. This demonstrates that our approach can be easily scaled to other datasets.

Warning

This is a work in progress, but some preliminary results are available here.

Evaluation

The /humaneval_groundtruth directory contains canonical solutions to HumanEval problems, adapted from the ground truth solutions provided with EvalPlus v0.1.0. The results from the equivalence tests on code samples for 84 (model, size, temperature) combinations provided with EvalPlus v0.1.0 are available in evaldata.csv. The script for executing this benchmark is a modified fork² of the EvalPlus script.

The limits/limits.py file contains several standardized limits for the strategies. The limits/fuzzer.py script is for running fuzz-tests on all HumanEval ground truth with the strategies in order to validate these limits.

Footnotes

Jiawei Liu et al. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In: arXiv preprint arXiv:2305.01210 (2023) ↩
We forked EvalPlus and modified the evaluation script to evaluate code samples with PropertyEval's property-based tests as well, in addition to the Base and Base + Extra test cases. We further modified the existing pipeline for estimating pass@k for PropertyEval's property-based tests also. The fork is available as EvalPlusPro. Some points to note are as follows.
1. The property-based tests are executed with 1000 examples, with @settings(max_examples=1000).
2. Instead of the time limits enforced by EvalPlus, we use the default deadline of 200ms that comes with Hypothesis.
↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PropertyEval

Usage

Contribution

Thoroughness

Dataset

Examples

Automation

Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

PropertyEval

Usage

Contribution

Thoroughness

Dataset

Examples

Automation

Evaluation

Footnotes