The use of GenAI for coding can be separated into two types of tasks: proposal and refinement. The initial request yields a proposal, and the subsequent calls are about refining the solution. Most of the calls to GenAI for coding are refinement calls. The refinement cycle is as follows: the developer runs the proposed solution, encounters a bug, pastes over the relevant output, gets a new suggestion, tries the solution, and iterates till the program runs.
Given the importance of the refinement calls, we produce a novel benchmark dataset of buggy code--Buggy DS-1000- to test the ability of LLMs to fix bugs. To create the dataset, we start with the DS-1000, a popular benchmark for data science tasks with a comprehensive evaluation suite with a relatively low false positive and negative rate (~ 5.7% for both). We introduce a variety of non-trivial errors in the code. Our current method deterministically introduces bugs in the code.
In the near future, we plan to support an LLM Agentic flow where we ask an LLM to propose bugs, another to fix bugs, and only keep bug proposals that can't be fixed in one shot.
# Basic installation
pip install -e .
# With OpenAI support
pip install -e ".[openai]"
# With Ollama support
pip install -e ".[ollama]"
# With all features
pip install -e ".[openai,ollama]"
python run.py --output-dir outputs/ --model ollama --model-name qwen2.5-coder:14b --num-samples 1 --bugs-per-problem 3
The tool loads the DS-1000 dataset directly from Hugging Face (xlangai/DS-1000
) and caches it locally for faster subsequent runs.
The generator can introduce the following types of bugs:
- Logic errors in conditionals or calculations
- Off-by-one errors in loops or indexing
- Incorrect usage of library APIs
- Wrong parameters in function calls
- Missing checks for edge cases
- Type errors or conversions
- Variable scope issues
- Incorrect or conflicting imports
- Memory leaks or unbounded resource usage
- Race conditions or concurrency issues
For each problem in the dataset, the generator:
- Saves the original problem as JSON
- Saves the original solution code
- Creates a buggy version of the solution
- Creates a
bug_metadata.json
file with information about the introduced bugs
The metadata file contains:
- A list of introduced bugs, each with:
- Bug type
- Description
- Line numbers where changes were made