Skip to content

Hangzhi/QuixBugs-bench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QuixBugs Benchmark

LLM Agent Benchmark

Original Dataset https://github.com/jkoppel/QuixBugs

This repository includes tools for benchmarking LLM agents (like Claude Code) on program repair tasks. The benchmark suite supports both Python and Java programs and allows you to:

  • Fix programs using LLM agents with minimal one-line changes
  • Run automated testing on fixed programs
  • Score results based on test success and code change minimality
  • Compare performance across different models and approaches

Latest Results:

  • Python: Claude Code achieved 86.5 ± 2.0% success rate with minimal one-line fixes
  • Java: Claude Code achieved 75.0 ± 3.06% success rate with minimal one-line fixes

Quick Start - Python:

python3 experiments_claude_code/orchestrator_claude_code.py --language python && python3 run_all_python_tests.py --program-folder experiments_claude_code/fixed_python_programs && python3 test_result_scorer.py experiments_claude_code/fixed_python_programs/test_results_*.json

Quick Start - Java:

python3 experiments_claude_code/orchestrator_claude_code.py --language java && python3 run_all_java_tests.py --program-folder experiments_claude_code/fixed_java_programs && python3 test_result_scorer.py experiments_claude_code/fixed_java_programs/test_results_*.json

LLM Agent QuixBugs Benchmarking Tools

1. problem_solver_claude_code.py

Fix a single program using Claude Code. Supports both Python and Java.

Python Example:

python3 experiments_claude_code/problem_solver_claude_code.py \
    --buggy-program-folder python_programs \
    --fixed-program-folder experiments_claude_code/fixed_python_programs \
    --program-name bitcount \
    --language python

Java Example:

python3 experiments_claude_code/problem_solver_claude_code.py \
    --program-name BITCOUNT \
    --language java

2. orchestrator_claude_code.py

Fix all 40 programs automatically with progress tracking (resumable). Supports both languages.

Python:

python3 experiments_claude_code/orchestrator_claude_code.py --language python

Java:

python3 experiments_claude_code/orchestrator_claude_code.py --language java

Multiple Runs (for experiments):

# Run multiple experiments with different run names
python3 experiments_claude_code/orchestrator_claude_code.py --language java --run_name run_0
python3 experiments_claude_code/orchestrator_claude_code.py --language java --run_name run_1

# Test each run's results
python3 run_all_java_tests.py --program-folder experiments_claude_code/fixed_java_programs_run_0
python3 run_all_java_tests.py --program-folder experiments_claude_code/fixed_java_programs_run_1

3. run_all_python_tests.py (Python)

Test fixed Python programs and generate results with diff analysis.

python3 run_all_python_tests.py --program-folder experiments_claude_code/fixed_python_programs

4. run_all_java_tests.py (Java)

Test fixed Java programs using Gradle. Features:

  • Automatic compilation and testing with JUnit
  • Support for buggy, correct, and fixed program versions
  • HTML test report parsing for accurate test counts
  • Diff analysis against correct versions
  • Handles package declarations and dependencies

Test Fixed Java Programs:

python3 run_all_java_tests.py --program-folder experiments_claude_code/fixed_java_programs

Test Specific Programs:

python3 run_all_java_tests.py --programs BITCOUNT GCD QUICKSORT --program-folder experiments_claude_code/fixed_java_programs

Options:

  • --program-folder: Folder containing programs to test
  • --programs: Specific programs to test (default: all 40)
  • --timeout: Test timeout in seconds (default: 10)
  • --output: Output JSON file path

5. test_result_scorer.py

Score results based on: tests pass + exactly 1 line changed. Works for both Python and Java results.

python3 test_result_scorer.py experiments_claude_code/fixed_python_programs/test_results_2025_08_30_081207.json
# or
python3 test_result_scorer.py experiments_claude_code/fixed_java_programs/test_results_2025_08_30_073722.json

Complete Workflow Examples

Python Workflow:

# 1. Fix all Python programs
python3 experiments_claude_code/orchestrator_claude_code.py --language python

# 2. Test the fixed programs
python3 run_all_python_tests.py --program-folder experiments_claude_code/fixed_python_programs

# 3. Score the results
python3 test_result_scorer.py experiments_claude_code/fixed_python_programs/test_results_*.json

# 4. View the scored results
cat experiments_claude_code/fixed_python_programs/test_results_*_scored.json

Java Workflow:

# 1. Fix all Java programs
python3 experiments_claude_code/orchestrator_claude_code.py --language java

# 2. Test the fixed programs
python3 run_all_java_tests.py --program-folder experiments_claude_code/fixed_java_programs

# 3. Score the results
python3 test_result_scorer.py experiments_claude_code/fixed_java_programs/test_results_*.json

# 4. View the scored results
cat experiments_claude_code/fixed_java_programs/test_results_*_scored.json

Original Benchmark

For detailed information about the original QuixBugs benchmark, test usage, and program structure, see the original QuixBugs repository.

About

A multi-lingual program repair benchmark set based on the Quixey Challenge

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 63.9%
  • Python 35.0%
  • Shell 1.1%