A Framework to Evaluate LLM-Generated Code via Real-World Problem Statements
Welcome to Efficiency Benchmark for LLM-Generated Code, a project designed to evaluate the performance of large language models (LLMs) like GPT-4, CodeLlama, and DeepSeek in solving algorithmic programming tasks. While existing benchmarks such as HumanEval and MBPP focus primarily on correctness, this project introduces a novel approach by incorporating efficiency metrics like execution time and memory consumption into the evaluation process.
The benchmark is tailored for competitive programming scenarios, where both correctness and optimality are critical. By leveraging curated datasets, stress test cases, and expert human-written solutions, this project aims to push the boundaries of automated code generation evaluation.
Diverse Dataset: Includes 16 algorithmic categories (e.g., Graphs, DP, Trees) sourced from platforms like LeetCode and online contests.
Stress Testing: Test cases range from basic functionality to large-scale inputs (Level 0 to Level 3).
Efficiency Metrics: Measures execution time (ms) and memory usage (MB) alongside functional correctness.
Dynamic Evaluation: Runs LLM-generated solutions in sandboxed environments for secure and reproducible testing.
Scoring System: Combines correctness and efficiency into a unified score for fair comparison.
This project introduces a carefully structured benchmark to evaluate the performance of LLM-generated code. The design emphasizes diversity, scalability, and robustness to ensure comprehensive testing.
Includes 16 task types such as Dynamic Programming, Graph Algorithms, Linked Lists, Sliding Window, Trees, Backtracking, and more.
Covers a wide range of programming paradigms and problem-solving techniques.
Test Case Levels:
- Level 0: Basic functionality with small inputs (e.g., n ≤ 10).
- Level 1: Edge cases with moderate inputs (e.g., n ≤ 100).
- Level 2: Large inputs for scalability testing (e.g., n ≤ 10^4).
- Level 3: Stress tests with maximum constraints (e.g., n ≤ 10^6).
Curated problems from LeetCode and competitive programming platforms like Codeforces and CodeChef. Ground truth solutions written by expert humans for optimal performance comparison.
- Functional Correctness: Pass/fail rate across all test cases.
- Execution Time: Measured in milliseconds using Python's time.time().
- Memory Usage: Potential analysis using Python's tracemalloc.
Combines correctness and efficiency into a unified score: Score= (Functional Correctness) * (Efficiency Score)
This benchmark design ensures that solutions are evaluated not only for their correctness but also for their ability to handle large-scale inputs efficiently, simulating real-world competitive programming scenarios.
You will need access to publically avaliable LLM models in huggingface
- Git Clone this repository in your local device
git clone https://github.com/Megh-Zyke/Efficiency-Benchmark.git
- Install all the requirements needed to run this repository
pip install -r requirements.txt
- Head to src and run the openSourceLLMoutput.py and the code to get inference of your desired LLM.
- Remember to keep your hugging face token id and gain access to your desired models.
python src/openSourceLLMoutput.py
- Inference will be stored in the data directory. Run the compute script from the src directory.
python src/compute.py
Results will be stored in the results directory with all the necessary information.
This project is a work in progress. It’s built with the idea that anyone can jump in, explore, and make it better. Whether you're a student experimenting with LLMs, someone prepping for coding interviews, or just curious about how models handle real-world problems, you’re more than welcome to contribute.
- Add More Problems
- Got a cool coding problem or something tricky from LeetCode/CodeChef? Add it in—test cases and all!
- Improve the Solutions
- See a better way to solve a problem? Optimize or refactor existing ground-truth implementations.
- Expand the Test Cases
- Think of edge cases or large inputs that might break the model? Throw them in—we want to see where things fail!
- Suggest Better Metrics
- If you have ideas on how to score models better—faster code, lower memory, cleaner logic—we’re all ears.
Feel free to fork the repo, experiment, and open a pull request whenever you’re ready. No contribution is too small
