Efficiency Benchmarking Code Generation Models

A Framework to Evaluate LLM-Generated Code via Real-World Problem Statements

Welcome to Efficiency Benchmark for LLM-Generated Code, a project designed to evaluate the performance of large language models (LLMs) like GPT-4, CodeLlama, and DeepSeek in solving algorithmic programming tasks. While existing benchmarks such as HumanEval and MBPP focus primarily on correctness, this project introduces a novel approach by incorporating efficiency metrics like execution time and memory consumption into the evaluation process.

The benchmark is tailored for competitive programming scenarios, where both correctness and optimality are critical. By leveraging curated datasets, stress test cases, and expert human-written solutions, this project aims to push the boundaries of automated code generation evaluation.

Key Features:

Diverse Dataset: Includes 16 algorithmic categories (e.g., Graphs, DP, Trees) sourced from platforms like LeetCode and online contests.

Stress Testing: Test cases range from basic functionality to large-scale inputs (Level 0 to Level 3).

Efficiency Metrics: Measures execution time (ms) and memory usage (MB) alongside functional correctness.

Dynamic Evaluation: Runs LLM-generated solutions in sandboxed environments for secure and reproducible testing.

Scoring System: Combines correctness and efficiency into a unified score for fair comparison.

Benchmark Design

This project introduces a carefully structured benchmark to evaluate the performance of LLM-generated code. The design emphasizes diversity, scalability, and robustness to ensure comprehensive testing.

Dataset Design:

Algorithmic Categories:

Includes 16 task types such as Dynamic Programming, Graph Algorithms, Linked Lists, Sliding Window, Trees, Backtracking, and more.

Covers a wide range of programming paradigms and problem-solving techniques.

Test Case Levels:

Level 0: Basic functionality with small inputs (e.g., n ≤ 10).
Level 1: Edge cases with moderate inputs (e.g., n ≤ 100).
Level 2: Large inputs for scalability testing (e.g., n ≤ 10^4).
Level 3: Stress tests with maximum constraints (e.g., n ≤ 10^6).

Sources:

Curated problems from LeetCode and competitive programming platforms like Codeforces and CodeChef. Ground truth solutions written by expert humans for optimal performance comparison.

Evaluation Metrics:

Functional Correctness: Pass/fail rate across all test cases.
Execution Time: Measured in milliseconds using Python's time.time().
Memory Usage: Potential analysis using Python's tracemalloc.

Scoring System:

Combines correctness and efficiency into a unified score: Score= (Functional Correctness) * (Efficiency Score)

Efficiency Score

This benchmark design ensures that solutions are evaluated not only for their correctness but also for their ability to handle large-scale inputs efficiently, simulating real-world competitive programming scenarios.

Setup

You will need access to publically avaliable LLM models in huggingface

Git Clone this repository in your local device

git clone https://github.com/Megh-Zyke/Efficiency-Benchmark.git

Install all the requirements needed to run this repository

pip install -r requirements.txt

Head to src and run the openSourceLLMoutput.py and the code to get inference of your desired LLM.
Remember to keep your hugging face token id and gain access to your desired models.

python src/openSourceLLMoutput.py

Inference will be stored in the data directory. Run the compute script from the src directory.

python src/compute.py

Results will be stored in the results directory with all the necessary information.

Conclusion & Open Call for Contributions

This project is a work in progress. It’s built with the idea that anyone can jump in, explore, and make it better. Whether you're a student experimenting with LLMs, someone prepping for coding interviews, or just curious about how models handle real-world problems, you’re more than welcome to contribute.

How You Can Help:

Add More Problems
- Got a cool coding problem or something tricky from LeetCode/CodeChef? Add it in—test cases and all!
Improve the Solutions
- See a better way to solve a problem? Optimize or refactor existing ground-truth implementations.
Expand the Test Cases
- Think of edge cases or large inputs that might break the model? Throw them in—we want to see where things fail!
Suggest Better Metrics
- If you have ideas on how to score models better—faster code, lower memory, cleaner logic—we’re all ears.

Feel free to fork the repo, experiment, and open a pull request whenever you’re ready. No contribution is too small

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
packages		packages
prompts		prompts
results		results
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
benchmark_prototype.xlsx		benchmark_prototype.xlsx
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficiency Benchmarking Code Generation Models

Key Features:

Benchmark Design

Dataset Design:

Algorithmic Categories:

Sources:

Evaluation Metrics:

Scoring System:

Efficiency Score

Setup

Conclusion & Open Call for Contributions

How You Can Help:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Megh-Zyke/Efficiency-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Efficiency Benchmarking Code Generation Models

Key Features:

Benchmark Design

Dataset Design:

Algorithmic Categories:

Sources:

Evaluation Metrics:

Scoring System:

Efficiency Score

Setup

Conclusion & Open Call for Contributions

How You Can Help:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages