openbench

Provider-agnostic, open-source evaluation infrastructure for language models 🚀

openbench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long-context recall, graph reasoning, and first-class support for your own local evals to preserve privacy. Works with any model provider - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, Hugging Face, and 30+ other providers.

🚧 Alpha Release

We're building in public! This is an alpha release - expect rapid iteration. The first stable release is coming soon.

Features

🎯 35+ Benchmarks: MMLU, GPQA, HumanEval, SimpleQA, competition math (AIME, HMMT), SciCode, GraphWalks, and more
🔧 Simple CLI: bench list, bench describe, bench eval (also available as openbench), -M/-T flags for model/task args, --debug mode for eval-retry, experimental benchmarks with --alpha flag
🏗️ Built on inspect-ai: Industry-standard evaluation framework
📊 Extensible: Easy to add new benchmarks and metrics
🤖 Provider-agnostic: Works with 30+ model providers out of the box
🛠️ Local Eval Support: Privatized benchmarks can be run with bench eval <path>
📤 Hugging Face Integration: Push evaluation results directly to Hugging Face datasets

🏃 Speedrun: Evaluate a Model in 60 Seconds

Prerequisite: Install uv

# Create a virtual environment and install openbench (30 seconds)
uv venv
source .venv/bin/activate
uv pip install openbench

# Set your API key (any provider!)
export GROQ_API_KEY=your_key  # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.

# Run your first eval (30 seconds)
bench eval mmlu --model groq/llama-3.3-70b-versatile --limit 10

# That's it! 🎉 Check results in ./logs/ or view them in an interactive UI:
bench view

openbench.mp4

Optional Plugins

Some benchmark suites ship as standalone plugins so they can iterate independently from the core distribution. Install them alongside openbench with uv pip and they will automatically appear in bench list via the plugin entry point system.

openbench-cyber: adds the CTI-Bench family plus CyBench (agentic CTF challenges). This plugin ships real exploit code and forensics artifacts that routinely trigger anti-malware scanners, so we require a deliberate, manual install after you read the security guidance.
- Install explicitly: uv pip install "openbench-cyber @ git+https://github.com/groq/openbench-cyber.git@d93522ba70392cdceddb83f762c78a68923e70da"
- Review the plugin README for sandbox requirements and risk acknowledgements before using it.

Using Different Providers

# Groq (blazing fast!)
bench eval gpqa_diamond --model groq/meta-llama/llama-4-maverick-17b-128e-instruct

# OpenAI
bench eval humaneval --model openai/o3-2025-04-16

# Anthropic
bench eval simpleqa --model anthropic/claude-sonnet-4-20250514

# Google
bench eval mmlu --model google/gemini-2.5-pro

# Local models with Ollama
bench eval musr --model ollama/llama3.1:70b

# Helicone AI Gateway
bench eval mmlu --model helicone/gpt-4o

# Hugging Face Inference Providers
bench eval mmlu --model huggingface/gpt-oss-120b:groq

# OpenRouter
bench eval gpqa_diamond --model openrouter/deepseek/deepseek-chat-v3.1

# 30+ providers supported - see full list below

Supported Providers

openbench supports 30+ model providers through Inspect AI. Set the appropriate API key environment variable and you're ready to go:

Provider	Environment Variable	Example Model String
AI21 Labs	`AI21_API_KEY`	`ai21/model-name`
Anthropic	`ANTHROPIC_API_KEY`	`anthropic/model-name`
AWS Bedrock	AWS credentials	`bedrock/model-name`
Azure	`AZURE_OPENAI_API_KEY`	`azure/<deployment-name>`
Baseten	`BASETEN_API_KEY`	`baseten/model-name`
Cerebras	`CEREBRAS_API_KEY`	`cerebras/model-name`
Cohere	`COHERE_API_KEY`	`cohere/model-name`
Crusoe	`CRUSOE_API_KEY`	`crusoe/model-name`
DeepInfra	`DEEPINFRA_API_KEY`	`deepinfra/model-name`
Friendli	`FRIENDLI_TOKEN`	`friendli/model-name`
Google	`GOOGLE_API_KEY`	`google/model-name`
Groq	`GROQ_API_KEY`	`groq/model-name`
Helicone	`HELICONE_API_KEY`	`helicone/model-name`
Hugging Face	`HF_TOKEN`	`huggingface/model-name`
Hyperbolic	`HYPERBOLIC_API_KEY`	`hyperbolic/model-name`
Lambda	`LAMBDA_API_KEY`	`lambda/model-name`
MiniMax	`MINIMAX_API_KEY`	`minimax/model-name`
Mistral	`MISTRAL_API_KEY`	`mistral/model-name`
Moonshot	`MOONSHOT_API_KEY`	`moonshot/model-name`
Nebius	`NEBIUS_API_KEY`	`nebius/model-name`
Nous Research	`NOUS_API_KEY`	`nous/model-name`
Novita AI	`NOVITA_API_KEY`	`novita/model-name`
Ollama	None (local)	`ollama/model-name`
OpenAI	`OPENAI_API_KEY`	`openai/model-name`
OpenRouter	`OPENROUTER_API_KEY`	`openrouter/model-name`
Parasail	`PARASAIL_API_KEY`	`parasail/model-name`
Perplexity	`PERPLEXITY_API_KEY`	`perplexity/model-name`
Reka	`REKA_API_KEY`	`reka/model-name`
SambaNova	`SAMBANOVA_API_KEY`	`sambanova/model-name`
SiliconFlow	`SILICONFLOW_API_KEY`	`siliconflow/model-name`
Together AI	`TOGETHER_API_KEY`	`together/model-name`
Vercel AI Gateway	`AI_GATEWAY_API_KEY`	`vercel/creator-name/model-name`
W&B Inference	`WANDB_API_KEY`	`wandb/model-name`
vLLM	None (local)	`vllm/model-name`

Available Benchmarks

Here are the currently available benchmarks. For an up-to-date list use bench list.

Note

Benchmark names are case-sensitive in the CLI.

Category	Benchmarks
Knowledge	MMLU (57 subjects), MMLU-Pro, GPQA (graduate-level), SuperGPQA (285 disciplines), TUMLU (9 languages), OpenBookQA, HLE (Humanity's Last Exam - 2,500 questions from 1,000+ experts), HLE_text (text-only version)
Coding	HumanEval (164 problems), MBPP, SciCode (alpha), GMCQ, JSONSchemaBench, Exercism (code agent eval across 5 languages)
Math	AIME 2023-2025, HMMT Feb 2023-2025, BRUMO 2025, MATH (competition-level problems), MATH-500 (challenging subset), MGSM (multilingual grade school math), MGSM_en (English), MGSM_latin (5 languages), MGSM_non_latin (6 languages), OTIS Mock AIME 2024-2025
Reasoning	SimpleQA (factuality), MuSR, MuSR murder_mysteries, MuSR object_placements, MuSR team_allocation, DROP (discrete reasoning over paragraphs), GraphWalks (multi-hop reasoning), BrowseComp (browsing agents), MMMU, MMMU_MCQ, MMMU_OPEN, MMMU_PRO, MMMU_PRO_VISION, MMMU subsets: accounting, agriculture, architecture_and_engineering, art, art_theory, basic_medical_science, biology, chemistry, clinical_medicine, design, diagnostics_and_laboratory_medicine, electronics, energy_and_power, finance, geography, history, literature, manage, marketing, materials, math, mechanical_engineering, music, pharmacy, physics, psychology, public_health, sociology
Long Context	OpenAI MRCR (multiple needle retrieval), OpenAI MRCR_2n (2 needle), OpenAI MRCR_4n (4 needle), OpenAI MRCR_8n (8 needle)
Healthcare	HealthBench (open-ended healthcare eval), HealthBench_hard (challenging variant), HealthBench_consensus (consensus variant)
Cybersecurity (requires `openbench-cyber` plugin)	CTI-Bench ATE (MITRE ATT&CK technique extraction), CTI-Bench MCQ (knowledge questions on CTI standards and best practices), CTI-Bench RCM (CVE to CWE vulnerability mapping), CTI-Bench VSP (CVSS score calculation), cybench (40 tasks from CTF competitions)
Community	ClockBench, DetailBench
MCP	LiveMCPBench (70 MCP servers and 527 tools)
Jailbreak	SafeMT_M2S (single turn conversion of SafeMT_Attack_600), CoSafe_M2S (single turn conversion of Cosafe_300), MHJ_M2S (single turn conversion of MHJ)

Configuration

# Set your API keys
export GROQ_API_KEY=your_key
export HF_TOKEN=your_key
export OPENAI_API_KEY=your_key  # Optional
export HELICONE_API_KEY=your_key  # For Helicone AI Gateway
export OPENROUTER_API_KEY=your_key  # For OpenRouter

# Set default model
export BENCH_MODEL=groq/openai/gpt-oss-20b

Commands and Options

For a complete list of all commands and options, run: bench --help

Command	Description
`bench` or `openbench`	Show main menu with available commands
`bench list`	List available evaluations, models, and flags
`bench eval <benchmark>`	Run benchmark evaluation on a model
`bench eval-retry`	Retry a failed evaluation
`bench view`	View logs from previous benchmark runs
`bench eval <path>`	Run your local/private evals built with Inspect AI
`bench cache`	Manage OpenBench caches (info/ls/clear)

Cache Command

The bench cache command helps manage OpenBench's caches, particularly for LiveMCPBench. It provides three subcommands:

# Show cache information and sizes
bench cache info

# List all cache contents
bench cache ls

#List specific cache with tree view
bench cache ls --type livemcpbench --tree

# Clear specific cache completely
bench cache clear --type livemcpbench --all

All cache data is stored under ~/.openbench. The cache command helps you monitor and manage this storage.

Key `eval` Command Common Configuration Options

Option	Environment Variable	Default	Description
`-M <args>`	None	None	Pass model/provider-specific arguments (e.g., `-M only=groq`)
`-T <args>`	None	None	Pass task-specific arguments to the benchmark
`--model`	`BENCH_MODEL`	`groq/openai/gpt-oss-20b`	Model(s) to evaluate
`--epochs`	`BENCH_EPOCHS`	`1`	Number of epochs to run each evaluation
`--epochs-reducer`	`BENCH_EPOCHS_REDUCER`	None	Reducer(s) applied when aggregating epoch scores.
`--max-connections`	`BENCH_MAX_CONNECTIONS`	`10`	Maximum parallel requests to model
`--temperature`	`BENCH_TEMPERATURE`	`0.6`	Model temperature
`--top-p`	`BENCH_TOP_P`	`1.0`	Model top-p
`--max-tokens`	`BENCH_MAX_TOKENS`	`None`	Maximum tokens for model response
`--seed`	`BENCH_SEED`	`None`	Seed for deterministic generation
`--limit`	`BENCH_LIMIT`	`None`	Limit evaluated samples (number or start,end)
`--logfile`	`BENCH_OUTPUT`	`None`	Output file for results
`--sandbox`	`BENCH_SANDBOX`	`None`	Environment to run evaluation (local/docker)
`--timeout`	`BENCH_TIMEOUT`	`10000`	Timeout for each API request (seconds)
`--display`	`BENCH_DISPLAY`	`None`	Display type (full/conversation/rich/plain/none)
`--reasoning-effort`	`BENCH_REASONING_EFFORT`	`None`	Reasoning effort level (low/medium/high)
`--json`	None	`False`	Output results in JSON format
`--log-format`	`BENCH_LOG_FORMAT`	`eval`	Output logging format (eval/json)
`--hub-repo`	`BENCH_HUB_REPO`	`None`	Push results to a Hugging Face Hub dataset
`--keep-livemcp-root`	`BENCH_KEEP_LIVEMCP_ROOT`	`False`	Allow preservation of root data after livemcpbench eval runs
`--code-agent`	`BENCH_CODE_AGENT`	`opencode`	Select code agent for exercism tasks

Grader Information

Some benchmarks use a grader model to score the model's performance. This requires an additional API key for the grader model.

To run these benchmarks, you'll need to export your OPENAI_API_key:

export OPENAI_API_KEY=your_openai_key

The following benchmarks use a grader model:

Benchmark	Default Grader Model
`simpleqa`	`openai/gpt-4.1-2025-04-14`
`hle`	`openai/o3-mini-2025-01-31`
`hle_text`	`openai/o3-mini-2025-01-31`
`browsecomp`	`openai/gpt-4.1-2025-04-14`
`healthbench`	`openai/gpt-4.1-2025-04-14`
`math`	`openai/gpt-4-turbo-preview`
`math_500`	`openai/gpt-4-turbo-preview`
`detailbench`	`gpt-5-mini-2025-08-07`
`livemcpbench`	`openai/gpt-4.1-mini-2025-04-14`
`otis_mock_aime`	`openai/gpt-4.1-mini-2025-04-14`

Building Your Own Evals

openbench is built on Inspect AI. To create custom evaluations, check out their excellent documentation.

Quick Eval: Run from Path

For one-off or private evaluations, point openbench directly at your eval:

bench eval /path/to/my_eval.py --model groq/llama-3.3-70b-versatile

Plugin System: Distribute as Packages

openbench supports a plugin system via Python entry points. Package your benchmarks and distribute them independently:

# pyproject.toml
[project.entry-points."openbench.benchmarks"]
my_benchmark = "my_pkg.metadata:get_benchmark_metadata"

After pip install my-benchmark-package, your benchmark appears in bench list and works with all CLI commands. Perfect for:

Sharing benchmarks across teams
Versioning evaluations independently
Overriding built-in benchmarks with custom implementations

📖 Full guide: Extending openbench

Exporting Logs to Hugging Face

openbench can export logs to a Hugging Face Hub dataset. This is useful if you want to share your results with the community or use them for further analysis.

export HF_TOKEN=<your-huggingface-token>

bench eval mmlu --model groq/llama-3.3-70b-versatile --limit 10 --hub-repo <your-username>/openbench-logs

This will export the logs to a Hugging Face Hub dataset with the name openbench-logs.

FAQ

How does openbench differ from Inspect AI?

openbench provides:

Reference implementations of 20+ major benchmarks with consistent interfaces
Shared utilities for common patterns (math scoring, multi-language support, etc.)
Curated scorers that work across different eval types
CLI tooling optimized for running standardized benchmarks

Think of it as a benchmark library built on Inspect's excellent foundation.

Why not just use Inspect AI, lm-evaluation-harness, or lighteval?

Different tools for different needs! openbench focuses on:

Shared components: Common scorers, solvers, and datasets across benchmarks reduce code duplication
Clean implementations: Each eval is written for readability and reliability
Developer experience: Simple CLI, consistent patterns, easy to extend

We built openbench because we needed evaluation code that was easy to understand, modify, and trust. It's a curated set of benchmarks built on Inspect AI's excellent foundation.

How can I run `bench` outside of the `uv` environment?

If you want bench to be available outside of uv, you can run the following command:

uv run pip install -e .

I'm running into an issue when downloading a dataset from HuggingFace - how do I fix it?

Some evaluations may require logging into HuggingFace to download the dataset. If bench prompts you to do so, or throws "gated" errors, defining the environment variable

HF_TOKEN="<HUGGINGFACE_TOKEN>"

should fix the issue. The full HuggingFace documentation can be found on the HuggingFace docs on Authentication.

Development

For development work, you'll need to clone the repository:

# Clone the repo
git clone https://github.com/groq/openbench.git
cd openbench

# Setup with UV
uv venv && uv sync --dev
source .venv/bin/activate

# CRITICAL: Install pre-commit hooks (CI will fail without this!)
pre-commit install

# Run tests
pytest

⚠️ IMPORTANT: You MUST run pre-commit install after setup or CI will fail on your PRs!

Contributing

We welcome contributions! Please see our Contributing Guide for detailed instructions on:

Setting up the development environment
Adding new benchmarks and model providers
Code style and testing requirements
Submitting issues and pull requests

Quick links:

Reproducibility Statement

As the authors of openbench, we strive to implement this tool's evaluations as faithfully as possible with respect to the original benchmarks themselves.

However, it is expected that developers may observe numerical discrepancies between openbench's scores and the reported scores from other sources.

These numerical differences can be attributed to many reasons, including (but not limited to) minor variations in the model prompts, different model quantization or inference approaches, and repurposing benchmarks to be compatible with the packages used to develop openbench.

As a result, openbench results are meant to be compared with openbench results, not as a universal one-to-one comparison with every external result. For meaningful comparisons, ensure you are using the same version of openbench.

We encourage developers to identify areas of improvement and we welcome open source contributions to openbench.

Acknowledgments

This project would not be possible without:

Inspect AI - The incredible evaluation framework that powers openbench
EleutherAI's lm-evaluation-harness - Pioneering work in standardized LLM evaluation
Hugging Face's lighteval - Excellent evaluation infrastructure

Citation

@software{openbench,
  title = {openbench: Provider-agnostic, open-source evaluation infrastructure for language models},
  author = {Sah, Aarush},
  year = {2025},
  url = {https://openbench.dev}
}

License

MIT

Built with ❤️ by Aarush Sah and the Groq team

Name		Name	Last commit message	Last commit date
Latest commit History 264 Commits
.claude		.claude
.github		.github
.vscode		.vscode
docs		docs
packages/openbench-core		packages/openbench-core
scripts		scripts
src/openbench		src/openbench
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
release-please-config.json		release-please-config.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

openbench

🚧 Alpha Release

Features

🏃 Speedrun: Evaluate a Model in 60 Seconds

Optional Plugins

Using Different Providers

Supported Providers

Available Benchmarks

Configuration

Commands and Options

Cache Command

Key `eval` Command Common Configuration Options

Grader Information

Building Your Own Evals

Quick Eval: Run from Path

Plugin System: Distribute as Packages

Exporting Logs to Hugging Face

FAQ

How does openbench differ from Inspect AI?

Why not just use Inspect AI, lm-evaluation-harness, or lighteval?

How can I run `bench` outside of the `uv` environment?

I'm running into an issue when downloading a dataset from HuggingFace - how do I fix it?

Development

Contributing

Reproducibility Statement

Acknowledgments

Citation

License

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors 32

Languages

License

groq/openbench

Folders and files

Latest commit

History

Repository files navigation

openbench

🚧 Alpha Release

Features

🏃 Speedrun: Evaluate a Model in 60 Seconds

Optional Plugins

Using Different Providers

Supported Providers

Available Benchmarks

Configuration

Commands and Options

Cache Command

Key eval Command Common Configuration Options

Grader Information

Building Your Own Evals

Quick Eval: Run from Path

Plugin System: Distribute as Packages

Exporting Logs to Hugging Face

FAQ

How does openbench differ from Inspect AI?

Why not just use Inspect AI, lm-evaluation-harness, or lighteval?

How can I run bench outside of the uv environment?

I'm running into an issue when downloading a dataset from HuggingFace - how do I fix it?

Development

Contributing

Reproducibility Statement

Acknowledgments

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors 32

Languages

Key `eval` Command Common Configuration Options

How can I run `bench` outside of the `uv` environment?

Packages