claude-eval

An evaluation tool for Claude Code, using a LLM-as-a-judge simplified approach.

When changing Claude Code memories, commands, agents and models:

How can you know if the change work as expected?
How can you know if the change will NOT break something else?

This tool solves those problems by enabling Eval-driven development for Claude Code.

No complex scoring or ranking — just clear PASSED ✅ / FAILED ❌ results for your evaluation criteria.

It's like TDD for AI.

Prerequisites

Node.js 18+ or Bun
Claude Code installed and configured in your project

Installation

Global Installation (Recommended)

For regular use, install claude-eval globally:

# Using npm
npm install -g claude-eval

# Using bun
bun install -g claude-eval

After global installation, you can use the claude-eval command directly and access the update functionality:

claude-eval --version
claude-eval update

One-time Usage

If you prefer not to install globally, you can run evaluations directly with npx:

npx claude-eval evals/*.yaml

Usage

# Single evaluation
claude-eval evals/say-dont-know-clear-way.yaml

# Multiple evaluations (batch)
claude-eval evals/*.yaml

# Custom concurrency
claude-eval evals/*.yaml --concurrency=3

# Check for updates
claude-eval update

# Show help
claude-eval --help

Evaluation File Format

Evaluation files are YAML documents with the following structure:

prompt: >
  What is the weather for today?

expected_behavior:
  - Just say you don't know in a clear way.
  - Don't give user alternatives.
  - Don't recommend user to research for the answer elsewhere.

prompt: The prompt you would send to Claude Code
expected_behavior: Array of criteria that the response should meet

How It Works

Parse YAML: Loads and validates the evaluation specification
Query Claude: Executes the prompt on Sonnet model, on plan mode
Judge Response: Evaluate the response with Haiku model
Format Results: Displays results with ✅/❌ indicators and summary

Contributing

We welcome contributions! Please:

Open an issue to discuss major changes before starting
Follow existing code style and patterns in the codebase
Add tests for new features and bug fixes
Update documentation as needed
Keep it simple - this tool is intentionally minimal and focused

For bug reports and feature requests, please use the GitHub Issues page.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
docs		docs
evals		evals
imgs		imgs
src		src
test		test
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
index.ts		index.ts
jest.config.cjs		jest.config.cjs
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

claude-eval

Prerequisites

Installation

Global Installation (Recommended)

One-time Usage

Usage

Evaluation File Format

How It Works

Contributing

About

Uh oh!

Releases

Packages

Languages

License

bkper/claude-eval

Folders and files

Latest commit

History

Repository files navigation

claude-eval

Prerequisites

Installation

Global Installation (Recommended)

One-time Usage

Usage

Evaluation File Format

How It Works

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages