An evaluation tool for Claude Code, using a LLM-as-a-judge simplified approach.
When changing Claude Code memories, commands, agents and models:
- How can you know if the change work as expected?
- How can you know if the change will NOT break something else?
This tool solves those problems by enabling Eval-driven development for Claude Code.
No complex scoring or ranking — just clear PASSED ✅ / FAILED ❌ results for your evaluation criteria.
It's like TDD for AI.
- Node.js 18+ or Bun
- Claude Code installed and configured in your project
For regular use, install claude-eval globally:
# Using npm
npm install -g claude-eval
# Using bun
bun install -g claude-eval
After global installation, you can use the claude-eval
command directly and access the update functionality:
claude-eval --version
claude-eval update
If you prefer not to install globally, you can run evaluations directly with npx:
npx claude-eval evals/*.yaml
# Single evaluation
claude-eval evals/say-dont-know-clear-way.yaml
# Multiple evaluations (batch)
claude-eval evals/*.yaml
# Custom concurrency
claude-eval evals/*.yaml --concurrency=3
# Check for updates
claude-eval update
# Show help
claude-eval --help
Evaluation files are YAML documents with the following structure:
prompt: >
What is the weather for today?
expected_behavior:
- Just say you don't know in a clear way.
- Don't give user alternatives.
- Don't recommend user to research for the answer elsewhere.
prompt
: The prompt you would send to Claude Codeexpected_behavior
: Array of criteria that the response should meet
- Parse YAML: Loads and validates the evaluation specification
- Query Claude: Executes the prompt on Sonnet model, on plan mode
- Judge Response: Evaluate the response with Haiku model
- Format Results: Displays results with ✅/❌ indicators and summary
We welcome contributions! Please:
- Open an issue to discuss major changes before starting
- Follow existing code style and patterns in the codebase
- Add tests for new features and bug fixes
- Update documentation as needed
- Keep it simple - this tool is intentionally minimal and focused
For bug reports and feature requests, please use the GitHub Issues page.