An evaluation framework for measuring how accurately LLM agents complete Auth0 integration tasks. It runs each task across multiple configurations - from a single LLM call with no tools, to a full agentic loop with MCP servers and skills - and compares the results so you can see exactly where each investment pays off.
We use auth0-evals to measure how well AI agents integrate Auth0 across our SDKs, MCP servers, and skills - and to track how those scores improve as we invest in better documentation, tooling, and agent experiences. The results power the Auth0 Agent Experience page.
Note
We develop auth0-evals in public for our own internal use. It is not intended for external use cases, and we provide no support, guarantees, or stability commitments for anyone building on top of it. You're welcome to read it, learn from it, provide feedback, and use it - but do so at your own risk.
@a0/eval- CLI (a0-eval), agent runners (Claude Code, Copilot, Gemini CLI), scoring, and result persistence.@a0/eval-core- Framework core - config loader, eval discovery, grader engine, workspace lifecycle, type definitions.@a0/eval-graders- Grader factory functions (contains,notContains,matches,judge) and theGraderLevelenum.@a0/eval-reporter- Generates HTML reports from scored results.auth0-evals- The Auth0 eval suite - task prompts, graders, scaffolds, and configuration.
Running evals requires Node.js 24+ and Docker (used to sandbox agent runs).
Before running the evals, ensure you install the dependencies and build the packages:
npm install
npm run buildConfigure the .env file in the apps/auth0-evals directory with your LLM API key and GitHub token (if needed):
cp apps/auth0-evals/.env.example apps/auth0-evals/.env
# set LLM_API_KEY in apps/auth0-evals/.env
# add GH_TOKEN if running evals that use gh CLI calls (e.g. android_quickstart): gh auth tokenThen, you can run a specific eval like this:
# Run a single eval in baseline mode
npm run evals -- --eval react_quickstart --mode baseline
# Generate an HTML report
npm run reportEach eval defines a prompt (the task an LLM must complete) and graders (pass/fail checks against the generated code). The framework can run the prompt across 5 configurations:
| Configuration | What it tests |
|---|---|
baseline |
Single LLM call, no tools - training-data knowledge only |
agent |
Full agentic loop with file/shell tools |
agent+skills |
Agent + skill files injected into context |
agent+mcp |
Agent + MCP server tools |
agent+mcp+skills |
Agent + MCP + skills combined |
The delta between configurations tells you where to invest:
- baseline → agent - value of tool access alone
- agent → agent+skills - value of skills investment
- agent → agent+mcp - value of MCP server
- agent+mcp+skills - full compound effect
Agent runs are scored across 8 dimensions (process + output quality) into a JSON results file. See packages/eval for CLI documentation and scoring details.
packages/eval/README.md- CLI usage, configuration, runners, scoring methodologyapps/auth0-evals/README.md- Auth0 eval suite, available evals, how to add new onesdocs/ADDING_EVALS.md- Full guide to writing evalsdocs/SCORING_METHODOLOGY.md- Scoring philosophy and dimension detailsdocs/TESTING_SKILLS.md- How to test skills locally
npm install # install all workspace dependencies
npm run build # compile all packages
npm test # run tests across all packages
npm run lint # lint
npm run format # format with PrettierRequires Node.js 24+ and Docker (for sandboxed agent runs).
We appreciate feedback and contribution to this repo! Before you get started, please read Auth0's general contribution guidelines.
To provide feedback or report a bug, please raise an issue on our issue tracker.
Please do not report security vulnerabilities on the public GitHub issue tracker. The Responsible Disclosure Program details the procedure for disclosing security issues.

Auth0 is an easy to implement, adaptable authentication and authorization platform. To learn more check out Why Auth0?
This project is licensed under the Apache 2.0 license. See the LICENSE file for more info.

