Skip to content

auth0/auth0-evals

auth0-evals

An evaluation framework for measuring how accurately LLM agents complete Auth0 integration tasks. It runs each task across multiple configurations - from a single LLM call with no tools, to a full agentic loop with MCP servers and skills - and compares the results so you can see exactly where each investment pays off.

License

What it's used for

We use auth0-evals to measure how well AI agents integrate Auth0 across our SDKs, MCP servers, and skills - and to track how those scores improve as we invest in better documentation, tooling, and agent experiences. The results power the Auth0 Agent Experience page.

Auth0 Agent Experience

Note

We develop auth0-evals in public for our own internal use. It is not intended for external use cases, and we provide no support, guarantees, or stability commitments for anyone building on top of it. You're welcome to read it, learn from it, provide feedback, and use it - but do so at your own risk.

Packages

  • @a0/eval - CLI (a0-eval), agent runners (Claude Code, Copilot, Gemini CLI), scoring, and result persistence.
  • @a0/eval-core - Framework core - config loader, eval discovery, grader engine, workspace lifecycle, type definitions.
  • @a0/eval-graders - Grader factory functions (contains, notContains, matches, judge) and the GraderLevel enum.
  • @a0/eval-reporter - Generates HTML reports from scored results.
  • auth0-evals - The Auth0 eval suite - task prompts, graders, scaffolds, and configuration.

Running Evals

Running evals requires Node.js 24+ and Docker (used to sandbox agent runs).

Before running the evals, ensure you install the dependencies and build the packages:

npm install
npm run build

Configure the .env file in the apps/auth0-evals directory with your LLM API key and GitHub token (if needed):

cp apps/auth0-evals/.env.example apps/auth0-evals/.env
# set LLM_API_KEY in apps/auth0-evals/.env
# add GH_TOKEN if running evals that use gh CLI calls (e.g. android_quickstart): gh auth token

Then, you can run a specific eval like this:

# Run a single eval in baseline mode
npm run evals -- --eval react_quickstart --mode baseline

# Generate an HTML report
npm run report

How it works

Each eval defines a prompt (the task an LLM must complete) and graders (pass/fail checks against the generated code). The framework can run the prompt across 5 configurations:

Configuration What it tests
baseline Single LLM call, no tools - training-data knowledge only
agent Full agentic loop with file/shell tools
agent+skills Agent + skill files injected into context
agent+mcp Agent + MCP server tools
agent+mcp+skills Agent + MCP + skills combined

The delta between configurations tells you where to invest:

  • baseline → agent - value of tool access alone
  • agent → agent+skills - value of skills investment
  • agent → agent+mcp - value of MCP server
  • agent+mcp+skills - full compound effect

Agent runs are scored across 8 dimensions (process + output quality) into a JSON results file. See packages/eval for CLI documentation and scoring details.

Documentation

Development

npm install       # install all workspace dependencies
npm run build     # compile all packages
npm test          # run tests across all packages
npm run lint      # lint
npm run format    # format with Prettier

Requires Node.js 24+ and Docker (for sandboxed agent runs).

Feedback

Contributing

We appreciate feedback and contribution to this repo! Before you get started, please read Auth0's general contribution guidelines.

Raise an issue

To provide feedback or report a bug, please raise an issue on our issue tracker.

Vulnerability Reporting

Please do not report security vulnerabilities on the public GitHub issue tracker. The Responsible Disclosure Program details the procedure for disclosing security issues.

What is Auth0?

Auth0 Logo

Auth0 is an easy to implement, adaptable authentication and authorization platform. To learn more check out Why Auth0?

This project is licensed under the Apache 2.0 license. See the LICENSE file for more info.

About

Benchmark AI agent accuracy on Auth0 integration tasks.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors