skillgym

Benchmark coding-agent skills by running real agent sessions and asserting on normalized execution reports.

Why it's useful

When you evaluate agent skills manually, it is hard to tell whether the agent actually selected the right skill, used it at the right time, and behaved correctly end to end. skillgym gives you a repeatable way to run real sessions, preserve session artifacts, verify outcomes with TypeScript assertions, and catch token regressions with snapshots.

Supported runners

OpenCode CLI
Codex CLI

Quick start

Install skillgym in the project where you want to benchmark agent behavior:

npm install --save-dev skillgym
yarn add --dev skillgym
pnpm add --save-dev skillgym
bun add --dev skillgym

Create skillgym.config.ts in your project root, or in a parent directory that the suite can discover upward:

import type { SkillGymConfig } from "skillgym";

const config: SkillGymConfig = {
  run: {
    cwd: ".",
    outputDir: "./.skillgym-results",
    reporter: "standard",
    schedule: "serial",
  },
  defaults: {
    timeoutMs: 120_000,
  },
  runners: {
    open-main: {
      agent: {
        type: "opencode",
        model: "openai/gpt-5",
      },
    },
    code-main: {
      agent: {
        type: "codex",
        model: "gpt-5",
      },
    },
  },
};

export default config;

Create a suite file such as ./skillgym/basic-suite.ts:

import type { TestSuite } from "skillgym";
import { assert } from "skillgym";

const suite: TestSuite = [
  {
    id: "always-passes",
    prompt: "Say only: skillgym ready",
    assert(report, ctx) {
      assert.match(ctx.finalOutput(), /skillgym ready/);
    },
  },
];

export default suite;

Run the suite with the package manager you use in that project:

npx skillgym run ./skillgym/basic-suite.ts
yarn skillgym run ./skillgym/basic-suite.ts
pnpm exec skillgym run ./skillgym/basic-suite.ts
bunx skillgym run ./skillgym/basic-suite.ts

View CLI help:

npx skillgym help

By default, skillgym uses the built-in standard reporter.

TypeScript config, suite, and reporter modules work out of the box on Node >=22.18.0 using Node's built-in TypeScript stripping.

TypeScript runtime limitations:

.ts, .mts, and .cts modules are supported
.tsx is not supported
runtime tsconfig path aliases are not supported
use explicit file extensions in relative imports, for example ./helpers.js
use import type for type-only imports
TypeScript features that need code generation, such as enum, are not supported by default

What you need to run a suite

a skillgym.config.* file with a non-empty runners map
at least one configured runner with agent.type and agent.model
the corresponding CLI installed and working in your environment
a suite file that exports test cases

Config is discovered upward from the suite file directory. CLI flags override config values.

Runner model selection is required per runner in runners.<name>.agent.model. Use agent.model instead of commandArgs when you need to select the agent model, especially for Codex where --model must be passed to codex exec rather than the outer launcher.

Runners

A runner is one configured agent target. It tells skillgym which CLI to launch and which model to use for a run.

Each test case runs once per selected runner. For example, 3 cases and 2 runners produce 6 executions.

Configuration

Most important config properties:

run.cwd: working directory used for shared-workspace runs
run.outputDir: where artifacts, reports, and preserved workspaces are written
run.reporter: built-in standard reporter or a custom reporter module path
run.schedule: execution scheduling mode for case x runner pairs
run.workspace: default workspace mode for the suite
defaults.timeoutMs: default per-case timeout
runners.<id>.agent.type: which agent integration to use, currently opencode or codex
runners.<id>.agent.model: model passed to that runner
snapshots: token regression baseline configuration

The execution unit is one case x runner pair. skillgym expands the suite into those pairs, runs them according to run.schedule, and writes artifacts for each execution.

run.schedule controls execution order:

serial: run every case/runner pair in declaration order
parallel: start all selected case/runner pairs concurrently
isolated-by-runner: keep each runner on its own serial queue while different runners may overlap

serial is the default. parallel maximizes overlap across the full matrix. isolated-by-runner is a middle ground when you want each runner to stay ordered internally but still allow different runners to overlap.

Concurrent schedules do not copy or isolate the workspace by themselves. Overlapping runs may still interact through the same filesystem state and live runner output unless you use isolated workspaces. Codex and OpenCode runtime state are isolated per run under each artifact directory.

Workspaces

A workspace is the directory where an execution runs.

skillgym supports two workspace modes:

shared: run directly in one real directory
isolated: create a fresh temporary workspace per case x runner execution

Use shared when you want the agent to work against your real project checkout. Use isolated when you want clean filesystem state per execution or need to prepare each run from a template.

You can configure workspaces in skillgym.config.* with run.workspace, or per suite with a named workspace export. Suite-level workspace config overrides config-level run.workspace.

Isolated workspace example in a suite:

export const workspace = {
  mode: "isolated",
  templateDir: "./fixtures/base-project",
  bootstrap: {
    command: "npm",
    args: ["install"],
  },
};

In isolated mode, each execution gets its own workspace. templateDir copies a starter project into that workspace, and bootstrap runs before the agent starts. Successful isolated runs are cleaned up; failed ones are preserved under outputDir/workspaces for debugging.

See Workspaces for the full workspace reference.

Assertions

assert extends Node's node:assert/strict helpers, so standard methods like assert.ok, assert.equal, and assert.match still work.

Built-in grouped assertions cover:

assert.skills.*
assert.commands.*
assert.fileReads.*
assert.toolCalls.*
assert.output.*

Example:

import { assert } from "skillgym";

assert.skills.has(report, "find-skills");
assert.skills.notHas(report, "upgrading-expo");
assert.commands.includes(report, "npx skills find");
assert.commands.notIncludes(report, "npm install");
assert.fileReads.includes(report, /find-skills\/SKILL\.md$/);
assert.fileReads.notIncludes(report, /upgrading-expo\/SKILL\.md$/);
assert.toolCalls.has(report, {
  tool: "skill",
  where: (args) => (args as { name?: string })?.name === "find-skills",
});
assert.output.notEmpty(report);

See the assertion reference.

Snapshots

Snapshot checks can fail runs when token usage regresses beyond a configured tolerance.

npx skillgym run ./examples/basic-suite.ts --update-snapshots

See the snapshot guide.

Example suites

The skill selection suite targets a real installed skill (find-skills) and checks that the runner loads it before invoking npx skills find.

npx skillgym run ./examples/skill-selection-suite.ts

The workspace isolation suite demonstrates isolated workspace setup with a template directory and bootstrap command:

npx skillgym run ./examples/workspace-isolation-suite.ts

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
examples		examples
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bin.js		bin.js
index.ts		index.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
skillgym.config.js		skillgym.config.js
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skillgym

Why it's useful

Supported runners

Quick start

What you need to run a suite

Runners

Configuration

Workspaces

Assertions

Snapshots

Example suites

Docs

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skillgym

Why it's useful

Supported runners

Quick start

What you need to run a suite

Runners

Configuration

Workspaces

Assertions

Snapshots

Example suites

Docs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages