Benchmark coding-agent skills by running real agent sessions and asserting on normalized execution reports.
When you evaluate agent skills manually, it is hard to tell whether the agent actually selected the right skill, used it at the right time, and behaved correctly end to end. skillgym gives you a repeatable way to run real sessions, preserve session artifacts, verify outcomes with TypeScript assertions, and catch token regressions with snapshots.
- OpenCode CLI
- Codex CLI
Install skillgym in the project where you want to benchmark agent behavior:
npm install --save-dev skillgym
yarn add --dev skillgym
pnpm add --save-dev skillgym
bun add --dev skillgymCreate skillgym.config.ts in your project root, or in a parent directory that the suite can discover upward:
import type { SkillGymConfig } from "skillgym";
const config: SkillGymConfig = {
run: {
cwd: ".",
outputDir: "./.skillgym-results",
reporter: "standard",
schedule: "serial",
},
defaults: {
timeoutMs: 120_000,
},
runners: {
open-main: {
agent: {
type: "opencode",
model: "openai/gpt-5",
},
},
code-main: {
agent: {
type: "codex",
model: "gpt-5",
},
},
},
};
export default config;Create a suite file such as ./skillgym/basic-suite.ts:
import type { TestSuite } from "skillgym";
import { assert } from "skillgym";
const suite: TestSuite = [
{
id: "always-passes",
prompt: "Say only: skillgym ready",
assert(report, ctx) {
assert.match(ctx.finalOutput(), /skillgym ready/);
},
},
];
export default suite;Run the suite with the package manager you use in that project:
npx skillgym run ./skillgym/basic-suite.ts
yarn skillgym run ./skillgym/basic-suite.ts
pnpm exec skillgym run ./skillgym/basic-suite.ts
bunx skillgym run ./skillgym/basic-suite.tsView CLI help:
npx skillgym helpBy default, skillgym uses the built-in standard reporter.
TypeScript config, suite, and reporter modules work out of the box on Node >=22.18.0 using Node's built-in TypeScript stripping.
TypeScript runtime limitations:
.ts,.mts, and.ctsmodules are supported.tsxis not supported- runtime
tsconfigpath aliases are not supported - use explicit file extensions in relative imports, for example
./helpers.js - use
import typefor type-only imports - TypeScript features that need code generation, such as
enum, are not supported by default
- a
skillgym.config.*file with a non-emptyrunnersmap - at least one configured runner with
agent.typeandagent.model - the corresponding CLI installed and working in your environment
- a suite file that exports test cases
Config is discovered upward from the suite file directory. CLI flags override config values.
Runner model selection is required per runner in runners.<name>.agent.model.
Use agent.model instead of commandArgs when you need to select the agent model, especially for Codex where --model must be passed to codex exec rather than the outer launcher.
A runner is one configured agent target. It tells skillgym which CLI to launch and which model to use for a run.
Each test case runs once per selected runner. For example, 3 cases and 2 runners produce 6 executions.
Most important config properties:
run.cwd: working directory used for shared-workspace runsrun.outputDir: where artifacts, reports, and preserved workspaces are writtenrun.reporter: built-instandardreporter or a custom reporter module pathrun.schedule: execution scheduling mode for case x runner pairsrun.workspace: default workspace mode for the suitedefaults.timeoutMs: default per-case timeoutrunners.<id>.agent.type: which agent integration to use, currentlyopencodeorcodexrunners.<id>.agent.model: model passed to that runnersnapshots: token regression baseline configuration
The execution unit is one case x runner pair. skillgym expands the suite into those pairs, runs them according to run.schedule, and writes artifacts for each execution.
run.schedule controls execution order:
serial: run every case/runner pair in declaration orderparallel: start all selected case/runner pairs concurrentlyisolated-by-runner: keep each runner on its own serial queue while different runners may overlap
serial is the default. parallel maximizes overlap across the full matrix. isolated-by-runner is a middle ground when you want each runner to stay ordered internally but still allow different runners to overlap.
Concurrent schedules do not copy or isolate the workspace by themselves. Overlapping runs may still interact through the same filesystem state and live runner output unless you use isolated workspaces. Codex and OpenCode runtime state are isolated per run under each artifact directory.
A workspace is the directory where an execution runs.
skillgym supports two workspace modes:
shared: run directly in one real directoryisolated: create a fresh temporary workspace per case x runner execution
Use shared when you want the agent to work against your real project checkout. Use isolated when you want clean filesystem state per execution or need to prepare each run from a template.
You can configure workspaces in skillgym.config.* with run.workspace, or per suite with a named workspace export. Suite-level workspace config overrides config-level run.workspace.
Isolated workspace example in a suite:
export const workspace = {
mode: "isolated",
templateDir: "./fixtures/base-project",
bootstrap: {
command: "npm",
args: ["install"],
},
};In isolated mode, each execution gets its own workspace. templateDir copies a starter project into that workspace, and bootstrap runs before the agent starts. Successful isolated runs are cleaned up; failed ones are preserved under outputDir/workspaces for debugging.
See Workspaces for the full workspace reference.
assert extends Node's node:assert/strict helpers, so standard methods like assert.ok, assert.equal, and assert.match still work.
Built-in grouped assertions cover:
assert.skills.*assert.commands.*assert.fileReads.*assert.toolCalls.*assert.output.*
Example:
import { assert } from "skillgym";
assert.skills.has(report, "find-skills");
assert.skills.notHas(report, "upgrading-expo");
assert.commands.includes(report, "npx skills find");
assert.commands.notIncludes(report, "npm install");
assert.fileReads.includes(report, /find-skills\/SKILL\.md$/);
assert.fileReads.notIncludes(report, /upgrading-expo\/SKILL\.md$/);
assert.toolCalls.has(report, {
tool: "skill",
where: (args) => (args as { name?: string })?.name === "find-skills",
});
assert.output.notEmpty(report);See the assertion reference.
Snapshot checks can fail runs when token usage regresses beyond a configured tolerance.
npx skillgym run ./examples/basic-suite.ts --update-snapshotsSee the snapshot guide.
The skill selection suite targets a real installed skill (find-skills) and checks that the runner loads it before invoking npx skills find.
npx skillgym run ./examples/skill-selection-suite.tsThe workspace isolation suite demonstrates isolated workspace setup with a template directory and bootstrap command:
npx skillgym run ./examples/workspace-isolation-suite.ts