Refactor eval configs: add EvalRunConfig and separate run-level vs env-level settings #737

willccbb · 2026-01-16T03:27:15Z

Motivation

Narrow the per-environment config to only the fields that come from TOML (env_id, env_args, num_examples, rollouts_per_example) so TOML / env pyproject settings are decoupled from run-wide options.
Allow multiple environments per run while keeping run-level settings (model, sampling, concurrency, saving) shared and overridable from the CLI.
Simplify configuration precedence and make CLI overrides for run-level options unambiguous.

Description

Introduced a new EvalRunConfig pydantic model and reduced EvalConfig to env-only fields (env_id, env_args, num_examples, rollouts_per_example) in verifiers/types.py.
Updated evaluation runner signatures and behavior: run_evaluation now accepts (env_config: EvalConfig, run_config: EvalRunConfig) and run_multi_evaluation accepts an EvalRunConfig, and result-path construction now takes both run and env config via get_eval_results_path(run_config, env_config) in verifiers/utils/path_utils.py and verifiers/utils/eval_utils.py.
Reworked CLI resolution in verifiers/scripts/eval.py so per-env TOML settings build EvalConfig objects, while run-level settings (model, endpoints, sampling args, concurrency, saving, headers, etc.) are collected into a single EvalRunConfig constructed from CLI flags and endpoint registry lookup.
Updated documentation in docs/evaluation.md and docs/reference.md to reflect the new TOML scope (per-env fields) and the new EvalRunConfig type and precedence: TOML controls only per-env fields and CLI controls run-level options.
Adjusted imports, logging messages, and save logic to use the new config split and preserved existing behavior for parallel multi-env runs.

Testing

No automated tests were executed as part of this change (no pytest or CI run was requested).

Codex Task

Note

Enables parallel evaluation of multiple environments and clarifies config boundaries between per-env and run-level settings.

Introduces EvalEnvConfig, EvalModelConfig, EvalSaveConfig, and EvalRunConfig; EvalConfig now nests env, model, and save
Refactors CLI (verifiers/scripts/eval.py) to accept env_id_or_path (single, comma-separated, or TOML), build multiple EvalConfigs, and apply clear precedence (per-eval TOML → TOML defaults → CLI → env defaults → global)
Adds TOML loader/validator (load_toml_config), detection (is_toml_config), and run executor run_evaluations to execute evals in parallel
Updates result path construction to get_eval_results_path(run_config, eval_config) and prints per-eval results plus aggregate event loop lag performance
Extends docs (docs/evaluation.md, docs/reference.md) with multi-env usage, TOML schema, and precedence; adds sample configs under configs/evals/ and local debug config
Adds comprehensive CLI/TOML tests (tests/test_eval_cli.py); minor logging tweak in EventLoopLagMonitor

^{Written by Cursor Bugbot for commit 1ee8aa7. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/scripts/eval.py

mikasenghaas and others added 19 commits January 15, 2026 13:10

simple multi eval scaffolding via toml config

7fba751

add debug config

3072cf8

demote to debug log

3343f3b

move around logs

a80e9ac

fix tests

63279d4

support comma-separated list

d976669

fix precedence

d23210b

minor

f34fee0

fix schema validation

73d2dcc

minor fix

d499fa8

update tests

cfc9ca0

add unit tests

084e684

revert pbar desc

f39c27e

update docs

3c361e5

typo

9f8bb55

fix mutation

98ed4b6

validation for env ids

1c6a73e

fix resolution issue

501a638

Refine eval configs for run vs env settings

ea99227

willccbb added the codex label Jan 16, 2026 — with ChatGPT Codex Connector

willccbb added 2 commits January 18, 2026 22:42

Restructure eval configs for env/model defaults

c477f9b

Add per-eval save configuration

840b418

cursor bot reviewed Jan 19, 2026

View reviewed changes

verifiers/scripts/eval.py Outdated Show resolved Hide resolved

Add eval config examples and fix base URL override

03026c8

mikasenghaas force-pushed the multi-env-evals branch from c4d690d to 6c047c9 Compare January 19, 2026 11:02

eval toml config refinements

1ee8aa7

willccbb closed this Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor eval configs: add EvalRunConfig and separate run-level vs env-level settings #737

Refactor eval configs: add EvalRunConfig and separate run-level vs env-level settings #737

Uh oh!

willccbb commented Jan 16, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor eval configs: add EvalRunConfig and separate run-level vs env-level settings #737

Refactor eval configs: add EvalRunConfig and separate run-level vs env-level settings #737

Uh oh!

Conversation

willccbb commented Jan 16, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Testing

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willccbb commented Jan 16, 2026 •

edited by cursor bot

Loading