Skip to content

Conversation

@willccbb
Copy link
Member

@willccbb willccbb commented Jan 16, 2026

Motivation

  • Narrow the per-environment config to only the fields that come from TOML (env_id, env_args, num_examples, rollouts_per_example) so TOML / env pyproject settings are decoupled from run-wide options.
  • Allow multiple environments per run while keeping run-level settings (model, sampling, concurrency, saving) shared and overridable from the CLI.
  • Simplify configuration precedence and make CLI overrides for run-level options unambiguous.

Description

  • Introduced a new EvalRunConfig pydantic model and reduced EvalConfig to env-only fields (env_id, env_args, num_examples, rollouts_per_example) in verifiers/types.py.
  • Updated evaluation runner signatures and behavior: run_evaluation now accepts (env_config: EvalConfig, run_config: EvalRunConfig) and run_multi_evaluation accepts an EvalRunConfig, and result-path construction now takes both run and env config via get_eval_results_path(run_config, env_config) in verifiers/utils/path_utils.py and verifiers/utils/eval_utils.py.
  • Reworked CLI resolution in verifiers/scripts/eval.py so per-env TOML settings build EvalConfig objects, while run-level settings (model, endpoints, sampling args, concurrency, saving, headers, etc.) are collected into a single EvalRunConfig constructed from CLI flags and endpoint registry lookup.
  • Updated documentation in docs/evaluation.md and docs/reference.md to reflect the new TOML scope (per-env fields) and the new EvalRunConfig type and precedence: TOML controls only per-env fields and CLI controls run-level options.
  • Adjusted imports, logging messages, and save logic to use the new config split and preserved existing behavior for parallel multi-env runs.

Testing

  • No automated tests were executed as part of this change (no pytest or CI run was requested).

Codex Task


Note

Enables parallel evaluation of multiple environments and clarifies config boundaries between per-env and run-level settings.

  • Introduces EvalEnvConfig, EvalModelConfig, EvalSaveConfig, and EvalRunConfig; EvalConfig now nests env, model, and save
  • Refactors CLI (verifiers/scripts/eval.py) to accept env_id_or_path (single, comma-separated, or TOML), build multiple EvalConfigs, and apply clear precedence (per-eval TOML → TOML defaults → CLI → env defaults → global)
  • Adds TOML loader/validator (load_toml_config), detection (is_toml_config), and run executor run_evaluations to execute evals in parallel
  • Updates result path construction to get_eval_results_path(run_config, eval_config) and prints per-eval results plus aggregate event loop lag performance
  • Extends docs (docs/evaluation.md, docs/reference.md) with multi-env usage, TOML schema, and precedence; adds sample configs under configs/evals/ and local debug config
  • Adds comprehensive CLI/TOML tests (tests/test_eval_cli.py); minor logging tweak in EventLoopLagMonitor

Written by Cursor Bugbot for commit 1ee8aa7. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@willccbb willccbb closed this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants