multi-env evals config #734

mikasenghaas · 2026-01-15T15:38:37Z

Description

This PR implements evaluating multiple environments in parallel via vf-eval. For more details check the updated docs.

This PR is mainly concerned with the config system. Cosmetic updates will be shipped separately, e.g see #735

Examples

By default, we still evaluate a single env with no changes to the interface

uv run vf-eval gsm8k -n5 -r3

To configure multi-environment training, specify a comma-separated list of env ids

uv run vf-eval gsm8k,alphabet-sort -n5 -r3

Note, that all environments use their default configuration. Since CLI arguments apply to all enviroments one can only change values for all environments at the same time. To have more fine-grained configurability, check below.

To configure multi-environment training with (potentially) different arguments for each specify a path to a TOML config file

uv run vf-eval configs/evals/debug.toml -n5 -r3

# configs/local/vf-eval/debug.toml
[[env]]
id = "gsm8k"
num_examples = 1
rollouts_per_example = 1

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Enables running multiple environments in one invocation with config-driven control, plus install utilities and extensive tests/docs.

Multi-env evals: New [[eval]] TOML configs (configs/eval/*.toml) parsed by load_toml_config() with validation and precedence; CLI positional becomes env_id_or_config (single env or .toml path)
Parallel execution: New EvalRunConfig and run_evaluations() to execute multiple EvalConfigs concurrently; event-loop lag monitoring moved to the multi-run flow
CLI defaults/refactor: Centralized defaults, header handling, sampling-args merge precedence, and endpoint resolution; prints per-eval results; checks Hub envs via check_hub_env_installed() before running
Install utilities: New verifiers.utils.install_utils (Hub/local/repo installers, package checks, ID parsing) and vf-install rewritten to use it
Docs: docs/evaluation.md expanded with multi-env usage, TOML schema, and configuration precedence
Type/cleanup: Added EvalRunConfig; minor casting/log-level tweaks in rlm_env.py and async_utils.py
Tests: New suites for CLI/TOML parsing and install utils; coverage for precedence, defaults, and error cases

^{Written by Cursor Bugbot for commit 9252a96. This will update automatically on new commits. Configure here.}

verifiers/utils/eval_utils.py

verifiers/scripts/eval.py

verifiers/utils/eval_utils.py

docs/evaluation.md

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-21T04:52:19Z

verifiers/utils/install_utils.py

+
+def is_hub_env(env_id: str) -> bool:
+    """Check if env_id refers to a Hub environment (has owner/ prefix)."""
+    return "/" in env_id and not env_id.startswith("./") and not env_id.startswith("/")


Mismatch between is_hub_env and parse_env_id causes unhandled crash

Medium Severity

The is_hub_env function accepts any string containing / (that doesn't start with ./ or /), but parse_env_id requires exactly two parts when split by /. An input like "a/b/c" passes is_hub_env but causes parse_env_id to raise an unhandled ValueError. Both check_hub_env_installed and install_from_hub call parse_env_id after is_hub_env returns True without catching this exception, causing the CLI to crash with a traceback instead of a helpful error message.

Additional Locations (1)

verifiers/utils/install_utils.py#L94-L99

mikasenghaas mentioned this pull request Jan 15, 2026

eval tui #735

Merged

13 tasks

mikasenghaas requested a review from willccbb January 15, 2026 17:22

mikasenghaas marked this pull request as ready for review January 15, 2026 17:23

mikasenghaas changed the title ~~multi-env evals~~ multi-env evals config Jan 15, 2026

cursor bot reviewed Jan 15, 2026

View reviewed changes

verifiers/utils/eval_utils.py Outdated Show resolved Hide resolved

verifiers/scripts/eval.py Outdated Show resolved Hide resolved

verifiers/scripts/eval.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 15, 2026

View reviewed changes

verifiers/utils/eval_utils.py Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

mikasenghaas added 19 commits January 19, 2026 10:42

simple multi eval scaffolding via toml config

746ea9c

add debug config

8be1b4a

demote to debug log

138d30e

move around logs

b43e337

fix tests

64e68f5

support comma-separated list

8e9f335

fix precedence

f84ac0e

minor

681ebfb

fix schema validation

f7118f6

minor fix

8a1da80

update tests

1a1f278

add unit tests

e49c648

revert pbar desc

9716fc4

update docs

26416b8

typo

27b65fa

fix mutation

3979923

validation for env ids

ce63f9b

fix resolution issue

ad47e3f

move debug config

6c047c9

mikasenghaas force-pushed the multi-env-evals branch from c4d690d to 6c047c9 Compare January 19, 2026 11:02

This comment was marked as outdated.

Sign in to view

mikasenghaas mentioned this pull request Jan 19, 2026

env server/client #744

Draft

19 tasks

better err msg for wrong toml path

8e80b0c

remove some failing tests

354d597

willccbb reviewed Jan 20, 2026

View reviewed changes

docs/evaluation.md Outdated Show resolved Hide resolved

willccbb added 2 commits January 20, 2026 16:16

fixes, streamlining CLI vs config

cae4604

streamline config vs CLI, add vf-install cmds for user/env-id

f1edfe6

This comment was marked as outdated.

Sign in to view

willccbb added 2 commits January 20, 2026 19:55

bugbot fixes

bef6a3d

configs/evals -> configs/eval

57d2886

This comment was marked as outdated.

Sign in to view

better error message for [eval]

f7395c7

This comment was marked as outdated.

Sign in to view

better request exception handling

9252a96

willccbb approved these changes Jan 21, 2026

View reviewed changes

willccbb merged commit 780bb21 into main Jan 21, 2026
6 checks passed

cursor bot reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-env evals config #734

multi-env evals config #734

Uh oh!

mikasenghaas commented Jan 15, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

multi-env evals config #734

multi-env evals config #734

Uh oh!

Conversation

mikasenghaas commented Jan 15, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Examples

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 21, 2026

Choose a reason for hiding this comment

Mismatch between is_hub_env and parse_env_id causes unhandled crash

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikasenghaas commented Jan 15, 2026 •

edited by cursor bot

Loading