Skip to content

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Jan 15, 2026

Description

This PR implements evaluating multiple environments in parallel via vf-eval. For more details check the updated docs.

This PR is mainly concerned with the config system. Cosmetic updates will be shipped separately, e.g see #735

Examples

By default, we still evaluate a single env with no changes to the interface

uv run vf-eval gsm8k -n5 -r3

To configure multi-environment training, specify a comma-separated list of env ids

uv run vf-eval gsm8k,alphabet-sort -n5 -r3

Note, that all environments use their default configuration. Since CLI arguments apply to all enviroments one can only change values for all environments at the same time. To have more fine-grained configurability, check below.

To configure multi-environment training with (potentially) different arguments for each specify a path to a TOML config file

uv run vf-eval configs/evals/debug.toml -n5 -r3
# configs/local/vf-eval/debug.toml
[[env]]
id = "gsm8k"
num_examples = 1
rollouts_per_example = 1

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Enables running multiple environments in one invocation with config-driven control, plus install utilities and extensive tests/docs.

  • Multi-env evals: New [[eval]] TOML configs (configs/eval/*.toml) parsed by load_toml_config() with validation and precedence; CLI positional becomes env_id_or_config (single env or .toml path)
  • Parallel execution: New EvalRunConfig and run_evaluations() to execute multiple EvalConfigs concurrently; event-loop lag monitoring moved to the multi-run flow
  • CLI defaults/refactor: Centralized defaults, header handling, sampling-args merge precedence, and endpoint resolution; prints per-eval results; checks Hub envs via check_hub_env_installed() before running
  • Install utilities: New verifiers.utils.install_utils (Hub/local/repo installers, package checks, ID parsing) and vf-install rewritten to use it
  • Docs: docs/evaluation.md expanded with multi-env usage, TOML schema, and configuration precedence
  • Type/cleanup: Added EvalRunConfig; minor casting/log-level tweaks in rlm_env.py and async_utils.py
  • Tests: New suites for CLI/TOML parsing and install utils; coverage for precedence, defaults, and error cases

Written by Cursor Bugbot for commit 9252a96. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas mentioned this pull request Jan 15, 2026
13 tasks
@mikasenghaas mikasenghaas requested a review from willccbb January 15, 2026 17:22
@mikasenghaas mikasenghaas marked this pull request as ready for review January 15, 2026 17:23
@mikasenghaas mikasenghaas changed the title multi-env evals multi-env evals config Jan 15, 2026
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@mikasenghaas mikasenghaas mentioned this pull request Jan 19, 2026
19 tasks
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@willccbb willccbb merged commit 780bb21 into main Jan 21, 2026
6 checks passed
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.


def is_hub_env(env_id: str) -> bool:
"""Check if env_id refers to a Hub environment (has owner/ prefix)."""
return "/" in env_id and not env_id.startswith("./") and not env_id.startswith("/")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mismatch between is_hub_env and parse_env_id causes unhandled crash

Medium Severity

The is_hub_env function accepts any string containing / (that doesn't start with ./ or /), but parse_env_id requires exactly two parts when split by /. An input like "a/b/c" passes is_hub_env but causes parse_env_id to raise an unhandled ValueError. Both check_hub_env_installed and install_from_hub call parse_env_id after is_hub_env returns True without catching this exception, causing the CLI to crash with a traceback instead of a helpful error message.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants