[DRAFT] Env Client/Server refactor proposal #740

willccbb · 2026-01-17T06:47:01Z

Description

Enables multi-processing for env workers

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Introduces a new results schema and multiprocessing evaluation pipeline.

Define RolloutResult and change GenerateOutputs to rollouts + metadata; update docs, printing, dataset conversion, and save utilities to consume rollout-centric outputs
Refactor Environment.generate/evaluate to build outputs via type_utils (state_to_result, build_generate_outputs) and sort/persist consistently; remove legacy parallel-list fields
Add multiprocessing env workers (verifiers.workers.{client,server,types}), export EnvClient/EnvServer, and migrate run_evaluation to use workers; add --num-workers and wire through EvalConfig
Update RL trainer to read trajectories/timing from rollouts; adjust CLI, tests, and minor utilities (message/data utils, rubric, math_rubric import) accordingly

^{Written by Cursor Bugbot for commit 10ed830. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 5 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

tests/test_gym_env.py

cursor · 2026-01-17T06:53:43Z

tests/test_gym_env.py


    res = env.evaluate_sync(client=client, model="mock")
-    st = res["state"][0]
+    st = res["results"][0]


Test accesses removed reward key on GenerateOutputs

High Severity

The assertion res["reward"] == [1.0] accesses a key that no longer exists. After the refactor, GenerateOutputs only contains rollouts and metadata. Rewards are now accessed via individual rollout results like res["rollouts"][0].get("reward").

verifiers/workers/client.py

verifiers/utils/eval_utils.py

verifiers/workers/server.py

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/workers/server.py

verifiers/utils/type_utils.py

docs/reference.md

verifiers/utils/eval_utils.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/workers/client.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-17T08:57:22Z

verifiers/rl/trainer/orchestrator.py

                completion_logprobs.append(tokens["completion_logprobs"])
                advantages.append(step["advantage"])

-        # Build rewards_dict from rollout-level data (for logging only)


KeyError when accessing optional advantage field in trajectory

High Severity

The orchestrator accesses step["advantage"] directly, but RolloutResultTrajectoryStep has total=False making advantage an optional key. When state_to_result converts a trajectory step, it only includes advantage if it's not None. If scoring hasn't run or uses dummy_score_rollout (which doesn't set advantage), the key will be absent and this line will raise KeyError instead of returning None as the old code did.

Additional Locations (1)

verifiers/utils/type_utils.py#L102-L104

mikasenghaas

im not super convinced that ipc is the right protocol to use here. because the env client "owns" the env workers. this should work for the current prime-rl/vf-eval usecases but im wondering if it might not be too restrictive for potential future usecases, ie. i could see it become desirable to really "host" an env server, that multiple clients can talk which would be hard to do with ipc.

also my gut feeling tells me that in order to make this maintainable we should aim to mirror the regular environment api as closely as possible (almost like a pass-through). the end goal would be that

env = vf.load_environment(env_id, **env_args)

can be 1-1 replaced with

env_client = EnvClient(env_id, **env_args)

im not sure this is fully realistic but imo would be the ideal state to have a contract/interface general enough to work in-proc and across arbitrary protocols.

mikasenghaas · 2026-01-17T15:55:39Z

docs/reference.md

 ### GenerateOutputs

 ```python
 class GenerateOutputs(TypedDict):


ahhh i love this!! finally row order:))

mikasenghaas · 2026-01-17T16:03:50Z

verifiers/workers/types.py

+    """
+
+    group_inputs: list[RolloutInput]
+    example_id: int


should prob name this request_id and construct internally (not part of public-facing api) because the same example_id might be in-flight multiple times (we acc had that bug in prime-rl)

mikasenghaas · 2026-01-17T16:05:54Z

verifiers/workers/client.py

+        # Request dataset and metadata from first worker
+        first_worker.send_request(MetadataRequest(num_examples=num_examples))
+        response = first_worker.recv_response(timeout=120)
+        if response is None or not isinstance(response, MetadataResponse):
+            raise RuntimeError("Failed to get metadata from worker")


i thought the point of the dataset builder pattern was that we dont have to transport the dataset via ipc. would really like to try to avoid this op. i feel like any env worker may or may not build the dataset (by default, they don't build it but should be configurable from client)

mikasenghaas · 2026-01-17T16:08:11Z

verifiers/workers/client.py

+            if not future.done():
+                future.set_result(response.results)
+
+    async def run_groups(


why expose run_groups here instead of run_group, imo it would be desirable to mirror the regular Environment methods as closely as possible with the EnvClient so that an EnvClient can just be hotswapped in for an in-proc Environment

willccbb · 2026-01-20T00:32:51Z

Gonna close this + work off your other PR, can revisit ideas from it as needed.

DRAFT checkpoint -- env client/server refactor

1a2d34e

cursor bot reviewed Jan 17, 2026

View reviewed changes

DRAFT checkpoint -- env client/server refactor

1ca407a

cursor bot reviewed Jan 17, 2026

View reviewed changes

verifiers/workers/server.py Show resolved Hide resolved

verifiers/utils/type_utils.py Outdated Show resolved Hide resolved

docs/reference.md Show resolved Hide resolved

verifiers/utils/eval_utils.py Outdated Show resolved Hide resolved

fix metrics saving

56e8a5b

cursor bot reviewed Jan 17, 2026

View reviewed changes

verifiers/workers/client.py Show resolved Hide resolved

fix bugbot

10ed830

cursor bot reviewed Jan 17, 2026

View reviewed changes

mikasenghaas reviewed Jan 17, 2026

View reviewed changes

willccbb closed this Jan 20, 2026

[DRAFT] Env Client/Server refactor proposal #740

[DRAFT] Env Client/Server refactor proposal #740

Uh oh!

Conversation

willccbb commented Jan 17, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Jan 17, 2026

Choose a reason for hiding this comment

Test accesses removed reward key on GenerateOutputs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 17, 2026

Choose a reason for hiding this comment

KeyError when accessing optional advantage field in trajectory

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

willccbb commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willccbb commented Jan 17, 2026 •

edited by cursor bot

Loading

Test accesses removed `reward` key on GenerateOutputs

KeyError when accessing optional `advantage` field in trajectory