-
Notifications
You must be signed in to change notification settings - Fork 472
[DRAFT] Env Client/Server refactor proposal #740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 5 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
tests/test_gym_env.py
Outdated
|
|
||
| res = env.evaluate_sync(client=client, model="mock") | ||
| st = res["state"][0] | ||
| st = res["results"][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test accesses removed reward key on GenerateOutputs
High Severity
The assertion res["reward"] == [1.0] accesses a key that no longer exists. After the refactor, GenerateOutputs only contains rollouts and metadata. Rewards are now accessed via individual rollout results like res["rollouts"][0].get("reward").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| completion_logprobs.append(tokens["completion_logprobs"]) | ||
| advantages.append(step["advantage"]) | ||
|
|
||
| # Build rewards_dict from rollout-level data (for logging only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyError when accessing optional advantage field in trajectory
High Severity
The orchestrator accesses step["advantage"] directly, but RolloutResultTrajectoryStep has total=False making advantage an optional key. When state_to_result converts a trajectory step, it only includes advantage if it's not None. If scoring hasn't run or uses dummy_score_rollout (which doesn't set advantage), the key will be absent and this line will raise KeyError instead of returning None as the old code did.
Additional Locations (1)
mikasenghaas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not super convinced that ipc is the right protocol to use here. because the env client "owns" the env workers. this should work for the current prime-rl/vf-eval usecases but im wondering if it might not be too restrictive for potential future usecases, ie. i could see it become desirable to really "host" an env server, that multiple clients can talk which would be hard to do with ipc.
also my gut feeling tells me that in order to make this maintainable we should aim to mirror the regular environment api as closely as possible (almost like a pass-through). the end goal would be that
env = vf.load_environment(env_id, **env_args)can be 1-1 replaced with
env_client = EnvClient(env_id, **env_args)im not sure this is fully realistic but imo would be the ideal state to have a contract/interface general enough to work in-proc and across arbitrary protocols.
| ### GenerateOutputs | ||
|
|
||
| ```python | ||
| class GenerateOutputs(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh i love this!! finally row order:))
| """ | ||
|
|
||
| group_inputs: list[RolloutInput] | ||
| example_id: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should prob name this request_id and construct internally (not part of public-facing api) because the same example_id might be in-flight multiple times (we acc had that bug in prime-rl)
| # Request dataset and metadata from first worker | ||
| first_worker.send_request(MetadataRequest(num_examples=num_examples)) | ||
| response = first_worker.recv_response(timeout=120) | ||
| if response is None or not isinstance(response, MetadataResponse): | ||
| raise RuntimeError("Failed to get metadata from worker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought the point of the dataset builder pattern was that we dont have to transport the dataset via ipc. would really like to try to avoid this op. i feel like any env worker may or may not build the dataset (by default, they don't build it but should be configurable from client)
| if not future.done(): | ||
| future.set_result(response.results) | ||
|
|
||
| async def run_groups( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why expose run_groups here instead of run_group, imo it would be desirable to mirror the regular Environment methods as closely as possible with the EnvClient so that an EnvClient can just be hotswapped in for an in-proc Environment
|
Gonna close this + work off your other PR, can revisit ideas from it as needed. |
Description
Enables multi-processing for env workers
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Introduces a new results schema and multiprocessing evaluation pipeline.
RolloutResultand changeGenerateOutputstorollouts + metadata; update docs, printing, dataset conversion, and save utilities to consume rollout-centric outputsEnvironment.generate/evaluateto build outputs viatype_utils(state_to_result,build_generate_outputs) and sort/persist consistently; remove legacy parallel-list fieldsverifiers.workers.{client,server,types}), exportEnvClient/EnvServer, and migraterun_evaluationto use workers; add--num-workersand wire throughEvalConfigrollouts; adjust CLI, tests, and minor utilities (message/data utils, rubric, math_rubric import) accordinglyWritten by Cursor Bugbot for commit 10ed830. This will update automatically on new commits. Configure here.