optional DatasetBuilder pattern #739

willccbb · 2026-01-16T22:51:18Z

Description

Allows passing a Callable[[], Dataset] in place of Dataset for dataset / eval_dataset, lazy-loads when get_dataset / get_eval_dataset is first called.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Adds optional lazy dataset loading to environments, enabling deferred construction of Datasets via a DatasetBuilder callable.

Core: Environment now accepts Dataset | DatasetBuilder for dataset/eval_dataset, with new build_dataset()/build_eval_dataset() and formatting via _format_dataset_source; get_*_dataset triggers builds; preserves eager build for raw Datasets. Stores map_kwargs and validates message_type earlier.
Group/Trainer: EnvGroup concatenation now builds sub-env datasets via env.build_dataset()/build_eval_dataset(); RL orchestrator filters with env.get_dataset() instead of accessing env.dataset directly.
Types/Exports: Add vf.DatasetBuilder type alias and export from verifiers.__init__.
Docs: Add "Lazy Loading with DatasetBuilder" section to docs/environments.md and environments/AGENTS.md; expand RLMEnv details.
Example env: Refactor environments/alphabet_sort to use get_dataset_builder, extract helpers, and return builder-backed SortingEnv; bump version to 0.1.11.
Misc: Minor typing/casts in data_utils.py; small eval input simplification.

^{Written by Cursor Bugbot for commit 714d8b0. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/envs/environment.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/envs/environment.py

mikasenghaas

nice, i like the pattern a lot. can you confirm my understanding: directly passing Dataset will eagerly build it, this means for such envs we will still "double" load datasets on e.g. the prime-rl orchestrator (which needs access to the dataset for sampling reasons) + on each env worker. but with this pattern we can use the builder class in which case the env workers would never have to build the dataset and it only exists on the orchestrator

im still a little worried abt the arbirary code execution part of load_environment which may have unpredictable side-effects (e.g. wiki-search creating a locked local db instance) but im not sure there's anything we can do abt it? maybe just move such init into a special space (analogous to setup_state but not on a per-rollout basis but globally?). in this world, setting up global state in load_environment would just be undefined behavior?

verifiers/rl/trainer/orchestrator.py

verifiers/envs/environment.py

willccbb · 2026-01-17T18:34:04Z

For now this doesn't deduplicate behavior, we can handle that at env client/server level.

It's just allowing deferral of dataset-specific preprocessing to whenever get_dataset is first called, so usage patterns which only call it for one of many replicas (like only getting data from a master worker + sending rollout requests to all workers) only need one loading total.

optional DatasetBuilder pattern

9583efe

willccbb requested a review from mikasenghaas January 16, 2026 22:51

cursor bot reviewed Jan 16, 2026

View reviewed changes

verifiers/envs/environment.py Outdated Show resolved Hide resolved

verifiers/envs/environment.py Show resolved Hide resolved

verifiers/envs/environment.py Outdated Show resolved Hide resolved

env group, docs, fixes

563daf9

cursor bot reviewed Jan 17, 2026

View reviewed changes

verifiers/envs/environment.py Outdated Show resolved Hide resolved

fix RLTrainer

e04399b

mikasenghaas reviewed Jan 17, 2026

View reviewed changes

verifiers/rl/trainer/orchestrator.py Outdated Show resolved Hide resolved

verifiers/envs/environment.py Outdated Show resolved Hide resolved

mikasenghaas mentioned this pull request Jan 19, 2026

env server/client #744

Draft

19 tasks

willccbb added 2 commits January 19, 2026 13:47

simplify dataset builder

8a0291c

rename map_kwargs

714d8b0

samsja approved these changes Jan 20, 2026

View reviewed changes

willccbb merged commit ea7eaa8 into main Jan 20, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optional DatasetBuilder pattern #739

optional DatasetBuilder pattern #739

Uh oh!

willccbb commented Jan 16, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

mikasenghaas left a comment

Uh oh!

Uh oh!

Uh oh!

willccbb commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

optional DatasetBuilder pattern #739

optional DatasetBuilder pattern #739

Uh oh!

Conversation

willccbb commented Jan 16, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

willccbb commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

willccbb commented Jan 16, 2026 •

edited by cursor bot

Loading