-
Notifications
You must be signed in to change notification settings - Fork 475
optional DatasetBuilder pattern #739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
mikasenghaas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, i like the pattern a lot. can you confirm my understanding: directly passing Dataset will eagerly build it, this means for such envs we will still "double" load datasets on e.g. the prime-rl orchestrator (which needs access to the dataset for sampling reasons) + on each env worker. but with this pattern we can use the builder class in which case the env workers would never have to build the dataset and it only exists on the orchestrator
im still a little worried abt the arbirary code execution part of load_environment which may have unpredictable side-effects (e.g. wiki-search creating a locked local db instance) but im not sure there's anything we can do abt it? maybe just move such init into a special space (analogous to setup_state but not on a per-rollout basis but globally?). in this world, setting up global state in load_environment would just be undefined behavior?
|
For now this doesn't deduplicate behavior, we can handle that at env client/server level. It's just allowing deferral of dataset-specific preprocessing to whenever get_dataset is first called, so usage patterns which only call it for one of many replicas (like only getting data from a master worker + sending rollout requests to all workers) only need one loading total. |
Description
Allows passing a Callable[[], Dataset] in place of Dataset for
dataset/eval_dataset, lazy-loads whenget_dataset/get_eval_datasetis first called.Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Adds optional lazy dataset loading to environments, enabling deferred construction of
Datasets via aDatasetBuildercallable.Environmentnow acceptsDataset | DatasetBuilderfordataset/eval_dataset, with newbuild_dataset()/build_eval_dataset()and formatting via_format_dataset_source;get_*_datasettriggers builds; preserves eager build for rawDatasets. Storesmap_kwargsand validatesmessage_typeearlier.EnvGroupconcatenation now builds sub-env datasets viaenv.build_dataset()/build_eval_dataset(); RLorchestratorfilters withenv.get_dataset()instead of accessingenv.datasetdirectly.vf.DatasetBuildertype alias and export fromverifiers.__init__.docs/environments.mdandenvironments/AGENTS.md; expandRLMEnvdetails.environments/alphabet_sortto useget_dataset_builder, extract helpers, and return builder-backedSortingEnv; bump version to0.1.11.data_utils.py; small eval input simplification.Written by Cursor Bugbot for commit 714d8b0. This will update automatically on new commits. Configure here.