-
Notifications
You must be signed in to change notification settings - Fork 116
add stem sdg pipeline #1010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add stem sdg pipeline #1010
Conversation
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Rima Shahbazyan <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
…ills into stem_sdg_pipeline
Signed-off-by: dgitman <[email protected]>
Kipok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding slurm tests for a simplified use-case of this pipeline to ensure nothing is broken in the future
nemo_skills/training/data_preparation_utils/config/stem_sft.yaml
Outdated
Show resolved
Hide resolved
| - ["mmlu", "test"] | ||
| - ["mmlu-pro", "test"] | ||
| - ["gpqa", "diamond"] | ||
| model: /hf_models/Qwen2.5-32B-Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using Qwen/Qwen2.5-32B-Instruct here and everywhere else to avoid extra step of manually downloading the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am constantly getting a HuggingFace rate limit when using hf model name instead of a local path
recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss-seed-data_with_gt.yaml
Outdated
Show resolved
Hide resolved
recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss_with_gt_with_tool.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Rima Shahbazyan <[email protected]>
60aec91 to
e737749
Compare
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
…ills into stem_sdg_pipeline
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]> # Conflicts: # recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/populate_configs.py
Co-authored-by: Igor Gitman <[email protected]> Signed-off-by: dgtm777 <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
…o stem_sdg_pipeline
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu, let's add the convert_to_qwen_from_messages and bucket-qwen stages into the base config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu, can we move specific changes you have here into settings?
| ) | ||
|
|
||
| # write to disk | ||
| with open( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu it is not necessary to create a separate config for each dataset - there is an --override parameter you can use to change the current config with the necessary updates and some prepared settings (e.g., for the mcq/openq generations) which can be applied with --settings flag. How about converting the metadata into a list of settings and overrides for each dataset? In this way, it is not necessary to create a YAML file for each dataset
| ) | ||
|
|
||
|
|
||
| def bucket_qwen(cluster, expname, run_after, stage_config, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu It looks the same as the bucket function. Can we just use it?
Adding @rimashahbazyan to this conversation
| # Only include tools if ADD_TOOLS flag is set | ||
| tools = None | ||
| if ADD_TOOLS: | ||
| if True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu why "if True"?
Adding @rimashahbazyan to this conversation
| def convert_to_qwen_from_messages(cluster, expname, run_after, stage_config, **kwargs): | ||
| input_file = stage_config["input_file"] | ||
| output_dir = stage_config["output_dir"] | ||
| output_file = f"{output_dir}/final_result.jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_file = f"{output_dir}/{OUTPUT_FILE}.jsonl"
| ) | ||
|
|
||
|
|
||
| def convert_to_qwen_from_messages(cluster, expname, run_after, stage_config, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu is it comfortable to convert from the message format? The pipeline does not have to go stage-by-stage; if it is better to convert the original file or the prepared one, it is possible to implement
Adding @rimashahbazyan to this conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s safer to keep the conversion from the messages format. If the model changes later, we’ll only need to update the conversion logic for that specific model only to messages format, while the rest of the conversions can remain the same. Otherwise, we risk introducing bugs, since these conversions are not very straightforward to implement.
| "convert_to_messages_format": convert_to_messages_format, | ||
| "bucket": bucket, | ||
| "convert_to_qwen_from_messages": convert_to_qwen_from_messages, | ||
| "remove_unused_fields": remove_unused_fields, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu, are you sure we need this stage? Usually, it is better to include metadata in the final file so you can downsample or compute count statistics and so on on the resulting data without the need to join it with other files.
Adding @rimashahbazyan to this conversation
| # ---------------------------------------------------------------------------- | ||
| MODEL_NAME = "/hf_models/Qwen2.5-32B-Instruct" | ||
| _TOKENIZER = None # type: ignore | ||
| ADD_TOOLS = False # Will be set in main() if input filename contains 'with-tool' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu I would avoid the reliance on the file name
| solution_key: ${output_key} | ||
| test_cases: | ||
| - { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 10" } } | ||
| # TODO: implement fractional arithmetic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling. | ||
|
|
||
| ## Config Layout | ||
| - **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you link the boxed prompt? will the pipeline handle update from boxed to smth like hle prompt as the default, i.e. no boxed and requires a judge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the difficulty estimation, it currently supports only boxed-like prompts. For solution generations, it should work (with proper modification of the config), but the predicted_answer for every sample will be empty, and the majority voting part (in cases where the expected_answer is not set) will consequently work incorrectly
| - `python_enabled` — enable python-tool prompting and sandbox execution. | ||
| - `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation. | ||
| - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation. | ||
| - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does the "metadata-only flow" include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decontamination, topics, difficulty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed and added a reference to the section with the description
| - `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation. | ||
| - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation. | ||
| - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers. | ||
| - `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not entirely clear what you mean here. How about listing all possible steps/stages that the pipeline can do somewhere on top? It might make defining these flags easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the description of stages on top
| - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation. | ||
| - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers. | ||
| - `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data. | ||
| - `multiple_prompts` - allow the usage of multiple prompts for the generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you elaborate here, e.g.
enables the use of multiple prompts (including distinct preambles and varied output formats) during the generation process to address prompt sensitivity.
Not clear what additional prompts/output format are used here? how to specify them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the readme specifies it on the high level; there is a detailed explanation below - https://github.com/NVIDIA-NeMo/Skills/blob/02d1aff80650b677add866c9e9b76ac110f532ac/recipes/opensciencereasoning/sdg_pipeline/README.md#using-the-multiple_prompts-setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the reference to the section
| - **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt. | ||
| - **Settings overrides** (under [`configs/settings/`](configs/settings/)) layer small, reusable tweaks. Reference them with or without the `.yaml` suffix: | ||
| - `without_gt` — route the pipeline through solution generation + majority voting to estimate ground truth answer. | ||
| - `python_enabled` — enable python-tool prompting and sandbox execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specifics here? e.g. what is python-tool prompt? where is it defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the link to all the parameters for each setting - configs/settings/) (it is in the readme)
| # OpenScienceReasoning Pipeline Quickstart | ||
| This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling. | ||
|
|
||
| ## Config Layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any assumptions on the incoming files, like .jsonl format? any particular fields needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is described here - (How to use section )https://github.com/NVIDIA-NeMo/Skills/blob/02d1aff80650b677add866c9e9b76ac110f532ac/recipes/opensciencereasoning/sdg_pipeline/README.md#how-to-use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the reference to the section in the filter_problem stage description
| print(page.content[:500]) # First 500 characters of content | ||
| print(page.links[:10]) # First 10 linked pages | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu - let's update this one to the final version you're working on
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
No description provided.