add stem sdg pipeline #1010

dgtm777 · 2025-10-30T07:12:03Z

No description provided.

Signed-off-by: dgitman <[email protected]>

Signed-off-by: Rima Shahbazyan <[email protected]>

Signed-off-by: dgitman <[email protected]>

…ills into stem_sdg_pipeline

Signed-off-by: dgitman <[email protected]>

Kipok

consider adding slurm tests for a simplified use-case of this pipeline to ensure nothing is broken in the future

nemo_skills/training/data_preparation_utils/config/stem_sft.yaml

nemo_skills/prompt/config/generic/search-boxed.yaml

Kipok · 2025-10-31T18:42:47Z

recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss-seed-data_with_gt.yaml

+      - ["mmlu", "test"]
+      - ["mmlu-pro", "test"]
+      - ["gpqa", "diamond"]
+    model: /hf_models/Qwen2.5-32B-Instruct


consider using Qwen/Qwen2.5-32B-Instruct here and everywhere else to avoid extra step of manually downloading the model

I am constantly getting a HuggingFace rate limit when using hf model name instead of a local path

recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss-seed-data_with_gt.yaml

recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss_with_gt_with_tool.yaml

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py

recipes/opensciencereasoning/sdg_pipeline/scripts/decontaminate.py

recipes/opensciencereasoning/README.md

nemo_skills/prompt/config/eval/aai/search-mcq-4choices.yaml

Signed-off-by: dgitman <[email protected]>

Signed-off-by: Jiacheng Xu <[email protected]>

Signed-off-by: dgitman <[email protected]>

Signed-off-by: Rima Shahbazyan <[email protected]>

Signed-off-by: dgitman <[email protected]>

…ills into stem_sdg_pipeline

Signed-off-by: dgitman <[email protected]>

Signed-off-by: Jiacheng Xu <[email protected]> # Conflicts: # recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/populate_configs.py

Co-authored-by: Igor Gitman <[email protected]> Signed-off-by: dgtm777 <[email protected]>

Signed-off-by: Jiacheng Xu <[email protected]>

…o stem_sdg_pipeline

Signed-off-by: Jiacheng Xu <[email protected]>

dgtm777 · 2025-11-20T17:28:04Z

recipes/opensciencereasoning/configs/SDG_pipeline/gpt-conversion.yaml

@jiacheng-xu, let's add the convert_to_qwen_from_messages and bucket-qwen stages into the base config

dgtm777 · 2025-11-20T17:30:40Z

recipes/opensciencereasoning/configs/SDG_pipeline/gpt-partial.yaml

@jiacheng-xu, can we move specific changes you have here into settings?

dgtm777 · 2025-11-20T17:40:40Z

recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/populate_configs.py

+    )
+
+    # write to disk
+    with open(


@jiacheng-xu it is not necessary to create a separate config for each dataset - there is an --override parameter you can use to change the current config with the necessary updates and some prepared settings (e.g., for the mcq/openq generations) which can be applied with --settings flag. How about converting the metadata into a list of settings and overrides for each dataset? In this way, it is not necessary to create a YAML file for each dataset

dgtm777 · 2025-11-20T17:47:22Z

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py

    )


+def bucket_qwen(cluster, expname, run_after, stage_config, **kwargs):


@jiacheng-xu It looks the same as the bucket function. Can we just use it?

Adding @rimashahbazyan to this conversation

dgtm777 · 2025-11-20T17:49:26Z

recipes/opensciencereasoning/sdg_pipeline/scripts/convert_to_qwen.py

    # Only include tools if ADD_TOOLS flag is set
    tools = None
-    if ADD_TOOLS:
+    if True:


@jiacheng-xu why "if True"?

Adding @rimashahbazyan to this conversation

dgtm777 · 2025-11-20T17:57:21Z

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py

+def convert_to_qwen_from_messages(cluster, expname, run_after, stage_config, **kwargs):
+    input_file = stage_config["input_file"]
+    output_dir = stage_config["output_dir"]
+    output_file = f"{output_dir}/final_result.jsonl"


output_file = f"{output_dir}/{OUTPUT_FILE}.jsonl"

dgtm777 · 2025-11-20T17:57:26Z

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py

    )


+def convert_to_qwen_from_messages(cluster, expname, run_after, stage_config, **kwargs):


@jiacheng-xu is it comfortable to convert from the message format? The pipeline does not have to go stage-by-stage; if it is better to convert the original file or the prepared one, it is possible to implement

Adding @rimashahbazyan to this conversation

It’s safer to keep the conversion from the messages format. If the model changes later, we’ll only need to update the conversion logic for that specific model only to messages format, while the rest of the conversions can remain the same. Otherwise, we risk introducing bugs, since these conversions are not very straightforward to implement.

dgtm777 · 2025-11-20T18:01:29Z

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py

    "convert_to_messages_format": convert_to_messages_format,
    "bucket": bucket,
+    "convert_to_qwen_from_messages": convert_to_qwen_from_messages,
+    "remove_unused_fields": remove_unused_fields,


@jiacheng-xu, are you sure we need this stage? Usually, it is better to include metadata in the final file so you can downsample or compute count statistics and so on on the resulting data without the need to join it with other files.

Adding @rimashahbazyan to this conversation

dgtm777 · 2025-11-20T18:03:31Z

recipes/opensciencereasoning/sdg_pipeline/scripts/convert_to_qwen.py

+# ----------------------------------------------------------------------------
+MODEL_NAME = "/hf_models/Qwen2.5-32B-Instruct"
+_TOKENIZER = None  # type: ignore
+ADD_TOOLS = False  # Will be set in main() if input filename contains 'with-tool'


@jiacheng-xu I would avoid the reliance on the file name

ekmb · 2025-11-18T12:51:14Z

nemo_skills/training/data_preparation_utils/config/stem_sft.yaml

+        solution_key: ${output_key}
+        test_cases:
+          - { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 10" } }
+          # TODO: implement fractional arithmetic


Is this needed?

@jiacheng-xu

ekmb · 2025-11-18T12:54:48Z