Skip to content

Conversation

@dgtm777
Copy link
Collaborator

@dgtm777 dgtm777 commented Oct 30, 2025

No description provided.

Signed-off-by: dgitman <[email protected]>
dgtm777 and others added 6 commits October 30, 2025 12:40
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Rima Shahbazyan <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding slurm tests for a simplified use-case of this pipeline to ensure nothing is broken in the future

- ["mmlu", "test"]
- ["mmlu-pro", "test"]
- ["gpqa", "diamond"]
model: /hf_models/Qwen2.5-32B-Instruct
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using Qwen/Qwen2.5-32B-Instruct here and everywhere else to avoid extra step of manually downloading the model

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am constantly getting a HuggingFace rate limit when using hf model name instead of a local path

Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
@dgtm777 dgtm777 requested a review from ekmb November 11, 2025 15:28
@ekmb ekmb requested a review from jiacheng-xu November 11, 2025 17:57
Jiacheng Xu and others added 2 commits November 12, 2025 10:46
Signed-off-by: Jiacheng Xu <[email protected]>

# Conflicts:
#	recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/populate_configs.py
@dgtm777 dgtm777 requested a review from Kipok November 13, 2025 09:17
@dgtm777 dgtm777 enabled auto-merge (squash) November 13, 2025 09:20
Jiacheng Xu added 6 commits November 18, 2025 13:52
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
@dgtm777 dgtm777 disabled auto-merge November 20, 2025 09:01
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu, let's add the convert_to_qwen_from_messages and bucket-qwen stages into the base config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu, can we move specific changes you have here into settings?

)

# write to disk
with open(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu it is not necessary to create a separate config for each dataset - there is an --override parameter you can use to change the current config with the necessary updates and some prepared settings (e.g., for the mcq/openq generations) which can be applied with --settings flag. How about converting the metadata into a list of settings and overrides for each dataset? In this way, it is not necessary to create a YAML file for each dataset

)


def bucket_qwen(cluster, expname, run_after, stage_config, **kwargs):
Copy link
Collaborator Author

@dgtm777 dgtm777 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu It looks the same as the bucket function. Can we just use it?

Adding @rimashahbazyan to this conversation

# Only include tools if ADD_TOOLS flag is set
tools = None
if ADD_TOOLS:
if True:
Copy link
Collaborator Author

@dgtm777 dgtm777 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu why "if True"?

Adding @rimashahbazyan to this conversation

def convert_to_qwen_from_messages(cluster, expname, run_after, stage_config, **kwargs):
input_file = stage_config["input_file"]
output_dir = stage_config["output_dir"]
output_file = f"{output_dir}/final_result.jsonl"
Copy link
Collaborator Author

@dgtm777 dgtm777 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_file = f"{output_dir}/{OUTPUT_FILE}.jsonl"

)


def convert_to_qwen_from_messages(cluster, expname, run_after, stage_config, **kwargs):
Copy link
Collaborator Author

@dgtm777 dgtm777 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu is it comfortable to convert from the message format? The pipeline does not have to go stage-by-stage; if it is better to convert the original file or the prepared one, it is possible to implement

Adding @rimashahbazyan to this conversation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s safer to keep the conversion from the messages format. If the model changes later, we’ll only need to update the conversion logic for that specific model only to messages format, while the rest of the conversions can remain the same. Otherwise, we risk introducing bugs, since these conversions are not very straightforward to implement.

"convert_to_messages_format": convert_to_messages_format,
"bucket": bucket,
"convert_to_qwen_from_messages": convert_to_qwen_from_messages,
"remove_unused_fields": remove_unused_fields,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu, are you sure we need this stage? Usually, it is better to include metadata in the final file so you can downsample or compute count statistics and so on on the resulting data without the need to join it with other files.

Adding @rimashahbazyan to this conversation

# ----------------------------------------------------------------------------
MODEL_NAME = "/hf_models/Qwen2.5-32B-Instruct"
_TOKENIZER = None # type: ignore
ADD_TOOLS = False # Will be set in main() if input filename contains 'with-tool'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu I would avoid the reliance on the file name

solution_key: ${output_key}
test_cases:
- { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 10" } }
# TODO: implement fractional arithmetic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling.

## Config Layout
- **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you link the boxed prompt? will the pipeline handle update from boxed to smth like hle prompt as the default, i.e. no boxed and requires a judge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the difficulty estimation, it currently supports only boxed-like prompts. For solution generations, it should work (with proper modification of the config), but the predicted_answer for every sample will be empty, and the majority voting part (in cases where the expected_answer is not set) will consequently work incorrectly

- `python_enabled` — enable python-tool prompting and sandbox execution.
- `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation.
- `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
- `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the "metadata-only flow" include?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decontamination, topics, difficulty

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed and added a reference to the section with the description

- `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation.
- `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
- `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.
- `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not entirely clear what you mean here. How about listing all possible steps/stages that the pipeline can do somewhere on top? It might make defining these flags easier

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the description of stages on top

- `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
- `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.
- `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data.
- `multiple_prompts` - allow the usage of multiple prompts for the generation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate here, e.g.

enables the use of multiple prompts (including distinct preambles and varied output formats) during the generation process to address prompt sensitivity.

Not clear what additional prompts/output format are used here? how to specify them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the reference to the section

- **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt.
- **Settings overrides** (under [`configs/settings/`](configs/settings/)) layer small, reusable tweaks. Reference them with or without the `.yaml` suffix:
- `without_gt` — route the pipeline through solution generation + majority voting to estimate ground truth answer.
- `python_enabled` — enable python-tool prompting and sandbox execution.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specifics here? e.g. what is python-tool prompt? where is it defined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the link to all the parameters for each setting - configs/settings/) (it is in the readme)

# OpenScienceReasoning Pipeline Quickstart
This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling.

## Config Layout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any assumptions on the incoming files, like .jsonl format? any particular fields needed?

Copy link
Collaborator Author

@dgtm777 dgtm777 Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the reference to the section in the filter_problem stage description

print(page.content[:500]) # First 500 characters of content
print(page.links[:10]) # First 10 linked pages
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu - let's update this one to the final version you're working on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants