Support megatron tokenization for post training datasets by kevalmorabia97 · Pull Request #1018 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-03-10T19:30:22Z

What does this PR do?

Update megatron_preprocess_data.py to support applying chat template for tokenizing chat based post training datasets

Usage

# Add a code snippet demonstrating how to use this

Testing

Tokenized Nemotron-Post-Training-Dataset-v2 (~2B tokens for stem + chat + math + code splits)
Doing distillation on pruned nano v2 7B

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ❌
Did you update Changelog?: N/A

Summary by CodeRabbit

Bug Fixes
- Added runtime logging for data truncation to improve visibility
- Improved handling of chat-formatted conversation data in list form
- Eliminated duplicate log messages during data encoding
Documentation
- Expanded usage examples and docstrings to cover chat-format datasets
Tests
- Added parameterized tests to validate multiple dataset scenarios and tokenization settings

coderabbitai · 2026-03-10T19:30:43Z

📝 Walkthrough

Walkthrough

Updates megatron preprocessing to apply per-key chat-template rendering for list-valued JSON keys with per-key logging suppression, adds logging for document and sequence truncation (reporting original vs new lengths), adjusts length accounting to use post-processing/truncation values, and parameterizes an HF-dataset test to run two dataset scenarios.

Changes

Cohort / File(s)	Summary
Megatron preprocessing `modelopt/torch/utils/plugins/megatron_preprocess_data.py`	Adds per-key chat_template handling for list-valued JSON keys (uses `tokenizer.apply_chat_template` with optional tools), introduces `_chat_template_logged` class-level set to avoid duplicate log messages, logs document truncation (original vs new lengths) and final sequence/token lengths, and changes length accounting to accumulate using post-truncation text and encoded lengths.
Parameterized tests `tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py`	Replaces a single HF-dataset test with a parameterized test (`pytest.mark.parametrize`) that supplies `hf_dataset`, `hf_split`, and `json_keys`; updates test signature and invocation to pass `hf_max_samples_per_split=10`, `json_keys`, new `tokenizer_name_or_path` (Qwen/Qwen3-0.6B), and reduces `max_sequence_length` from 512 to 32. Also adds top-level `pytest` import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and clearly describes the main change: adding support for chat template tokenization in megatron for post-training datasets.
Security Anti-Patterns	✅ Passed	Pull request contains no security anti-patterns including unsafe torch.load, numpy.load with allow_pickle, hardcoded trust_remote_code, eval/exec calls, nosec comments, or non-permissive dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kmorabia/post-training-tokenization

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (2)

119-131: Consider handling unexpected value types gracefully.

The code assumes value is either a list (chat format) or a string (raw text). If a JSON key contains an unexpected type (e.g., dict, int, None), line 131 assigns it directly to text, which will fail at _Encoder.tokenizer.encode(text) with an unclear error.

Consider adding explicit type validation:

🛡️ Suggested defensive check

             if isinstance(value, list):
                 if key not in _Encoder._chat_template_logged:
                     _Encoder._chat_template_logged.add(key)
                     print(f"Applying chat_template to '{key}' key")
                 kwargs = {}
                 tools = data.get("tools")
                 if tools:
                     kwargs["tools"] = tools
                 text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
-            else:
+            elif isinstance(value, str):
                 text = value
+            else:
+                raise TypeError(f"Expected list or str for key '{key}', got {type(value).__name__}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 119 -
131, The code assumes value is list or string and assigns non-string types
directly to text, causing failures later in _Encoder.tokenizer.encode(text);
update the handling in _Encoder (in the block around
tokenizer.apply_chat_template) to validate value types: if value is list handle
as before, if value is str use directly, if value is None/other primitive
convert to string (e.g., str(value)) or skip/log a warning (use
_Encoder._chat_template_logged or a logger) and ensure text is always a string
before calling _Encoder.tokenizer.encode; add a clear warning or raise a
controlled TypeError for unexpected dict/complex types to make failures
explicit.

135-146: Inconsistent truncation logging and potential noise.

Document truncation is logged (lines 137-138), but sequence truncation logging is commented out (line 146). Additionally, the commented-out code references original_length which would be undefined in that context since it was overwritten at line 135.

For large datasets with many long documents, the document truncation logging could produce excessive output. Consider:

Making truncation logging consistent (either log both or neither)
Using a counter and logging summary statistics at the end, rather than per-document

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 135 -
146, The code currently prints per-document truncation and has a commented-out,
buggy sequence-truncation log that references original_length incorrectly;
instead add counters (e.g., num_doc_truncated, num_seq_truncated) and track
original character length and original token length (capture
original_token_length = len(encoded) before truncating to
self.max_sequence_length) inside the block using _Encoder.tokenizer.encode and
the document trimming to self.max_document_length, increment the appropriate
counter when truncation occurs, remove per-document print calls, and emit a
single summary log at the end of preprocessing (including counts and maybe
totals like total_docs and total_truncated_tokens) so truncation reporting is
consistent and not noisy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Line 90: The class-level _chat_template_logged set cannot deduplicate logs
across worker processes because each multiprocessing.Pool worker has its own
copy; either move the "Applying chat_template..." logging out of worker code
into the parent process before dispatching tasks (so logging/deduplication is
done once), or replace the in-process set with a process-shared structure
created from multiprocessing.Manager() (e.g., a Manager().dict or list used as a
set) and use that shared object where _chat_template_logged is referenced so all
Pool workers see the same dedupe state; update references to
_chat_template_logged and the code that writes the log accordingly.

---

Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 119-131: The code assumes value is list or string and assigns
non-string types directly to text, causing failures later in
_Encoder.tokenizer.encode(text); update the handling in _Encoder (in the block
around tokenizer.apply_chat_template) to validate value types: if value is list
handle as before, if value is str use directly, if value is None/other primitive
convert to string (e.g., str(value)) or skip/log a warning (use
_Encoder._chat_template_logged or a logger) and ensure text is always a string
before calling _Encoder.tokenizer.encode; add a clear warning or raise a
controlled TypeError for unexpected dict/complex types to make failures
explicit.
- Around line 135-146: The code currently prints per-document truncation and has
a commented-out, buggy sequence-truncation log that references original_length
incorrectly; instead add counters (e.g., num_doc_truncated, num_seq_truncated)
and track original character length and original token length (capture
original_token_length = len(encoded) before truncating to
self.max_sequence_length) inside the block using _Encoder.tokenizer.encode and
the document trimming to self.max_document_length, increment the appropriate
counter when truncation occurs, remove per-document print calls, and emit a
single summary log at the end of preprocessing (including counts and maybe
totals like total_docs and total_truncated_tokens) so truncation reporting is
consistent and not noisy.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f4953324-76ee-4bb8-8ae1-3ac0be1c4f1b

📥 Commits

Reviewing files that changed from the base of the PR and between 695c8e8 and 4726723.

📒 Files selected for processing (1)

modelopt/torch/utils/plugins/megatron_preprocess_data.py

modelopt/torch/utils/plugins/megatron_preprocess_data.py

codecov · 2026-03-10T21:53:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.12%. Comparing base (bc87981) to head (d07375e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1018      +/-   ##
==========================================
+ Coverage   70.11%   70.12%   +0.01%     
==========================================
  Files         221      221              
  Lines       25459    25459              
==========================================
+ Hits        17851    17854       +3     
+ Misses       7608     7605       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AAnoosheh · 2026-03-11T11:22:28Z

Can we add test in test_megatron_preprocess_data.py

kevalmorabia97 · 2026-03-11T11:53:13Z

Can we add test in test_megatron_preprocess_data.py

Do you know any small dataset we can use for test here?

AAnoosheh · 2026-03-11T12:52:50Z

The test_sft split of this dataset is very small: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/test_sft

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 · 2026-03-13T11:43:43Z

The test_sft split of this dataset is very small: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/test_sft

Added unit test. PTAL

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py`:
- Around line 78-81: Update the module/test docstring for
test_megatron_preprocess_data to describe that the test is parameterized and
runs against multiple datasets (e.g., "nanotron/minipile_100_samples" and the
chat-based "everyday-conversations"), and mention that each parameterized run
downloads the specified dataset split from Hugging Face, tokenizes it, and
validates behavior for --append_eod and --max_sequence_length; locate the
docstring at the top of
tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py near the
test function name test_megatron_preprocess_data and replace the single-dataset
wording with a concise description of the parameterized behavior and datasets.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 89551ad1-4bc3-4b65-9967-d32d360a6491

📥 Commits

Reviewing files that changed from the base of the PR and between 4726723 and d07375e.

📒 Files selected for processing (2)

modelopt/torch/utils/plugins/megatron_preprocess_data.py
tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/utils/plugins/megatron_preprocess_data.py

coderabbitai · 2026-03-13T11:47:40Z

tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py

    """Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.

    Downloads nanotron/minipile_100_samples train split from Hugging Face and tokenizes it.
    """


⚠️ Potential issue | 🟡 Minor

Update the docstring to reflect parameterized test behavior.

The docstring still references only "nanotron/minipile_100_samples train split", but the test is now parameterized to run against multiple datasets including the chat-based everyday-conversations dataset.

📝 Proposed fix

- """Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length. - - Downloads nanotron/minipile_100_samples train split from Hugging Face and tokenizes it. - """ + """Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length. + + Downloads parameterized HuggingFace datasets and tokenizes them with the specified json_keys. + Tests both plain text and chat-based datasets. + """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"""Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.

Downloads nanotron/minipile_100_samples train split from Hugging Face and tokenizes it.

"""

"""Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.

Downloads parameterized HuggingFace datasets and tokenizes them with the specified json_keys.

Tests both plain text and chat-based datasets.

"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py` around lines 78 - 81, Update the module/test docstring for test_megatron_preprocess_data to describe that the test is parameterized and runs against multiple datasets (e.g., "nanotron/minipile_100_samples" and the chat-based "everyday-conversations"), and mention that each parameterized run downloads the specified dataset split from Hugging Face, tokenizes it, and validates behavior for --append_eod and --max_sequence_length; locate the docstring at the top of tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py near the test function name test_megatron_preprocess_data and replace the single-dataset wording with a concise description of the parameterized behavior and datasets.

kevalmorabia97 requested a review from AAnoosheh March 10, 2026 19:30

kevalmorabia97 requested a review from a team as a code owner March 10, 2026 19:30

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

modelopt/torch/utils/plugins/megatron_preprocess_data.py Show resolved Hide resolved

kevalmorabia97 added 2 commits March 13, 2026 04:35

Support megatron tokenization for post training datasets

3379bd3

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Add test for chat data tokenization

d07375e

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 force-pushed the kmorabia/post-training-tokenization branch from 4726723 to d07375e Compare March 13, 2026 11:43

kevalmorabia97 enabled auto-merge (squash) March 13, 2026 11:43

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

AAnoosheh approved these changes Mar 13, 2026

View reviewed changes

kevalmorabia97 merged commit d0abca7 into main Mar 13, 2026
40 checks passed

kevalmorabia97 deleted the kmorabia/post-training-tokenization branch March 13, 2026 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support megatron tokenization for post training datasets#1018

Support megatron tokenization for post training datasets#1018
kevalmorabia97 merged 2 commits intomainfrom
kmorabia/post-training-tokenization

kevalmorabia97 commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

AAnoosheh commented Mar 11, 2026

Uh oh!

kevalmorabia97 commented Mar 11, 2026

Uh oh!

AAnoosheh commented Mar 11, 2026

Uh oh!

kevalmorabia97 commented Mar 13, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevalmorabia97 commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AAnoosheh commented Mar 11, 2026

Uh oh!

kevalmorabia97 commented Mar 11, 2026

Uh oh!

AAnoosheh commented Mar 11, 2026

Uh oh!

kevalmorabia97 commented Mar 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevalmorabia97 commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading