Skip to content

Support megatron tokenization for post training datasets#1018

Merged
kevalmorabia97 merged 2 commits intomainfrom
kmorabia/post-training-tokenization
Mar 13, 2026
Merged

Support megatron tokenization for post training datasets#1018
kevalmorabia97 merged 2 commits intomainfrom
kmorabia/post-training-tokenization

Conversation

@kevalmorabia97
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 commented Mar 10, 2026

What does this PR do?

Update megatron_preprocess_data.py to support applying chat template for tokenizing chat based post training datasets

Usage

# Add a code snippet demonstrating how to use this

Testing

  • Tokenized Nemotron-Post-Training-Dataset-v2 (~2B tokens for stem + chat + math + code splits)
  • Doing distillation on pruned nano v2 7B

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ❌
  • Did you update Changelog?: N/A

Summary by CodeRabbit

  • Bug Fixes

    • Added runtime logging for data truncation to improve visibility
    • Improved handling of chat-formatted conversation data in list form
    • Eliminated duplicate log messages during data encoding
  • Documentation

    • Expanded usage examples and docstrings to cover chat-format datasets
  • Tests

    • Added parameterized tests to validate multiple dataset scenarios and tokenization settings

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner March 10, 2026 19:30
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Updates megatron preprocessing to apply per-key chat-template rendering for list-valued JSON keys with per-key logging suppression, adds logging for document and sequence truncation (reporting original vs new lengths), adjusts length accounting to use post-processing/truncation values, and parameterizes an HF-dataset test to run two dataset scenarios.

Changes

Cohort / File(s) Summary
Megatron preprocessing
modelopt/torch/utils/plugins/megatron_preprocess_data.py
Adds per-key chat_template handling for list-valued JSON keys (uses tokenizer.apply_chat_template with optional tools), introduces _chat_template_logged class-level set to avoid duplicate log messages, logs document truncation (original vs new lengths) and final sequence/token lengths, and changes length accounting to accumulate using post-truncation text and encoded lengths.
Parameterized tests
tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py
Replaces a single HF-dataset test with a parameterized test (pytest.mark.parametrize) that supplies hf_dataset, hf_split, and json_keys; updates test signature and invocation to pass hf_max_samples_per_split=10, json_keys, new tokenizer_name_or_path (Qwen/Qwen3-0.6B), and reduces max_sequence_length from 512 to 32. Also adds top-level pytest import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and clearly describes the main change: adding support for chat template tokenization in megatron for post-training datasets.
Security Anti-Patterns ✅ Passed Pull request contains no security anti-patterns including unsafe torch.load, numpy.load with allow_pickle, hardcoded trust_remote_code, eval/exec calls, nosec comments, or non-permissive dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmorabia/post-training-tokenization
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (2)

119-131: Consider handling unexpected value types gracefully.

The code assumes value is either a list (chat format) or a string (raw text). If a JSON key contains an unexpected type (e.g., dict, int, None), line 131 assigns it directly to text, which will fail at _Encoder.tokenizer.encode(text) with an unclear error.

Consider adding explicit type validation:

🛡️ Suggested defensive check
             if isinstance(value, list):
                 if key not in _Encoder._chat_template_logged:
                     _Encoder._chat_template_logged.add(key)
                     print(f"Applying chat_template to '{key}' key")
                 kwargs = {}
                 tools = data.get("tools")
                 if tools:
                     kwargs["tools"] = tools
                 text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
-            else:
+            elif isinstance(value, str):
                 text = value
+            else:
+                raise TypeError(f"Expected list or str for key '{key}', got {type(value).__name__}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 119 -
131, The code assumes value is list or string and assigns non-string types
directly to text, causing failures later in _Encoder.tokenizer.encode(text);
update the handling in _Encoder (in the block around
tokenizer.apply_chat_template) to validate value types: if value is list handle
as before, if value is str use directly, if value is None/other primitive
convert to string (e.g., str(value)) or skip/log a warning (use
_Encoder._chat_template_logged or a logger) and ensure text is always a string
before calling _Encoder.tokenizer.encode; add a clear warning or raise a
controlled TypeError for unexpected dict/complex types to make failures
explicit.

135-146: Inconsistent truncation logging and potential noise.

Document truncation is logged (lines 137-138), but sequence truncation logging is commented out (line 146). Additionally, the commented-out code references original_length which would be undefined in that context since it was overwritten at line 135.

For large datasets with many long documents, the document truncation logging could produce excessive output. Consider:

  1. Making truncation logging consistent (either log both or neither)
  2. Using a counter and logging summary statistics at the end, rather than per-document
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 135 -
146, The code currently prints per-document truncation and has a commented-out,
buggy sequence-truncation log that references original_length incorrectly;
instead add counters (e.g., num_doc_truncated, num_seq_truncated) and track
original character length and original token length (capture
original_token_length = len(encoded) before truncating to
self.max_sequence_length) inside the block using _Encoder.tokenizer.encode and
the document trimming to self.max_document_length, increment the appropriate
counter when truncation occurs, remove per-document print calls, and emit a
single summary log at the end of preprocessing (including counts and maybe
totals like total_docs and total_truncated_tokens) so truncation reporting is
consistent and not noisy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Line 90: The class-level _chat_template_logged set cannot deduplicate logs
across worker processes because each multiprocessing.Pool worker has its own
copy; either move the "Applying chat_template..." logging out of worker code
into the parent process before dispatching tasks (so logging/deduplication is
done once), or replace the in-process set with a process-shared structure
created from multiprocessing.Manager() (e.g., a Manager().dict or list used as a
set) and use that shared object where _chat_template_logged is referenced so all
Pool workers see the same dedupe state; update references to
_chat_template_logged and the code that writes the log accordingly.

---

Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 119-131: The code assumes value is list or string and assigns
non-string types directly to text, causing failures later in
_Encoder.tokenizer.encode(text); update the handling in _Encoder (in the block
around tokenizer.apply_chat_template) to validate value types: if value is list
handle as before, if value is str use directly, if value is None/other primitive
convert to string (e.g., str(value)) or skip/log a warning (use
_Encoder._chat_template_logged or a logger) and ensure text is always a string
before calling _Encoder.tokenizer.encode; add a clear warning or raise a
controlled TypeError for unexpected dict/complex types to make failures
explicit.
- Around line 135-146: The code currently prints per-document truncation and has
a commented-out, buggy sequence-truncation log that references original_length
incorrectly; instead add counters (e.g., num_doc_truncated, num_seq_truncated)
and track original character length and original token length (capture
original_token_length = len(encoded) before truncating to
self.max_sequence_length) inside the block using _Encoder.tokenizer.encode and
the document trimming to self.max_document_length, increment the appropriate
counter when truncation occurs, remove per-document print calls, and emit a
single summary log at the end of preprocessing (including counts and maybe
totals like total_docs and total_truncated_tokens) so truncation reporting is
consistent and not noisy.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f4953324-76ee-4bb8-8ae1-3ac0be1c4f1b

📥 Commits

Reviewing files that changed from the base of the PR and between 695c8e8 and 4726723.

📒 Files selected for processing (1)
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.12%. Comparing base (bc87981) to head (d07375e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1018      +/-   ##
==========================================
+ Coverage   70.11%   70.12%   +0.01%     
==========================================
  Files         221      221              
  Lines       25459    25459              
==========================================
+ Hits        17851    17854       +3     
+ Misses       7608     7605       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AAnoosheh
Copy link
Contributor

Can we add test in test_megatron_preprocess_data.py

@kevalmorabia97
Copy link
Collaborator Author

Can we add test in test_megatron_preprocess_data.py

Do you know any small dataset we can use for test here?

@AAnoosheh
Copy link
Contributor

The test_sft split of this dataset is very small: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/test_sft

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/post-training-tokenization branch from 4726723 to d07375e Compare March 13, 2026 11:43
@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) March 13, 2026 11:43
@kevalmorabia97
Copy link
Collaborator Author

The test_sft split of this dataset is very small: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/test_sft

Added unit test. PTAL

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py`:
- Around line 78-81: Update the module/test docstring for
test_megatron_preprocess_data to describe that the test is parameterized and
runs against multiple datasets (e.g., "nanotron/minipile_100_samples" and the
chat-based "everyday-conversations"), and mention that each parameterized run
downloads the specified dataset split from Hugging Face, tokenizes it, and
validates behavior for --append_eod and --max_sequence_length; locate the
docstring at the top of
tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py near the
test function name test_megatron_preprocess_data and replace the single-dataset
wording with a concise description of the parameterized behavior and datasets.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 89551ad1-4bc3-4b65-9967-d32d360a6491

📥 Commits

Reviewing files that changed from the base of the PR and between 4726723 and d07375e.

📒 Files selected for processing (2)
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py
  • tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py

Comment on lines 78 to 81
"""Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.

Downloads nanotron/minipile_100_samples train split from Hugging Face and tokenizes it.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the docstring to reflect parameterized test behavior.

The docstring still references only "nanotron/minipile_100_samples train split", but the test is now parameterized to run against multiple datasets including the chat-based everyday-conversations dataset.

📝 Proposed fix
-    """Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.
-
-    Downloads nanotron/minipile_100_samples train split from Hugging Face and tokenizes it.
-    """
+    """Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.
+
+    Downloads parameterized HuggingFace datasets and tokenizes them with the specified json_keys.
+    Tests both plain text and chat-based datasets.
+    """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.
Downloads nanotron/minipile_100_samples train split from Hugging Face and tokenizes it.
"""
"""Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.
Downloads parameterized HuggingFace datasets and tokenizes them with the specified json_keys.
Tests both plain text and chat-based datasets.
"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py`
around lines 78 - 81, Update the module/test docstring for
test_megatron_preprocess_data to describe that the test is parameterized and
runs against multiple datasets (e.g., "nanotron/minipile_100_samples" and the
chat-based "everyday-conversations"), and mention that each parameterized run
downloads the specified dataset split from Hugging Face, tokenizes it, and
validates behavior for --append_eod and --max_sequence_length; locate the
docstring at the top of
tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py near the
test function name test_megatron_preprocess_data and replace the single-dataset
wording with a concise description of the parameterized behavior and datasets.

@kevalmorabia97 kevalmorabia97 merged commit d0abca7 into main Mar 13, 2026
40 checks passed
@kevalmorabia97 kevalmorabia97 deleted the kmorabia/post-training-tokenization branch March 13, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants