NVIDIA · kevalmorabia97 · Jun 5, 2026 · Jun 2, 2026 · Jun 4, 2026 · Jun 4, 2026
@@ -87,7 +87,7 @@ jobs:
     with:
       docker_image: "nvcr.io/nvidia/nemo:26.04"
       example: megatron_bridge
-      timeout_minutes: 30
+      timeout_minutes: 45
       pip_install_extras: "[hf,puzzletron,dev-test]"
       runner: ${{ startsWith(github.ref, 'refs/heads/pull-request/') && 'linux-amd64-gpu-rtxpro6000-latest-1' || 'linux-amd64-gpu-rtxpro6000-latest-2' }}
 

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -29,6 +29,7 @@ Changelog
 - Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.
 - Add support for ``active_params`` (for MoE models) and ``memory_mb`` constraints in Minitron pruning on top of existing ``params`` constraint. You can also provide multiple constraints. See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details. The underlying utility functions ``mcore_param_count``, ``mcore_memory_footprint_mb``, and ``print_mcore_model_stats`` in ``modelopt.torch.nas.plugins.megatron_model_stats`` are also available for standalone use to compute parameter counts and memory footprints (weights + KV-cache + Mamba state) for any Megatron-Core model.
 - Add Minitron pruning support for Megatron-Bridge Gemma3 models.
+- Add quantization examples for the Megatron-Bridge framework: post-training quantization (`quantize.py <https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/megatron_bridge/quantize.py>`_), export to a deployable HuggingFace checkpoint (`export.py <https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/megatron_bridge/export.py>`_), and Quantization Aware Distillation (extend existing `distill.py <https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/megatron_bridge/distill.py>`_).
 - Add end-to-end tutorial for Minitron pruning + two-phase distillation (80B @ 8K + 20B @ 32K long-context = 100B tokens) + FP8 PTQ + vLLM deployment for Nemotron-3-Nano-30B-A3B-BF16 (MoE + Mamba-Transformer hybrid) → Pruned 22B/A3.0B active params, along with data blend preparation steps (with tool-calling data) and detailed pruning / data-blend / long-context ablations. See `examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/>`_ for details.
 - Add ``--cast_mxfp4_to_nvfp4`` flag to ``examples/llm_ptq/hf_ptq.py`` for closed-form, bit-exact MXFP4 → NVFP4 weight conversion. Supports the GPT-OSS family (``openai/gpt-oss-20b``, ``openai/gpt-oss-120b``). See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#mxfp4--nvfp4-cast-for-gpt-oss>`__ for usage.
 - DeepSeek PTQ (``examples/deepseek/ptq.py``) now defaults to native top-k calibration with post-hoc per-layer peer-max sync of expert ``input_quantizer.amax``; the all-experts path is preserved behind ``--calib_all_experts``.
@@ -43,7 +44,6 @@ Changelog
 - Add mixed-precision FP8 + NVFP4 export for Megatron-Core: per-layer ``quant_algo`` recorded under ``quantized_layers`` in ``hf_quant_config.json``, PP-aware ``kv_cache_dtype`` gather, fused-QKV exclude split into per-HF-name ``q/k/v_proj`` entries.
 - Add Nemotron-3-Super-120B-A12B PTQ recipes ``modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`` (MSE-mixed) and ``super-nvfp4-max-calib.yaml`` (max-calib mixed): NVFP4 W4A4 routed experts + FP8 per-tensor shared experts / Mamba in/out_proj + FP8 KV cache.
 - Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
-- Add post-training quantization (PTQ) example for the Megatron-Bridge framework: ``examples/megatron_bridge/quantize.py`` calibrates an HF model (via ``--quant_cfg`` alias / full config name or a ``--recipe`` YAML, with optional KV-cache quant, weight-only, compression, and MoE expert-ratio calibration) and saves a Megatron checkpoint (tensor / pipeline / expert parallelism supported), and ``examples/megatron_bridge/export.py`` converts that checkpoint to a deployable HuggingFace (unified) checkpoint for TensorRT-LLM / vLLM / SGLang. See `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for details.
 - Refactor ``llm_qat`` example with unified YAML-based configuration and flexible dataset blending.
   ``ModelOptArgParser`` adds ``--config`` YAML support with CLI overrides and auto-generates ``ARGUMENTS.md`` from dataclass definitions.
   Dataset blending (``configs/dataset/blend.yaml``) supports HuggingFace datasets, local JSON/JSONL/Parquet files, and weighted multi-source blends.

@@ -143,7 +143,7 @@ Loss balancers:
 
 ## Knowledge Distillation (KD) in NVIDIA Megatron-Bridge Framework
 
-Checkout the stand-alone distillation script in the [examples/megatron_bridge/](../megatron_bridge/README.md).
+Checkout the stand-alone distillation script in the [examples/megatron_bridge/](../megatron_bridge/README.md) for example scripts for KD with Megatron-Bridge which is generally more performant than the Hugging Face scripts.
 
 ## Knowledge Distillation (KD) in NVIDIA Megatron-LM Framework
 

@@ -262,7 +262,7 @@ This functionality is currently in beta and has been tested on `nvidia/NVIDIA-Ne
 
 ### Megatron-Bridge Example Script
 
-Please refer to [examples/megatron_bridge/README.md](../megatron_bridge/README.md) for example scripts for PTQ with Megatron-Bridge.
+Please refer to [examples/megatron_bridge/README.md](../megatron_bridge/README.md) for example scripts for PTQ / QAD with Megatron-Bridge which is generally more performant than the Hugging Face scripts.
 
 ### Megatron-LM Example Script
 

@@ -1,5 +1,7 @@
 # QAD Training Scripts
 
+> **Deprecated:** These scripts are deprecated and will be removed in the next release. Please migrate to the [megatron_bridge QAD example](../megatron_bridge/README.md#quantization-aware-distillation-qad), which provides a simpler Python-based interface and better model coverage.
+
 Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models.
 
 > **Note:** For Hugging Face LLM QAD, see the [LLM QAT QAD section](../llm_qat/README.md#end-to-end-qad-example).
@@ -53,7 +55,7 @@ See [Megatron-LM ModelOpt examples](https://github.com/NVIDIA/Megatron-LM/tree/m
    ```bash
    # For MoE models
    cp configs/qwen3-30b-a3b-instruct-2507-moe_template.conf configs/my-experiment.conf
-   
+
    # For Dense models
    cp configs/qwen3-8b_template.conf configs/my-experiment.conf
    ```

@@ -87,12 +87,16 @@ python export.py --pyt_ckpt_path qwen3-8b-qad-nvfp4 --export_path qwen3-8b-qad-d
 
 Exported checkpoints can be deployed on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm), or [SGLang](https://github.com/sgl-project/sglang). See [llm_ptq/README.md](../llm_ptq/README.md#deployment) for deployment instructions. For quick accuracy evaluation without exporting, see [Native Fake-Quantized Evaluation](#native-fake-quantized-evaluation).
 
-> **Note:** To see the full QAT flow in a single script (quantize + train + save), see [simple_qat_train.py](simple_qat_train.py):
+> [!NOTE]
+> To see the full QAT flow in a single script (quantize + train + save), see [simple_qat_train.py](simple_qat_train.py):
 >
 > ```sh
 > python simple_qat_train.py --model-path meta-llama/Llama-3.2-3B --recipe general/ptq/nvfp4_default-kv_fp8
 > ```
 
+> [!TIP]
+> For more performant QAD, please refer to [examples/megatron_bridge/README.md](../megatron_bridge/README.md) for example scripts for PTQ / QAD with Megatron-Bridge which is generally more performant than the Hugging Face scripts.
+
 ## Background
 
 ### What is QAT?

@@ -8,12 +8,16 @@ This directory contains examples of using Model Optimizer with [NeMo Megatron-Br
 | :------------: | :------------: | :------------: |
 | Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] |
 | Post-Training Quantization | Quantizing a model | \[[Link](#post-training-quantization)\] |
+| Sanity-Check Generation | Quick generation check with vLLM | \[[Link](#sanity-check-generation)\] |
 | Distillation | Distilling a pruned or quantized model | \[[Link](#distillation)\] |
 | Pruning | Pruning a model using Minitron algorithm | \[[Link](#pruning)\] |
 | Resources | Extra links to relevant resources | \[[Link](#resources)\] |
 
 </div>
 
+> [!TIP]
+> Checkout the [Nemotron-3-Nano-30B-A3B pruning + distillation (with data blend prep) + quantization tutorial](../pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md) for a complete end-to-end workflow using Megatron-Bridge!
+
 ## Pre-Requisites
 
 Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.04`) which has all the dependencies installed.
@@ -63,33 +67,47 @@ This section shows how to quantize a HuggingFace model using ModelOpt in the Meg
 1. [quantize.py](quantize.py) applies post-training quantization (PTQ) with calibration and saves a **Megatron checkpoint** (with ModelOpt state). Tensor / pipeline / expert parallelism are all supported, and the checkpoint can be reloaded for further training (Quantization Aware Training / Quantization Aware Distillation).
 2. [export.py](export.py) converts that Megatron checkpoint to a **HuggingFace (unified) checkpoint** that deploys directly with TensorRT-LLM, vLLM, or SGLang.
 
-`quantize.py` supports the following formats via `--quant_cfg` (e.g. `fp8`, `nvfp4`, `int8_sq`, `int4_awq`, `w4a8_awq`, ...). You can also pass any full config name exposed by ModelOpt (e.g. `FP8_DEFAULT_CFG`) or a YAML `--recipe` (e.g. `general/ptq/fp8_default-kv_fp8`, authoritative for quant_cfg + algorithm + KV-cache). KV-cache quantization can be enabled on top via `--kv_cache_quant` (e.g. `fp8`, `nvfp4`).
+`quantize.py` supports the following formats via `--quant_cfg` (e.g. `fp8`, `nvfp4`, `int8_sq`, `int4_awq`, `w4a8_awq`, ...). You can also pass any full config name exposed by ModelOpt (e.g. `NVFP4_DEFAULT_CFG`) or a YAML `--recipe` (e.g. `general/ptq/nvfp4_default-kv_fp8`, authoritative for quant_cfg + algorithm + KV-cache). KV-cache quantization can be enabled on top via `--kv_cache_quant` (e.g. `fp8`, `nvfp4`).
 
-**Step 1 — quantize** Qwen3-8B to FP8 on 2 GPUs (Tensor Parallelism = 2) using 1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration:
+**Step 1 — quantize** Qwen3-8B to NVFP4 on 2 GPUs (Tensor Parallelism = 2) using 1024 samples from default dataset (Mix of [`cnn_dailymail`](https://huggingface.co/datasets/abisee/cnn_dailymail) and [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2)) for calibration (sequence length = 4096):
 
 ```bash
 torchrun --nproc_per_node 2 quantize.py \
     --hf_model_name_or_path Qwen/Qwen3-8B \
-    --quant_cfg fp8 \
+    --quant_cfg nvfp4 \
     --tp_size 2 \
-    --export_megatron_path /tmp/Qwen3-8B-FP8-megatron
+    --calib_batch_size 1 \
+    --seq_length 4096 \
+    --export_megatron_path /tmp/Qwen3-8B-NVFP4-megatron
 ```
 
 **Step 2 — export** the Megatron checkpoint to a deployable HuggingFace checkpoint:
 
 ```bash
-torchrun --nproc_per_node 1 export.py \
+torchrun --nproc_per_node 2 export.py \
     --hf_model_name_or_path Qwen/Qwen3-8B \
-    --megatron_path /tmp/Qwen3-8B-FP8-megatron \
-    --export_unified_hf_path /tmp/Qwen3-8B-FP8-hf
+    --megatron_path /tmp/Qwen3-8B-NVFP4-megatron \
+    --pp_size 2 \
+    --export_unified_hf_path /tmp/Qwen3-8B-NVFP4-hf
 ```
 
 > [!NOTE]
 > The HuggingFace unified exporter does not gather tensor-parallel-sharded weights. Use `--pp_size` on `export.py` to shard a large model with pipeline parallelism across GPUs for export.
 
+> [!TIP]
+> To recover the accuracy lost during quantization, fine-tune the quantized Megatron checkpoint (from step 1) with [Quantization Aware Distillation (QAD)](#quantization-aware-distillation-qad) before running the step 2 export.
+
 To see the full usage for advanced configurations, run `torchrun --nproc_per_node 1 quantize.py --help` (or `export.py --help`).
 
-For Quantization scripts covering VLMs, QAT, and resuming quantized checkpoints, see the Megatron-Bridge repository [here](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/quantization).
+For VLM (vision-language model) quantization, see the Megatron-Bridge repository [here](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/quantization).
+
+## Sanity-Check Generation
+
+[generate_vllm.py](generate_vllm.py) runs a quick generation check on a unified HuggingFace checkpoint using vLLM. vLLM auto-detects the ModelOpt quantization from the exported `hf_quant_config.json`, so no extra quant flags are needed:
+
+```bash
+python generate_vllm.py --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --trust_remote_code
+```
 
 ## Distillation
 
@@ -99,6 +117,8 @@ This can be used stand-alone or after [Pruning](#pruning) / [Post-Training Quant
 
 The [distill.py](distill.py) script supports both standard HuggingFace checkpoints and [Puzzletron AnyModel](../puzzletron/README.md) checkpoints as student/teacher inputs. Just pass the checkpoint path via `--student_hf_path` / `--teacher_hf_path`. The distilled model is saved to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
 
+To distill a student whose weights live in a **Megatron checkpoint** (e.g. a quantized checkpoint from [quantize.py](quantize.py) for [Quantization Aware Distillation](#quantization-aware-distillation-qad), or a pruned checkpoint), additionally pass `--student_megatron_path` — `--student_hf_path` is still required to build the student architecture.
+
 ### Data Preparation
 
 The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
@@ -159,6 +179,28 @@ torchrun --nproc_per_node 8 distill.py \
     --output_dir /tmp/test_distill
 ```
 
+### Quantization Aware Distillation (QAD)
+
+To recover the accuracy lost during [Post-Training Quantization](#post-training-quantization), distill the quantized model (student) from the original, unquantized model (teacher). Pass the quantized **Megatron checkpoint** produced by `quantize.py` via `--student_megatron_path` (the ModelOpt quantizers are restored automatically, so distillation trains the fake-quantized student), while `--student_hf_path` provides the student architecture and `--teacher_hf_path` points to the original unquantized model. We also use a smaller learning rate for QAD:
+
+```bash
+torchrun --nproc_per_node 8 distill.py \
+    --tp_size 8 \
+    --teacher_hf_path Qwen/Qwen3-8B \
+    --student_hf_path Qwen/Qwen3-8B \
+    --student_megatron_path /tmp/Qwen3-8B-NVFP4-megatron \
+    --data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \
+    --data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
+    --seq_length 8192 \
+    --gbs 768 \
+    --train_iters 1000 \
+    --lr 1e-5 \
+    --min_lr 5e-6 \
+    --output_dir /output/qwen3_8b_nvfp4_qad
+```
+
+The distilled checkpoint retains the ModelOpt quantization state, so it can be converted to a deployable HuggingFace checkpoint with [export.py](export.py) (point `--megatron_path` at `<output_dir>/checkpoints`), exactly like the PTQ checkpoint in [step 2 above](#post-training-quantization).
+
 ### Slurm Usage
 
 To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.