Fix MoE quantization tests and add sync_expert_weight_amax option by jenchen13 · Pull Request #1158 · NVIDIA/Model-Optimizer

jenchen13 · 2026-04-01T19:58:06Z

Fix MOE quantization tests to ensure QuantSequentialMLP and QuantTEGroupedMLP produce identical outputs when SequentialMLP has weight & input quantizer sync both enabled.
By default SequentialMLP does not sync weight scales (but does sync input quantizer scales) so the outputs of QuantSequentialMLP and QuantTEGroupedMLP are not the same. However when weight scales are synced the amax values and outputs of both modules should be identical

What does this PR do?

Type of change: Bug fix

Add sync_expert_weight_amax flag to MaxCalibConfig to optionally sync weight quantizer amax across SequentialMLP experts (matches TEGroupedMLP)
Update layer_sync_moe_local_experts_amax to accept sync_weight_amax param

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added a calibration option (default: off) to optionally synchronize weight-quantizer amax values across local experts in MoE models, aligning sequential-expert calibration with grouped-expert behavior.
Tests
- Updated and expanded quantization tests to cover grouped vs sequential MoE calibration, the new expert-weight synchronization option, and adjusted test parametrizations and skip logic accordingly.

- Pass num_experts and moe_grouped_gemm to TE spec in get_mcore_gpt_model - Fix partial arg ordering in test_te_grouped_vs_sequential_quantize - Add sync_expert_weight_amax flag to MaxCalibConfig to optionally sync weight quantizer amax across SequentialMLP experts (matches TEGroupedMLP) - Update sync_moe_expert_amax helper to accept sync_weight_amax param - Add debug prints for grouped vs sequential amax comparison Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

coderabbitai · 2026-04-01T19:58:21Z

📝 Walkthrough

Walkthrough

Added a boolean calibration option sync_expert_weight_amax (default False) and propagated it from MaxCalibConfig through max_calibrate() into MoE layer sync methods and the sync_moe_expert_amax() utility to optionally include weight-quantizer amax in local-expert synchronization.

Changes

Cohort / File(s)	Summary
Calibration Configuration `modelopt/torch/quantization/config.py`	Added `MaxCalibConfig.sync_expert_weight_amax: bool = False` with metadata describing scope and behavior for MoE SequentialMLP experts.
Calibration Pipeline `modelopt/torch/quantization/model_calib.py`	Extended `max_calibrate()` signature to accept `sync_expert_weight_amax: bool = False` and pass the flag into layer-level MoE sync calls.
MoE Plugins `modelopt/torch/quantization/plugins/megatron.py`, `modelopt/torch/quantization/plugins/huggingface.py`	Updated `_MegatronSequentialMLP.layer_sync_moe_local_experts_amax(...)` and `_QuantSparseMoe.layer_sync_moe_local_experts_amax(...)` to accept `sync_weight_amax` (default False) and forward it to the core sync utility.
Core Sync Utility `modelopt/torch/quantization/utils/core_utils.py`	Updated `sync_moe_expert_amax(experts, sync_weight_amax=False)` to optionally include `weight_quantizer.amax` in collection/synchronization and adjusted collection/fallback logic for modules with/without `amax`.
Tests / Quantization Behavior `tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`	Removed `use_te` plumbing, switched TE selection to `transformer_impl`, adjusted MoE sharded/state-dict and TEGrouped-vs-sequential quantization tests, and extended local-expert amax sync tests to exercise `sync_weight_amax`.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Config (MaxCalibConfig)
    participant Calibrator as max_calibrate()
    participant MoELayer as layer_sync_moe_local_experts_amax()
    participant SyncUtil as sync_moe_expert_amax()
    participant Experts as Local Experts

    Config->>Calibrator: provide sync_expert_weight_amax flag
    Calibrator->>MoELayer: call layer_sync_moe_local_experts_amax(sync_weight_amax=flag)
    MoELayer->>SyncUtil: call sync_moe_expert_amax(experts, sync_weight_amax=flag)
    SyncUtil->>Experts: collect amax (input_quantizer[, weight_quantizer if flag])
    SyncUtil-->>MoELayer: return synchronized amax values
    MoELayer-->>Calibrator: complete local-expert sync

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main changes: fixing MoE quantization tests and adding the sync_expert_weight_amax option, which are the primary objectives of this changeset.
Security Anti-Patterns	✅ Passed	Pull request introduces only configuration and parameter additions with no new security anti-patterns.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jennifchen/fix_moe_tests_v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-01T20:02:20Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1158/
Built to branch `gh-pages` at 2026-04-02 21:55 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/utils/core_utils.py (1)

533-558: ⚠️ Potential issue | 🟠 Major

Populate missing weight amax before taking the cross-expert max.

When sync_weight_amax=True, Line 548 can assign a shared value to an expert that never collected weight stats, so Lines 552-557 no longer run its weight-only fallback. If that unrouted expert has the largest weights, it never contributes to the synced max, and SequentialMLP can still diverge from TEGroupedMLP. Calibrate missing weight quantizers first, or run a second max-reduction after the fallback.

💡 One way to preserve TEGroupedMLP semantics

-    amax_dict: dict[str, torch.Tensor] = {}
+    from ..model_calib import max_calibrate
+
+    if sync_weight_amax:
+        for expert in experts:
+            for name, module in expert.named_modules():
+                if name.endswith("weight_quantizer") and module.is_enabled and module.amax is None:
+                    weight = expert.state_dict().get(name.replace("weight_quantizer", "weight"))
+                    if weight is not None:
+                        max_calibrate(module, lambda m, w=weight: m(w), distributed_sync=False)
+
+    amax_dict: dict[str, torch.Tensor] = {}
@@
-    from ..model_calib import max_calibrate
-
     for expert in experts:
         for name, module in expert.named_modules():
             if name.endswith("weight_quantizer") and module.is_enabled and module.amax is None:
                 weight = expert.state_dict().get(name.replace("weight_quantizer", "weight"))
                 if weight is not None:
                     max_calibrate(module, lambda m, w=weight: m(w), distributed_sync=False)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/core_utils.py` around lines 533 - 558, The
current loop computes amax_dict and assigns shared amax to weight_quantizer
modules before running the per-expert weight-only fallback (max_calibrate),
causing experts that never collected weight stats to be skipped; change the
order so missing weight amax values are calibrated first (using max_calibrate on
modules where name.endswith("weight_quantizer") and module.is_enabled and
module.amax is None, retrieving weight via expert.state_dict()) and only then
compute the cross-expert maximum into amax_dict and assign it, or alternatively
retain the existing first max-reduction but add a second pass that recomputes
amax_dict (torch.maximum reduction) after the fallback; reference
functions/classes to update: TensorQuantizer, amax_dict, max_calibrate,
sync_weight_amax, and the expert.named_modules() loops.

🧹 Nitpick comments (1)

tests/gpu_megatron/torch/quantization/plugins/test_megatron.py (1)
638-676: Pin the no-token expert case in this test.

sync_expert_weight_amax only changes behavior when some SequentialMLP experts never collect weight amax. With random routing and batch_size=8, this helper can still touch every local expert, so the regression can slip through. Force at least one expert to receive zero tokens, or compare the synced weight amax tensors against TEGroupedMLP directly. As per coding guidelines, "tests/**/*.py: Write tests using pytest for all new features and examples".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py` around lines
638 - 676, The test can miss the no-token-expert case because random routing
with batch_size=8 can touch every expert; update the test around
TEGroupedMLP/SequentialMLP (where copy_weights_from_grouped_to_non_grouped,
mtq.quantize, seq_quant_cfg, and forward are used) to force at least one
SequentialMLP expert to receive zero tokens before quantization (e.g., make the
gating deterministic or craft inputs so only a subset of the num_moe_experts=4
experts are selected) so sync_expert_weight_amax behavior is exercised;
alternatively, after quantizing both models compare the synced expert weight
amax tensors from sequential_moe_model directly against the TEGroupedMLP amaxs
to ensure they match.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 133-136: The call to
layer_sync_moe_local_experts_amax(sync_weight_amax=sync_expert_weight_amax) is
incompatible with some MoE implementations (e.g., the HuggingFace plugin) that
do not accept the sync_weight_amax kwarg; update the call site in the loop over
model.named_modules() to detect the method signature or fall back at runtime
(e.g., use inspect.signature(module.layer_sync_moe_local_experts_amax) to check
for a sync_weight_amax parameter or try calling with the kwarg and on TypeError
call without it) so the call works for both implementations, leaving the
HuggingFace and Megatron implementations unchanged.

---

Outside diff comments:
In `@modelopt/torch/quantization/utils/core_utils.py`:
- Around line 533-558: The current loop computes amax_dict and assigns shared
amax to weight_quantizer modules before running the per-expert weight-only
fallback (max_calibrate), causing experts that never collected weight stats to
be skipped; change the order so missing weight amax values are calibrated first
(using max_calibrate on modules where name.endswith("weight_quantizer") and
module.is_enabled and module.amax is None, retrieving weight via
expert.state_dict()) and only then compute the cross-expert maximum into
amax_dict and assign it, or alternatively retain the existing first
max-reduction but add a second pass that recomputes amax_dict (torch.maximum
reduction) after the fallback; reference functions/classes to update:
TensorQuantizer, amax_dict, max_calibrate, sync_weight_amax, and the
expert.named_modules() loops.

---

Nitpick comments:
In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`:
- Around line 638-676: The test can miss the no-token-expert case because random
routing with batch_size=8 can touch every expert; update the test around
TEGroupedMLP/SequentialMLP (where copy_weights_from_grouped_to_non_grouped,
mtq.quantize, seq_quant_cfg, and forward are used) to force at least one
SequentialMLP expert to receive zero tokens before quantization (e.g., make the
gating deterministic or craft inputs so only a subset of the num_moe_experts=4
experts are selected) so sync_expert_weight_amax behavior is exercised;
alternatively, after quantizing both models compare the synced expert weight
amax tensors from sequential_moe_model directly against the TEGroupedMLP amaxs
to ensure they match.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ce8a1169-4247-4e70-95e3-e0d28bbc2ce2

📥 Commits

Reviewing files that changed from the base of the PR and between 09b3c0b and 4900377.

📒 Files selected for processing (5)

modelopt/torch/quantization/config.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/plugins/megatron.py
modelopt/torch/quantization/utils/core_utils.py
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py

modelopt/torch/quantization/model_calib.py

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/quantization/plugins/huggingface.py (1)
556-556: Add explicit type hints on the updated sync hook signature.

Line 556 introduces sync_weight_amax without type annotation. Please annotate it (and return type) for mypy consistency.
Suggested diff
-    def layer_sync_moe_local_experts_amax(self, sync_weight_amax=False):
+    def layer_sync_moe_local_experts_amax(self, sync_weight_amax: bool = False) -> None:
As per coding guidelines "Ensure type hints are properly annotated for static type checking with mypy".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` at line 556, The new hook
method layer_sync_moe_local_experts_amax has an unannotated parameter and
missing return type; add explicit type hints for sync_weight_amax (e.g., bool)
and the method return type (likely None) to satisfy mypy. Locate the method
layer_sync_moe_local_experts_amax in the HuggingFace quantization plugin and
update its signature to include the parameter annotation and a return annotation
consistent with other sync hooks in the file (use bool for sync_weight_amax and
-> None if it returns nothing).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Line 556: The new hook method layer_sync_moe_local_experts_amax has an
unannotated parameter and missing return type; add explicit type hints for
sync_weight_amax (e.g., bool) and the method return type (likely None) to
satisfy mypy. Locate the method layer_sync_moe_local_experts_amax in the
HuggingFace quantization plugin and update its signature to include the
parameter annotation and a return annotation consistent with other sync hooks in
the file (use bool for sync_weight_amax and -> None if it returns nothing).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fc0ff616-d05f-4d63-9b3b-f2ee1462617c

📥 Commits

Reviewing files that changed from the base of the PR and between 4900377 and fd5c48e.

📒 Files selected for processing (1)

modelopt/torch/quantization/plugins/huggingface.py

codecov · 2026-04-02T15:57:57Z

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.54%. Comparing base (c37c74f) to head (d56d7ca).
⚠️ Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/plugins/megatron.py	0.00%	2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (c37c74f) and HEAD (d56d7ca). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (c37c74f) HEAD (d56d7ca)

1 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1158       +/-   ##
===========================================
- Coverage   70.20%   54.54%   -15.67%     
===========================================
  Files         230      348      +118     
  Lines       26098    39781    +13683     
===========================================
+ Hits        18322    21697     +3375     
- Misses       7776    18084    +10308

Flag	Coverage Δ
unit	`54.54% <80.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

tests/gpu_megatron/torch/quantization/plugins/test_megatron.py (1)

667-669: Preserve existing algorithm options when enabling sync_expert_weight_amax.

Line 668 replaces the entire algorithm config, which can silently drop other algorithm fields if callers pass a richer config. Prefer merging instead of overwriting.

♻️ Proposed refactor

-    seq_quant_cfg = copy.deepcopy(quant_cfg)
-    seq_quant_cfg["algorithm"] = {"method": "max", "sync_expert_weight_amax": True}
+    seq_quant_cfg = copy.deepcopy(quant_cfg)
+    algo_cfg = seq_quant_cfg.get("algorithm", {})
+    if isinstance(algo_cfg, str):
+        algo_cfg = {"method": algo_cfg}
+    else:
+        algo_cfg = copy.deepcopy(algo_cfg)
+    algo_cfg["method"] = "max"
+    algo_cfg["sync_expert_weight_amax"] = True
+    seq_quant_cfg["algorithm"] = algo_cfg

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py` around lines
667 - 669, The test currently overwrites the entire algorithm dict
(seq_quant_cfg["algorithm"] = {...}), which can drop existing algorithm fields
from quant_cfg; instead merge the new option into the existing algorithm dict so
other options are preserved—e.g., obtain the original algorithm from
seq_quant_cfg (or quant_cfg), update it with
{"method":"max","sync_expert_weight_amax":True}, assign that merged dict back to
seq_quant_cfg["algorithm"], and then call mtq.quantize(sequential_moe_model,
seq_quant_cfg, forward).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`:
- Around line 667-669: The test currently overwrites the entire algorithm dict
(seq_quant_cfg["algorithm"] = {...}), which can drop existing algorithm fields
from quant_cfg; instead merge the new option into the existing algorithm dict so
other options are preserved—e.g., obtain the original algorithm from
seq_quant_cfg (or quant_cfg), update it with
{"method":"max","sync_expert_weight_amax":True}, assign that merged dict back to
seq_quant_cfg["algorithm"], and then call mtq.quantize(sequential_moe_model,
seq_quant_cfg, forward).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a5410045-2e76-452d-a309-ae5994d24b6a

📥 Commits

Reviewing files that changed from the base of the PR and between fd5c48e and d56d7ca.

📒 Files selected for processing (1)

tests/gpu_megatron/torch/quantization/plugins/test_megatron.py

jenchen13 requested a review from a team as a code owner April 1, 2026 19:58

jenchen13 requested a review from sugunav14 April 1, 2026 19:58

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

modelopt/torch/quantization/model_calib.py Show resolved Hide resolved

jenchen13 requested review from kinjalpatel27 and realAsma April 1, 2026 21:34

update api for HF layer_sync_moe

fd5c48e

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

fix test

d56d7ca

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MoE quantization tests and add sync_expert_weight_amax option#1158

Fix MoE quantization tests and add sync_expert_weight_amax option#1158
jenchen13 wants to merge 3 commits intomainfrom
jennifchen/fix_moe_tests_v2

jenchen13 commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Apr 1, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-02 21:55 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jenchen13 commented Apr 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-02 21:55 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jenchen13 commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-02 21:55 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Apr 2, 2026 •

edited

Loading