Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin by cjluo-nv · Pull Request #1170 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-04-02T18:18:17Z

Summary

Refactors _QuantQwen35MoeExperts from QuantModule with a custom forward to _QuantFunctionalMixin, keeping the original HF forward unmodified (single fused F.linear + chunk instead of two separate matmuls per expert)
Adds per-expert quantizer ModuleLists with expert index recovery via storage offset, preserving per-expert calibration granularity
Adds _export_qwen35_experts in moe_utils.py to split fused 3D params into per-expert named tensors at export time, reusing _export_quantized_weight for all quantization formats
Moves Qwen3_5MoeSparseMoeBlock to the fused gate_up_proj/down_proj expert linear names group in layer_utils.py

Test plan

Run MoE quantization unit tests: python -m pytest tests/unit/torch/quantization/plugins/test_sparse_moe.py -x
Run export tests: python -m pytest tests/gpu/torch/export/ -x
Verify exported checkpoint naming matches experts.{E}.gate_proj.weight convention
Verify no regression on Qwen3 MoE (non-3.5) quantization

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Corrected Qwen3.5 MoE block expert detection logic.
New Features
- Added quantized export support for Qwen3.5 Mixture of Experts models with per-expert quantization buffers.
Improvements
- Optimized MoE expert quantization using functional interception for improved efficiency.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

github-actions · 2026-04-02T18:22:33Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1170/
Built to branch `gh-pages` at 2026-04-02 18:28 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai · 2026-04-02T18:23:14Z

📝 Walkthrough

Walkthrough

This change implements Qwen3.5 MoE expert quantization and export by updating expert classification, adding specialized expert splitting and quantization logic, integrating new export paths in the unified export pipeline, and replacing the decomposition-based quantization approach with a functional wrapper that intercepts linear operations.

Changes

Cohort / File(s)	Summary
Expert Linear Name Detection `modelopt/torch/export/layer_utils.py`	Moved `Qwen3_5MoeSparseMoeBlock` from Qwen-style unfused mapping to the fused `gate_up_proj`/`down_proj` mapping, aligning its classification with `GptOssMoE`-type modules.
Expert Quantization & Splitting `modelopt/torch/export/moe_utils.py`	Added `_export_qwen35_experts` function to decompose fused Qwen3.5 MoE weights into per-expert submodules, export quantized weights with per-expert scales, apply fallback quantization logic for uncalibrated weights, and clean up original fused parameters.
Unified Export Integration `modelopt/torch/export/unified_export_hf.py`	Integrated Qwen3.5 MoE expert export into `_process_quantized_modules` and `_export_transformers_checkpoint` to call the new export function and skip redundant per-expert processing.
Functional Quantization Wrapper `modelopt/torch/quantization/plugins/huggingface.py`	Replaced per-expert decomposition with `_QuantQwen35MoeExperts` functional wrapper that intercepts `torch.nn.functional.linear` calls, extracts expert indices from fused weights, and applies per-expert quantization without materializing intermediate submodules.

Sequence Diagram

sequenceDiagram
    participant Exporter as Unified Exporter
    participant MoeModule as Qwen3.5 MoE Module
    participant SplitLogic as Expert Splitting Logic
    participant QuantLogic as Quantization Logic
    participant Storage as Module Storage

    Exporter->>MoeModule: Identify QuantQwen3_5MoeExperts
    Exporter->>SplitLogic: Call _export_qwen35_experts()
    
    SplitLogic->>MoeModule: Access fused gate_up_proj & down_proj
    SplitLogic->>SplitLogic: Decompose fused weights per expert
    
    loop For each expert slice
        SplitLogic->>QuantLogic: Export quantized weight & scales
        QuantLogic->>QuantLogic: Apply per-channel amax fallback
        QuantLogic->>QuantLogic: Compute amax if uncalibrated
    end
    
    SplitLogic->>Storage: Register per-expert submodules
    SplitLogic->>Storage: Remove fused parameters
    SplitLogic->>Exporter: Return with per-expert structure

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: refactoring Qwen3.5 MoE quantization implementation to use _QuantFunctionalMixin instead of custom forward logic.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	The pull request complies with all security coding practices outlined in SECURITY.md. No unsafe patterns detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/refactor_qwen35

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

🧹 Nitpick comments (2)

modelopt/torch/export/moe_utils.py (1)
105-117: Amax slicing logic is correct but inconsistent with line 130.

The proportional slicing for per-channel amax is mathematically correct. However, line 117 sets w_quantizer._amax (the internal attribute), while line 130 sets w_quantizer.amax (the property). Consider using the property setter consistently for proper validation:
-               w_quantizer._amax = amax[slice_start:slice_end].contiguous()
+               w_quantizer.amax = amax[slice_start:slice_end].contiguous()
This ensures any property-level validation in TensorQuantizer.amax.setter is applied uniformly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/moe_utils.py` around lines 105 - 117, The per-channel
amax slice currently assigns directly to the internal attribute
w_quantizer._amax (in the block that checks hasattr(w_quantizer, "_amax")),
which bypasses any validation in the TensorQuantizer.amax property; change this
to assign via the property (e.g., set w_quantizer.amax =
sliced_amax.contiguous()) instead of writing to _amax so the
TensorQuantizer.amax.setter runs consistently with the later code that uses
w_quantizer.amax.
modelopt/torch/quantization/plugins/huggingface.py (1)
805-828: Consider thread-safety implications of the toggle mechanism.

The toggle state (_down_proj_linear, _current_expert_idx) is instance-level mutable state accessed during F.linear interception. If the same module instance is used concurrently (e.g., in data-parallel training without proper synchronization), the toggle could become inconsistent across threads.

This is likely fine for typical inference/calibration workloads (single-threaded forward), but worth noting for future maintainers if concurrent usage becomes a requirement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 805 - 828,
The toggle state used in functionals_to_replace via the nested _quantized_linear
(specifically instance fields _down_proj_linear and _current_expert_idx) is
mutable and not thread-safe; replace the instance-level toggle with a
thread-local or per-call state to avoid race conditions when F.linear is
intercepted concurrently. Concretely, change _quantized_linear to use a
threading.local() or local context object (created outside or on the stack)
keyed per-thread/call to store the down-proj boolean and current expert index
(instead of _down_proj_linear/_current_expert_idx), or protect access with a
lightweight Lock around reads/writes; update uses of
_get_expert_idx_from_gate_up, gate_up_proj_input_quantizers,
gate_up_proj_weight_quantizers, down_proj_input_quantizers and
down_proj_weight_quantizers to read/write the thread-local or locked state so
concurrent forwards don’t clobber each other.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 105-117: The per-channel amax slice currently assigns directly to
the internal attribute w_quantizer._amax (in the block that checks
hasattr(w_quantizer, "_amax")), which bypasses any validation in the
TensorQuantizer.amax property; change this to assign via the property (e.g., set
w_quantizer.amax = sliced_amax.contiguous()) instead of writing to _amax so the
TensorQuantizer.amax.setter runs consistently with the later code that uses
w_quantizer.amax.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 805-828: The toggle state used in functionals_to_replace via the
nested _quantized_linear (specifically instance fields _down_proj_linear and
_current_expert_idx) is mutable and not thread-safe; replace the instance-level
toggle with a thread-local or per-call state to avoid race conditions when
F.linear is intercepted concurrently. Concretely, change _quantized_linear to
use a threading.local() or local context object (created outside or on the
stack) keyed per-thread/call to store the down-proj boolean and current expert
index (instead of _down_proj_linear/_current_expert_idx), or protect access with
a lightweight Lock around reads/writes; update uses of
_get_expert_idx_from_gate_up, gate_up_proj_input_quantizers,
gate_up_proj_weight_quantizers, down_proj_input_quantizers and
down_proj_weight_quantizers to read/write the thread-local or locked state so
concurrent forwards don’t clobber each other.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1a839b2a-9e97-4d74-a751-5dd420978867

📥 Commits

Reviewing files that changed from the base of the PR and between 665cc63 and 59d10d9.

📒 Files selected for processing (4)

modelopt/torch/export/layer_utils.py
modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/plugins/huggingface.py

codecov · 2026-04-02T18:36:21Z

Codecov Report

❌ Patch coverage is 11.70213% with 83 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.74%. Comparing base (00c002f) to head (59d10d9).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/export/moe_utils.py	3.70%	52 Missing ⚠️
modelopt/torch/quantization/plugins/huggingface.py	18.18%	27 Missing ⚠️
modelopt/torch/export/unified_export_hf.py	33.33%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
+ Coverage   74.27%   75.74%   +1.47%     
==========================================
  Files         349      349              
  Lines       39846    39886      +40     
==========================================
+ Hits        29594    30212     +618     
+ Misses      10252     9674     -578

Flag	Coverage Δ
examples	`43.87% <11.70%> (+4.81%)`	⬆️
gpu	`57.03% <9.57%> (-0.23%)`	⬇️
unit	`54.48% <8.51%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv added 2 commits April 2, 2026 18:06

Refactor Qwen3.5 MOE support for HF 5.0

9d7c8c0

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Reviewer1

d3f23c5

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested review from a team as code owners April 2, 2026 18:18

cjluo-nv requested review from meenchen and sychen52 April 2, 2026 18:18

Reviewer 2

59d10d9

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin#1170

Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin#1170
cjluo-nv wants to merge 3 commits intomainfrom
chenjiel/refactor_qwen35

cjluo-nv commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Apr 2, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-02 18:28 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjluo-nv commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

github-actions bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-02 18:28 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cjluo-nv commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Apr 2, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-02 18:28 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

codecov bot commented Apr 2, 2026 •

edited

Loading