Skip to content

Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin#1170

Open
cjluo-nv wants to merge 3 commits intomainfrom
chenjiel/refactor_qwen35
Open

Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin#1170
cjluo-nv wants to merge 3 commits intomainfrom
chenjiel/refactor_qwen35

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Apr 2, 2026

Summary

  • Refactors _QuantQwen35MoeExperts from QuantModule with a custom forward to _QuantFunctionalMixin, keeping the original HF forward unmodified (single fused F.linear + chunk instead of two separate matmuls per expert)
  • Adds per-expert quantizer ModuleLists with expert index recovery via storage offset, preserving per-expert calibration granularity
  • Adds _export_qwen35_experts in moe_utils.py to split fused 3D params into per-expert named tensors at export time, reusing _export_quantized_weight for all quantization formats
  • Moves Qwen3_5MoeSparseMoeBlock to the fused gate_up_proj/down_proj expert linear names group in layer_utils.py

Test plan

  • Run MoE quantization unit tests: python -m pytest tests/unit/torch/quantization/plugins/test_sparse_moe.py -x
  • Run export tests: python -m pytest tests/gpu/torch/export/ -x
  • Verify exported checkpoint naming matches experts.{E}.gate_proj.weight convention
  • Verify no regression on Qwen3 MoE (non-3.5) quantization

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Corrected Qwen3.5 MoE block expert detection logic.
  • New Features

    • Added quantized export support for Qwen3.5 Mixture of Experts models with per-expert quantization buffers.
  • Improvements

    • Optimized MoE expert quantization using functional interception for improved efficiency.

cjluo-nv added 2 commits April 2, 2026 18:06
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested review from a team as code owners April 2, 2026 18:18
@cjluo-nv cjluo-nv requested review from meenchen and sychen52 April 2, 2026 18:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1170/

Built to branch gh-pages at 2026-04-02 18:28 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

This change implements Qwen3.5 MoE expert quantization and export by updating expert classification, adding specialized expert splitting and quantization logic, integrating new export paths in the unified export pipeline, and replacing the decomposition-based quantization approach with a functional wrapper that intercepts linear operations.

Changes

Cohort / File(s) Summary
Expert Linear Name Detection
modelopt/torch/export/layer_utils.py
Moved Qwen3_5MoeSparseMoeBlock from Qwen-style unfused mapping to the fused gate_up_proj/down_proj mapping, aligning its classification with GptOssMoE-type modules.
Expert Quantization & Splitting
modelopt/torch/export/moe_utils.py
Added _export_qwen35_experts function to decompose fused Qwen3.5 MoE weights into per-expert submodules, export quantized weights with per-expert scales, apply fallback quantization logic for uncalibrated weights, and clean up original fused parameters.
Unified Export Integration
modelopt/torch/export/unified_export_hf.py
Integrated Qwen3.5 MoE expert export into _process_quantized_modules and _export_transformers_checkpoint to call the new export function and skip redundant per-expert processing.
Functional Quantization Wrapper
modelopt/torch/quantization/plugins/huggingface.py
Replaced per-expert decomposition with _QuantQwen35MoeExperts functional wrapper that intercepts torch.nn.functional.linear calls, extracts expert indices from fused weights, and applies per-expert quantization without materializing intermediate submodules.

Sequence Diagram

sequenceDiagram
    participant Exporter as Unified Exporter
    participant MoeModule as Qwen3.5 MoE Module
    participant SplitLogic as Expert Splitting Logic
    participant QuantLogic as Quantization Logic
    participant Storage as Module Storage

    Exporter->>MoeModule: Identify QuantQwen3_5MoeExperts
    Exporter->>SplitLogic: Call _export_qwen35_experts()
    
    SplitLogic->>MoeModule: Access fused gate_up_proj & down_proj
    SplitLogic->>SplitLogic: Decompose fused weights per expert
    
    loop For each expert slice
        SplitLogic->>QuantLogic: Export quantized weight & scales
        QuantLogic->>QuantLogic: Apply per-channel amax fallback
        QuantLogic->>QuantLogic: Compute amax if uncalibrated
    end
    
    SplitLogic->>Storage: Register per-expert submodules
    SplitLogic->>Storage: Remove fused parameters
    SplitLogic->>Exporter: Return with per-expert structure
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: refactoring Qwen3.5 MoE quantization implementation to use _QuantFunctionalMixin instead of custom forward logic.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed The pull request complies with all security coding practices outlined in SECURITY.md. No unsafe patterns detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/refactor_qwen35

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
modelopt/torch/export/moe_utils.py (1)

105-117: Amax slicing logic is correct but inconsistent with line 130.

The proportional slicing for per-channel amax is mathematically correct. However, line 117 sets w_quantizer._amax (the internal attribute), while line 130 sets w_quantizer.amax (the property). Consider using the property setter consistently for proper validation:

-               w_quantizer._amax = amax[slice_start:slice_end].contiguous()
+               w_quantizer.amax = amax[slice_start:slice_end].contiguous()

This ensures any property-level validation in TensorQuantizer.amax.setter is applied uniformly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/moe_utils.py` around lines 105 - 117, The per-channel
amax slice currently assigns directly to the internal attribute
w_quantizer._amax (in the block that checks hasattr(w_quantizer, "_amax")),
which bypasses any validation in the TensorQuantizer.amax property; change this
to assign via the property (e.g., set w_quantizer.amax =
sliced_amax.contiguous()) instead of writing to _amax so the
TensorQuantizer.amax.setter runs consistently with the later code that uses
w_quantizer.amax.
modelopt/torch/quantization/plugins/huggingface.py (1)

805-828: Consider thread-safety implications of the toggle mechanism.

The toggle state (_down_proj_linear, _current_expert_idx) is instance-level mutable state accessed during F.linear interception. If the same module instance is used concurrently (e.g., in data-parallel training without proper synchronization), the toggle could become inconsistent across threads.

This is likely fine for typical inference/calibration workloads (single-threaded forward), but worth noting for future maintainers if concurrent usage becomes a requirement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 805 - 828,
The toggle state used in functionals_to_replace via the nested _quantized_linear
(specifically instance fields _down_proj_linear and _current_expert_idx) is
mutable and not thread-safe; replace the instance-level toggle with a
thread-local or per-call state to avoid race conditions when F.linear is
intercepted concurrently. Concretely, change _quantized_linear to use a
threading.local() or local context object (created outside or on the stack)
keyed per-thread/call to store the down-proj boolean and current expert index
(instead of _down_proj_linear/_current_expert_idx), or protect access with a
lightweight Lock around reads/writes; update uses of
_get_expert_idx_from_gate_up, gate_up_proj_input_quantizers,
gate_up_proj_weight_quantizers, down_proj_input_quantizers and
down_proj_weight_quantizers to read/write the thread-local or locked state so
concurrent forwards don’t clobber each other.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 105-117: The per-channel amax slice currently assigns directly to
the internal attribute w_quantizer._amax (in the block that checks
hasattr(w_quantizer, "_amax")), which bypasses any validation in the
TensorQuantizer.amax property; change this to assign via the property (e.g., set
w_quantizer.amax = sliced_amax.contiguous()) instead of writing to _amax so the
TensorQuantizer.amax.setter runs consistently with the later code that uses
w_quantizer.amax.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 805-828: The toggle state used in functionals_to_replace via the
nested _quantized_linear (specifically instance fields _down_proj_linear and
_current_expert_idx) is mutable and not thread-safe; replace the instance-level
toggle with a thread-local or per-call state to avoid race conditions when
F.linear is intercepted concurrently. Concretely, change _quantized_linear to
use a threading.local() or local context object (created outside or on the
stack) keyed per-thread/call to store the down-proj boolean and current expert
index (instead of _down_proj_linear/_current_expert_idx), or protect access with
a lightweight Lock around reads/writes; update uses of
_get_expert_idx_from_gate_up, gate_up_proj_input_quantizers,
gate_up_proj_weight_quantizers, down_proj_input_quantizers and
down_proj_weight_quantizers to read/write the thread-local or locked state so
concurrent forwards don’t clobber each other.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1a839b2a-9e97-4d74-a751-5dd420978867

📥 Commits

Reviewing files that changed from the base of the PR and between 665cc63 and 59d10d9.

📒 Files selected for processing (4)
  • modelopt/torch/export/layer_utils.py
  • modelopt/torch/export/moe_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/plugins/huggingface.py

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 11.70213% with 83 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.74%. Comparing base (00c002f) to head (59d10d9).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/moe_utils.py 3.70% 52 Missing ⚠️
modelopt/torch/quantization/plugins/huggingface.py 18.18% 27 Missing ⚠️
modelopt/torch/export/unified_export_hf.py 33.33% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
+ Coverage   74.27%   75.74%   +1.47%     
==========================================
  Files         349      349              
  Lines       39846    39886      +40     
==========================================
+ Hits        29594    30212     +618     
+ Misses      10252     9674     -578     
Flag Coverage Δ
examples 43.87% <11.70%> (+4.81%) ⬆️
gpu 57.03% <9.57%> (-0.23%) ⬇️
unit 54.48% <8.51%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant