Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin#1170
Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin#1170
Conversation
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
📝 WalkthroughWalkthroughThis change implements Qwen3.5 MoE expert quantization and export by updating expert classification, adding specialized expert splitting and quantization logic, integrating new export paths in the unified export pipeline, and replacing the decomposition-based quantization approach with a functional wrapper that intercepts linear operations. Changes
Sequence DiagramsequenceDiagram
participant Exporter as Unified Exporter
participant MoeModule as Qwen3.5 MoE Module
participant SplitLogic as Expert Splitting Logic
participant QuantLogic as Quantization Logic
participant Storage as Module Storage
Exporter->>MoeModule: Identify QuantQwen3_5MoeExperts
Exporter->>SplitLogic: Call _export_qwen35_experts()
SplitLogic->>MoeModule: Access fused gate_up_proj & down_proj
SplitLogic->>SplitLogic: Decompose fused weights per expert
loop For each expert slice
SplitLogic->>QuantLogic: Export quantized weight & scales
QuantLogic->>QuantLogic: Apply per-channel amax fallback
QuantLogic->>QuantLogic: Compute amax if uncalibrated
end
SplitLogic->>Storage: Register per-expert submodules
SplitLogic->>Storage: Remove fused parameters
SplitLogic->>Exporter: Return with per-expert structure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
modelopt/torch/export/moe_utils.py (1)
105-117: Amax slicing logic is correct but inconsistent with line 130.The proportional slicing for per-channel amax is mathematically correct. However, line 117 sets
w_quantizer._amax(the internal attribute), while line 130 setsw_quantizer.amax(the property). Consider using the property setter consistently for proper validation:- w_quantizer._amax = amax[slice_start:slice_end].contiguous() + w_quantizer.amax = amax[slice_start:slice_end].contiguous()This ensures any property-level validation in
TensorQuantizer.amax.setteris applied uniformly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/moe_utils.py` around lines 105 - 117, The per-channel amax slice currently assigns directly to the internal attribute w_quantizer._amax (in the block that checks hasattr(w_quantizer, "_amax")), which bypasses any validation in the TensorQuantizer.amax property; change this to assign via the property (e.g., set w_quantizer.amax = sliced_amax.contiguous()) instead of writing to _amax so the TensorQuantizer.amax.setter runs consistently with the later code that uses w_quantizer.amax.modelopt/torch/quantization/plugins/huggingface.py (1)
805-828: Consider thread-safety implications of the toggle mechanism.The toggle state (
_down_proj_linear,_current_expert_idx) is instance-level mutable state accessed during F.linear interception. If the same module instance is used concurrently (e.g., in data-parallel training without proper synchronization), the toggle could become inconsistent across threads.This is likely fine for typical inference/calibration workloads (single-threaded forward), but worth noting for future maintainers if concurrent usage becomes a requirement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 805 - 828, The toggle state used in functionals_to_replace via the nested _quantized_linear (specifically instance fields _down_proj_linear and _current_expert_idx) is mutable and not thread-safe; replace the instance-level toggle with a thread-local or per-call state to avoid race conditions when F.linear is intercepted concurrently. Concretely, change _quantized_linear to use a threading.local() or local context object (created outside or on the stack) keyed per-thread/call to store the down-proj boolean and current expert index (instead of _down_proj_linear/_current_expert_idx), or protect access with a lightweight Lock around reads/writes; update uses of _get_expert_idx_from_gate_up, gate_up_proj_input_quantizers, gate_up_proj_weight_quantizers, down_proj_input_quantizers and down_proj_weight_quantizers to read/write the thread-local or locked state so concurrent forwards don’t clobber each other.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 105-117: The per-channel amax slice currently assigns directly to
the internal attribute w_quantizer._amax (in the block that checks
hasattr(w_quantizer, "_amax")), which bypasses any validation in the
TensorQuantizer.amax property; change this to assign via the property (e.g., set
w_quantizer.amax = sliced_amax.contiguous()) instead of writing to _amax so the
TensorQuantizer.amax.setter runs consistently with the later code that uses
w_quantizer.amax.
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 805-828: The toggle state used in functionals_to_replace via the
nested _quantized_linear (specifically instance fields _down_proj_linear and
_current_expert_idx) is mutable and not thread-safe; replace the instance-level
toggle with a thread-local or per-call state to avoid race conditions when
F.linear is intercepted concurrently. Concretely, change _quantized_linear to
use a threading.local() or local context object (created outside or on the
stack) keyed per-thread/call to store the down-proj boolean and current expert
index (instead of _down_proj_linear/_current_expert_idx), or protect access with
a lightweight Lock around reads/writes; update uses of
_get_expert_idx_from_gate_up, gate_up_proj_input_quantizers,
gate_up_proj_weight_quantizers, down_proj_input_quantizers and
down_proj_weight_quantizers to read/write the thread-local or locked state so
concurrent forwards don’t clobber each other.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1a839b2a-9e97-4d74-a751-5dd420978867
📒 Files selected for processing (4)
modelopt/torch/export/layer_utils.pymodelopt/torch/export/moe_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/plugins/huggingface.py
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1170 +/- ##
==========================================
+ Coverage 74.27% 75.74% +1.47%
==========================================
Files 349 349
Lines 39846 39886 +40
==========================================
+ Hits 29594 30212 +618
+ Misses 10252 9674 -578
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
_QuantQwen35MoeExpertsfromQuantModulewith a custom forward to_QuantFunctionalMixin, keeping the original HF forward unmodified (single fusedF.linear+ chunk instead of two separate matmuls per expert)ModuleLists with expert index recovery via storage offset, preserving per-expert calibration granularity_export_qwen35_expertsinmoe_utils.pyto split fused 3D params into per-expert named tensors at export time, reusing_export_quantized_weightfor all quantization formatsQwen3_5MoeSparseMoeBlockto the fusedgate_up_proj/down_projexpert linear names group inlayer_utils.pyTest plan
python -m pytest tests/unit/torch/quantization/plugins/test_sparse_moe.py -xpython -m pytest tests/gpu/torch/export/ -xexperts.{E}.gate_proj.weightconvention🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
New Features
Improvements