Skip to content

[NVBug 6045859]Fix export support for Qwen3VL MoE experts#1164

Open
shengliangxu wants to merge 4 commits intomainfrom
shengliangx/qwen3vlmoe-export
Open

[NVBug 6045859]Fix export support for Qwen3VL MoE experts#1164
shengliangxu wants to merge 4 commits intomainfrom
shengliangx/qwen3vlmoe-export

Conversation

@shengliangxu
Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu commented Apr 1, 2026

What does this PR do?

Fix HF checkpoint export support for Qwen3-VL MoE models (e.g. Qwen/Qwen3-VL-30B-A3B-Instruct).

Previously, running hf_ptq.py on Qwen3-VL MoE models failed during export_hf_checkpoint with:

NotImplementedError: MoE model with experts type 'QuantQwen3VLMoeTextExperts' is not supported in export.

Root cause: _QuantQwen3VLMoeTextExperts stored expert weights as flat nn.ModuleLists (one per projection type), making the module non-iterable. The export code requires sub_module.experts to be iterable to handle input quantizer amax and gate/up amax sync.

Fix: Refactor _QuantQwen3VLMoeTextExperts to use per-expert module containers, matching the established _QuantQwen35MoeExperts pattern:

  • Add _Qwen3VLMoeExpertModule container class with gate_proj, up_proj, down_proj
  • Register experts as numbered children producing state dict keys like experts.{id}.gate_proj.weight (standard Qwen3 MoE naming, compatible with vLLM)
  • Implement __len__/__iter__/__getitem__ for iterability
  • Add Qwen3VLMoeSparseMoeBlock to get_expert_linear_names in layer_utils.py

Files changed:

  • modelopt/torch/quantization/plugins/huggingface.py — refactored expert module structure
  • modelopt/torch/export/layer_utils.py — added Qwen3VLMoe to expert linear name mapping

Testing

  • [ x ] FP8 PTQ + export on Qwen/Qwen3-VL-30B-A3B-Instruct:
    python examples/llm_ptq/hf_ptq.py \
      --pyt_ckpt_path=Qwen/Qwen3-VL-30B-A3B-Instruct \
      --export_path=<output_dir> \
      --qformat=fp8 --calib_size=8 --batch_size=1
    
  • [ x ] Verify exported checkpoint loads in vLLM

Summary by CodeRabbit

  • New Features

    • Added support for an additional Qwen3 Vision Language Model Mixture of Experts variant.
  • Improvements

    • Enhanced Mixture of Experts module structure and handling for improved performance.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 1, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Added recognition of a new Qwen3 MoE expert module type (Qwen3VLMoeTextSparseMoeBlock) to the export layer utilities. Refactored the quantization plugin's expert container structure to use a unified _Qwen3VLMoeExpertModule wrapper instead of separate ModuleLists, with updated forward pass routing and container protocol support.

Changes

Cohort / File(s) Summary
MoE Expert Type Recognition
modelopt/torch/export/layer_utils.py
Extended get_expert_linear_names() to recognize Qwen3VLMoeTextSparseMoeBlock module type and return standard Qwen expert linear layer names.
Expert Container Refactoring
modelopt/torch/quantization/plugins/huggingface.py
Restructured _QuantQwen3VLMoeTextExperts to use a new _Qwen3VLMoeExpertModule container per expert instead of three parallel ModuleLists. Updated weight registration, forward pass indexing, and added __len__, __iter__, __getitem__ methods for container protocol support.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns ✅ Passed Pull request introduces only class refactoring for Qwen3-VL MoE expert module handling with no security anti-patterns detected.
Title check ✅ Passed The title clearly describes the main change: fixing export support for Qwen3VL MoE experts, which is the central objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch shengliangx/qwen3vlmoe-export

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1164/

Built to branch gh-pages at 2026-04-03 22:52 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 20.00000% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.53%. Comparing base (df80a0f) to head (e949e33).

Files with missing lines Patch % Lines
modelopt/torch/quantization/plugins/huggingface.py 20.00% 20 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1164       +/-   ##
===========================================
- Coverage   74.76%   63.53%   -11.24%     
===========================================
  Files         351      351               
  Lines       40072    40084       +12     
===========================================
- Hits        29961    25468     -4493     
- Misses      10111    14616     +4505     
Flag Coverage Δ
examples 40.27% <20.00%> (-0.03%) ⬇️
gpu 18.76% <20.00%> (-38.47%) ⬇️
unit 54.74% <20.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shengliangxu shengliangxu force-pushed the shengliangx/qwen3vlmoe-export branch from daf0144 to de55e8a Compare April 2, 2026 21:37
…ontainers

Qwen3VLMoeTextExperts stored expert weights as flat ModuleLists
(gate_proj, up_proj, down_proj), making the module non-iterable. The HF
export code requires `sub_module.experts` to be iterable, causing a
NotImplementedError during `export_hf_checkpoint`.

Refactor _QuantQwen3VLMoeTextExperts to use per-expert module
containers (matching the _QuantQwen35MoeExperts pattern):

- Add _Qwen3VLMoeExpertModule container class
- Register experts as numbered children (experts.{id}.gate_proj.weight)
- Implement __len__/__iter__/__getitem__ for iterability
- Add Qwen3VLMoeSparseMoeBlock to get_expert_linear_names

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@shengliangxu shengliangxu reopened this Apr 2, 2026
@shengliangxu shengliangxu marked this pull request as ready for review April 2, 2026 23:28
@shengliangxu shengliangxu requested review from a team as code owners April 2, 2026 23:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)

690-701: Consider unifying with _Qwen35MoeExpertModule.

This class is nearly identical to _Qwen35MoeExpertModule (lines 792-803), differing only in parameter naming (hidden_size vs hidden_dim). Consider creating a single reusable expert module class to reduce code duplication.

♻️ Proposed unified expert module
-class _Qwen3VLMoeExpertModule(nn.Module):
-    """Container for a single Qwen3VL MoE expert's linear layers.
-
-    Produces the naming pattern: experts.{id}.gate_proj.weight
-    (consistent with standard Qwen3 MoE per-expert module structure).
-    """
-
-    def __init__(self, hidden_size: int, expert_dim: int):
-        super().__init__()
-        self.gate_proj = nn.Linear(hidden_size, expert_dim, bias=False)
-        self.up_proj = nn.Linear(hidden_size, expert_dim, bias=False)
-        self.down_proj = nn.Linear(expert_dim, hidden_size, bias=False)
+class _QwenMoeExpertModule(nn.Module):
+    """Container for a single Qwen MoE expert's linear layers.
+
+    Produces the naming pattern: experts.{id}.gate_proj.weight
+    (consistent with standard Qwen MoE per-expert module structure).
+    Reusable for Qwen3VL, Qwen3.5, and similar variants.
+    """
+
+    def __init__(self, hidden_dim: int, expert_dim: int):
+        super().__init__()
+        self.gate_proj = nn.Linear(hidden_dim, expert_dim, bias=False)
+        self.up_proj = nn.Linear(hidden_dim, expert_dim, bias=False)
+        self.down_proj = nn.Linear(expert_dim, hidden_dim, bias=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 690 - 701,
These two nearly identical classes (_Qwen3VLMoeExpertModule and
_Qwen35MoeExpertModule) should be replaced with one reusable expert module (e.g.
_QwenMoeExpertModule) that accepts the common parameters (hidden_dim/hidden_size
-> hidden_dim, expert_dim) and exposes gate_proj, up_proj, down_proj with the
same naming pattern (experts.{id}.gate_proj.weight); update all instantiations
that referenced _Qwen3VLMoeExpertModule and _Qwen35MoeExpertModule to use the
new class and normalize the parameter name to hidden_dim to remove duplication
while preserving behavior and bias=False on the Linear layers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 690-701: These two nearly identical classes
(_Qwen3VLMoeExpertModule and _Qwen35MoeExpertModule) should be replaced with one
reusable expert module (e.g. _QwenMoeExpertModule) that accepts the common
parameters (hidden_dim/hidden_size -> hidden_dim, expert_dim) and exposes
gate_proj, up_proj, down_proj with the same naming pattern
(experts.{id}.gate_proj.weight); update all instantiations that referenced
_Qwen3VLMoeExpertModule and _Qwen35MoeExpertModule to use the new class and
normalize the parameter name to hidden_dim to remove duplication while
preserving behavior and bias=False on the Linear layers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d09e8137-0fe2-4503-be18-2045798557e7

📥 Commits

Reviewing files that changed from the base of the PR and between 18ddcb7 and 998e258.

📒 Files selected for processing (2)
  • modelopt/torch/export/layer_utils.py
  • modelopt/torch/quantization/plugins/huggingface.py

@shengliangxu shengliangxu changed the title Add export support for Qwen3VL MoE experts with ModuleList linear layers Fix export support for Qwen3VL MoE experts Apr 3, 2026
@shengliangxu shengliangxu requested a review from cjluo-nv April 3, 2026 17:55
@shengliangxu shengliangxu changed the title Fix export support for Qwen3VL MoE experts [NVBug 6045859]Fix export support for Qwen3VL MoE experts Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant