[Calibration] Add MoE Calibration Context #1596

dsikka · 2025-06-25T20:56:31Z

Summary:

Introduce an moe_calibration_context which during calibration, replaces MoE blocks with custom modules which are needed to properly calibrate all experts requiring data
The context can be optionally enabled through a new calibrate_moe_context argument which if set to True, will enable the context
Modules are replaced with new definitions defined in the prepare folder (shared with replace_modules_for_calibration)
This enables a second pathway for calibrating MoEs and other models that require updates to their modules to be compatible with llm-compressor:

Replacing modules during calibration
Replacing modules permanently (as done by replace_modules_for_calibration, previously called prepare_for_calibration).

Similar to replace_modules_for_calibration, a dictionary defining the replacement has been added: moe_context

Testing

Tested with a Qwen/Qwen3-30B-A3B NVFP4 example and added the example to the folder as well

Next Steps:

Definitions for updated the MoE modules are hardcoded atm - we want to expand and add additional parameters to have more control over the MoE forward pass, such as through parameters defined here: [MoE] Add MoE calibration options #1593 - this is especially important if we find a certain configuration results in optimal calibration
We may find it easier to refactor out calibration args into their own pydantic model and not put everything under datraset args

github-actions · 2025-06-25T20:56:37Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Summary of Changes

Hello @dsikka, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces specialized support for calibrating Mixture-of-Experts (MoE) models within the llmcompressor framework. It enables model-specific adjustments during the calibration process, which is crucial for accurately quantizing these complex architectures. The changes ensure that MoE models like Qwen3 and DeepseekV3 can be properly handled, improving the overall effectiveness of quantization for these models.

Highlights

MoE Calibration Context: Introduced a new moe_calibration_context mechanism to apply model-specific modifications during the calibration phase for Mixture-of-Experts (MoE) models. This allows for specialized handling required by MoE architectures during quantization.
Model-Specific MoE Handling: Implemented specific context updates for Qwen3 MoE models (patching the top_k attribute of MLP modules) and DeepseekV3 models (replacing MLP modules with a specialized version) to ensure proper calibration behavior for these architectures.
Pipeline Integration: Integrated the calibrate_moe_context flag into the oneshot entrypoint and both the Independent and Sequential calibration pipelines. This enables conditional application of the MoE-specific calibration logic during the overall quantization process.
Qwen3 MoE Example: Added a new example script (examples/quantization_w4a4_fp4/qwen_30b_a2b.py) demonstrating how to quantize a Qwen3-30B-A3B MoE model using the new calibration context and the NVFP4 scheme.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a calibration context for Mixture-of-Experts (MoE) models, which is a great addition for handling this model architecture during quantization. The changes involve adding logic to activate all experts during calibration for supported models like Qwen3 and DeepseekV3, and plumbing this feature through the oneshot workflow.

I've identified a critical issue in the implementation that will cause crashes for non-MoE models. I've also pointed out a high-severity issue related to a hardcoded feature flag and a few medium-severity issues regarding code clarity and robustness. Addressing these points will significantly improve the quality and stability of this new feature.

src/llmcompressor/modeling/prepare.py

src/llmcompressor/entrypoints/oneshot.py

examples/quantization_w4a4_fp4/qwen_30b_a2b.py

src/llmcompressor/modeling/prepare.py

kylesayrs

Is the plan to use this for llama4 as well, or will that be a separate function?

src/llmcompressor/modeling/prepare.py

kylesayrs · 2025-07-07T14:41:55Z

src/llmcompressor/modeling/prepare.py

+def update_qwen3_moe(model, stack):
+    for _, module in model.named_modules():
+        cls_name = module.__class__.__name__
+        if cls_name == "Qwen3MoeDecoderLayer":


Could you use something like this pattern for matching? This way things don't break if the parent's structure changes, and we can also share matching logic between replacements

for name, module in model.named_modules(): cls_name = module.__class__.__name__ if cls_name in replacements: new_module = replacements[cls_name](module) replace_module(model, name, new_module)

I think if we want to use patch_attr in order to follow the other context set-up functionality, we need both the parent and the child so it would still require setting "mlp" - I think replace_modules finds the parent for you when replacing the module.

We could expand patch_attr I guess to follow that potentiallly

kylesayrs · 2025-07-07T14:42:57Z

src/llmcompressor/modeling/prepare.py

+}
+
+
+def moe_calibration_context(model: PreTrainedModel, stack):


Can you show what it's like to pass additional calibration options (moe_calibrate_all_experts moe_calibrate_gated_acts), if these are still options we want to supply research/users

I left a small comment but see other comment below.

kylesayrs · 2025-07-07T14:43:19Z

src/llmcompressor/modeling/qwen3_moe.py

@@ -0,0 +1,69 @@
+import torch


May want to add the HF copyright and a note about the amendments

I added the HF copyright - do you hava reference for what type of note should be made about amendments?

src/llmcompressor/pipelines/basic/pipeline.py

src/llmcompressor/modeling/deepseek_v3.py

src/llmcompressor/entrypoints/oneshot.py

dsikka · 2025-07-07T16:09:02Z

Is the plan to use this for llama4 as well, or will that be a separate function?

I think for Llama4, we may want to change the structure permanently, in which case we'd want to use the replace_modules_for_calibration so that we can also compress is correctly post calibration as well.

brian-dellabetta

couple nits

examples/quantization_w4a4_fp4/qwen_30b_a2b.py

examples/quantizing_moe/deepseek_r1_example.py

kylesayrs · 2025-07-14T18:28:33Z

src/llmcompressor/args/dataset_arguments.py

@@ -117,6 +117,16 @@ class DatasetArguments(CustomDatasetArguments):
        default=512,
        metadata={"help": "Number of samples to use for one-shot calibration"},
    )
+    calibrate_moe_context: bool = field(


Shouldn't this basically always be on? Is there ever a case where a user shouldn't use this?

when you want to call prepare_for_calibration and want to permanently change the module definition, as opposed to only the duration of calibration

Hm, but there's no conflict between prepare_for_calibration and calibrate_moe_context, right? I think it'd also look a little confusing to calibrate an MoE model, but explicitly call

prepare_for_calibration(my_moe_model) oneshot(my_moe_model, calibrate_moe_context=False)

Would it be better to just remove it and always run with calibrate_moe_context=True?

We require prepare_for_calibration to be explicitly applied, the idea was to be the same with this moe calibration context being enabled

But yes, there is no conflict between the two. I think we can technically run deepseek with both but I haven't tested it with the context

src/llmcompressor/entrypoints/oneshot.py

SUMMARY: "please provide a brief summary" TEST PLAN: "please outline how the changes were tested"

examples/quantization_w4a4_fp4/qwen_30b_a3b.py

src/llmcompressor/modeling/prepare.py

gemini-code-assist bot reviewed Jun 25, 2025

View reviewed changes

dsikka added the ready When a PR is ready for review label Jul 3, 2025

dsikka marked this pull request as ready for review July 3, 2025 16:29

dsikka requested review from kylesayrs, brian-dellabetta, rahul-tuli and shanjiaz July 3, 2025 16:33

dsikka mentioned this pull request Jul 6, 2025

nvfp4 w4a4 doesn't work on Qwen/Qwen3-235b-A22B #1624

Closed

kylesayrs reviewed Jul 7, 2025

View reviewed changes

brian-dellabetta reviewed Jul 7, 2025

View reviewed changes

src/llmcompressor/modeling/deepseek_v3.py Outdated Show resolved Hide resolved

src/llmcompressor/entrypoints/oneshot.py Show resolved Hide resolved

dsikka mentioned this pull request Jul 9, 2025

[Bug]: Unable to deploy NVFP4 quantized model vllm-project/vllm#19853

Closed

1 task

dsikka requested review from brian-dellabetta and kylesayrs July 9, 2025 19:36

brian-dellabetta previously approved these changes Jul 14, 2025

View reviewed changes

examples/quantization_w4a4_fp4/qwen_30b_a2b.py Outdated Show resolved Hide resolved

examples/quantizing_moe/deepseek_r1_example.py Show resolved Hide resolved

kylesayrs reviewed Jul 14, 2025

View reviewed changes

dsikka added 12 commits July 14, 2025 20:17

update

e3021ef

style; update name

7310fed

fix

fe6f316

update

04324d8

update

423c94f

update entrypoint

6af3ffc

update

1a3dd30

clean-up

079c71f

update prepare

aee670f

add comment

3b9e2c2

fix check

a7af9ca

PR comments

ea93089

dsikka added 2 commits July 14, 2025 20:25

fix typing

945007c

rebase; fix

528cdc8

dsikka dismissed brian-dellabetta’s stale review via 528cdc8 July 14, 2025 21:00

dsikka force-pushed the provide_moe_calibration_mode branch from e850a7c to 528cdc8 Compare July 14, 2025 21:00

dsikka added 2 commits July 14, 2025 21:05

update

d7039a1

update

9c96183

dsikka requested review from kylesayrs and brian-dellabetta July 14, 2025 23:57

dsikka added 4 commits July 14, 2025 20:19

Update qwen3_moe.py

a5a42bd

fix

8fc840f

quality

732b2ea

Alternate moe calib (#1645)

3e860d6

SUMMARY: "please provide a brief summary" TEST PLAN: "please outline how the changes were tested"

brian-dellabetta approved these changes Jul 15, 2025

View reviewed changes

examples/quantization_w4a4_fp4/qwen_30b_a3b.py Show resolved Hide resolved

src/llmcompressor/modeling/prepare.py Show resolved Hide resolved

Merge branch 'main' into provide_moe_calibration_mode

331cca3

kylesayrs mentioned this pull request Jul 15, 2025

[MoE] Add MoE calibration options #1593

Closed

Merge branch 'main' into provide_moe_calibration_mode

fc08e41

		}


		def moe_calibration_context(model: PreTrainedModel, stack):

[Calibration] Add MoE Calibration Context #1596

Are you sure you want to change the base?

[Calibration] Add MoE Calibration Context #1596

Uh oh!

Conversation

dsikka commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Testing

Next Steps:

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylesayrs Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka commented Jul 7, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylesayrs Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka commented Jun 25, 2025 •

edited

Loading

dsikka Jul 9, 2025 •

edited

Loading

dsikka Jul 14, 2025 •

edited

Loading

dsikka Jul 15, 2025 •

edited

Loading