-
Notifications
You must be signed in to change notification settings - Fork 176
[Calibration] Add MoE Calibration Context #1596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @dsikka, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces specialized support for calibrating Mixture-of-Experts (MoE) models within the llmcompressor
framework. It enables model-specific adjustments during the calibration process, which is crucial for accurately quantizing these complex architectures. The changes ensure that MoE models like Qwen3 and DeepseekV3 can be properly handled, improving the overall effectiveness of quantization for these models.
Highlights
- MoE Calibration Context: Introduced a new
moe_calibration_context
mechanism to apply model-specific modifications during the calibration phase for Mixture-of-Experts (MoE) models. This allows for specialized handling required by MoE architectures during quantization. - Model-Specific MoE Handling: Implemented specific context updates for Qwen3 MoE models (patching the
top_k
attribute of MLP modules) and DeepseekV3 models (replacing MLP modules with a specialized version) to ensure proper calibration behavior for these architectures. - Pipeline Integration: Integrated the
calibrate_moe_context
flag into theoneshot
entrypoint and both theIndependent
andSequential
calibration pipelines. This enables conditional application of the MoE-specific calibration logic during the overall quantization process. - Qwen3 MoE Example: Added a new example script (
examples/quantization_w4a4_fp4/qwen_30b_a2b.py
) demonstrating how to quantize a Qwen3-30B-A3B MoE model using the new calibration context and theNVFP4
scheme.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a calibration context for Mixture-of-Experts (MoE) models, which is a great addition for handling this model architecture during quantization. The changes involve adding logic to activate all experts during calibration for supported models like Qwen3 and DeepseekV3, and plumbing this feature through the oneshot
workflow.
I've identified a critical issue in the implementation that will cause crashes for non-MoE models. I've also pointed out a high-severity issue related to a hardcoded feature flag and a few medium-severity issues regarding code clarity and robustness. Addressing these points will significantly improve the quality and stability of this new feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to use this for llama4 as well, or will that be a separate function?
def update_qwen3_moe(model, stack): | ||
for _, module in model.named_modules(): | ||
cls_name = module.__class__.__name__ | ||
if cls_name == "Qwen3MoeDecoderLayer": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use something like this pattern for matching? This way things don't break if the parent's structure changes, and we can also share matching logic between replacements
for name, module in model.named_modules():
cls_name = module.__class__.__name__
if cls_name in replacements:
new_module = replacements[cls_name](module)
replace_module(model, name, new_module)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to use patch_attr in order to follow the other context set-up functionality, we need both the parent and the child so it would still require setting "mlp" - I think replace_modules finds the parent for you when replacing the module.
We could expand patch_attr I guess to follow that potentiallly
} | ||
|
||
|
||
def moe_calibration_context(model: PreTrainedModel, stack): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show what it's like to pass additional calibration options (moe_calibrate_all_experts
moe_calibrate_gated_acts
), if these are still options we want to supply research/users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a small comment but see other comment below.
@@ -0,0 +1,69 @@ | |||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to add the HF copyright and a note about the amendments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the HF copyright - do you hava reference for what type of note should be made about amendments?
I think for Llama4, we may want to change the structure permanently, in which case we'd want to use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple nits
@@ -117,6 +117,16 @@ class DatasetArguments(CustomDatasetArguments): | |||
default=512, | |||
metadata={"help": "Number of samples to use for one-shot calibration"}, | |||
) | |||
calibrate_moe_context: bool = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this basically always be on? Is there ever a case where a user shouldn't use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you want to call prepare_for_calibration
and want to permanently change the module definition, as opposed to only the duration of calibration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but there's no conflict between prepare_for_calibration
and calibrate_moe_context
, right? I think it'd also look a little confusing to calibrate an MoE model, but explicitly call
prepare_for_calibration(my_moe_model)
oneshot(my_moe_model, calibrate_moe_context=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to just remove it and always run with calibrate_moe_context=True
?
We require prepare_for_calibration
to be explicitly applied, the idea was to be the same with this moe calibration context being enabled
But yes, there is no conflict between the two. I think we can technically run deepseek with both but I haven't tested it with the context
e850a7c
to
528cdc8
Compare
SUMMARY: "please provide a brief summary" TEST PLAN: "please outline how the changes were tested"
Summary:
moe_calibration_context
which during calibration, replaces MoE blocks with custom modules which are needed to properly calibrate all experts requiring datacalibrate_moe_context
argument which if set to True, will enable the contextprepare
folder (shared withreplace_modules_for_calibration
)replace_modules_for_calibration
, previously calledprepare_for_calibration
).replace_modules_for_calibration
, a dictionary defining the replacement has been added:moe_context
Testing
Qwen/Qwen3-30B-A3B
NVFP4 example and added the example to the folder as wellNext Steps: