-
Notifications
You must be signed in to change notification settings - Fork 611
quant blocked fp8 #4018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
quant blocked fp8 #4018
Conversation
skip_patterns = [ | ||
'lm_head', | ||
'embed_tokens', | ||
'mlp.gate', # sparse MOE router gate | ||
'vision_model', # non-HF InternVL, vision part | ||
'mlp1', # non-HF InternVL, projector | ||
'mlp2', # non-HF InternVL-Flash, projector | ||
'vision_tower', # HF InternVL, vision part | ||
'multi_modal_projector', # HF InternVL, projector | ||
] | ||
modules_to_not_convert = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These configurations are model-specific. We should adopt a more maintainable approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the vLLM FP8 compressor example, and noticed that the ignored patterns are indeed model-specific. Currently, these patterns are passed as an input argument named ignore
in the quantization recipe.
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8
How about we also expose this as a configurable input argument, allowing users to define their own ignore patterns as needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RunningLeon As discussed with @CUHKSZzxy, we propose a new --skip-pattern config.py
option for custom skip patterns, alongside lmdeploy's internal defaults.
what's your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, if only passing skip patterns, a config file is not necessary.
""" | ||
tensor: torch.Tensor | ||
scale: torch.Tensor | ||
weight_scale_inv: torch.Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing scale
to weight_scale_inv
might affect w8a8 quantized model inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RunningLeon @grimoire any good ideas?
Usage
NOTE: We can use either
pytorch
orturbomind
backend for FP8 inference. Here we takepytorch
backend as an example.Accuracy
Dataset: OCRBench
Model: InternVL3.5-8B (FP8), InternVL3_5-30B-A3B (FP8)
Tested with VLMEvalKit.
Checklist
weight_scale_inv
modification affects other quant methods / modules