quant blocked fp8 #4018

CUHKSZzxy · 2025-09-29T12:57:17Z

Usage

quantize

lmdeploy lite blocked_fp8 ${model_path} --work-dir ${quantized_model_path} --quant-dtype fp8

test case

NOTE: We can use either pytorch or turbomind backend for FP8 inference. Here we take pytorch backend as an example.

from lmdeploy import pipeline, PytorchEngineConfig

model_path = "OpenGVLab/InternVL3_5-8B-FP8"

if __name__ == '__main__':
    engine_config = PytorchEngineConfig(tp=1)
    pipe = pipeline(model_path, backend_config=engine_config)
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

Accuracy

Dataset: OCRBench
Model: InternVL3.5-8B (FP8), InternVL3_5-30B-A3B (FP8)

Backend	InternVL3.5-8B	InternVL3.5-8B-FP8	InternVL3_5-30B-A3B	InternVL3_5-30B-A3B-FP8
TurboMind	84.3	84.1	88.8	88.4
PyTorch	84.3	84.2	88.7	88.1

Tested with VLMEvalKit.

Checklist

Align the quantization config with QWen3 / InternS1 FP8
Add documents for blocked FP8
Verify the FP8 model accuracy
Fix quantizations for MOE models
Check whether weight_scale_inv modification affects other quant methods / modules

docs/en/quantization/blocked_fp8.md

lvhan028 · 2025-10-07T14:13:39Z

lmdeploy/lite/apis/blocked_fp8.py

+    skip_patterns = [
+        'lm_head',
+        'embed_tokens',
+        'mlp.gate',  # sparse MOE router gate
+        'vision_model',  # non-HF InternVL, vision part
+        'mlp1',  # non-HF InternVL, projector
+        'mlp2',  # non-HF InternVL-Flash, projector
+        'vision_tower',  # HF InternVL, vision part
+        'multi_modal_projector',  # HF InternVL, projector
+    ]
+    modules_to_not_convert = []


These configurations are model-specific. We should adopt a more maintainable approach.

I checked the vLLM FP8 compressor example, and noticed that the ignored patterns are indeed model-specific. Currently, these patterns are passed as an input argument named ignore in the quantization recipe.

https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8

https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/qwen2vl_example.py#L20

How about we also expose this as a configurable input argument, allowing users to define their own ignore patterns as needed?

@RunningLeon As discussed with @CUHKSZzxy, we propose a new --skip-pattern config.py option for custom skip patterns, alongside lmdeploy's internal defaults.
what's your opinion?

Personally, if only passing skip patterns, a config file is not necessary.

lvhan028 · 2025-10-09T06:38:59Z

lmdeploy/pytorch/models/q_modules.py

    """
    tensor: torch.Tensor
-    scale: torch.Tensor
+    weight_scale_inv: torch.Tensor


Changing scale to weight_scale_inv might affect w8a8 quantized model inference.

@RunningLeon @grimoire any good ideas?

CUHKSZzxy added 3 commits September 29, 2025 20:52

quant blocked fp8

b72cc9d

align config formats, optimize

d4e151b

add docs

444cf9a

CUHKSZzxy marked this pull request as ready for review September 30, 2025 04:35

CUHKSZzxy added 2 commits September 30, 2025 15:00

fix moe quant, fix for hf models

28c3f56

fix for internvl flash

c327c01

lvhan028 added the enhancement New feature or request label Oct 7, 2025

lvhan028 requested review from RunningLeon and lvhan028 October 7, 2025 14:05

lvhan028 reviewed Oct 7, 2025

View reviewed changes

docs/en/quantization/blocked_fp8.md Outdated Show resolved Hide resolved

lvhan028 reviewed Oct 7, 2025

View reviewed changes

docs/en/quantization/blocked_fp8.md Outdated Show resolved Hide resolved

lvhan028 reviewed Oct 7, 2025

View reviewed changes

update docs

291a450

lvhan028 reviewed Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

quant blocked fp8 #4018

quant blocked fp8 #4018

CUHKSZzxy commented Sep 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

lvhan028 Oct 7, 2025

Uh oh!

CUHKSZzxy Oct 9, 2025

Uh oh!

lvhan028 Oct 9, 2025

Uh oh!

RunningLeon Oct 9, 2025

Uh oh!

lvhan028 Oct 9, 2025

Uh oh!

lvhan028 Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

quant blocked fp8 #4018

Are you sure you want to change the base?

quant blocked fp8 #4018

Conversation

CUHKSZzxy commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Accuracy

Checklist

Uh oh!

Uh oh!

Uh oh!

lvhan028 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

CUHKSZzxy Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

lvhan028 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

RunningLeon Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

lvhan028 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

lvhan028 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUHKSZzxy commented Sep 29, 2025 •

edited

Loading