[Feature] request smoothquant (int8, W8A8) quantization on 40G A100 #2474

Hao-YunDeng · 2024-12-13T00:11:37Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

We have been using smoothquant (int8, W8A8) quantization on A100 GPU with TensorRT-LLM and recently tested with vLLM as well. The performance is good: speed, memory, and accuracy all are advantageous compared to fp16 or other quantizations.

Can SGLang also support such quantization for A100 machine? My team is very eager to see it coming.

Thanks

Related resources

No response

Hao-YunDeng · 2024-12-13T01:57:13Z

AWQ and GPTQ both are W8A16; we need W8A8

zhyncs · 2024-12-13T08:31:29Z

AWQ and GPTQ are W4A16.

@HandH1998 and @ispobock are collaborating on the W8A8.

Hao-YunDeng · 2024-12-13T18:42:49Z

AWQ and GPTQ are W4A16.

@HandH1998 and @ispobock are collaborating on the W8A8.

thank you so much for your reply. Is this W8A8 feature going to be smoothquant? If so, when do you expect to have it available? @zhyncs @HandH1998 @ispobock

ispobock · 2024-12-14T14:42:19Z

AWQ and GPTQ are W4A16.
@HandH1998 and @ispobock are collaborating on the W8A8.

thank you so much for your reply. Is this W8A8 feature going to be smoothquant? If so, when do you expect to have it available? @zhyncs @HandH1998 @ispobock

Yes, it's smoothquant.
Maybe next two weeks.

halexan · 2024-12-17T02:10:53Z

Does sglang support w8a8 quantized model? Like this one: neuralmagic-ent/Qwen2.5-72B-Instruct-quantized.w8a8

If supported, how can I fly it?

zhyncs assigned ispobock and HandH1998 Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] request smoothquant (int8, W8A8) quantization on 40G A100 #2474

[Feature] request smoothquant (int8, W8A8) quantization on 40G A100 #2474

Hao-YunDeng commented Dec 13, 2024 •

edited

Loading

Hao-YunDeng commented Dec 13, 2024

zhyncs commented Dec 13, 2024

Hao-YunDeng commented Dec 13, 2024

ispobock commented Dec 14, 2024

halexan commented Dec 17, 2024

[Feature] request smoothquant (int8, W8A8) quantization on 40G A100 #2474

[Feature] request smoothquant (int8, W8A8) quantization on 40G A100 #2474

Comments

Hao-YunDeng commented Dec 13, 2024 • edited Loading

Checklist

Motivation

Related resources

Hao-YunDeng commented Dec 13, 2024

zhyncs commented Dec 13, 2024

Hao-YunDeng commented Dec 13, 2024

ispobock commented Dec 14, 2024

halexan commented Dec 17, 2024

Hao-YunDeng commented Dec 13, 2024 •

edited

Loading