Does Marlin support zero-point quantization? #5

casper-hansen · 2024-01-20T11:43:30Z

Dear creators of Marlin

What a huge performance boost these kernels can bring! I’m super excited about this as the open source community has been lacking kernels that scale.

To my question, does Marlin support zero point quantization like we normally get from AutoGPTQ or AutoAWQ?

Best wishes
Casper

RonanKMcGovern · 2024-01-20T12:19:05Z

+1 - would be great to have marlin speed with AWQ perplexity

fergusfinn · 2024-02-08T08:31:35Z

+1 - marlin's great, would be amazing to have AWQ support

dalistarh · 2024-02-08T08:48:46Z

I am a bit confused by this issue. Have you compared the PPL of Marlin models relative to AWQ?
Looking at the AWQ paper, I see Wiki2 PPL of 5.60 for LLaMA2-7B AWQ g128.
The LLaMA2-7B GPTQ model released by Elias, which works roughly under the same parameters, has PPL 5.27 (for a base PPL of 5.12).

casper-hansen · 2024-02-08T12:18:47Z

Marlin used a different method for measuring perplexity, so can’t compare the two unfortunately

dalistarh · 2024-02-08T12:27:36Z

Well, my point is that the above post seems to be assuming that the AWQ PPL is better than the GPTQ version used by Marlin. This might not be the case.

efrantar · 2024-02-08T13:03:32Z

Hi,

in general, my experience is that when GPTQ is tuned and configured properly (e.g., also uses grid-clipping), results are extremely similar to AWQ. That being said, Marlin is a general fp16xint4 matmul kernel, at the moment supporting symmetric linear quantization either column-wise or at groupsize 128, with fp16 scales. It does not matter how the quantized weights are produced, they could come from GPTQ, AWQ, ZeroQuant or any other quantization method, they just have to follow Marlin's format. I think fixing the zero-point to 8 should cause AWQ to produce Marlin-compatible weights?

Currently, Marlin does not support zero points. With moderately sized groups and grid-clipping (as used by AWQ or our improved GPTQ implementation), the difference between symmetric and asymmetric seemed very small in my tests, maybe <= 0.01 PPL. Zero points stored in fp16 should not be too hard to support, but are probably not worth it from an accuracy standpoint (one could use smaller symmetric groupsize instead). Quantized zero points may bring marginal gains in some cases, but are likely a bit tricky to support without any efficiency drop (already the current version requires quite some care to avoid unfavorable instruction ordering by the compiler in the main loop when using groups).

mobicham · 2024-04-04T08:42:31Z

Thanks for the amazing work @efrantar !

Regarding the zero-point, it is actually very important to have it especially at low-bits. In fact, the zero-point is more important than the scaling. That is why methods like HQQ optimize for the zero-point.

To give you some perspective on why the zero-point is important, I run two experiments on wikitext, Llama2-7B model, 2-bit quantization, context-size=1024 with HQQ+:

Use a group-size of -1 for the scaling and a group-size of 8 for the zero-point: ppl 6.272
Use a group-size of 8 for the scaling and a group-size of -1 for the zero-point: ppl 216.0079

If the group-size for both the scaling and zero-point are the same, it shouldn't be too difficult to add it I think.
Really looking forward to a version that supports lower group-sizes as well!

efrantar mentioned this issue Apr 2, 2024

[QST] Weight Format & GEMM #18

Open

mobicham mentioned this issue Aug 20, 2024

[RFC] Which low bit CUDA kernels should we merge or write? pytorch/ao#697

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Marlin support zero-point quantization? #5

Does Marlin support zero-point quantization? #5

casper-hansen commented Jan 20, 2024

RonanKMcGovern commented Jan 20, 2024

fergusfinn commented Feb 8, 2024 •

edited

Loading

dalistarh commented Feb 8, 2024

casper-hansen commented Feb 8, 2024

dalistarh commented Feb 8, 2024 •

edited

Loading

efrantar commented Feb 8, 2024

mobicham commented Apr 4, 2024 •

edited

Loading

Does Marlin support zero-point quantization? #5

Does Marlin support zero-point quantization? #5

Comments

casper-hansen commented Jan 20, 2024

RonanKMcGovern commented Jan 20, 2024

fergusfinn commented Feb 8, 2024 • edited Loading

dalistarh commented Feb 8, 2024

casper-hansen commented Feb 8, 2024

dalistarh commented Feb 8, 2024 • edited Loading

efrantar commented Feb 8, 2024

mobicham commented Apr 4, 2024 • edited Loading

fergusfinn commented Feb 8, 2024 •

edited

Loading

dalistarh commented Feb 8, 2024 •

edited

Loading

mobicham commented Apr 4, 2024 •

edited

Loading