Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Marlin support zero-point quantization? #5

Open
casper-hansen opened this issue Jan 20, 2024 · 7 comments
Open

Does Marlin support zero-point quantization? #5

casper-hansen opened this issue Jan 20, 2024 · 7 comments

Comments

@casper-hansen
Copy link

Dear creators of Marlin

What a huge performance boost these kernels can bring! I’m super excited about this as the open source community has been lacking kernels that scale.

To my question, does Marlin support zero point quantization like we normally get from AutoGPTQ or AutoAWQ?

Best wishes
Casper

@RonanKMcGovern
Copy link

+1 - would be great to have marlin speed with AWQ perplexity

@fergusbarratt
Copy link

fergusbarratt commented Feb 8, 2024

+1 - marlin's great, would be amazing to have AWQ support

@dalistarh
Copy link

I am a bit confused by this issue. Have you compared the PPL of Marlin models relative to AWQ?
Looking at the AWQ paper, I see Wiki2 PPL of 5.60 for LLaMA2-7B AWQ g128.
The LLaMA2-7B GPTQ model released by Elias, which works roughly under the same parameters, has PPL 5.27 (for a base PPL of 5.12).

@casper-hansen
Copy link
Author

Marlin used a different method for measuring perplexity, so can’t compare the two unfortunately

@dalistarh
Copy link

dalistarh commented Feb 8, 2024

Well, my point is that the above post seems to be assuming that the AWQ PPL is better than the GPTQ version used by Marlin. This might not be the case.

@efrantar
Copy link
Member

efrantar commented Feb 8, 2024

Hi,

in general, my experience is that when GPTQ is tuned and configured properly (e.g., also uses grid-clipping), results are extremely similar to AWQ. That being said, Marlin is a general fp16xint4 matmul kernel, at the moment supporting symmetric linear quantization either column-wise or at groupsize 128, with fp16 scales. It does not matter how the quantized weights are produced, they could come from GPTQ, AWQ, ZeroQuant or any other quantization method, they just have to follow Marlin's format. I think fixing the zero-point to 8 should cause AWQ to produce Marlin-compatible weights?

Currently, Marlin does not support zero points. With moderately sized groups and grid-clipping (as used by AWQ or our improved GPTQ implementation), the difference between symmetric and asymmetric seemed very small in my tests, maybe <= 0.01 PPL. Zero points stored in fp16 should not be too hard to support, but are probably not worth it from an accuracy standpoint (one could use smaller symmetric groupsize instead). Quantized zero points may bring marginal gains in some cases, but are likely a bit tricky to support without any efficiency drop (already the current version requires quite some care to avoid unfavorable instruction ordering by the compiler in the main loop when using groups).

@mobicham
Copy link

mobicham commented Apr 4, 2024

Thanks for the amazing work @efrantar !

Regarding the zero-point, it is actually very important to have it especially at low-bits. In fact, the zero-point is more important than the scaling. That is why methods like HQQ optimize for the zero-point.

To give you some perspective on why the zero-point is important, I run two experiments on wikitext, Llama2-7B model, 2-bit quantization, context-size=1024 with HQQ+:

  • Use a group-size of -1 for the scaling and a group-size of 8 for the zero-point: ppl 6.272
  • Use a group-size of 8 for the scaling and a group-size of -1 for the zero-point: ppl 216.0079

If the group-size for both the scaling and zero-point are the same, it shouldn't be too difficult to add it I think.
Really looking forward to a version that supports lower group-sizes as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants