add QuiP quant support #217

waters222 · 2023-12-07T02:21:28Z

This is draft PR for adding QuiP quant into ExllamaV2.
Original QuiP Repo

Works:

right it it can load pre-quant model into model and generate token.
since origin model does not have tokenizer.model file etc. so it needs to copy HF model file into this.
Adding new cuda ext.

Ppl performance benchmark

using dataset: [wikitext-2-v1_validation_0000.parquet]
(https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-v1/validation)
sample cmd

python test_inference.py -m /media/storage/models/Llama-2-7b-E8P-2Bit -ed /media/storage/wikitext/wikitext-2-v1_validation_0000.parquet

Model	Performance
2Bit
Llama-2-7b-E8P-2Bit	8.7339
Llama2-7b-exl2-2.5bpw	8.0745
Llama-2-13b-E8P-2Bit	7.1207
Llama2-13b-exl2-2.5bpw	7.2741
Llama-2-70b-E8P-2Bit	6.2192
Llama2-70b-exl2-2.5bpw	5.8270
4Bit
Llama-2-7b-HI-4Bit-Packed	6.0748
Llama2-7b-exl2-4.0bpw	6.0300
Llama-2-13b-HI-4Bit-Packed	7.4169
Llama2-13b-exl2-4.0bpw	5.4905

inference example

7B 2bit E8P

Once upon a time, the 2018 version of Windows 10 will be available in preview and then it will get into beta mode. everybody will install windows update preview and they will see the next release of windows 10 and the next feature will come to life for the first time, but this is not yet.
The 365 Days of Windows 10 is a big part of Microsoft’s plan. Microsoft has been working on the future of Windows for over three years, including Windows Server, Office, Xbox Live etc The new version of Windows 10 is called the cloud, so you can access all

 -- Response generated in 6.75 seconds, 128 tokens, 18.98 tokens/second (includes prompt eval.)

13B 2bit E8P

Once upon a time, when I was in the fifth grade and at an age where you can do what you want.
I don’t know why my teacher had to write it so many times that day but there is something about this sentence that makes me feel like we all need someone else beside us. I am not saying I am a bad person who doesn’t have any friends or anything, I just find myself really alone sometimes, despite of having plenty of people around me. So, even though I might be one of those lucky ones with no friends at all (as long as they are not the worst people ever), I can never den

 -- Response generated in 9.53 seconds, 128 tokens, 13.43 tokens/second (includes prompt eval.)

70B 2bit E8P

Once upon a time, the only way to get a book signed was in person. It could be awk anymore!
I think it'holm very special that you can do both of those things simultaneously!
A few people are still using paper and pen for their signatures, but most of them are using digital signatures as well nowadays because they don't need to write anything down anymore!
Most authors use electronic pens these days so that we don't have any problems signing our books or getting them back from customers when needed (and trust me- there will always be someone who wants us). But if an author does want his

-- Response generated in 18.82 seconds, 128 tokens, 6.80 tokens/second (includes prompt eval.)

CyberTimon · 2023-12-07T18:59:35Z

Very cool. Thank you for making this, will be awesome when it's done 👍

KnutJaegersberg · 2024-01-26T10:23:27Z

Would like to have this feature :)

tau0-deltav · 2024-02-10T07:17:10Z

It might be worth nothing that most of the what QuIP# appears to be achieving seems to be acheived better by llama.cpp's new quantization schema. the dedicated 2 bit quants (IQ2_XS and XXS, 2.03 and 2.3 BPW - as opposed to being more like 3 bits) are very strong.*

There's now an optimisation scheme involved but it's clearly not a 'generate FP64 hessians for a week (no really that's what they suggested - for 6k ctx) on your grandma's Threadripper X cluster' - it's much more like the 'discover the most important weights by throwing words at the decoder' - schema we're familiar with here.

ikawrakow here is well worth a read. I'll link this technical discussion I found instead of the unsightly spat with one of the Cornell team.

I'm long past a claim to being any sort of computer scientist but I'd like to hope EXL2's inheritance from GPTQ (quantize-then-calibrate as GPTQ's design goal, flexibility in weight assignments added by EXL2 from that) could make EXL2 itself a better use for the methods used here than QuIP#. Those imatrix files are bigger than exl2's measurements but 25mb isn't exactly out of reach here, compared to 2mb?

(...Could these just convert directly? Probably not without the E8 enumeration support but I do wonder what exactly is in a GGUF that isn't in a EXL2 or vice-versa.)

*IQ3_XXS is Absolutely Robust it's scary. It's very new. It's making me plug in a 3070 Ti that I ought to sell.

waters222 added 3 commits December 6, 2023 21:12

move on to testing

6f8be52

done

b622674

stuck on generation text

9ab22f9

waters222 changed the title ~~[Draft] Trying to add QuiP quant into inference~~ add QuiP quant into inference Dec 7, 2023

waters222 changed the title ~~add QuiP quant into inference~~ add QuiP quant support Dec 7, 2023

fix some bugs and revert version

0e5f322

waters222 mentioned this pull request Dec 7, 2023

Low Ppl benchmark results Cornell-RelaxML/quip-sharp#9

Closed

waters222 added 2 commits December 7, 2023 05:57

linting

0f7e970

linting:

0330335

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add QuiP quant support #217

add QuiP quant support #217

waters222 commented Dec 7, 2023 •

edited

Loading

CyberTimon commented Dec 7, 2023

KnutJaegersberg commented Jan 26, 2024

tau0-deltav commented Feb 10, 2024 •

edited

Loading

add QuiP quant support #217

Are you sure you want to change the base?

add QuiP quant support #217

Conversation

waters222 commented Dec 7, 2023 • edited Loading

Ppl performance benchmark

inference example

7B 2bit E8P

13B 2bit E8P

70B 2bit E8P

CyberTimon commented Dec 7, 2023

KnutJaegersberg commented Jan 26, 2024

tau0-deltav commented Feb 10, 2024 • edited Loading

waters222 commented Dec 7, 2023 •

edited

Loading

tau0-deltav commented Feb 10, 2024 •

edited

Loading