-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantize: specify each major tensor quant in CLI for common LLMs #8917
Draft
Nexesenex
wants to merge
10
commits into
ggerganov:master
Choose a base branch
from
Nexesenex:lcpp_pr_specific_quants
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Commits on Aug 7, 2024
-
Quantize: specify each major tensor quant in CLI for common LLMs
This PR simply replicates the tensor per tensor custom quantization CLI feature brought by Ikawrakow for the token embeddings and output tensors in ggerganov#6239 to : - attn_q.weight - attn_k.weight - attn_v.weight - attn_qkv.weight - attn_output.weight - ffn_gate - ffn_down - ffn_up This, to allow LlamaCPP users to easily tailor their chosen quant strategy to their needs, but ALSO to allow them to requant easily a quant "a bit too big" for their VRAM in the case of GPU users. For example, a nice Miqu 70b Q5_K_M (which has no FP16 weight available beyond dequants of Q5_K_M) is short of VRAM in one's pair of 3090s. And one is French, like me, so Miqu is one of his main local model. Requanting the Q5_K_M in... Q5_K_M, BUT with all the ffn_down and attn_v.weight tensors specified in Q5_K, and the attn_q.weight specified in Q4_K_M might save you approximatively 1.5GB without degrading too much the quality. That means 1.3-1.4GB of additional context (yummy with FA and KV Cache) and let's say 100-200MB of additional compute cache with a resonable Blas Batch Size in MMQ. But also : the unspecified tensors won't be requantized, because LlamaCPP just copy the tensor rather than requantizing it when a specific tensor quant of the chosent strategy is the same than the source. So one can enjoy the original Miqu quant of these tensors rather than a dequant/requant. And that's just an example. I think that many LCPP users could enjoy this feature for their own needs. This, even if it remains quite basic : This PR doesn't support hybrid quantization of a tensor (example, with a fraction of the layers in the upper quant (from layer 0 onwards), or the "more_bits" calculus devised by Ikawrakow to create intervals of different quants (ex : 1 layer every 3 layers quantized with the superior quant). CL example: `llama-quantize --allow-requantize --imatrix Q:\iMatrix\Sheared\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.iMatrix_Wiki_c32_ch500.dat --output-tensor-type q4_0 --token-embedding-type q4_0 --attn-q-type q4_0 --attn-k-type q4_0 --attn-v-type q4_0 --attn-output-type q4_0 --ffn-gate-type q4_0 --ffn-down-type q4_0 --ffn-up-type q4_0 D:\text-generation-webui\models\Q8_0\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.gguf D:\text-generation-webui\models\princeton-nlp_Sheared-LLaMA-2.7B-AR-b228N.iMatrix_Wiki_c32_ch500-Q5_K_M.gguf Q5_K_M` for a full q4_0 quant equivalent to a pure quant, but specified tensor by tensor.
Configuration menu - View commit details
-
Copy full SHA for 4a95bd5 - Browse repository at this point
Copy the full SHA 4a95bd5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 28a41e7 - Browse repository at this point
Copy the full SHA 28a41e7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 867e352 - Browse repository at this point
Copy the full SHA 867e352View commit details -
Configuration menu - View commit details
-
Copy full SHA for 259c5f3 - Browse repository at this point
Copy the full SHA 259c5f3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 60d11d0 - Browse repository at this point
Copy the full SHA 60d11d0View commit details -
Configuration menu - View commit details
-
Copy full SHA for fc4ed23 - Browse repository at this point
Copy the full SHA fc4ed23View commit details
Commits on Aug 9, 2024
-
Configuration menu - View commit details
-
Copy full SHA for bd575f0 - Browse repository at this point
Copy the full SHA bd575f0View commit details -
Create a Custom Quantization Scheme (CQS) FTYPE
And integrate it in the tensors quantization tree.
Configuration menu - View commit details
-
Copy full SHA for 23198ce - Browse repository at this point
Copy the full SHA 23198ceView commit details -
Configuration menu - View commit details
-
Copy full SHA for f547b52 - Browse repository at this point
Copy the full SHA f547b52View commit details
Commits on Oct 10, 2024
-
Configuration menu - View commit details
-
Copy full SHA for ed78de2 - Browse repository at this point
Copy the full SHA ed78de2View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.