Quantize: specify each major tensor quant in CLI for common LLMs #8917

This PR simply replicates the tensor per tensor custom quantization CLI feature brought by Ikawrakow for the token embeddings and output tensors in ggerganov#6239 to : - attn_q.weight - attn_k.weight - attn_v.weight - attn_qkv.weight - attn_output.weight - ffn_gate - ffn_down - ffn_up This, to allow LlamaCPP users to easily tailor their chosen quant strategy to their needs, but ALSO to allow them to requant easily a quant "a bit too big" for their VRAM in the case of GPU users. For example, a nice Miqu 70b Q5_K_M (which has no FP16 weight available beyond dequants of Q5_K_M) is short of VRAM in one's pair of 3090s. And one is French, like me, so Miqu is one of his main local model. Requanting the Q5_K_M in... Q5_K_M, BUT with all the ffn_down and attn_v.weight tensors specified in Q5_K, and the attn_q.weight specified in Q4_K_M might save you approximatively 1.5GB without degrading too much the quality. That means 1.3-1.4GB of additional context (yummy with FA and KV Cache) and let's say 100-200MB of additional compute cache with a resonable Blas Batch Size in MMQ. But also : the unspecified tensors won't be requantized, because LlamaCPP just copy the tensor rather than requantizing it when a specific tensor quant of the chosent strategy is the same than the source. So one can enjoy the original Miqu quant of these tensors rather than a dequant/requant. And that's just an example. I think that many LCPP users could enjoy this feature for their own needs. This, even if it remains quite basic : This PR doesn't support hybrid quantization of a tensor (example, with a fraction of the layers in the upper quant (from layer 0 onwards), or the "more_bits" calculus devised by Ikawrakow to create intervals of different quants (ex : 1 layer every 3 layers quantized with the superior quant). CL example: `llama-quantize --allow-requantize --imatrix Q:\iMatrix\Sheared\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.iMatrix_Wiki_c32_ch500.dat --output-tensor-type q4_0 --token-embedding-type q4_0 --attn-q-type q4_0 --attn-k-type q4_0 --attn-v-type q4_0 --attn-output-type q4_0 --ffn-gate-type q4_0 --ffn-down-type q4_0 --ffn-up-type q4_0 D:\text-generation-webui\models\Q8_0\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.gguf D:\text-generation-webui\models\princeton-nlp_Sheared-LLaMA-2.7B-AR-b228N.iMatrix_Wiki_c32_ch500-Q5_K_M.gguf Q5_K_M` for a full q4_0 quant equivalent to a pure quant, but specified tensor by tensor.

And integrate it in the tensors quantization tree.

Commits on Oct 10, 2024

Merge branch 'master' into pr/8917

Nexesenex committed Oct 10, 2024

Configuration menu

View commit details

Copy full SHA for ed78de2

Browse repository at this point

Copy the full SHA

ed78de2 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantize: specify each major tensor quant in CLI for common LLMs #8917

Quantize: specify each major tensor quant in CLI for common LLMs #8917

Commits on Aug 7, 2024

Commits on Aug 9, 2024

Commits on Oct 10, 2024

Quantize: specify each major tensor quant in CLI for common LLMs #8917

Are you sure you want to change the base?

Quantize: specify each major tensor quant in CLI for common LLMs #8917

Commits on Aug 7, 2024

Commits on Aug 9, 2024

Commits on Oct 10, 2024