Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantize: specify each major tensor quant in CLI for common LLMs #8917

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Commits on Aug 7, 2024

  1. Quantize: specify each major tensor quant in CLI for common LLMs

    This PR simply replicates the tensor per tensor custom quantization CLI feature brought by Ikawrakow for the token embeddings and output tensors in ggerganov#6239 to :
    - attn_q.weight
    - attn_k.weight
    - attn_v.weight
    - attn_qkv.weight
    - attn_output.weight
    - ffn_gate
    - ffn_down
    - ffn_up
    
    This, to allow LlamaCPP users to easily tailor their chosen quant strategy to their needs, but ALSO to allow them to requant easily a quant "a bit too big" for their VRAM in the case of GPU users.
    
    For example, a nice Miqu 70b Q5_K_M (which has no FP16 weight available beyond dequants of Q5_K_M) is short of VRAM in one's pair of 3090s.
    And one is French, like me, so Miqu is one of his main local model.
    
    Requanting the Q5_K_M in... Q5_K_M, BUT with all the ffn_down and attn_v.weight tensors specified in Q5_K, and the attn_q.weight specified in Q4_K_M might save you approximatively 1.5GB without degrading too much the quality.
    That means 1.3-1.4GB of additional context (yummy with FA and KV Cache) and let's say 100-200MB of additional compute cache with a resonable Blas Batch Size in MMQ.
    
    But also : the unspecified tensors won't be requantized, because LlamaCPP just copy the tensor rather than requantizing it when a specific tensor quant of the chosent strategy is the same than the source.
    So one can enjoy the original Miqu quant of these tensors rather than a dequant/requant.
    
    And that's just an example.
    
    I think that many LCPP users could enjoy this feature for their own needs.
    
    This, even if it remains quite basic :
    This PR doesn't support hybrid quantization of a tensor (example, with a fraction of the layers in the upper quant (from layer 0 onwards), or the "more_bits" calculus devised by Ikawrakow to create intervals of different quants (ex : 1 layer every 3 layers quantized with the superior quant).
    
    CL example: `llama-quantize --allow-requantize --imatrix Q:\iMatrix\Sheared\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.iMatrix_Wiki_c32_ch500.dat --output-tensor-type q4_0 --token-embedding-type q4_0 --attn-q-type q4_0 --attn-k-type q4_0 --attn-v-type q4_0 --attn-output-type q4_0 --ffn-gate-type q4_0 --ffn-down-type q4_0 --ffn-up-type q4_0 D:\text-generation-webui\models\Q8_0\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.gguf D:\text-generation-webui\models\princeton-nlp_Sheared-LLaMA-2.7B-AR-b228N.iMatrix_Wiki_c32_ch500-Q5_K_M.gguf Q5_K_M` for a full q4_0 quant equivalent to a pure quant, but specified tensor by tensor.
    Nexesenex committed Aug 7, 2024
    Configuration menu
    Copy the full SHA
    4a95bd5 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    28a41e7 View commit details
    Browse the repository at this point in the history
  3. trailing whitespace

    Nexesenex authored Aug 7, 2024
    Configuration menu
    Copy the full SHA
    867e352 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    259c5f3 View commit details
    Browse the repository at this point in the history
  5. trailing whitespaces

    Nexesenex authored Aug 7, 2024
    Configuration menu
    Copy the full SHA
    60d11d0 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    fc4ed23 View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2024

  1. Configuration menu
    Copy the full SHA
    bd575f0 View commit details
    Browse the repository at this point in the history
  2. Create a Custom Quantization Scheme (CQS) FTYPE

    And integrate it in the tensors quantization tree.
    Nexesenex committed Aug 9, 2024
    Configuration menu
    Copy the full SHA
    23198ce View commit details
    Browse the repository at this point in the history
  3. Fix little mistakes

    Nexesenex committed Aug 9, 2024
    Configuration menu
    Copy the full SHA
    f547b52 View commit details
    Browse the repository at this point in the history

Commits on Oct 10, 2024

  1. Configuration menu
    Copy the full SHA
    ed78de2 View commit details
    Browse the repository at this point in the history