-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
82 additions
and
85 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,83 +1,80 @@ | ||
[4mLLAMAFILE-QUANTIZE[24m(1) General Commands Manual [4mLLAMAFILE-QUANTIZE[24m(1) | ||
|
||
[1mNAME[0m | ||
llamafile-quantize — large language model quantizer | ||
|
||
[1mSYNOPSIS[0m | ||
[1mllamafile-quantize [22m[flags...] [4mmodel-f32.gguf[24m [[4mmodel-quant.gguf[24m] [4mtype[0m | ||
[[4mnthreads[24m] | ||
|
||
[1mDESCRIPTION[0m | ||
[1mllamafile-quantize [22mconverts large language model weights from the | ||
float32 or float16 formats into smaller data types from 2 to 8 bits in | ||
size. | ||
|
||
[1mOPTIONS[0m | ||
The following flags are available: | ||
|
||
[1m--allow-requantize[0m | ||
Allows requantizing tensors that have already been quantized. | ||
Warning: This can severely reduce quality compared to quantiz‐ | ||
ing from 16bit or 32bit | ||
|
||
[1m--leave-output-tensor[0m | ||
Will leave output.weight un(re)quantized. Increases model size | ||
but may also increase quality, especially when requantizing | ||
|
||
[1m--pure [22mDisable k-quant mixtures and quantize all tensors to the same | ||
type | ||
|
||
[1mARGUMENTS[0m | ||
The following positional arguments are accepted: | ||
|
||
[4mmodel-f32.gguf[0m | ||
Is the input file, which contains the unquantized model weights | ||
in either the float32 or float16 format. | ||
|
||
[4mmodel-quant.gguf[0m | ||
Is the output file, which will contain quantized weights in the | ||
desired format. If this path isn't specified, it'll default to | ||
[inp path]/ggml-model-[ftype].gguf. | ||
|
||
[4mtype[24m Is the desired quantization format, which may be the integer id | ||
of a supported quantization type, or its name. See the quanti‐ | ||
zation types section below for acceptable formats. | ||
|
||
[4mnthreads[0m | ||
Number of threads to use during computation (default: nproc/2) | ||
|
||
[1mQUANTIZATION TYPES[0m | ||
The following quantization types are available. This table shows the ID | ||
of the quantization format, its name, the file size of 7B model weights | ||
that use it, and finally the amount of quality badness it introduces as | ||
measured by the llamafile-perplexity tool averaged over 128 chunks with | ||
the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with | ||
how recommended the quantization format is for general usage. | ||
|
||
[1m- [22m18 Q6_K 5.6gb +0.0446 ppl (q6 kawrakow) | ||
[1m- [22m7 Q8_0 7.2gb +0.0022 ppl (q8 gerganov) | ||
[1m- [22m1 F16 14gb +0.0000 ppl (best but biggest) | ||
[1m- [22m8 Q5_0 4.7gb +0.0817 ppl (q5 gerganov zero) | ||
[1m- [22m17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium) | ||
[1m- [22m16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small) | ||
[1m- [22m15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium) | ||
[1m- [22m14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small) | ||
[1m- [22m13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large) | ||
[1m- [22m12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium) | ||
[1m- [22m11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small) | ||
[1m- [22m10 Q2_K 2.6gb +4.2359 ppl (tiniest hallucinates most) | ||
[1m- [22m32 BF16 14gb +0.0000 ppl (canonical but cpu/cuda only) | ||
[1m- [22m0 F32 27gb 9.0952 ppl (reference point) | ||
[1m- [22m2 Q4_0 3.9gb +0.3339 ppl (legacy) | ||
[1m- [22m3 Q4_1 4.3gb +0.4163 ppl (legacy) | ||
[1m- [22m9 Q5_1 5.1gb +0.1091 ppl (legacy) | ||
[1m- [22m12 Q3_K alias for Q3_K_M | ||
[1m- [22m15 Q4_K alias for Q4_K_M | ||
[1m- [22m17 Q5_K alias for Q5_K_M | ||
[1m- [22mCOPY Only copy tensors, no quantizing. | ||
|
||
[1mSEE ALSO[0m | ||
[4mllamafile[24m(1), [4mllamafile-imatrix[24m(1), [4mllamafile-perplexity[24m(1), | ||
[4mllava-quantize[24m(1), [4mzipalign[24m(1), [4munzip[24m(1) | ||
|
||
Llamafile Manual December 5, 2023 [4mLLAMAFILE-QUANTIZE[24m(1) | ||
EMBEDFILE(1) General Commands Manual EMBEDFILE(1) | ||
|
||
NNAAMMEE | ||
eemmbbeeddffiillee - large language model quantizer | ||
|
||
SSYYNNOOPPSSIISS | ||
eemmbbeeddffiillee [flags...] _m_o_d_e_l_-_f_3_2_._g_g_u_f [_m_o_d_e_l_-_q_u_a_n_t_._g_g_u_f] _t_y_p_e [_n_t_h_r_e_a_d_s] | ||
|
||
DDEESSCCRRIIPPTTIIOONN | ||
eemmbbeeddffiillee converts large language model weights from the float32 or float16 | ||
formats into smaller data types from 2 to 8 bits in size. | ||
|
||
OOPPTTIIOONNSS | ||
The following flags are available: | ||
|
||
----aallllooww--rreeqquuaannttiizzee | ||
Allows requantizing tensors that have already been quantized. | ||
Warning: This can severely reduce quality compared to quantizing | ||
from 16bit or 32bit | ||
|
||
----lleeaavvee--oouuttppuutt--tteennssoorr | ||
Will leave output.weight un(re)quantized. Increases model size but | ||
may also increase quality, especially when requantizing | ||
|
||
----ppuurree Disable k-quant mixtures and quantize all tensors to the same type | ||
|
||
AARRGGUUMMEENNTTSS | ||
The following positional arguments are accepted: | ||
|
||
_m_o_d_e_l_-_f_3_2_._g_g_u_f | ||
Is the input file, which contains the unquantized model weights in | ||
either the float32 or float16 format. | ||
|
||
_m_o_d_e_l_-_q_u_a_n_t_._g_g_u_f | ||
Is the output file, which will contain quantized weights in the | ||
desired format. If this path isn't specified, it'll default to [inp | ||
path]/ggml-model-[ftype].gguf. | ||
|
||
_t_y_p_e Is the desired quantization format, which may be the integer id of | ||
a supported quantization type, or its name. See the quantization | ||
types section below for acceptable formats. | ||
|
||
_n_t_h_r_e_a_d_s | ||
Number of threads to use during computation (default: nproc/2) | ||
|
||
QQUUAANNTTIIZZAATTIIOONN TTYYPPEESS | ||
The following quantization types are available. This table shows the ID of | ||
the quantization format, its name, the file size of 7B model weights that | ||
use it, and finally the amount of quality badness it introduces as measured | ||
by the llamafile-perplexity tool averaged over 128 chunks with the | ||
TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with how | ||
recommended the quantization format is for general usage. | ||
|
||
-- 18 Q6_K 5.6gb +0.0446 ppl (q6 kawrakow) | ||
-- 7 Q8_0 7.2gb +0.0022 ppl (q8 gerganov) | ||
-- 1 F16 14gb +0.0000 ppl (best but biggest) | ||
-- 8 Q5_0 4.7gb +0.0817 ppl (q5 gerganov zero) | ||
-- 17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium) | ||
-- 16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small) | ||
-- 15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium) | ||
-- 14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small) | ||
-- 13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large) | ||
-- 12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium) | ||
-- 11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small) | ||
-- 10 Q2_K 2.6gb +4.2359 ppl (tiniest hallucinates most) | ||
-- 32 BF16 14gb +0.0000 ppl (canonical but cpu/cuda only) | ||
-- 0 F32 27gb 9.0952 ppl (reference point) | ||
-- 2 Q4_0 3.9gb +0.3339 ppl (legacy) | ||
-- 3 Q4_1 4.3gb +0.4163 ppl (legacy) | ||
-- 9 Q5_1 5.1gb +0.1091 ppl (legacy) | ||
-- 12 Q3_K alias for Q3_K_M | ||
-- 15 Q4_K alias for Q4_K_M | ||
-- 17 Q5_K alias for Q5_K_M | ||
-- COPY Only copy tensors, no quantizing. | ||
|
||
SSEEEE AALLSSOO | ||
llamafile(1), llamafile-imatrix(1), llamafile-perplexity(1), | ||
llava-quantize(1), zipalign(1), unzip(1) | ||
|
||
Llamafile Manual December 5, 2023 Llamafile Manual |