Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama-Quantize : Layers quantized in the wrong order, thus damaging the variable bits tensor quants scheme consistency. #9005

Closed
Nexesenex opened this issue Aug 12, 2024 · 1 comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale

Comments

@Nexesenex
Copy link
Contributor

Nexesenex commented Aug 12, 2024

What happened?

On master b3573, when quantizing Gemma 9b it:

The tensors are quantized in a wrong order.

Right now, because of the layer jump from 7 to 10 without the ffns of layer 7 to be quantized, it breaks not only the layer quantization order, but also the correlation between ffn_down Q6_K and attn_v Q6_K : From layer 7, some layers will have ffn_down Q6_K and attn_v Q5_K, and some others ffn_down Q5_K and attn_v Q6_K.
This gives us suboptimal quants per BPW.

I expect the tensors to be quantized in the right order.

This, so the Q5_K_M quant, as well as the othersusing "use_more_bits(i_layer, n_layer)" to have a variable quant of ffn_down in conjunction with "use_more_bits(qs.i_attention_wv, qs.n_attention_wv))" to have a variable quant of attn_v.weight, can be optimal.

Name and Version

main: build = 3573 (2589292)
main: built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

[  45/ 464]                  blk.3.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q6_K .. size =    14.00 MiB ->     5.74 MiB
[  46/ 464]               blk.4.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  47/ 464]                blk.4.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q6_K .. size =    98.00 MiB ->    40.20 MiB
[  48/ 464]                blk.4.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  49/ 464]                  blk.4.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  50/ 464]     blk.4.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  51/ 464]           blk.4.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  52/ 464]                blk.4.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  53/ 464]                  blk.4.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  54/ 464]             blk.4.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  55/ 464]                  blk.4.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  56/ 464]                  blk.4.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q6_K .. size =    14.00 MiB ->     5.74 MiB
[  57/ 464]               blk.5.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  58/ 464]                blk.5.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  59/ 464]                blk.5.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  60/ 464]                  blk.5.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  61/ 464]     blk.5.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  62/ 464]           blk.5.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  63/ 464]                blk.5.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  64/ 464]                  blk.5.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  65/ 464]             blk.5.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  66/ 464]                  blk.5.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  67/ 464]                  blk.5.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  68/ 464]               blk.6.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  69/ 464]                blk.6.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  70/ 464]                blk.6.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  71/ 464]                  blk.6.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  72/ 464]     blk.6.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  73/ 464]           blk.6.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  74/ 464]                blk.6.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  75/ 464]                  blk.6.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  76/ 464]             blk.6.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  77/ 464]                  blk.6.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  78/ 464]                  blk.6.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  79/ 464]                blk.7.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  80/ 464]                  blk.7.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  81/ 464]                  blk.7.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  82/ 464]             blk.7.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  83/ 464]                  blk.7.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  84/ 464]                  blk.7.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q6_K .. size =    14.00 MiB ->     5.74 MiB
[  85/ 464]              blk.10.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  86/ 464]               blk.10.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q6_K .. size =    98.00 MiB ->    40.20 MiB
[  87/ 464]               blk.10.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  88/ 464]                 blk.10.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  89/ 464]    blk.10.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  90/ 464]          blk.10.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  91/ 464]               blk.10.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  92/ 464]                 blk.10.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  93/ 464]            blk.10.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  94/ 464]                 blk.10.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  95/ 464]                 blk.10.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  96/ 464]              blk.11.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
@Nexesenex Nexesenex added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 12, 2024
@Nexesenex Nexesenex changed the title Llama-Quantize : Layers quantized in the wrong order, thus breaking variable bits tensor quants consistency. Llama-Quantize : Layers quantized in the wrong order, thus damaging the variable bits tensor quants scheme consistency. Aug 12, 2024
@github-actions github-actions bot added the stale label Sep 12, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale
Projects
None yet
Development

No branches or pull requests

1 participant