Llama-Quantize : Layers quantized in the wrong order, thus damaging the variable bits tensor quants scheme consistency. #9005

Nexesenex · 2024-08-12T12:59:04Z

What happened?

On master b3573, when quantizing Gemma 9b it:

The tensors are quantized in a wrong order.

Right now, because of the layer jump from 7 to 10 without the ffns of layer 7 to be quantized, it breaks not only the layer quantization order, but also the correlation between ffn_down Q6_K and attn_v Q6_K : From layer 7, some layers will have ffn_down Q6_K and attn_v Q5_K, and some others ffn_down Q5_K and attn_v Q6_K.
This gives us suboptimal quants per BPW.

I expect the tensors to be quantized in the right order.

This, so the Q5_K_M quant, as well as the othersusing "use_more_bits(i_layer, n_layer)" to have a variable quant of ffn_down in conjunction with "use_more_bits(qs.i_attention_wv, qs.n_attention_wv))" to have a variable quant of attn_v.weight, can be optimal.

Name and Version

main: build = 3573 (2589292)
main: built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

[  45/ 464]                  blk.3.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q6_K .. size =    14.00 MiB ->     5.74 MiB
[  46/ 464]               blk.4.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  47/ 464]                blk.4.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q6_K .. size =    98.00 MiB ->    40.20 MiB
[  48/ 464]                blk.4.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  49/ 464]                  blk.4.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  50/ 464]     blk.4.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  51/ 464]           blk.4.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  52/ 464]                blk.4.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  53/ 464]                  blk.4.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  54/ 464]             blk.4.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  55/ 464]                  blk.4.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  56/ 464]                  blk.4.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q6_K .. size =    14.00 MiB ->     5.74 MiB
[  57/ 464]               blk.5.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  58/ 464]                blk.5.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  59/ 464]                blk.5.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  60/ 464]                  blk.5.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  61/ 464]     blk.5.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  62/ 464]           blk.5.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  63/ 464]                blk.5.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  64/ 464]                  blk.5.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  65/ 464]             blk.5.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  66/ 464]                  blk.5.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  67/ 464]                  blk.5.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  68/ 464]               blk.6.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  69/ 464]                blk.6.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  70/ 464]                blk.6.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  71/ 464]                  blk.6.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  72/ 464]     blk.6.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  73/ 464]           blk.6.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  74/ 464]                blk.6.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  75/ 464]                  blk.6.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  76/ 464]             blk.6.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  77/ 464]                  blk.6.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  78/ 464]                  blk.6.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  79/ 464]                blk.7.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  80/ 464]                  blk.7.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  81/ 464]                  blk.7.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  82/ 464]             blk.7.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  83/ 464]                  blk.7.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  84/ 464]                  blk.7.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q6_K .. size =    14.00 MiB ->     5.74 MiB
[  85/ 464]              blk.10.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  86/ 464]               blk.10.ffn_down.weight - [14336,  3584,     1,     1], type =   bf16, converting to q6_K .. size =    98.00 MiB ->    40.20 MiB
[  87/ 464]               blk.10.ffn_gate.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  88/ 464]                 blk.10.ffn_up.weight - [ 3584, 14336,     1,     1], type =   bf16, converting to q5_K .. size =    98.00 MiB ->    33.69 MiB
[  89/ 464]    blk.10.post_attention_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  90/ 464]          blk.10.post_ffw_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  91/ 464]               blk.10.ffn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[  92/ 464]                 blk.10.attn_k.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  93/ 464]            blk.10.attn_output.weight - [ 4096,  3584,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  94/ 464]                 blk.10.attn_q.weight - [ 3584,  4096,     1,     1], type =   bf16, converting to q5_K .. size =    28.00 MiB ->     9.62 MiB
[  95/ 464]                 blk.10.attn_v.weight - [ 3584,  2048,     1,     1], type =   bf16, converting to q5_K .. size =    14.00 MiB ->     4.81 MiB
[  96/ 464]              blk.11.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-27T01:07:20Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Nexesenex added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 12, 2024

Nexesenex changed the title ~~Llama-Quantize : Layers quantized in the wrong order, thus breaking variable bits tensor quants consistency.~~ Llama-Quantize : Layers quantized in the wrong order, thus damaging the variable bits tensor quants scheme consistency. Aug 12, 2024

github-actions bot added the stale label Sep 12, 2024

github-actions bot closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-Quantize : Layers quantized in the wrong order, thus damaging the variable bits tensor quants scheme consistency. #9005

Llama-Quantize : Layers quantized in the wrong order, thus damaging the variable bits tensor quants scheme consistency. #9005

Nexesenex commented Aug 12, 2024 •

edited

Loading

github-actions bot commented Sep 27, 2024

Llama-Quantize : Layers quantized in the wrong order, thus damaging the variable bits tensor quants scheme consistency. #9005

Llama-Quantize : Layers quantized in the wrong order, thus damaging the variable bits tensor quants scheme consistency. #9005

Comments

Nexesenex commented Aug 12, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

github-actions bot commented Sep 27, 2024

Nexesenex commented Aug 12, 2024 •

edited

Loading