-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Perf] [CPU] eliminate redundant memory access in group query attention #13319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: ZelinMa557 <[email protected]>
Just tested this out of curiosity: Qwen 3 degrades in quality (ignores |
Hi, thanks for your reply! It do not break compatibility with older models in theory, but there might be small bugs in my implementation. In my test, it works with Qwen 2.5 7b. Can you tell me the Qwen3 model size you used to test? I will test both qwen3 and mistral to debug. |
I've tested both 8b and 4b models in Q6, both worked correctly without this PR. Mistral Small 2 is in Q5_K_L, works correctly on main too. |
Thanks, I have reproduced the same problem. I will try to fix it. |
Signed-off-by: ZelinMa557 <[email protected]>
I have fixed the bug. Are there any scripts to format the code locally? This pr cannot pass the code lint now |
Thank you! I've already deleted Qwen models, unfortunately, but Mistral Small 2 generates text correctly now. I'll test it a bit more with other models, but so far it seems to be fixed. On i7 8700 with Mistral Small 3 (the 24b one, q4_k_m) I get 2.08t/s with this PR vs 1.97t/s on current main. |
Signed-off-by: ZelinMa557 <[email protected]>
Hm, I opened your PR in my editor and saw this: I ran Here's a patch to fix the change, if you can't find the lineFrom 3c7b2ed48acfcb5a9c06846ed0b548b3e48707af Mon Sep 17 00:00:00 2001
From: Excigma <[email protected]>
Date: Mon, 12 May 2025 15:46:03 +1200
Subject: [PATCH] style: remove trailing whitespace
---
ggml/src/ggml-cpu/ops.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 250b6abc..a1481d9e 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -7055,7 +7055,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
const float * pq = (const float *) ((char *) q->data + (iq1*nbq1 + (iq2 + i_gqa)*nbq2 + iq3*nbq3));
q_to_vec_dot(pq, Q_q[i_gqa], DK);
-
+
const uint32_t h = iq2 + i_gqa;
slope[i_gqa] = (max_bias > 0.0f) ? h < n_head_log2 ? powf(m0, h + 1) : powf(m1, 2*(h - n_head_log2) + 1) : 1.0f;
}
--
2.49.0
|
Modern LLMs (Llama3, qwen 2.5, etc) usually use group query attention, which significantly reduces memory usage caused by KV cache. Group query attention means that query rows of neighbor query heads share kv rows of the same kv head, so we can reorder the loop to:
to improve spatial locality of memory access. However the original implemention of cpu flash attention kernel didn't consider that, and this pr improves it.
This is my test command:
The mastrer branch result:
With the optimization, the result is:
We can see slight speed up in prefill, and 25% speed up in decode!
Further work:
My test environment: