CUDA: fix insufficient buffer clearing for MMQ #10032

JohannesGaessler · 2024-10-24T09:57:19Z

Fixes #10011 .
Followup to #10021 .

As of right now a quantized K cache works correctly with an FP16 model but not with a quantized model. The problem is that the clearing I added in #10021 is insufficient. It only cleared the last bit of the buffer but because multiple matrices are copied to the buffer at different regions this is insufficient. For an FP16 model this is not an issue because the initial buffer data is never NaN unless one of the FP16 values is NaN. For quantized data however the integer data can encode a non-finite value.

CUDA: fix insufficient buffer clearing for MMQ

49dcb0a

JohannesGaessler added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Oct 24, 2024

JohannesGaessler mentioned this pull request Oct 24, 2024

Bug: K cache without FA goes Nan on Llama 3.1. #10011

Closed

slaren approved these changes Oct 24, 2024

View reviewed changes

ikawrakow mentioned this pull request Oct 24, 2024

Fix quantized k-cache without FA ikawrakow/ik_llama.cpp#105

Merged

JohannesGaessler merged commit 167a515 into ggerganov:master Oct 24, 2024
53 checks passed

yeahdongcn mentioned this pull request Oct 25, 2024

musa: workaround for Guilty Lockup in cleaning src0 in #10032 #10042

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fix insufficient buffer clearing for MMQ #10032

CUDA: fix insufficient buffer clearing for MMQ #10032

JohannesGaessler commented Oct 24, 2024

CUDA: fix insufficient buffer clearing for MMQ #10032

CUDA: fix insufficient buffer clearing for MMQ #10032

Conversation

JohannesGaessler commented Oct 24, 2024