Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: fix insufficient buffer clearing for MMQ #10032

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #10011 .
Followup to #10021 .

As of right now a quantized K cache works correctly with an FP16 model but not with a quantized model. The problem is that the clearing I added in #10021 is insufficient. It only cleared the last bit of the buffer but because multiple matrices are copied to the buffer at different regions this is insufficient. For an FP16 model this is not an issue because the initial buffer data is never NaN unless one of the FP16 values is NaN. For quantized data however the integer data can encode a non-finite value.

@JohannesGaessler JohannesGaessler added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Oct 24, 2024
@JohannesGaessler JohannesGaessler merged commit 167a515 into ggerganov:master Oct 24, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: K cache without FA goes Nan on Llama 3.1.
2 participants