[Feat]: quantized KV cache and flash attention #79

mseri · 2024-11-03T12:23:18Z

Description
Flash attention and quantized kv stores are both supported by llama.cpp.

These features allow for much larger contexts with drastically reduced memory footprints. These could be quite convenient for the limited resources on the phone.

Quantized kv cache, with q8, means half of the memory for the context with barely any effect on the quality (q4 is 1/4 memory but you notice degradation in my tests).

The feature could be implemented adding two optional parameters: a checkbox for flash attention (required for the KV quantization) and a dropdown to select a quantization for both the k and v store, f16 — (current) default, f8 and f4.

a-ghorbani · 2024-11-30T23:05:18Z

Adding the options should be straightforward i guess. But I was wondering which mobile devices support flash attention?

sammcj · 2024-12-07T20:28:50Z

This would make a significant improvement to the experience of running models on edge devices as it would unlock either larger, more useful context sizes or larger parameter models with the same context size at around the same memory usage.

I recently wrote a little blog post about qKV after getting it enabled in Ollama (which uses llama.cpp), there is an estimation calculator of the memory savings which you might find interesting: https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama

mseri added the enhancement New feature or request label Nov 3, 2024

mseri changed the title ~~[Feat]: quantized KV cache and/or flash attention~~ [Feat]: quantized KV cache and flash attention Nov 8, 2024

a-ghorbani mentioned this issue Dec 26, 2024

[Feat] Expose attn, batch, ubatch, cach_type_kv settings to the UI and bench results #148

Merged

7 tasks

a-ghorbani closed this as completed in #148 Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: quantized KV cache and flash attention #79

[Feat]: quantized KV cache and flash attention #79

mseri commented Nov 3, 2024 •

edited

Loading

a-ghorbani commented Nov 30, 2024

sammcj commented Dec 7, 2024

[Feat]: quantized KV cache and flash attention #79

[Feat]: quantized KV cache and flash attention #79

Comments

mseri commented Nov 3, 2024 • edited Loading

a-ghorbani commented Nov 30, 2024

sammcj commented Dec 7, 2024

mseri commented Nov 3, 2024 •

edited

Loading