You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These features allow for much larger contexts with drastically reduced memory footprints. These could be quite convenient for the limited resources on the phone.
Quantized kv cache, with q8, means half of the memory for the context with barely any effect on the quality (q4 is 1/4 memory but you notice degradation in my tests).
The feature could be implemented adding two optional parameters: a checkbox for flash attention (required for the KV quantization) and a dropdown to select a quantization for both the k and v store, f16 — (current) default, f8 and f4.
The text was updated successfully, but these errors were encountered:
This would make a significant improvement to the experience of running models on edge devices as it would unlock either larger, more useful context sizes or larger parameter models with the same context size at around the same memory usage.
Description
Flash attention and quantized kv stores are both supported by llama.cpp.
These features allow for much larger contexts with drastically reduced memory footprints. These could be quite convenient for the limited resources on the phone.
Quantized kv cache, with q8, means half of the memory for the context with barely any effect on the quality (q4 is 1/4 memory but you notice degradation in my tests).
The feature could be implemented adding two optional parameters: a checkbox for flash attention (required for the KV quantization) and a dropdown to select a quantization for both the k and v store, f16 — (current) default, f8 and f4.
The text was updated successfully, but these errors were encountered: