Replies: 1 comment 1 reply
-
mul mat q is enabled by compiling with cublas and using the command --mul-mat-q in CLI. In the latest llama.cpp versions (not merged yet) mul mat q is the default, so the command no longer works. And yes, its faster and saves quite a lot of VRAM. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am compiling as per the readme for cuBlas but would like to try mul_mat_q kernels to compare speeds. From what I gather these kernels are implemented using openblas?
Does this mean I have to separately compile a llama-cpp-python for each backend and uninstall them in between? Or can I compile one backend with both cuBlas and openblas?
Will mul_mat_q flag also work with cublas compiled only?
Beta Was this translation helpful? Give feedback.
All reactions