Confused about mul_mat_q and multiple backends. #629

Ph0rk0z · 2023-08-22T11:46:15Z

Ph0rk0z
Aug 22, 2023

I am compiling as per the readme for cuBlas but would like to try mul_mat_q kernels to compare speeds. From what I gather these kernels are implemented using openblas?

Does this mean I have to separately compile a llama-cpp-python for each backend and uninstall them in between? Or can I compile one backend with both cuBlas and openblas?

Will mul_mat_q flag also work with cublas compiled only?

Dampfinchen · 2023-08-24T22:42:21Z

Dampfinchen
Aug 24, 2023

mul mat q is enabled by compiling with cublas and using the command --mul-mat-q in CLI. In the latest llama.cpp versions (not merged yet) mul mat q is the default, so the command no longer works. And yes, its faster and saves quite a lot of VRAM.

1 reply

Ph0rk0z Aug 25, 2023
Author

I tried to compile with both cublas and openblas on and using the flag did nothing for memory or speeds, same as it did when I compiled with cublas only. Nothing is really acknowledged about it on the output. That was pre GGUF so maybe we'll see different once the PR here is merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused about mul_mat_q and multiple backends. #629

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Confused about mul_mat_q and multiple backends. #629

Ph0rk0z Aug 22, 2023

Replies: 1 comment · 1 reply

Dampfinchen Aug 24, 2023

Ph0rk0z Aug 25, 2023 Author

Ph0rk0z
Aug 22, 2023

Replies: 1 comment 1 reply

Dampfinchen
Aug 24, 2023

Ph0rk0z Aug 25, 2023
Author