Improve performance by using GEMM intrinsics, but this seems mainly like matrix-vector computation #2291

mgiessing · 2023-07-20T12:12:20Z

mgiessing
Jul 20, 2023

Hi, I'm currently evaluating the possibility of computing the mat mul in ggml_compute_forward_mat_mul_f32 (and later the quantized versions) with intrinsic functions for matrix multiplication.
The Power10 processor has a specific MMA unit (matrix math accelerator) for many precision types (fp64/fp32/fp16/bf16/i8/i4)

However, after implementing the first version I've seen a huge performance decrease and have then seen that most of the computation seems to be matrix - vector multiplication.

If I got that right ne11 & ne00 are the sizes we have to look at and while ne00 has "good" size of like 4096 or larger ne11 seems to be almost always be 1 (very few cases it is 8 in the beginning) and my question might be very naive, but is there a way to have matrices instead of vectors to improve this computation?

Here is a sample comparison between the branches with a 8 core CPU running the llama_f32 model:

SMT level	Model	t/s master	t/s MMA branch
SMT=8	7b_f32	4.75	<0.01
SMT=4	7b_f32	4.7	<0.01
SMT=2	7b_f32	3.27	<0.01
SMT=1	7b_f32	2.79	<0.01

SMT is Simultaneous multithreading which allows multiple instruction streams (threads) to run concurrently on the same physical processor

t/s is the eval_time

Thanks a lot in advance for your feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance by using GEMM intrinsics, but this seems mainly like matrix-vector computation #2291

{{title}}

Replies: 0 comments

Select a reply

Improve performance by using GEMM intrinsics, but this seems mainly like matrix-vector computation #2291

mgiessing Jul 20, 2023

Replies: 0 comments

mgiessing
Jul 20, 2023