You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm currently evaluating the possibility of computing the mat mul in ggml_compute_forward_mat_mul_f32 (and later the quantized versions) with intrinsic functions for matrix multiplication.
The Power10 processor has a specific MMA unit (matrix math accelerator) for many precision types (fp64/fp32/fp16/bf16/i8/i4)
However, after implementing the first version I've seen a huge performance decrease and have then seen that most of the computation seems to be matrix - vector multiplication.
If I got that right ne11 & ne00 are the sizes we have to look at and while ne00 has "good" size of like 4096 or larger ne11 seems to be almost always be 1 (very few cases it is 8 in the beginning) and my question might be very naive, but is there a way to have matrices instead of vectors to improve this computation?
Here is a sample comparison between the branches with a 8 core CPU running the llama_f32 model:
SMT level
Model
t/s master
t/s MMA branch
SMT=8
7b_f32
4.75
<0.01
SMT=4
7b_f32
4.7
<0.01
SMT=2
7b_f32
3.27
<0.01
SMT=1
7b_f32
2.79
<0.01
SMT is Simultaneous multithreading which allows multiple instruction streams (threads) to run concurrently on the same physical processor
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I'm currently evaluating the possibility of computing the mat mul in
ggml_compute_forward_mat_mul_f32
(and later the quantized versions) with intrinsic functions for matrix multiplication.The Power10 processor has a specific MMA unit (matrix math accelerator) for many precision types (fp64/fp32/fp16/bf16/i8/i4)
However, after implementing the first version I've seen a huge performance decrease and have then seen that most of the computation seems to be
matrix - vector
multiplication.If I got that right
ne11
&ne00
are the sizes we have to look at and whilene00
has "good" size of like 4096 or largerne11
seems to be almost always be 1 (very few cases it is 8 in the beginning) and my question might be very naive, but is there a way to have matrices instead of vectors to improve this computation?Here is a sample comparison between the branches with a 8 core CPU running the llama_f32 model:
SMT is Simultaneous multithreading which allows multiple instruction streams (threads) to run concurrently on the same physical processor
t/s is the
eval_time
Thanks a lot in advance for your feedback!
Beta Was this translation helpful? Give feedback.
All reactions