Introduction of gemm4xN and gemmMx4 for Q4_0 and Q8_0 for better performance results #8908

Srihari-mcw · 2024-08-07T13:40:15Z

The PR introduces gemm4xN and gemmMx4 templated functions for the gemm functions of relevant dimensions for Q4_0 and Q8_0
The functions make use of _mm_cvtph_ps function for conversion of delta values to FP32 precision before using _mm_mul_ps for multiplication of delta values, Loop unrolling is done so as to extract and use the resultant delta multiplication outputs
The above changes help the Q4_0 and Q8_0 quantizations to gain performance especially with prompt processing

GCC Linux :

Meta Llama2 7B model:

Q4_0 Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id	Notes
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	pp 512	43.79 ± 0.08		7e72aa74	Base commit
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	pp 512	59.37 ± 0.08	35.58%	cdf3a251	Commit with PR changes
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	tg 128	14.65 ± 0.01		7e72aa74	Base commit
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	tg 128	14.51 ± 0.00	-0.96%	cdf3a251	Commit with PR changes

Q8_0 Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id	Notes
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	6	pp 512	56.87 + 0.06		7e72aa74	Base commit
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	6	pp 512	68.03 + 0.13	19.69%	cdf3a251	Commit with PR changes
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	6	tg 128	8.12 ± 0.00		7e72aa74	Base commit
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	6	tg 128	8.12 ± 0.00	0.00%	cdf3a251	Commit with PR changes

Mistral-7B-Instruct-v0.3 model:

Q4_0 Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id	Notes
llama 7B Q4_0	3.83 GiB	7.25 B	CPU	6	pp 512	40.96 ± 0.05		7e72aa74	Base commit
llama 7B Q4_0	3.83 GiB	7.25 B	CPU	6	pp 512	55.71 ± 0.11	36.01%	cdf3a251	Commit with PR changes
llama 7B Q4_0	3.83 GiB	7.25 B	CPU	6	tg 128	13.81 ± 0.01		7e72aa74	Base commit
llama 7B Q4_0	3.83 GiB	7.25 B	CPU	6	tg 128	13.66 ± 0.00	-1.09%	cdf3a251	Commit with PR changes

Q8_0 Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id	Notes
llama 7B Q8_0	7.17 GiB	7.25 B	CPU	6	pp 512	53.34 + 0.04		7e72aa74	Base commit
llama 7B Q8_0	7.17 GiB	7.25 B	CPU	6	pp 512	63.64 + 0.07	19.31%	cdf3a251	Commit with PR changes
llama 7B Q8_0	7.17 GiB	7.25 B	CPU	6	tg 128	7.59 ± 0.00		7e72aa74	Base commit
llama 7B Q8_0	7.17 GiB	7.25 B	CPU	6	tg 128	7.60 ± 0.00	0.13%	cdf3a251	Commit with PR changes

GCC Version = 12.3

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

Original Unquantized Models :

Llama2 7B : https://huggingface.co/meta-llama/Llama-2-7b
Mistral 7B Instruct v0.3 : https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

…delta multiplication

Srihari-mcw · 2024-08-08T00:59:30Z

The PR #8908 was also tested in an AMD Ryzen ThreadRipper PRO 5995WX machine. Test Results are attached below along with flags supported and other details

Performance Results in AMD Ryzen Threadripper PRO 5995WX

GCC Linux :

Mistral-7B-Instruct-v0.3 model:

Q4_0 Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id	Notes
llama 7B Q4_0	3.83 GiB	7.25 B	CPU	64	pp 512	189.30 ± 0.31		7e72aa74	Base commit
llama 7B Q4_0	3.83 GiB	7.25 B	CPU	64	pp 512	210.26 ± 0.32	11.07%	cdf3a251	Commit with PR changes
llama 7B Q4_0	3.83 GiB	7.25 B	CPU	64	tg 128	33.74 ± 0.04		7e72aa74	Base commit
llama 7B Q4_0	3.83 GiB	7.25 B	CPU	64	tg 128	33.77 ± 0.05	0.09%	cdf3a251	Commit with PR changes

Q8_0 Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id	Notes
llama 7B Q8_0	7.17 GiB	7.25 B	CPU	64	pp 512	214.93 + 0.25		7e72aa74	Base commit
llama 7B Q8_0	7.17 GiB	7.25 B	CPU	64	pp 512	241.85 + 0.47	12.53%	cdf3a251	Commit with PR changes
llama 7B Q8_0	7.17 GiB	7.25 B	CPU	64	tg 128	19.83 ± 0.01		7e72aa74	Base commit
llama 7B Q8_0	7.17 GiB	7.25 B	CPU	64	tg 128	19.74 ± 0.00	0.13%	cdf3a251	Commit with PR changes

GCC Version = 12.3

The machine supports the following flags by default :

Original Unquantized Models :

Llama2 7B : https://huggingface.co/meta-llama/Llama-2-7b
Mistral 7B Instruct v0.3 : https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

ggerganov

I observe 10%-15% PP speed improvement on Ryzen 9 5950X using Gemma 2 2B models. Perplexity is the same

Add loop unrolled 4xN and MX4 dimension GEMM functions with parallel …

cdf3a25

…delta multiplication

Srihari-mcw mentioned this pull request Aug 7, 2024

Introduce Q8_0 and Q4_0 with Bf16 delta values #7497

Open

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 8, 2024

slaren approved these changes Aug 30, 2024

View reviewed changes

ggerganov approved these changes Aug 31, 2024

View reviewed changes

ggerganov merged commit ea5d747 into ggerganov:master Aug 31, 2024
53 checks passed

netrunnereve mentioned this pull request Sep 11, 2024

IQ4_NL sgemm + Q4_0 AVX optimization #9422

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction of gemm4xN and gemmMx4 for Q4_0 and Q8_0 for better performance results #8908

Introduction of gemm4xN and gemmMx4 for Q4_0 and Q8_0 for better performance results #8908

Srihari-mcw commented Aug 7, 2024

Srihari-mcw commented Aug 8, 2024

ggerganov left a comment

Introduction of gemm4xN and gemmMx4 for Q4_0 and Q8_0 for better performance results #8908

Introduction of gemm4xN and gemmMx4 for Q4_0 and Q8_0 for better performance results #8908

Conversation

Srihari-mcw commented Aug 7, 2024

Srihari-mcw commented Aug 8, 2024

ggerganov left a comment

Choose a reason for hiding this comment