Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3 speed #585

Open
freQuensy23-coder opened this issue Aug 4, 2024 · 2 comments
Open

Llama 3 speed #585

freQuensy23-coder opened this issue Aug 4, 2024 · 2 comments

Comments

@freQuensy23-coder
Copy link

freQuensy23-coder commented Aug 4, 2024

I tested the speed of lama models using exlama and noticed that the speed of 8b models is much slower than 7b (although this is not the case with other inferences. Can you tell me what the problem might be?

A100 80gb

  tokens/s tokens first second symbols/sec
suzume-llama-3-8B-multilingual-gptq 63.96 ± 7.37 56.48 ± 16.76 96.78 ± 14.76
Swallow-7b-instruct-v0.1-gptq 194.56 ± 22.83 166.34 ± 40.23 165.04 ± 37.06
shisa-v1-llama3-8b-gptq 60.01 ± 9.03 58.92 ± 14.76 90.45 ± 15.50
@turboderp
Copy link
Owner

GPTQ doesn't quantize the output layer, so with the much larger vocabulary of Llama3, the output tensor alone is about 750 MB larger. This adds a significant amount of latency.

Llama3 also uses GQA, which means you have a similar amount of computation for attention with somewhat lower VRAM overhead, but then the weights that would be assigned to keys/values are instead added to the MLP inner layer, making that slower instead.

I don't know that this accounts for the entire difference you're seeing, but certainly some of it. You could try with EXL2 models which do have quantized output layers, perhaps. I would need to know more about the hardware setup before I could attempt to reproduce it.

@Dan-wanna-M
Copy link

I tested the speed of lama models using exlama and noticed that the speed of 8b models is much slower than 7b (although this is not the case with other inferences. Can you tell me what the problem might be?

A100 80gb

  tokens/s tokens first second symbols/sec
suzume-llama-3-8B-multilingual-gptq 63.96 ± 7.37 56.48 ± 16.76 96.78 ± 14.76
Swallow-7b-instruct-v0.1-gptq 194.56 ± 22.83 166.34 ± 40.23 165.04 ± 37.06
shisa-v1-llama3-8b-gptq 60.01 ± 9.03 58.92 ± 14.76 90.45 ± 15.50

GPTQ doesn't quantize the output layer, so with the much larger vocabulary of Llama3, the output tensor alone is about 750 MB larger. This adds a significant amount of latency.

Llama3 also uses GQA, which means you have a similar amount of computation for attention with somewhat lower VRAM overhead, but then the weights that would be assigned to keys/values are instead added to the MLP inner layer, making that slower instead.

I don't know that this accounts for the entire difference you're seeing, but certainly some of it. You could try with EXL2 models which do have quantized output layers, perhaps. I would need to know more about the hardware setup before I could attempt to reproduce it.

I think quantized output layers definitely make differences. I am benchmarking my constrained decoding library and even with A5000 I obtained 1.5x speedup in comparison to @freQuensy23-coder benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants