-
-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama 3 speed #585
Comments
GPTQ doesn't quantize the output layer, so with the much larger vocabulary of Llama3, the output tensor alone is about 750 MB larger. This adds a significant amount of latency. Llama3 also uses GQA, which means you have a similar amount of computation for attention with somewhat lower VRAM overhead, but then the weights that would be assigned to keys/values are instead added to the MLP inner layer, making that slower instead. I don't know that this accounts for the entire difference you're seeing, but certainly some of it. You could try with EXL2 models which do have quantized output layers, perhaps. I would need to know more about the hardware setup before I could attempt to reproduce it. |
I think quantized output layers definitely make differences. I am benchmarking my constrained decoding library and even with A5000 I obtained 1.5x speedup in comparison to @freQuensy23-coder benchmark. |
I tested the speed of lama models using exlama and noticed that the speed of 8b models is much slower than 7b (although this is not the case with other inferences. Can you tell me what the problem might be?
A100 80gb
The text was updated successfully, but these errors were encountered: