Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify formula and make overhead static at 20% #82

Merged
merged 2 commits into from
Nov 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 15 additions & 9 deletions blog/2023-11-16-calculating-gpu-memory-for-llm.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
slug: calculating-gpu-memory-for-llm
title: "Calculating GPU memory for LLMs"
title: "Calculating GPU memory for serving LLMs"
authors:
- name: Sam Stoelinga
title: Engineer
Expand All @@ -15,7 +15,7 @@ the Large Language Model.

The formula is simple:
$$
M = \dfrac{(P * 4B)}{ (32 / Q)} + O
M = \dfrac{(P * 4B)}{ (32 / Q)} * 1.2
$$
| Symbol | Description |
| ----------- | ----------- |
Expand All @@ -24,28 +24,34 @@ $$
| 4B | 4 bytes, expressing the bytes used for each parameter |
| 32 | There are 32 bits in 4 bytes |
| Q | The amount of bits that should be used for loading the model. E.g. 16 bits, 8 bits or 4 bits. |
| O | Overhead of loading additional things in GPU memory. E.g. input or batches |
| 1.2 | Represents a 20% overhead of loading additional things in GPU memory.|

Now let's try out some examples.

### GPU memory required for serving Llama 70B
Let's try it out for Llama 70B that we will load in 16 bit with 10GB overhead.
Let's try it out for Llama 70B that we will load in 16 bit.
The model has 70 billion parameters.
$$
\dfrac{70 * 4 \mathrm{bytes}}{32 / 16} + 10\mathrm{GB} = 150\mathrm{GB}
\dfrac{70 * 4 \mathrm{bytes}}{32 / 16} * 1.2 = 168\mathrm{GB}
$$
That's quite a lot of memory. A single A100 80GB wouldn't be enough, although
2x A100 80GB should be enough to serve the Llama 2 70B model in 16 bit mode.

How to further reduce GPU memory required for Llama 2 70B? Quantization is a method to reduce the memory footprint. Quantization is able to do this by reducing the precision of the model's parameters from floating-point to lower-bit representations, such as 8-bit integers. This process significantly decreases the memory and computational requirements, enabling more efficient deployment of the model, particularly on devices with limited resources. However, it requires careful management to maintain the model's performance, as reducing precision can potentially impact the accuracy of the outputs.
**How to further reduce GPU memory required for Llama 2 70B?**

Quantization is a method to reduce the memory footprint. Quantization is able to do this by reducing the precision of the model's parameters from floating-point to lower-bit representations, such as 8-bit integers. This process significantly decreases the memory and computational requirements, enabling more efficient deployment of the model, particularly on devices with limited resources. However, it requires careful management to maintain the model's performance, as reducing precision can potentially impact the accuracy of the outputs.

In general, the consensus seems to be that 8 bit quantization achieves similar performance to using 16 bit. However, 4 bit quantization could have a noticeable impact to the model performance.

Let's do another example where we use 4 bit quantization of Llama 2 70B and 1GB overhead:
Let's do another example where we use **4 bit quantization of Llama 2 70B**:
$$
\dfrac{70 * 4 \mathrm{bytes}}{32 / 4} + 1\mathrm{GB} = 36\mathrm{GB}
\dfrac{70 * 4 \mathrm{bytes}}{32 / 4} * 1.2 = 42\mathrm{GB}
$$
This is something you could easily run on 2 x L4 24GB GPUs.
This is something you could run on 2 x L4 24GB GPUs.

### Relevant tools and resources
1. [Tool for checking how many GPUs you need for a specific model](https://huggingface.co/spaces/Vokturz/can-it-run-llm)
2. [Transformer Math 101](https://blog.eleuther.ai/transformer-math/)

Got more questions? Don't hesitate to join our Discord and ask away.

Expand Down
Binary file modified static/img/llm-gpu-mem-formula.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.