-
-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2? #575
Comments
I'm sorry, but I don't think I really understand the question? |
will elaborate, consider llama 3 and llama 3.1 8b for comparison and same exl2 quantization, |
Don’t quite understand the logic here. L3.1 has the exact same architecture as L3, so it uses the same amount of VRAM per token of context. So 128k context of L3.1 will use 16 times as much VRAM as 8k context of L3. Quantization size of the weights also does not affect the VRAM cost of context. |
Don't think they added any VRAM usage reducing tech in 3.1 - It'll use more VRAM if you have 128k you check how much a model uses with this |
Yea, dont think there is any such thing as VRAM reduction without longer context and higher cache. Model architect is similar, and higher context and cache requires more VRAM. Dont know why OP think the other way around |
let me put it other way, consider these scenarios, since I could set cache max seq length to 64k for a model that supports context length of 8k, an increased context length of the model itself is of no use and additionally doesn't affect the VRAM usage? or my understanding of context length and cache is wrong |
The VRAM usage is determined entirely by the model architecture, cache size, and cache type. The model's supported context length has no influence on VRAM usage. |
Claude has an answer for me, guess I would be setting the max length to 16k or 32k for my use case, BTW could it be 20k too? You're absolutely right, and you've touched on some important nuances. Let's break this down:
In these cases, Llama 3.1's ability to handle up to 128k tokens would provide a significant advantage in maintaining coherence, accuracy, and relevance throughout the task. To sum up, while the VRAM usage might be similar if you set the same cache size, the longer context model (Llama 3.1) offers significant advantages in handling extended contexts, which becomes crucial for certain types of tasks and interactions. The benefits may not be apparent in shorter interactions but become increasingly important as the context length grows. |
query directly relates to llama 3 8b 8k being inferred with Q4 with 64k max seq length used to take around 8788 GPU VRAM, now does increased context length of 3.1 save us VRAM? If it does, what context length to put in and in what kind of cache?
The text was updated successfully, but these errors were encountered: