Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2? #575

lovebeatz · 2024-07-27T04:19:20Z

query directly relates to llama 3 8b 8k being inferred with Q4 with 64k max seq length used to take around 8788 GPU VRAM, now does increased context length of 3.1 save us VRAM? If it does, what context length to put in and in what kind of cache?

turboderp · 2024-07-27T12:34:59Z

I'm sorry, but I don't think I really understand the question?

lovebeatz · 2024-07-27T12:42:45Z

will elaborate, consider llama 3 and llama 3.1 8b for comparison and same exl2 quantization,
so when llama 3 had 8k context length and Q4 could take it to 64k with 87xx VRAM requirement, would llama 3.1 offer 128k context length with lower VRAM usage? If it's possible, with what cache choice is it possible? and more specifically, what shall be max seq len as input in that cache? Since we won't require exllamav2 to add any more context length to it

DocShotgun · 2024-07-27T16:53:11Z

Don’t quite understand the logic here.

L3.1 has the exact same architecture as L3, so it uses the same amount of VRAM per token of context. So 128k context of L3.1 will use 16 times as much VRAM as 8k context of L3. Quantization size of the weights also does not affect the VRAM cost of context.

ghost · 2024-07-27T16:55:29Z

Don't think they added any VRAM usage reducing tech in 3.1 - It'll use more VRAM if you have 128k

you check how much a model uses with this
Just be sure to find a reupload of 3.1 to use with this.

remichu-ai · 2024-07-27T17:19:53Z

Yea, dont think there is any such thing as VRAM reduction without longer context and higher cache. Model architect is similar, and higher context and cache requires more VRAM. Dont know why OP think the other way around

lovebeatz · 2024-07-28T00:54:29Z

let me put it other way,
supported_context_length=8k, cache set to 64k
supported_context_length=128k, cache set to 128k
supported_context_length=128k, cache set to 32k

consider these scenarios, since I could set cache max seq length to 64k for a model that supports context length of 8k, an increased context length of the model itself is of no use and additionally doesn't affect the VRAM usage?

or my understanding of context length and cache is wrong

DocShotgun · 2024-07-28T03:29:10Z

The VRAM usage is determined entirely by the model architecture, cache size, and cache type. The model's supported context length has no influence on VRAM usage.

lovebeatz · 2024-08-05T03:53:16Z

Claude has an answer for me, guess I would be setting the max length to 16k or 32k for my use case, BTW could it be 20k too?

You're absolutely right, and you've touched on some important nuances. Let's break this down:

Processing time:
If both models are set to a 64k max_seq_length cache, the processing time for sequences up to 8k tokens would be very similar. However, for longer sequences:

Llama 3 (8k context) would need to use techniques like sliding windows or chunking to process longer contexts, which could increase processing time and potentially reduce quality.
Llama 3.1 (128k context) could process the longer sequence in one go, potentially being faster and maintaining better coherence.

Quality differences:
You're correct that there would likely be quality issues for Llama 3 when dealing with very long contexts. The key differences are:

Context understanding: Llama 3.1 can "see" the entire context up to 128k tokens, allowing for better overall understanding and coherence.
Information retrieval: Llama 3.1 can directly access information from much earlier in the context, which Llama 3 might "forget" due to its shorter context window.

Edge for longer context support:
The advantages of longer context support become apparent in scenarios like:

Long document analysis
Summarizing extensive conversations or documents
Maintaining coherence over very long interactions
Tasks requiring integration of information spread across a large context

Differences in shorter conversations:
You're correct that for conversations or tasks under 8k tokens, the difference between the two models would be minimal in terms of context handling. The primary differences in such cases would be:

Model improvements: Llama 3.1 might have general improvements in its architecture or training, independent of context length.
Potential for longer responses: Even in shorter conversations, Llama 3.1 has more "room" to generate longer responses if needed.

Visibility of differences:
You're spot on that the real differences would become apparent in scenarios with very long contexts, like:

Analyzing a long research paper
Maintaining context over a day-long chat interaction
Summarizing an entire book
Complex coding tasks involving large codebases

In these cases, Llama 3.1's ability to handle up to 128k tokens would provide a significant advantage in maintaining coherence, accuracy, and relevance throughout the task.

To sum up, while the VRAM usage might be similar if you set the same cache size, the longer context model (Llama 3.1) offers significant advantages in handling extended contexts, which becomes crucial for certain types of tasks and interactions. The benefits may not be apparent in shorter interactions but become increasingly important as the context length grows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2? #575

Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2? #575

lovebeatz commented Jul 27, 2024 •

edited

Loading

turboderp commented Jul 27, 2024

lovebeatz commented Jul 27, 2024

DocShotgun commented Jul 27, 2024

ghost commented Jul 27, 2024

remichu-ai commented Jul 27, 2024

lovebeatz commented Jul 28, 2024

DocShotgun commented Jul 28, 2024 •

edited

Loading

lovebeatz commented Aug 5, 2024

Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2? #575

Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2? #575

Comments

lovebeatz commented Jul 27, 2024 • edited Loading

turboderp commented Jul 27, 2024

lovebeatz commented Jul 27, 2024

DocShotgun commented Jul 27, 2024

ghost commented Jul 27, 2024

remichu-ai commented Jul 27, 2024

lovebeatz commented Jul 28, 2024

DocShotgun commented Jul 28, 2024 • edited Loading

lovebeatz commented Aug 5, 2024

lovebeatz commented Jul 27, 2024 •

edited

Loading

DocShotgun commented Jul 28, 2024 •

edited

Loading