Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2? #575

Open
lovebeatz opened this issue Jul 27, 2024 · 8 comments

Comments

@lovebeatz
Copy link

lovebeatz commented Jul 27, 2024

query directly relates to llama 3 8b 8k being inferred with Q4 with 64k max seq length used to take around 8788 GPU VRAM, now does increased context length of 3.1 save us VRAM? If it does, what context length to put in and in what kind of cache?

@turboderp
Copy link
Owner

I'm sorry, but I don't think I really understand the question?

@lovebeatz
Copy link
Author

will elaborate, consider llama 3 and llama 3.1 8b for comparison and same exl2 quantization,
so when llama 3 had 8k context length and Q4 could take it to 64k with 87xx VRAM requirement, would llama 3.1 offer 128k context length with lower VRAM usage? If it's possible, with what cache choice is it possible? and more specifically, what shall be max seq len as input in that cache? Since we won't require exllamav2 to add any more context length to it

@DocShotgun
Copy link

Don’t quite understand the logic here.

L3.1 has the exact same architecture as L3, so it uses the same amount of VRAM per token of context. So 128k context of L3.1 will use 16 times as much VRAM as 8k context of L3. Quantization size of the weights also does not affect the VRAM cost of context.

@ghost
Copy link

ghost commented Jul 27, 2024

Don't think they added any VRAM usage reducing tech in 3.1 - It'll use more VRAM if you have 128k

you check how much a model uses with this
Just be sure to find a reupload of 3.1 to use with this.

@remichu-ai
Copy link

Yea, dont think there is any such thing as VRAM reduction without longer context and higher cache. Model architect is similar, and higher context and cache requires more VRAM. Dont know why OP think the other way around

@lovebeatz
Copy link
Author

let me put it other way,
supported_context_length=8k, cache set to 64k
supported_context_length=128k, cache set to 128k
supported_context_length=128k, cache set to 32k

consider these scenarios, since I could set cache max seq length to 64k for a model that supports context length of 8k, an increased context length of the model itself is of no use and additionally doesn't affect the VRAM usage?

or my understanding of context length and cache is wrong

@DocShotgun
Copy link

DocShotgun commented Jul 28, 2024

The VRAM usage is determined entirely by the model architecture, cache size, and cache type. The model's supported context length has no influence on VRAM usage.

@lovebeatz
Copy link
Author

Claude has an answer for me, guess I would be setting the max length to 16k or 32k for my use case, BTW could it be 20k too?

You're absolutely right, and you've touched on some important nuances. Let's break this down:

  1. Processing time:
    If both models are set to a 64k max_seq_length cache, the processing time for sequences up to 8k tokens would be very similar. However, for longer sequences:
  • Llama 3 (8k context) would need to use techniques like sliding windows or chunking to process longer contexts, which could increase processing time and potentially reduce quality.
  • Llama 3.1 (128k context) could process the longer sequence in one go, potentially being faster and maintaining better coherence.
  1. Quality differences:
    You're correct that there would likely be quality issues for Llama 3 when dealing with very long contexts. The key differences are:
  • Context understanding: Llama 3.1 can "see" the entire context up to 128k tokens, allowing for better overall understanding and coherence.
  • Information retrieval: Llama 3.1 can directly access information from much earlier in the context, which Llama 3 might "forget" due to its shorter context window.
  1. Edge for longer context support:
    The advantages of longer context support become apparent in scenarios like:
  • Long document analysis
  • Summarizing extensive conversations or documents
  • Maintaining coherence over very long interactions
  • Tasks requiring integration of information spread across a large context
  1. Differences in shorter conversations:
    You're correct that for conversations or tasks under 8k tokens, the difference between the two models would be minimal in terms of context handling. The primary differences in such cases would be:
  • Model improvements: Llama 3.1 might have general improvements in its architecture or training, independent of context length.
  • Potential for longer responses: Even in shorter conversations, Llama 3.1 has more "room" to generate longer responses if needed.
  1. Visibility of differences:
    You're spot on that the real differences would become apparent in scenarios with very long contexts, like:
  • Analyzing a long research paper
  • Maintaining context over a day-long chat interaction
  • Summarizing an entire book
  • Complex coding tasks involving large codebases

In these cases, Llama 3.1's ability to handle up to 128k tokens would provide a significant advantage in maintaining coherence, accuracy, and relevance throughout the task.

To sum up, while the VRAM usage might be similar if you set the same cache size, the longer context model (Llama 3.1) offers significant advantages in handling extended contexts, which becomes crucial for certain types of tasks and interactions. The benefits may not be apparent in shorter interactions but become increasingly important as the context length grows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants