Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about long input and difference between streaming-llm and dense attention. #41

Closed
hxs91 opened this issue Oct 16, 2023 · 2 comments

Comments

@hxs91
Copy link

hxs91 commented Oct 16, 2023

Thank you for your nice work and I have read the issue #33, thank you for your patient explanation for the difference between StreamingLLM and Dense Attention, based on your answer, I have futher question:

  1. As you mentioned in FAQ.3, I guess streaming-llm processes long input in a truncation manner. But if I don't mind the expensive time consuming and handle the long input text in a streaming-llm manner (recompuation from the very begining of the text to the end with attention sinks window by window), theoretically, will it perform better than truncation? do you perform some experiments?

  2. For a model with large context size (say 16k), we can still implement streaming llm use shorter context size(say 4k), have you perform comparison between dense attention with 16k and streaming llm with 4k? Theoretically, streaming-4k should perform worse than dense-16k, what's the gap? this is important if one want to use the streaming-llm to approximate larger window size performance.

@Guangxuan-Xiao
Copy link
Collaborator

Hello,

Thank you for your thoughtful questions. Let's delve into the details:

  1. Regarding streaming-llm processing with truncation vs. re-computation:

    • In our paper, we have results that touch on this topic. The baseline you're referring to is the "sliding window with re-computation." If you refer to Figure 3 of our paper, you'll see that StreamingLLM's perplexity is in line with this baseline. So, in essence, StreamingLLM performs comparably when handling long inputs in the manner you described.
  2. Comparison between dense attention at 16k and StreamingLLM at 4k:

    • Firstly, dense attention is designed to function within its pre-training range. So, for the example you provided, it operates within the 0-16K range.
    • Within the [0, 4k] range, dense attention and StreamingLLM have equivalent perplexity scores.
    • For the range [4K, 16K], dense attention is likely to outperform StreamingLLM because it retains information from previous tokens, thus having a broader context. This gap should be more apparent when the relevant information is evicted from the window of StreamingLLM.
    • Beyond the 16K mark (i.e., [16K, ∞]), the dense attention model won't be operational, whereas StreamingLLM will continue to function.

I hope this helps explain the questions you asked.

Guangxuan

@hxs91
Copy link
Author

hxs91 commented Oct 18, 2023

@Guangxuan-Xiao Got it. thank you for the answer. btw, hope to see some quantitatively results of the gap in question 2. : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants