Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about evaluation results and demo #39

Closed
hsm1997 opened this issue Oct 13, 2023 · 2 comments
Closed

Question about evaluation results and demo #39

hsm1997 opened this issue Oct 13, 2023 · 2 comments

Comments

@hsm1997
Copy link

hsm1997 commented Oct 13, 2023

  1. I found the concept of "window attention" confusing. In fig1 there are two types of window attention, b(naive window attention) and c(recompute window attention). Fig.3 shows that c-recompute-window-attention behaves close to streaming-llm on ppl, but table1 says that "window" attention has poor performance on ppl, so I guess table1 uses b-naive-window-attention? And table5 says that "window" attention fails in ARC benchmark, so I guess this is also b-naive-window-attention? Then in figure10, it says that the speedup is benchmarked with c-recompute-window-attention. Could you benchmark ALL results with BOTH "window-attention" methods to make the comparison fair? Or did I miss anything?
  2. Looking into your demo video and https://github.com/mit-han-lab/streaming-llm/blob/main/examples/run_streaming_llama.py , I don't quite understand why the model generates erroneous tokens(when "model performance breaks") if streaming is not enabled. Since the prompts are actually processed by the model one-by-one (#L63), I suppose the model should be either OOM or "generating good tokens". Where does the erroneous tokens come from?
  3. What is the exact pipeline of ARC evaluation (table 5)? Does the model "process q1 -> generate a1 with evicted past cache of q1 -> process q2 with evicted past cache of q1 and a1 -> generate a2 with evicted past cache of q1 and a1 and q2-> ..." (which is what run_streaming_llama.py do), or "process [q1, q2, q3... qn] -> generate [a1, a2, a3, ..., an]"?

Thanks in advance!

@Guangxuan-Xiao
Copy link
Collaborator

Guangxuan-Xiao commented Oct 13, 2023

Thank you for taking a closer look at our paper and for raising these questions. Let me address them in detail:

  1. Window Attention Clarification: There is no two types of window attention as b-naive-window-attention and c-recompute-window-attention. The concept you referred to as c-recompute-window-attention is actually "sliding window with re-computation", which is not window attention. For an in-depth explanation of how sliding windows with re-computation works, please look at this issue. Essentially, the sliding window with re-computation is a form of dense attention applied to truncated text. Both Table 1 and 5 employ window attention, while Figure 10 specifically uses the sliding window with re-computation. The distinctions were mentioned explicitly in the respective captions.

  2. Streaming and Token Generation: While prompts are indeed processed individually, when streaming is disabled, all prior inputs are retained. Consequently, if the length of cached text surpasses the model's pre-trained chunk size, there's a degradation in model performance, leading to the generation of erroneous tokens.

  3. Evaluation Pipeline: The exact evaluation sequence follows the pattern: [q1, a1, q2, a2, ...].

I hope my clarifications have helped you better understand your questions. Feel free to let me know if there's anything else I can help with.

Guangxuan

@hsm1997 hsm1997 closed this as completed Oct 15, 2023
@hsm1997
Copy link
Author

hsm1997 commented Oct 15, 2023

Thanks again for your reply! Really helped me understands better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants