-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confused with four attention mechanism and their performance mentioned by paper #33
Comments
Thank you for your interest in our paper; I appreciate your insightful questions. Here are my clarifications:
I hope this addresses your confusion. Thanks, |
I'd like to clarify that the issue isn't related to absolute or relative positional encoding. Our current results were obtained with models that utilize relative position encodings, such as Llama and MPT. The core of the matter lies in whether the context Keys have been computed with or without previous tokens. Attention sinks refer to Keys computed without prior tokens. Hence, such keys are present in the sliding window with the recomputation baseline. However, in the window attention baseline, all context keys are computed from numerous preceding tokens, and the model is trained to recognize that these aren't attention sinks. |
Get it! Thanks for your kind reply. |
@weizhenhuan Hey, zhenhuan, I am also interested in the figure 1 and spent some time to figure it out. I put my thoughts on #42 , and I'd appreciate it if you could correct me if I've made any mistakes. Thank you! |
Hi Guangxuan, Could you give some insights on the attention sink for the 1st token? If we have 3 existing tokens (a,b,c), the probability of the 4th token d is p(d) = p(a)*p(b|a)*p(c|b)*p(d|c). Token d will depend on the first token. However, I do not see why the attention score of a is larger than the rest, as shown in your early 2020 paper. I did a quick test using gpt2 decoder, the output attention score of the first token is not the highest either. The question is: will the attention score of the 1st token always be the highest among all? If not, why removing the 1st token can be a problem? |
This is very interesting. In fact I suppose the main difference between the "initial tokens" and "middle tokens" is the positional embedding. For RoPE style positional embedding, an attenuation coefficient is applied to the word embedding and the infomation almost vanish when context is super long and the positional index is very large. From this perspective, maybe sota positional embedding manners like RoPE-NTK, Yarn, etc would bridge the gap between the SWA and SWA-recompute to a certain extent. |
I also have similar feeling, how the model tell which is the initial token? Positional embedding is highly possible. So I am wondering if they tried to re-assign the first token in the sliding window a positional information to make it a "initial token". |
Nice idea, and it really works well! Thanks for you nice work. But I have some questions. In paper, it mentions four attention mechanism, dense attention fails because it mismatches with the traing phase's length when the outputs' length is longer than training phase, window attention fails because it evicts the initial tokens' kv cache, but for sliding attention with recomputation and streaming attention, I have some questions.
Thanks for your nice work again. Hope to get a reply.
The text was updated successfully, but these errors were encountered: