-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about re-computation #51
Comments
Hi, Thank you for your interest in our work! The "Sliding Window with Recomputation" method is commonly used in language modeling evaluations to calculate perplexity given a certain number of context tokens. Huggingface provides a comprehensive tutorial on this topic. You can check it out here: https://huggingface.co/docs/transformers/perplexity. Best, |
Hi Guangxuan. Could you please explain how sliding windows with recomputation leads to O(TL^2) complexity. IMO, calculation of keys and values for the first L-1 tokens should be a function of weight matrix and dimension of KVs. How is this computation for each token taking O(L^2) time? I don't see a need for computing attention scores for initial L-1 tokens which would take O(L^2). Thank you. |
@rituraj2847 Your argument would have been valid had there been only one layer in the transformer. i.e. We need to only do a forward pass for each token with weight matrices of K and V [O(L)]. Unfortunately, the output of attention block of the first layer (attained through |
Thank you. Understood :)
…On Wed, 25 Oct 2023 at 5:56 PM, Bhuvanesh Sridharan < ***@***.***> wrote:
@rituraj2847 <https://github.com/rituraj2847> Your argument would have
been valid had there been only one layer in the transformer. i.e. We need
to only do a forward pass for each token with weight matrices of K and V
[O(L)]. Unfortunately, the output of attention block of the first layer
(attained through (Q @ K.T) @ V ) is actually the initial embedding of
the second layer.
Hence, each time you change the positional encoding of the tokens, it
would take O(L^2) ops to do generation iteration of one new token since we
need to calculate self-attention for all the tokens. Since the sequence
length is T, the overall complexity becomes O(TL^2).
Only for the last layer, we may avoid calculating self-attention for all
the initial tokens. For the last layer we may only calculate attention of
initial tokens applied to the query vector of the last token.
I also had the same question as you and had to chalk out my understanding
of transformers on a piece of paper in order to understand this.
—
Reply to this email directly, view it on GitHub
<#51 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGLO3TH5KBWF55DZPVEI4KTYBEAQZAVCNFSM6AAAAAA6NAQ5TGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZZGE3DGMJYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hello. After read paper and code, I have a question about the method about the Sliding Window w/Re-compuation, is there any code or paper about this method for reference? Thank you a lot.
The text was updated successfully, but these errors were encountered: