Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to Implement Additional Efficient Transformer Variants #1

Open
rajveer43 opened this issue Aug 24, 2024 · 1 comment
Open

Comments

@rajveer43
Copy link

rajveer43 commented Aug 24, 2024

Description:

Hello! I appreciate the excellent work on benchmarking Performer and Longformer against the base Transformer. I’d like to propose the implementation of additional efficient Transformer variants to further extend the benchmarking scope. This could provide a more comprehensive comparison and serve as a valuable resource for the community.

Suggested Models:

Reformer:

Description:

Reformer introduces two key innovations: locality-sensitive hashing (LSH) for reducing the attention complexity from O(N^2) to O(N log N) and reversible layers to reduce memory consumption.

Reference Paper:

Reformer: The Efficient Transformer

Implementation Considerations: Implementing the LSH attention mechanism and reversible layers within the current framework could provide significant memory and time savings, especially for long sequences.

Linformer:

Description:

Linformer approximates the self-attention mechanism with linear complexity by projecting the key and value matrices to lower dimensions. This makes the attention computation linear with respect to the sequence length.

Reference Paper:

Linformer: Self-Attention with Linear Complexity
Implementation Considerations: The key challenge will be in effectively reducing the dimensionality of the key and value matrices without compromising the model's performance.

BigBird:

Description:

BigBird uses a combination of global, local, and random attention mechanisms to handle sequences of up to thousands of tokens efficiently. It’s especially beneficial for tasks like long document classification.

Reference Paper:

Big Bird: Transformers for Longer Sequences
Implementation Considerations: Adapting the attention mechanism to incorporate global, local, and random attention will be critical. This will allow the model to process longer sequences with improved efficiency.

Synthesizer:

Description:

Synthesizer replaces the dot-product self-attention mechanism with synthetic attention, which either learns attention weights through dense or random projection, aiming to simplify the attention computation while maintaining performance.

Reference Paper:

Synthesizer: Rethinking Self-Attention in Transformer Models
Implementation Considerations: Implementing synthetic attention mechanisms would provide an interesting comparison to traditional attention-based models, especially in terms of performance and computational cost.

Looking forward to your thoughts on this!

@shashank3009
Copy link
Collaborator

Hi Rajveer, thank you for your feedback and sharing other variants. We will surely add implementation of other efficient variants of the transformers to this and you are more than welcome to contribute on the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants