Suggestion to Implement Additional Efficient Transformer Variants #1

rajveer43 · 2024-08-24T13:16:58Z

Description:

Hello! I appreciate the excellent work on benchmarking Performer and Longformer against the base Transformer. I’d like to propose the implementation of additional efficient Transformer variants to further extend the benchmarking scope. This could provide a more comprehensive comparison and serve as a valuable resource for the community.

Suggested Models:

Reformer:

Description:

Reformer introduces two key innovations: locality-sensitive hashing (LSH) for reducing the attention complexity from O(N^2) to O(N log N) and reversible layers to reduce memory consumption.

Reference Paper:

Reformer: The Efficient Transformer

Implementation Considerations: Implementing the LSH attention mechanism and reversible layers within the current framework could provide significant memory and time savings, especially for long sequences.

Linformer:

Description:

Linformer approximates the self-attention mechanism with linear complexity by projecting the key and value matrices to lower dimensions. This makes the attention computation linear with respect to the sequence length.

Reference Paper:

Linformer: Self-Attention with Linear Complexity
Implementation Considerations: The key challenge will be in effectively reducing the dimensionality of the key and value matrices without compromising the model's performance.

BigBird:

Description:

BigBird uses a combination of global, local, and random attention mechanisms to handle sequences of up to thousands of tokens efficiently. It’s especially beneficial for tasks like long document classification.

Reference Paper:

Big Bird: Transformers for Longer Sequences
Implementation Considerations: Adapting the attention mechanism to incorporate global, local, and random attention will be critical. This will allow the model to process longer sequences with improved efficiency.

Synthesizer:

Description:

Synthesizer replaces the dot-product self-attention mechanism with synthetic attention, which either learns attention weights through dense or random projection, aiming to simplify the attention computation while maintaining performance.

Reference Paper:

Synthesizer: Rethinking Self-Attention in Transformer Models
Implementation Considerations: Implementing synthetic attention mechanisms would provide an interesting comparison to traditional attention-based models, especially in terms of performance and computational cost.

Looking forward to your thoughts on this!

shashank3009 · 2024-09-18T11:10:47Z

Hi Rajveer, thank you for your feedback and sharing other variants. We will surely add implementation of other efficient variants of the transformers to this and you are more than welcome to contribute on the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion to Implement Additional Efficient Transformer Variants #1

Suggestion to Implement Additional Efficient Transformer Variants #1

rajveer43 commented Aug 24, 2024 •

edited

Loading

shashank3009 commented Sep 18, 2024

Suggestion to Implement Additional Efficient Transformer Variants #1

Suggestion to Implement Additional Efficient Transformer Variants #1

Comments

rajveer43 commented Aug 24, 2024 • edited Loading

Description:

Suggested Models:

Reformer:

Description:

Reference Paper:

Linformer:

Description:

Reference Paper:

BigBird:

Description:

Reference Paper:

Synthesizer:

Description:

Reference Paper:

shashank3009 commented Sep 18, 2024

rajveer43 commented Aug 24, 2024 •

edited

Loading