Sparse Transformers in PyTorch: limited attention span and projection onto a smaller space
Linformer paper: https://arxiv.org/abs/2006.04768
Limited attention span transformers: simply limits maximum attention distance, using sparse tensors. Note: sparse tensors are WIP in PyTorch so this may not work with all versions.