📖A curated list of Awesome LLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism etc. 🎉🎉
-
Updated
Nov 25, 2024
📖A curated list of Awesome LLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism etc. 🎉🎉
📚Tensor/CUDA Cores, 📖150+ CUDA Kernels, 🔥🔥toy-hgemm library with WMMA, MMA and CuTe(99%~100%+ TFLOPS of cuBLAS 🎉🎉).
Shush is an app that deploys a WhisperV3 model with Flash Attention v2 on Modal and makes requests to it via a NextJS app
Triton implementation of FlashAttention2 that adds Custom Masks.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Flash Attention Implementation with Multiple Backend Support and Sharding This module provides a flexible implementation of Flash Attention with support for different backends (GPU, TPU, CPU) and platforms (Triton, Pallas, JAX).
Uses the powerful WhisperS2T and Ctranslate2 libraries to batch transcribe multiple files
Poplar implementation of FlashAttention for IPU
Toy Flash Attention implementation in torch
Transcribe audio in minutes with OpenAI's WhisperV3 and Flash Attention v2 + Transformers without relying on third-party providers and APIs. Host it yourself or try it out.
Add a description, image, and links to the flash-attention-2 topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention-2 topic, visit your repo's landing page and select "manage topics."