I am reading these papers:
β
LLaMA: Open and Efficient Foundation Language Models
β
Llama 2: Open Foundation and Fine-Tuned Chat Models
βοΈ OPT: Open Pre-trained Transformer Language Models
β
Attention Is All You Need
β
Root Mean Square Layer Normalization
β
GLU Variants Improve Transformer
β
RoFormer: Enhanced Transformer with Rotary Position Embedding
β
Self-Attention with Relative Position Representations
βοΈ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
βοΈ To Fold or Not to Fold: a Necessary and Sufficient Condition on Batch-Normalization Layers Folding
β
Fast Transformer Decoding: One Write-Head is All You Need
β
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
βοΈ PaLM: Scaling Language Modeling with Pathways
β
Understand the concept of dot product of two matrices.
β
Understand the concept of autoregressive language models.
β
Understand the concept of attention computation.
β
Understand the workings of Byte-Pair Encoding
(BPE) algorithm and tokenizer.
β
Read and implement the workings of the SentencePiece
library and tokenizer.
β
Understand the concept of tokenization, input ids and embedding vectors.
β
Understand & implement the concept of positional encoding.
β
Understand the concept of single head self-attention.
β
Understand the concept of scaled dot-product attention.
β
Understand & implement the concept of multi-head attention.
β
Understand & implement the concept of layer normalization.
β
Understand the concept of masked multi-head attention & softmax layer.
β
Understand and implement the concept of RMSNorm and difference with LayerNorm.
β
Understand the concept of internal covariate shift.
β
Understand the concept and implementation of feed-forward network with ReLU activation.
β
Understand the concept and implementation of feed-forward network with SwiGLU activation.
β
Understand the concept of absolute positional encoding.
β
Understand the concept of relative positional encoding.
β
Understand and implement the rotary positional embedding.
β
Understand and implement the transformer architecture.
β
Understand and implement the original Llama (1) architecture.
β
Understand the concept of multi-query attention with single KV projection.
β
Understand and implement grouped query attention from scratch.
β
Understand and implement the concept of KV cache.
β
Understand and implement the concept of Llama2 architecture.
β
Test the Llama2 implementation using the checkpoints from Meta.
β
Download the checkpoints of Llama2 and inspect the inference code and working.
βοΈ Documentation of the Llama2 implementation and repo.
β
Work on implementation of enabling and disabling the KV cache.
β
Add the attention mask when disabling the KV cache in Llama2.
β
LLAMA: OPEN AND EFFICIENT LLM NOTES
β
UNDERSTANDING KV CACHE
β
GROUPED QUERY ATTENTION (GQA)
π pytorch-llama - PyTorch implementation of LLaMA by Umar Jamil.
π pytorch-transformer - PyTorch implementation of Transformer by Umar Jamil.
π llama - Facebook's LLaMA implementation.
π tensor2tensor - Google's transformer implementation.
π rmsnorm - RMSNorm implementation.
π roformer - Rotary Tranformer implementation.
π xformers - Facebook's implementation.
β
Understanding SentencePiece ([Under][Standing][_Sentence][Piece])
β
SwiGLU: GLU Variants Improve Transformer (2020)