From ee1aca5c2b8d2bef2b633edb61a031c606b4fc6c Mon Sep 17 00:00:00 2001 From: Madjakul Date: Wed, 13 Nov 2024 19:32:24 +0100 Subject: [PATCH] Course 5: Inference --- README.md | 2 +- markdown/course5_inference.md | 46 +++++++++++++++++------------------ 2 files changed, 24 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 132e328..b4fdc20 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ 2. Tokenization ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course2_tokenization.pdf) / [lab session](https://colab.research.google.com/drive/1xEKz_1LcnkfcEenukIGCrk-Nf_5Hb19s?usp=sharing)) 3. Language Modeling ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course3_lm.pdf) / [lab session](https://colab.research.google.com/drive/1QmVOWC1oB206PmOBn8j0EF54laSh3BBd?usp=sharing)) 4. NLP without 2048 GPUs ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course4_efficiency.pdf) / lab session) -5. Language Models at Inference Time ([slides](https://raw.githubusercontent.com/NathanGodey/AdvancedNLP/main/slides/pdf/course5_inference.pdf) / lab session) +5. Language Models at Inference Time ([slides](https://raw.githubusercontent.com/NathanGodey/AdvancedNLP/main/slides/pdf/course5_inference.pdf) / [lab session](https://colab.research.google.com/drive/13Q1WVHDvmFX4pDQ9pSr0KrggBnPtBSPX?usp=sharing)) 6. Handling the Risks of Language Models ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course6_risks.pdf) / lab session) 7. Advanced NLP tasks ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course7advanced.pdf) / lab session) 8. Domain-specific NLP ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course8_specific.pdf) / lab session) diff --git a/markdown/course5_inference.md b/markdown/course5_inference.md index a5e5aa9..9cbbd79 100644 --- a/markdown/course5_inference.md +++ b/markdown/course5_inference.md @@ -2,7 +2,7 @@ theme: gaia _class: lead paginate: true -title: "Course 4: Efficient NLP" +title: "Course 5: LMs at Inference Time" backgroundColor: #fff marp: true --- @@ -12,7 +12,7 @@ marp: true --- - + ## Introduction @@ -30,8 +30,8 @@ Scaling language models (LMs) is the go-to solution to achieve greater performan ### Background -- Evidently, the more you scale, the more compute you need at inference. -- Hardware cost can hinder LLMs useless if no optimization is done. +- The more you scale, the more compute you need at inference. +- Hardware costs can hinder LLMs if no optimization is done. - Not all optimization techniques are born equal... **What are the different responses to the trade-off between an LLM performance and an LLM througput?** @@ -64,7 +64,7 @@ Scaling language models (LMs) is the go-to solution to achieve greater performan ### Prompt pruning: when KV caching is not enough -Attention matrices need to be calculated for every token constituing an LLM's prompt, leading to latency. +Attention matrices need to be calculated for every token constituting an LLM's prompt, leading to latency. - On LLaMa2-70b models, given a long prompt, 23% of the total generation time is accounted for the time to first token (TTFT). - KV caching is of no-use in that context... @@ -76,7 +76,7 @@ How to reduce that TTFT with minimum performance loss? ### Prompt pruning: when KV caching is not enough -When does KV cachin comes into play? +When does KV caching comes into play?
@@ -98,7 +98,7 @@ Not all tokens are useful to understand/answer the prompt. How to effectively choose tokens to prune out? -Transformer's attention represent more abstract concept as the compution is done deeper in its layers [3]. +Transformer's attention represents more abstract concept as the compution is done deeper in its layers [3]. The last attention matrices play an important role in the decision boundaries computed by a transformer-based LM [4]. @@ -142,7 +142,7 @@ Drawbacks: An **LLM** can **predict multiple tokens in a single forward pass** : -- **Speculative decoding** [5] allow an LLM to **"guess" future tokens** while generating curent tokens, **all within a single forward pass**. +- **Speculative decoding** [5] allows an LLM to **"guess" future tokens** while generating current tokens, **all within a single forward pass**. - By running a draft model to predict multiple tokens, the main model (larger) only has to verify the predicted tokens for "correctness". --- @@ -162,7 +162,7 @@ An **LLM** can **predict multiple tokens in a single forward pass** : ### Speculative decoding -The main model just verifies that the distribution $q(x)$, computed by the assistant is not to far from the distribution $p(x)$ it computes within a forward pass. +The main model just verifies that the distribution $q(x)$, computed by the assistant is not too far from the distribution $p(x)$ it computes within a forward pass. The expected number of tokens generated within one looop of speculative decoding can be theorithically formulated as: @@ -170,7 +170,7 @@ $$ E(\#generated\_tokens) = \frac{1 - \alpha^{\gamma + 1}}{1 - \alpha} $$ -Which is # forward passes' reduction factor. +Which is the forward passes' reduction factor. --- @@ -186,7 +186,7 @@ The expected number of tokens generated via speculative decoding as a function o ### Speculative decoding -In order **to take the most out of speculative decoding**, the distance between **$q(x)$ and $p(x)$ need to be minimal**. +In order **to take the most out of speculative decoding**, the distance between **$q(x)$ and $p(x)$ needs to be minimal**. How to reduce the distance between $q(x)$ and $p(x)$ when the assistance model is smaller? @@ -202,7 +202,7 @@ How to reduce the distance between $q(x)$ and $p(x)$ when the assistance model i Speculative decoding comes with two inconveniences: - Loading two models in memory -- Making sure the assistant model output a token distribution as close as possible to the main model +- Making sure the assistant model outputs a token distribution as close as possible to the main model --- @@ -211,7 +211,7 @@ Speculative decoding comes with two inconveniences: Why not let the main model do the speculation itself? -**Transformer models** are believed to be **over-parametrized** and the **last layers specialized** on computing the decision boundaries **before projecting on the LM head**. Maybe we can make **each layer able to project on the LM head**, thus skipping layers [6] and allowing for an **early exit** at inference [7]. +**Transformer models** are believed to be **over-parameterized** and the **last layers specialized** on computing the decision boundaries **before projecting on the LM head**. Maybe we can make **each layer able to project on the LM head**, thus skipping layers [6] and allowing for an **early exit** at inference [7]. --- @@ -341,17 +341,17 @@ $$ ### Retrieval augmented generation (at inference) -- Although conditioned on retrieved knowledge, output may be an hallucination. -- Most of RAG's performance depend on the chunking method and the retriever. +- Although conditioned on retrieved knowledge, output may be a hallucination. +- Most of RAG's performance depends on the chunking method and the retriever. --- ### Test time compute -The goal is to **allocate more compute at inference** to **"natively" incorporate chain-of-thoughts** like decoding. +The goal is to **allocate more compute at inference** to **"natively" incorporate chain-of-thought** like decoding. -The hypothesis is that **models have good reasoning capabilities** but standard **decoding processes hinder it**. +The hypothesis is that **models have good reasoning capabilities** but standard **decoding processes hinder them**. --- @@ -391,7 +391,7 @@ A **reward model (verifier)** selects the **best answer** based on a **systemati **Modifying proposal distribution**: -**Reinforcement learning-like techniques** where a **model learns to refin its own answer** to reach the optimal one: look at **ReST** [12] and **STaR** [11]. +**Reinforcement learning-like techniques** where a **model learns to refine its own answer** to reach the optimal one: look at **ReST** [12] and **STaR** [11]. Unlinke standard decoding, **the model can backtrack to previous steps**. @@ -410,7 +410,7 @@ Unlinke standard decoding, **the model can backtrack to previous steps**. Takeaways (DeepMind's scaling laws): - Small models (<10b) are better at answering easy questions when given more TTC than pretraning compute. -- Disminishing return on larger models with more TTC than pretraining compute. +- Diminishing return on larger models with more TTC than pretraining compute. --- @@ -451,9 +451,9 @@ Divide one FFN network with $M$ parameters into $N$ experts with $M' = \frac{M}{ ### Mixture of experts -- Reduced computational the training and inference since we only need to run $1/N$th of the FFN weights. -- Instable during training: can strugle to generalized, thus prone to overfitting. -- Load balancing is curcial: we do not want a subset of experts to be under-utilized. +- Reduced computation during training and inference since we only need to run $1/N$th of the FFN weights. +- Unstable during training: can struggle to generalize, thus prone to overfitting. +- Load balancing is crucial: we do not want a subset of experts to be under-utilized. --- @@ -505,7 +505,7 @@ $$ --- - + ## Questions?