diff --git a/README.md b/README.md index b4fdc20..39f3898 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ 6. Handling the Risks of Language Models ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course6_risks.pdf) / lab session) 7. Advanced NLP tasks ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course7advanced.pdf) / lab session) 8. Domain-specific NLP ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course8_specific.pdf) / lab session) -9. Multilingual NLP ([slides](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/Course%205%20-%20Multilingual%20NLP.pdf) / lab session) +9. Multilingual NLP ([slides](https://github.com/NathanGodey/AdvancedNLP/blob/main/slides/pdf/Course%209%20-%20Multilingual%20NLP.pdf) / [lab session](https://colab.research.google.com/drive/11TX-q-hAdFiSeMVqFp1VCXhi_Ifoj8Rp?usp=sharing)) 10. Multimodal NLP ([slides](https://docs.google.com/presentation/d/1K2DgnPSOGXB1hQ4FZoUU-5ppJ4dn_sLC41Ecwmxi2Zk/edit?usp=sharing) / lab session) ## Evaluation diff --git a/imgs/course4/bfloat.png b/imgs/course4/bfloat.png new file mode 100644 index 0000000..9fcfafb Binary files /dev/null and b/imgs/course4/bfloat.png differ diff --git a/imgs/course4/flashattn_banner.png b/imgs/course4/flashattn_banner.png new file mode 100644 index 0000000..d138650 Binary files /dev/null and b/imgs/course4/flashattn_banner.png differ diff --git a/imgs/course4/quantization.png b/imgs/course4/quantization.png new file mode 100644 index 0000000..f582866 Binary files /dev/null and b/imgs/course4/quantization.png differ diff --git a/imgs/course4/tensor_parallel.png b/imgs/course4/tensor_parallel.png new file mode 100644 index 0000000..e7f323a Binary files /dev/null and b/imgs/course4/tensor_parallel.png differ diff --git a/markdown/course3_lm.md b/markdown/course3_lm.md index 6b828f4..a81431e 100644 --- a/markdown/course3_lm.md +++ b/markdown/course3_lm.md @@ -477,7 +477,7 @@ $$ --- ### Decoders - Inference speed * For greedy decoding without prefix: - * $n$ passes with sequences of length $n$ + * $n$ passes with sequences of length $1\leq t \leq n$ * Each pass is $O(n^2)$ * Complexity: $O(n^3)$ * Other decoding are more costly diff --git a/markdown/course4_efficiency.md b/markdown/course4_efficiency.md index ad0c381..2316bf1 100644 --- a/markdown/course4_efficiency.md +++ b/markdown/course4_efficiency.md @@ -86,6 +86,16 @@ $$ * `float16`: reduces memory usage, good with V100-gen GPUs * `bfloat16`: more stability, but only usable with A100-gen GPUs +--- +### Training LMs - (b)float16 +
+
+ +--- +### Training LMs - Efficient implementations +- FlashAttention (Dao et al. 2022) +
+ --- ### Training LMs - Efficient implementations - FlashAttention (Dao et al. 2022) @@ -93,7 +103,7 @@ $$ --- ### Training LMs - Efficient implementations -- FlashAttention2 (Dao et al. 2023) +- FlashAttention 2 & 3 (Dao et al. 2023)
--- @@ -158,6 +168,10 @@ $$ ### Training LMs - FSDP
+--- +### Training LMs - FSDP +
+ --- ### Training LMs - DeepSpeed - Similar to FSDP: @@ -210,6 +224,11 @@ $$ Q_{i_4}(0.3) \neq 0$$ --- +### Quantization +
+ +--- + ### LM quantization - GPTQ (Frantar et al. 2023)
@@ -285,7 +304,7 @@ where $W$ is a weight matrix to quantize into $\hat{W}$, and $X$ are data points --- -### Sheared Llama (Xia et al. 2023) +### Pruning - Sheared Llama (Xia et al. 2023) * Remove weights that minimize loss increase
* Continue the pretraining of the obtained reduced model diff --git a/slides/course3_lm.html b/slides/course3_lm.html index 7220ec4..682fdbf 100644 --- a/slides/course3_lm.html +++ b/slides/course3_lm.html @@ -573,7 +573,7 @@

Decoders - Inference speed