|
86 | 86 | * `float16`: reduces memory usage, good with V100-gen GPUs
|
87 | 87 | * `bfloat16`: more stability, but only usable with A100-gen GPUs
|
88 | 88 |
|
| 89 | +--- |
| 90 | +### Training LMs - (b)float16 |
| 91 | +<br> |
| 92 | +<center><img width="1000px" src="../imgs/course4/bfloat.png"></center> |
| 93 | + |
| 94 | +--- |
| 95 | +### Training LMs - Efficient implementations |
| 96 | +- FlashAttention (Dao et al. 2022) |
| 97 | +<center><img width="1000px" src="../imgs/course4/flashattn_banner.jpeg"/></center> |
| 98 | + |
89 | 99 | ---
|
90 | 100 | ### Training LMs - Efficient implementations
|
91 | 101 | - FlashAttention (Dao et al. 2022)
|
92 | 102 | <center><img width="1000px" src="../imgs/course4/flashattn_banner.jpeg"/></center>
|
93 | 103 |
|
94 | 104 | ---
|
95 | 105 | ### Training LMs - Efficient implementations
|
96 |
| -- FlashAttention2 (Dao et al. 2023) |
| 106 | +- FlashAttention 2 & 3 (Dao et al. 2023) |
97 | 107 | <center><img width="600px" src="../imgs/course4/flash2.png"/></center>
|
98 | 108 |
|
99 | 109 | ---
|
|
158 | 168 | ### Training LMs - FSDP
|
159 | 169 | <center><img width="1000px" src="../imgs/course4/fsdp.png"/></center>
|
160 | 170 |
|
| 171 | +--- |
| 172 | +### Training LMs - FSDP |
| 173 | +<center><img width="1000px" src="../imgs/course4/tensor_parallel.png"/></center> |
| 174 | + |
161 | 175 | ---
|
162 | 176 | ### Training LMs - DeepSpeed
|
163 | 177 | - Similar to FSDP:
|
@@ -210,6 +224,11 @@ $$ Q_{i_4}(0.3) \neq 0$$
|
210 | 224 |
|
211 | 225 | ---
|
212 | 226 |
|
| 227 | +### Quantization |
| 228 | +<center><img width="800px" src="../imgs/course4/quantization.png"/></center> |
| 229 | + |
| 230 | +--- |
| 231 | + |
213 | 232 | ### LM quantization
|
214 | 233 | - GPTQ (Frantar et al. 2023)
|
215 | 234 | <center><img width="900px" src="../imgs/course4/gptq.png"/></center>
|
@@ -285,7 +304,7 @@ where $W$ is a weight matrix to quantize into $\hat{W}$, and $X$ are data points
|
285 | 304 |
|
286 | 305 | ---
|
287 | 306 |
|
288 |
| -### Sheared Llama (Xia et al. 2023) |
| 307 | +### Pruning - Sheared Llama (Xia et al. 2023) |
289 | 308 | * Remove weights that minimize loss increase <center><img width="1000px" src="../imgs/course4/sheared_llama.png"/></center>
|
290 | 309 | * Continue the pretraining of the obtained reduced model
|
291 | 310 |
|
|
0 commit comments