|
11 | 11 |
|
12 | 12 | 📖**CUDA-Learn-Notes**: 🎉CUDA/C++ 笔记 / 技术博客: **fp32、fp16/bf16、fp8/int8**、flash_attn、sgemm、sgemv、warp/block reduce、dot prod、elementwise、softmax、layernorm、rmsnorm、hist etc. 👉News: Most of my time now is focused on **LLM/VLM/Diffusion** Inference. Please check 📖[Awesome-LLM-Inference](https://github.com/DefTruth/Awesome-LLM-Inference) , 📖[Awesome-SD-Inference](https://github.com/DefTruth/Awesome-SD-Inference)  and 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes)  for more details.
|
13 | 13 |
|
14 |
| -## 0x00 📖 博客目录 |
15 |
| - |
16 | 14 | <img width="1438" alt="image" src="https://github.com/user-attachments/assets/0c5e5125-586f-43fa-8e8b-e2c61c1afbbe">
|
17 | 15 |
|
| 16 | +## 0x00 📖 CUDA Kernel目录 (面试常考题目) |
| 17 | +- / = not supported now. |
| 18 | +- ✔️ = known work and already supported now. |
| 19 | +- ❔ = in my plan, but not coming soon, maybe a few weeks later. |
| 20 | +- **workflow**: custom **CUDA** kernel impl -> **Torch** python binding -> Run tests. |
| 21 | + |
| 22 | +|📖 cuda kernel| 📖 elem dtype| 📖 acc dtype| 📖 docs | 📖 level | |
| 23 | +|:---|:---|:---|:---|:---| |
| 24 | +| ✔️ [elementwise_f32_kernel](./elementwise/elementwise.cu)|f32|/|[link](./elementwise/)|⭐️| |
| 25 | +| ✔️ [elementwise_f32x4_kernel](./elementwise/elementwise.cu)|f32|/|[link](./elementwise/)|⭐️| |
| 26 | +| ✔️ [elementwise_f16_kernel](./elementwise/elementwise.cu)|f16|/|[link](./elementwise/)|⭐️| |
| 27 | +| ✔️ [elementwise_f16x2_kernel](./elementwise/elementwise.cu)|f16|/|[link](./elementwise/)|⭐️| |
| 28 | +| ✔️ [histogram_i32_kernel](./histogram/histogram.cu)|i32|/|[link](./histogram/)|⭐️| |
| 29 | +| ✔️ [histogram_i32x4_kernel](./histogram/histogram.cu)|i32|/|[link](./histogram/)|⭐️| |
| 30 | +| ✔️ [sigmoid_f32_kernel](./sigmoid/sigmoid.cu)|f32|/|[link](./sigmoid/)|⭐️| |
| 31 | +| ✔️ [sigmoid_f32x4_kernel](./sigmoid/sigmoid.cu)|f32|/|[link](./sigmoid/)|⭐️| |
| 32 | +| ✔️ [relu_f32_kernel](./relu/relu.cu)|f32|/|[link](./relu/)|⭐️| |
| 33 | +| ✔️ [relu_f32x4_kernel](./relu/relu.cu)|f32|/|[link](./relu/)|⭐️| |
| 34 | +| ✔️ [relu_f16_kernel](./relu/relu.cu)|f16|/|[link](./relu/)|⭐️| |
| 35 | +| ✔️ [relu_f16x2_kernel](./relu/relu.cu)|f16|/|[link](./relu/)|⭐️| |
| 36 | +| ✔️ [warp_reduce_f32/f16/bf16_kernel](./reduce/block_all_reduce.cu)|f16/bf16/f32|f16/bf16/f32|[link](./reduce/)|⭐️⭐️| |
| 37 | +| ✔️ [block_reduce_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|⭐️⭐️| |
| 38 | +| ✔️ [block_all_reduce_sum_f32_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|⭐️⭐️| |
| 39 | +| ✔️ [block_all_reduce_sum_f32x4_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|⭐️⭐️| |
| 40 | +| ✔️ [block_all_reduce_sum_f16_f16_kernel](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|⭐️⭐️| |
| 41 | +| ✔️ [block_all_reduce_sum_f16_f32_kernel](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|⭐️⭐️| |
| 42 | +| ✔️ [block_all_reduce_sum_f16x2_f16_kernel](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|⭐️⭐️| |
| 43 | +| ✔️ [block_all_reduce_sum_f16x2_f32_kernel](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|⭐️⭐️| |
| 44 | +| ✔️ [block_all_reduce_sum_bf16_bf16_kernel](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|⭐️⭐️| |
| 45 | +| ✔️ [block_all_reduce_sum_bf16_f32_kernel](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|⭐️⭐️| |
| 46 | +| ✔️ [block_all_reduce_sum_bf16x2_bf16_kernel](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|⭐️⭐️| |
| 47 | +| ✔️ [block_all_reduce_sum_bf16x2_f32_kernel](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|⭐️⭐️| |
| 48 | +| ✔️ [block_all_reduce_sum_fp8_e4m3_f16_kernel](./reduce/block_all_reduce.cu)|fp8_e4m3|f16|[link](./reduce/)|⭐️⭐️| |
| 49 | +| ✔️ [block_all_reduce_sum_fp8_e5m2_f16_kernel](./reduce/block_all_reduce.cu)|fp8_e5m2|f16|[link](./reduce/)|⭐️⭐️| |
| 50 | +| ✔️ [block_all_reduce_sum_i8_i32_kernel](./reduce/block_all_reduce.cu)|i8|i32|[link](./reduce/)|⭐️⭐️| |
| 51 | +| ✔️ [dot_product_f32_kernel](./dot-product/dot_product.cu)|f32|f32|[link](./dot-product/)|⭐️⭐️| |
| 52 | +| ✔️ [dot_product_f32x4_kernel](./dot-product/dot_product.cu)|f32|f32|[link](./dot-product/)|⭐️⭐️| |
| 53 | +| ✔️ [dot_product_f16_f32_kernel](./dot-product/dot_product.cu)|f16|f32|[link](./dot-product/)|⭐️⭐️| |
| 54 | +| ✔️ [dot_product_f16x2_f32_kernel](./dot-product/dot_product.cu)|f16|f32|[link](./dot-product/)|⭐️⭐️| |
| 55 | +| ✔️ [softmax_f32_kernel (grid level memory fence)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| |
| 56 | +| ✔️ [softmax_f32x4_kernel (grid level memory fence)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| |
| 57 | +| ✔️ [softmax_f32_kernel (per token)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| |
| 58 | +| ✔️ [softmax_f32x4_kernel (per token)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| |
| 59 | +| ✔️ [safe_softmax_f32_kernel (per token)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| |
| 60 | +| ✔️ [safe_softmax_f32x4_kernel (per token)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| |
| 61 | +| ✔️ [layer_norm_f32_kernel (per token)](./layer-norm/layer_norm.cu)|f32|f32|[link](./layer-norm/)|⭐️⭐️| |
| 62 | +| ✔️ [layer_norm_f32x4_kernel (per token)](./layer-norm/layer_norm.cu)|f32|f32|[link](./layer-norm/)|⭐️⭐️| |
| 63 | +| ❔ [layer_norm_f16_kernel (per token)](./layer-norm/layer_norm.cu)|f16|f16|❔|⭐️⭐️| |
| 64 | +| ❔ [layer_norm_f16x2_kernel (per token)](./layer-norm/layer_norm.cu)|f16|f16|❔|⭐️⭐️| |
| 65 | +| ✔️ [rms_norm_f32_kernel (per token)](./rms-norm/rms_norm.cu)|f32|f32|[link](./rms-norm/)|⭐️⭐️| |
| 66 | +| ✔️ [rms_norm_f32x4_kernel (per token)](./rms-norm/rms_norm.cu)|f32|f32|[link](./rms-norm/)|⭐️⭐️| |
| 67 | +| ❔ [rms_norm_f16_kernel (per token)](./rms-norm/rms_norm.cu)|f16|f16|❔|⭐️⭐️| |
| 68 | +| ❔ [rms_norm_f16x2_kernel (per token)](./rms-norm/rms_norm.cu)|f16|f16|❔|⭐️⭐️| |
| 69 | +| ✔️ [sgemm_sliced_k_f32_kernel](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️| |
| 70 | +| ✔️ [sgemm_t_8x8_sliced_k_f32x4_kernel](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️| |
| 71 | +| ❔ [hgemm_sliced_k_f16_f32_kernel](./hgemm)|f16|f32|❔|⭐️⭐️⭐️| |
| 72 | +| ❔ [hgemm_t_tile_sliced_k_f16x2_f32_kernel](./hgemm)|f16|f32|❔|⭐️⭐️⭐️| |
| 73 | +| ✔️ [sgemv_k32_f32_kernel](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️| |
| 74 | +| ✔️ [sgemv_k128_f32x4_kernel](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️| |
| 75 | +| ✔️ [sgemv_k16_f32_kernel](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️| |
| 76 | +| ❔ [hgemv_k32_f16_kernel](./hgemv)|f16|f16|❔|⭐️⭐️⭐️| |
| 77 | +| ❔ [hgemv_k128_f16x2_kernel](./hgemv)|f16|f16|❔|⭐️⭐️⭐️| |
| 78 | +| ❔ [hgemv_k16_f16_kernel](./hgemv)|f16|f16|❔|⭐️⭐️⭐️| |
| 79 | +| ✔️ [flash_attn_1_fwd_f32_kernel](./flash-attn/flash_attn_1_fwd_f32.cu)|f32|f32|[link](./flash-attn)|⭐️⭐️⭐️| |
| 80 | +| ❔ [flash_attn_2_fwd_f32_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|f32|f32|[link](./flash-attn)|⭐️⭐️⭐️| |
| 81 | +| ❔ [flash_attn_2_fwd_f16_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|f16|f32|[link](./flash-attn)|⭐️⭐️⭐️| |
| 82 | +| ❔ [flash_attn_2_fwd_bf16_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|bf16|f32|[link](./flash-attn)|⭐️⭐️⭐️| |
| 83 | +| ✔️ [hard_nms cpp only](./nms/nms.cc)|f32|/|❔|⭐️| |
| 84 | +| ✔️ [notes v1(deprecated)](./notes-v1.cu)|f32|f32|/|⭐️| |
| 85 | + |
| 86 | +## 0x01 📖 博客目录 |
| 87 | + |
18 | 88 | ### 📖 大模型|多模态|Diffusion|推理优化 (本人作者)
|
19 | 89 |
|
20 | 90 | |📖 类型-标题|📖 作者|
|
|
125 | 195 | | [[cutlass教程][入门]📖cutlass 软件架构](https://zhuanlan.zhihu.com/p/678915618)|@JoeNomad|
|
126 | 196 | | [[cutlass教程][入门]📖CUTLASS 基础介绍](https://zhuanlan.zhihu.com/p/671324125)|@进击的Killua|
|
127 | 197 | | [[cutlass教程][入门]📖乱谈CUTLASS GTC2020 SLIDES](https://zhuanlan.zhihu.com/p/674693873)|@zzk again|
|
128 |
| -| [[cutlass教程][深入]📖cutlass block swizzle 和 tile iterator(@JoeNomad)](https://zhuanlan.zhihu.com/p/679929705)|@JoeNomad| |
| 198 | +| [[cutlass教程][深入]📖cutlass block swizzle 和 tile iterator](https://zhuanlan.zhihu.com/p/679929705)|@JoeNomad| |
129 | 199 | | [[cutlass教程][深入]📖cutlass bank conflict free 的shared memory layout](https://zhuanlan.zhihu.com/p/681966685)|@JoeNomad|
|
130 | 200 | | [[cutlass教程][深入]📖cutlass 多级流水线](https://zhuanlan.zhihu.com/p/687397095)|@JoeNomad|
|
131 | 201 | | [[GPU指令集架构][精解]📖NVidia GPU指令集架构-前言](https://zhuanlan.zhihu.com/p/686198447)|@reed|
|
|
150 | 220 |
|
151 | 221 | 💡说明: 大佬们写的文章实在是太棒了,学到了很多东西。欢迎大家提PR推荐更多优秀的文章!
|
152 | 222 |
|
153 |
| -## 0x01 📖 CUDA Kernel目录 (面试常考题目) |
154 |
| -<div id="kernellist"></div> |
155 |
| - |
156 |
| -- / = not supported now. |
157 |
| -- ✔️ = known work and already supported now. |
158 |
| -- ❔ = in my plan, but not coming soon, maybe a few weeks later. |
159 |
| -- **workflow**: custom **CUDA** kernel impl -> **Torch** python binding -> Run tests. |
160 |
| - |
161 |
| -|📖 cuda kernel| 📖 elem dtype| 📖 acc dtype| 📖 docs | |
162 |
| -|:---|:---|:---|:---| |
163 |
| -| ✔️ [sgemm_sliced_k_f32_kernel](./sgemm/sgemm.cu)|f32|f32|❔| |
164 |
| -| ✔️ [sgemm_t_tile_sliced_k_f32x4_kernel](./sgemm/sgemm.cu)|f32|f32|❔| |
165 |
| -| ❔ [hgemm_sliced_k_f16_f32_kernel](./sgemm/sgemm.cu)|f16|f32|❔| |
166 |
| -| ❔ [hgemm_t_tile_sliced_k_f16x2_f32_kernel](./sgemm/sgemm.cu)|f16|f32|❔| |
167 |
| -| ✔️ [sgemv_k32_f32_kernel](./sgemv/sgemv.cu)|f32|f32|❔| |
168 |
| -| ✔️ [sgemv_k128_f32x4_kernel](./sgemv/sgemv.cu)|f32|f32|❔| |
169 |
| -| ✔️ [sgemv_k16_f32_kernel](./sgemv/sgemv.cu)|f32|f32|❔| |
170 |
| -| ❔ [hgemv_k32_f16_kernel](./sgemv/sgemv.cu)|f16|f16|❔| |
171 |
| -| ❔ [hgemv_k128_f16x2_kernel](./sgemv/sgemv.cu)|f16|f16|❔| |
172 |
| -| ❔ [hgemv_k16_f16_kernel](./sgemv/sgemv.cu)|f16|f16|❔| |
173 |
| -| ✔️ [warp_reduce_f32/f16/bf16_kernel](./reduce/block_all_reduce.cu)|f16/bf16/f32|f16/bf16/f32|[link](./reduce/)| |
174 |
| -| ✔️ [block_reduce_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)| |
175 |
| -| ✔️ [block_all_reduce_sum_f32_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)| |
176 |
| -| ✔️ [block_all_reduce_sum_f32x4_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)| |
177 |
| -| ✔️ [block_all_reduce_sum_f16_f16_kernel](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)| |
178 |
| -| ✔️ [block_all_reduce_sum_f16_f32_kernel](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)| |
179 |
| -| ✔️ [block_all_reduce_sum_f16x2_f16_kernel](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)| |
180 |
| -| ✔️ [block_all_reduce_sum_f16x2_f32_kernel](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)| |
181 |
| -| ✔️ [block_all_reduce_sum_bf16_bf16_kernel](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)| |
182 |
| -| ✔️ [block_all_reduce_sum_bf16_f32_kernel](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)| |
183 |
| -| ✔️ [block_all_reduce_sum_bf16x2_bf16_kernel](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)| |
184 |
| -| ✔️ [block_all_reduce_sum_bf16x2_f32_kernel](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)| |
185 |
| -| ✔️ [block_all_reduce_sum_fp8_e4m3_f16_kernel](./reduce/block_all_reduce.cu)|fp8_e4m3|f16|[link](./reduce/)| |
186 |
| -| ❔ [block_all_reduce_sum_i8_i32_kernel](./reduce/block_all_reduce.cu)|i8|i32|[link](./reduce/)| |
187 |
| -| ✔️ [dot_product_f32_kernel](./dot-product/dot_product.cu)|f32|f32|❔| |
188 |
| -| ✔️ [dot_product_f32x4_kernel](./dot-product/dot_product.cu)|f32|f32|❔| |
189 |
| -| ❔ [dot_product_f16_f16_kernel](./dot-product/dot_product.cu)|f16|f16|❔| |
190 |
| -| ❔ [dot_product_f16x2_f16_kernel](./dot-product/dot_product.cu)|f16|f16|❔| |
191 |
| -| ❔ [dot_product_f16_f32_kernel](./dot-product/dot_product.cu)|f16|f32|/|❔| |
192 |
| -| ❔ [dot_product_f16x2_f32_kernel](./dot-product/dot_product.cu)|f16|f32|/|❔| |
193 |
| -| ✔️ [elementwise_f32_kernel](./elementwise/elementwise.cu)|f32|/|/|❔| |
194 |
| -| ✔️ [elementwise_f32x4_kernel](./elementwise/elementwise.cu)|f32|/|/|❔| |
195 |
| -| ❔ [elementwise_f16_kernel](./elementwise/elementwise.cu)|f16|/|/|❔| |
196 |
| -| ❔ [elementwise_f16x2_kernel](./elementwise/elementwise.cu)|f16|/|/|❔| |
197 |
| -| ✔️ [histogram_i32_kernel](./histogram/histogram.cu)|i32|/|/|❔| |
198 |
| -| ✔️ [histogram_i32x4_kernel](./histogram/histogram.cu)|i32|/|/|❔| |
199 |
| -| ✔️ [softmax_f32_kernel (grid level memory fence)](./softmax/softmax.cu)|f32|f32|❔| |
200 |
| -| ✔️ [softmax_f32x4_kernel (grid level memory fence)](./softmax/softmax.cu)|f32|f32|❔| |
201 |
| -| ❔ [softmax_f32x4_kernel (per token)](./softmax/softmax.cu)|f32|f32|❔| |
202 |
| -| ❔ [safe_softmax_f32x4_kernel (per token)](./softmax/softmax.cu)|f32|f32|❔| |
203 |
| -| ✔️ [sigmoid_f32_kernel](./sigmoid/sigmoid.cu)|f32|/|❔| |
204 |
| -| ✔️ [sigmoid_f32x4_kernel](./sigmoid/sigmoid.cu)|f32|/|❔| |
205 |
| -| ✔️ [relu_f32_kernel](./relu/relu.cu)|f32|/|❔| |
206 |
| -| ✔️ [relu_f32x4_kernel](./relu/relu.cu)|f32|/|❔| |
207 |
| -| ❔ [relu_f16_kernel](./relu/relu.cu)|f16|/|❔| |
208 |
| -| ❔ [relu_f16x2_kernel](./relu/relu.cu)|f16|/|❔| |
209 |
| -| ✔️ [layer_norm_f32_kernel (per token)](./layer-norm/layer_norm.cu)|f32|f32|❔| |
210 |
| -| ✔️ [layer_norm_f32x4_kernel (per token)](./layer-norm/layer_norm.cu)|f32|f32|❔| |
211 |
| -| ❔ [layer_norm_f16_kernel (per token)](./layer-norm/layer_norm.cu)|f16|f16|❔| |
212 |
| -| ❔ [layer_norm_f16x2_kernel (per token)](./layer-norm/layer_norm.cu)|f16|f16|❔| |
213 |
| -| ✔️ [rms_norm_f32_kernel (per token)](./rms-norm/rms_norm.cu)|f32|f32|❔| |
214 |
| -| ✔️ [rms_norm_f32x4_kernel (per token)](./rms-norm/rms_norm.cu)|f32|f32|❔| |
215 |
| -| ❔ [rms_norm_f16_kernel (per token)](./rms-norm/rms_norm.cu)|f16|f16|❔| |
216 |
| -| ❔ [rms_norm_f16x2_kernel (per token)](./rms-norm/rms_norm.cu)|f16|f16|❔| |
217 |
| -| ✔️ [flash_attn_1_fwd_f32_kernel](./flash-attn/flash_attn_1_fwd_f32.cu)|f32|f32|[link](./flash-attn)| |
218 |
| -| ❔ [flash_attn_2_fwd_f32_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|f32|f32|[link](./flash-attn)| |
219 |
| -| ❔ [flash_attn_2_fwd_f16_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|f16|f32|[link](./flash-attn)| |
220 |
| -| ❔ [flash_attn_2_fwd_bf16_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|bf16|f32|[link](./flash-attn)| |
221 |
| -| ✔️ [hard_nms cpp only](./nms/nms.cc)|f32|/|❔| |
222 |
| -| ✔️ [notes v1(deprecated)](./notes-v1.cu)|f32|f32|/| |
223 |
| - |
224 | 223 | ## ©️License
|
225 | 224 | GNU General Public License v3.0
|
226 | 225 |
|
|
0 commit comments