Skip to content

Commit be6ba18

Browse files
authored
[Refactor][5/N] CUDA Learn Notes refactor Part-5 (xlite-dev#15)
1 parent d616f29 commit be6ba18

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+2678
-188
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,6 @@ __pycache__
1212
*.engine
1313
*.pt
1414
*.pth
15+
*.nsys*
16+
*.sqlite
17+
*.engine

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -671,4 +671,4 @@ into proprietary programs. If your program is a subroutine library, you
671671
may consider it more useful to permit linking proprietary applications with
672672
the library. If this is what you want to do, use the GNU Lesser General
673673
Public License instead of this License. But first, please read
674-
<https://www.gnu.org/licenses/why-not-lgpl.html>.
674+
<https://www.gnu.org/licenses/why-not-lgpl.html>.

README.md

Lines changed: 73 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,80 @@
1111

1212
📖**CUDA-Learn-Notes**: 🎉CUDA/C++ 笔记 / 技术博客: **fp32、fp16/bf16、fp8/int8**、flash_attn、sgemm、sgemv、warp/block reduce、dot prod、elementwise、softmax、layernorm、rmsnorm、hist etc. 👉News: Most of my time now is focused on **LLM/VLM/Diffusion** Inference. Please check 📖[Awesome-LLM-Inference](https://github.com/DefTruth/Awesome-LLM-Inference) ![](https://img.shields.io/github/stars/DefTruth/Awesome-LLM-Inference.svg?style=social), 📖[Awesome-SD-Inference](https://github.com/DefTruth/Awesome-SD-Inference) ![](https://img.shields.io/github/stars/DefTruth/Awesome-SD-Inference.svg?style=social) and 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) ![](https://img.shields.io/github/stars/DefTruth/CUDA-Learn-Notes.svg?style=social) for more details.
1313

14-
## 0x00 📖 博客目录
15-
1614
<img width="1438" alt="image" src="https://github.com/user-attachments/assets/0c5e5125-586f-43fa-8e8b-e2c61c1afbbe">
1715

16+
## 0x00 📖 CUDA Kernel目录 (面试常考题目)
17+
- / = not supported now.
18+
- ✔️ = known work and already supported now.
19+
- ❔ = in my plan, but not coming soon, maybe a few weeks later.
20+
- **workflow**: custom **CUDA** kernel impl -> **Torch** python binding -> Run tests.
21+
22+
|📖 cuda kernel| 📖 elem dtype| 📖 acc dtype| 📖 docs | 📖 level |
23+
|:---|:---|:---|:---|:---|
24+
| ✔️ [elementwise_f32_kernel](./elementwise/elementwise.cu)|f32|/|[link](./elementwise/)|⭐️|
25+
| ✔️ [elementwise_f32x4_kernel](./elementwise/elementwise.cu)|f32|/|[link](./elementwise/)|⭐️|
26+
| ✔️ [elementwise_f16_kernel](./elementwise/elementwise.cu)|f16|/|[link](./elementwise/)|⭐️|
27+
| ✔️ [elementwise_f16x2_kernel](./elementwise/elementwise.cu)|f16|/|[link](./elementwise/)|⭐️|
28+
| ✔️ [histogram_i32_kernel](./histogram/histogram.cu)|i32|/|[link](./histogram/)|⭐️|
29+
| ✔️ [histogram_i32x4_kernel](./histogram/histogram.cu)|i32|/|[link](./histogram/)|⭐️|
30+
| ✔️ [sigmoid_f32_kernel](./sigmoid/sigmoid.cu)|f32|/|[link](./sigmoid/)|⭐️|
31+
| ✔️ [sigmoid_f32x4_kernel](./sigmoid/sigmoid.cu)|f32|/|[link](./sigmoid/)|⭐️|
32+
| ✔️ [relu_f32_kernel](./relu/relu.cu)|f32|/|[link](./relu/)|⭐️|
33+
| ✔️ [relu_f32x4_kernel](./relu/relu.cu)|f32|/|[link](./relu/)|⭐️|
34+
| ✔️ [relu_f16_kernel](./relu/relu.cu)|f16|/|[link](./relu/)|⭐️|
35+
| ✔️ [relu_f16x2_kernel](./relu/relu.cu)|f16|/|[link](./relu/)|⭐️|
36+
| ✔️ [warp_reduce_f32/f16/bf16_kernel](./reduce/block_all_reduce.cu)|f16/bf16/f32|f16/bf16/f32|[link](./reduce/)|⭐️⭐️|
37+
| ✔️ [block_reduce_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|⭐️⭐️|
38+
| ✔️ [block_all_reduce_sum_f32_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|⭐️⭐️|
39+
| ✔️ [block_all_reduce_sum_f32x4_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|⭐️⭐️|
40+
| ✔️ [block_all_reduce_sum_f16_f16_kernel](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|⭐️⭐️|
41+
| ✔️ [block_all_reduce_sum_f16_f32_kernel](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|⭐️⭐️|
42+
| ✔️ [block_all_reduce_sum_f16x2_f16_kernel](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|⭐️⭐️|
43+
| ✔️ [block_all_reduce_sum_f16x2_f32_kernel](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|⭐️⭐️|
44+
| ✔️ [block_all_reduce_sum_bf16_bf16_kernel](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|⭐️⭐️|
45+
| ✔️ [block_all_reduce_sum_bf16_f32_kernel](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|⭐️⭐️|
46+
| ✔️ [block_all_reduce_sum_bf16x2_bf16_kernel](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|⭐️⭐️|
47+
| ✔️ [block_all_reduce_sum_bf16x2_f32_kernel](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|⭐️⭐️|
48+
| ✔️ [block_all_reduce_sum_fp8_e4m3_f16_kernel](./reduce/block_all_reduce.cu)|fp8_e4m3|f16|[link](./reduce/)|⭐️⭐️|
49+
| ✔️ [block_all_reduce_sum_fp8_e5m2_f16_kernel](./reduce/block_all_reduce.cu)|fp8_e5m2|f16|[link](./reduce/)|⭐️⭐️|
50+
| ✔️ [block_all_reduce_sum_i8_i32_kernel](./reduce/block_all_reduce.cu)|i8|i32|[link](./reduce/)|⭐️⭐️|
51+
| ✔️ [dot_product_f32_kernel](./dot-product/dot_product.cu)|f32|f32|[link](./dot-product/)|⭐️⭐️|
52+
| ✔️ [dot_product_f32x4_kernel](./dot-product/dot_product.cu)|f32|f32|[link](./dot-product/)|⭐️⭐️|
53+
| ✔️ [dot_product_f16_f32_kernel](./dot-product/dot_product.cu)|f16|f32|[link](./dot-product/)|⭐️⭐️|
54+
| ✔️ [dot_product_f16x2_f32_kernel](./dot-product/dot_product.cu)|f16|f32|[link](./dot-product/)|⭐️⭐️|
55+
| ✔️ [softmax_f32_kernel (grid level memory fence)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️|
56+
| ✔️ [softmax_f32x4_kernel (grid level memory fence)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️|
57+
| ✔️ [softmax_f32_kernel (per token)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️|
58+
| ✔️ [softmax_f32x4_kernel (per token)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️|
59+
| ✔️ [safe_softmax_f32_kernel (per token)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️|
60+
| ✔️ [safe_softmax_f32x4_kernel (per token)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️|
61+
| ✔️ [layer_norm_f32_kernel (per token)](./layer-norm/layer_norm.cu)|f32|f32|[link](./layer-norm/)|⭐️⭐️|
62+
| ✔️ [layer_norm_f32x4_kernel (per token)](./layer-norm/layer_norm.cu)|f32|f32|[link](./layer-norm/)|⭐️⭐️|
63+
|[layer_norm_f16_kernel (per token)](./layer-norm/layer_norm.cu)|f16|f16||⭐️⭐️|
64+
|[layer_norm_f16x2_kernel (per token)](./layer-norm/layer_norm.cu)|f16|f16||⭐️⭐️|
65+
| ✔️ [rms_norm_f32_kernel (per token)](./rms-norm/rms_norm.cu)|f32|f32|[link](./rms-norm/)|⭐️⭐️|
66+
| ✔️ [rms_norm_f32x4_kernel (per token)](./rms-norm/rms_norm.cu)|f32|f32|[link](./rms-norm/)|⭐️⭐️|
67+
|[rms_norm_f16_kernel (per token)](./rms-norm/rms_norm.cu)|f16|f16||⭐️⭐️|
68+
|[rms_norm_f16x2_kernel (per token)](./rms-norm/rms_norm.cu)|f16|f16||⭐️⭐️|
69+
| ✔️ [sgemm_sliced_k_f32_kernel](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
70+
| ✔️ [sgemm_t_8x8_sliced_k_f32x4_kernel](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
71+
|[hgemm_sliced_k_f16_f32_kernel](./hgemm)|f16|f32||⭐️⭐️⭐️|
72+
|[hgemm_t_tile_sliced_k_f16x2_f32_kernel](./hgemm)|f16|f32||⭐️⭐️⭐️|
73+
| ✔️ [sgemv_k32_f32_kernel](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
74+
| ✔️ [sgemv_k128_f32x4_kernel](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
75+
| ✔️ [sgemv_k16_f32_kernel](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
76+
|[hgemv_k32_f16_kernel](./hgemv)|f16|f16||⭐️⭐️⭐️|
77+
|[hgemv_k128_f16x2_kernel](./hgemv)|f16|f16||⭐️⭐️⭐️|
78+
|[hgemv_k16_f16_kernel](./hgemv)|f16|f16||⭐️⭐️⭐️|
79+
| ✔️ [flash_attn_1_fwd_f32_kernel](./flash-attn/flash_attn_1_fwd_f32.cu)|f32|f32|[link](./flash-attn)|⭐️⭐️⭐️|
80+
|[flash_attn_2_fwd_f32_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|f32|f32|[link](./flash-attn)|⭐️⭐️⭐️|
81+
|[flash_attn_2_fwd_f16_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|f16|f32|[link](./flash-attn)|⭐️⭐️⭐️|
82+
|[flash_attn_2_fwd_bf16_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|bf16|f32|[link](./flash-attn)|⭐️⭐️⭐️|
83+
| ✔️ [hard_nms cpp only](./nms/nms.cc)|f32|/||⭐️|
84+
| ✔️ [notes v1(deprecated)](./notes-v1.cu)|f32|f32|/|⭐️|
85+
86+
## 0x01 📖 博客目录
87+
1888
### 📖 大模型|多模态|Diffusion|推理优化 (本人作者)
1989

2090
|📖 类型-标题|📖 作者|
@@ -125,7 +195,7 @@
125195
| [[cutlass教程][入门]📖cutlass 软件架构](https://zhuanlan.zhihu.com/p/678915618)|@JoeNomad|
126196
| [[cutlass教程][入门]📖CUTLASS 基础介绍](https://zhuanlan.zhihu.com/p/671324125)|@进击的Killua|
127197
| [[cutlass教程][入门]📖乱谈CUTLASS GTC2020 SLIDES](https://zhuanlan.zhihu.com/p/674693873)|@zzk again|
128-
| [[cutlass教程][深入]📖cutlass block swizzle 和 tile iterator(@JoeNomad)](https://zhuanlan.zhihu.com/p/679929705)|@JoeNomad|
198+
| [[cutlass教程][深入]📖cutlass block swizzle 和 tile iterator](https://zhuanlan.zhihu.com/p/679929705)|@JoeNomad|
129199
| [[cutlass教程][深入]📖cutlass bank conflict free 的shared memory layout](https://zhuanlan.zhihu.com/p/681966685)|@JoeNomad|
130200
| [[cutlass教程][深入]📖cutlass 多级流水线](https://zhuanlan.zhihu.com/p/687397095)|@JoeNomad|
131201
| [[GPU指令集架构][精解]📖NVidia GPU指令集架构-前言](https://zhuanlan.zhihu.com/p/686198447)|@reed|
@@ -150,77 +220,6 @@
150220

151221
💡说明: 大佬们写的文章实在是太棒了,学到了很多东西。欢迎大家提PR推荐更多优秀的文章!
152222

153-
## 0x01 📖 CUDA Kernel目录 (面试常考题目)
154-
<div id="kernellist"></div>
155-
156-
- / = not supported now.
157-
- ✔️ = known work and already supported now.
158-
- ❔ = in my plan, but not coming soon, maybe a few weeks later.
159-
- **workflow**: custom **CUDA** kernel impl -> **Torch** python binding -> Run tests.
160-
161-
|📖 cuda kernel| 📖 elem dtype| 📖 acc dtype| 📖 docs |
162-
|:---|:---|:---|:---|
163-
| ✔️ [sgemm_sliced_k_f32_kernel](./sgemm/sgemm.cu)|f32|f32||
164-
| ✔️ [sgemm_t_tile_sliced_k_f32x4_kernel](./sgemm/sgemm.cu)|f32|f32||
165-
|[hgemm_sliced_k_f16_f32_kernel](./sgemm/sgemm.cu)|f16|f32||
166-
|[hgemm_t_tile_sliced_k_f16x2_f32_kernel](./sgemm/sgemm.cu)|f16|f32||
167-
| ✔️ [sgemv_k32_f32_kernel](./sgemv/sgemv.cu)|f32|f32||
168-
| ✔️ [sgemv_k128_f32x4_kernel](./sgemv/sgemv.cu)|f32|f32||
169-
| ✔️ [sgemv_k16_f32_kernel](./sgemv/sgemv.cu)|f32|f32||
170-
|[hgemv_k32_f16_kernel](./sgemv/sgemv.cu)|f16|f16||
171-
|[hgemv_k128_f16x2_kernel](./sgemv/sgemv.cu)|f16|f16||
172-
|[hgemv_k16_f16_kernel](./sgemv/sgemv.cu)|f16|f16||
173-
| ✔️ [warp_reduce_f32/f16/bf16_kernel](./reduce/block_all_reduce.cu)|f16/bf16/f32|f16/bf16/f32|[link](./reduce/)|
174-
| ✔️ [block_reduce_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|
175-
| ✔️ [block_all_reduce_sum_f32_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|
176-
| ✔️ [block_all_reduce_sum_f32x4_f32_kernel](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|
177-
| ✔️ [block_all_reduce_sum_f16_f16_kernel](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|
178-
| ✔️ [block_all_reduce_sum_f16_f32_kernel](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|
179-
| ✔️ [block_all_reduce_sum_f16x2_f16_kernel](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|
180-
| ✔️ [block_all_reduce_sum_f16x2_f32_kernel](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|
181-
| ✔️ [block_all_reduce_sum_bf16_bf16_kernel](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|
182-
| ✔️ [block_all_reduce_sum_bf16_f32_kernel](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|
183-
| ✔️ [block_all_reduce_sum_bf16x2_bf16_kernel](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|
184-
| ✔️ [block_all_reduce_sum_bf16x2_f32_kernel](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|
185-
| ✔️ [block_all_reduce_sum_fp8_e4m3_f16_kernel](./reduce/block_all_reduce.cu)|fp8_e4m3|f16|[link](./reduce/)|
186-
|[block_all_reduce_sum_i8_i32_kernel](./reduce/block_all_reduce.cu)|i8|i32|[link](./reduce/)|
187-
| ✔️ [dot_product_f32_kernel](./dot-product/dot_product.cu)|f32|f32||
188-
| ✔️ [dot_product_f32x4_kernel](./dot-product/dot_product.cu)|f32|f32||
189-
|[dot_product_f16_f16_kernel](./dot-product/dot_product.cu)|f16|f16||
190-
|[dot_product_f16x2_f16_kernel](./dot-product/dot_product.cu)|f16|f16||
191-
|[dot_product_f16_f32_kernel](./dot-product/dot_product.cu)|f16|f32|/||
192-
|[dot_product_f16x2_f32_kernel](./dot-product/dot_product.cu)|f16|f32|/||
193-
| ✔️ [elementwise_f32_kernel](./elementwise/elementwise.cu)|f32|/|/||
194-
| ✔️ [elementwise_f32x4_kernel](./elementwise/elementwise.cu)|f32|/|/||
195-
|[elementwise_f16_kernel](./elementwise/elementwise.cu)|f16|/|/||
196-
|[elementwise_f16x2_kernel](./elementwise/elementwise.cu)|f16|/|/||
197-
| ✔️ [histogram_i32_kernel](./histogram/histogram.cu)|i32|/|/||
198-
| ✔️ [histogram_i32x4_kernel](./histogram/histogram.cu)|i32|/|/||
199-
| ✔️ [softmax_f32_kernel (grid level memory fence)](./softmax/softmax.cu)|f32|f32||
200-
| ✔️ [softmax_f32x4_kernel (grid level memory fence)](./softmax/softmax.cu)|f32|f32||
201-
|[softmax_f32x4_kernel (per token)](./softmax/softmax.cu)|f32|f32||
202-
|[safe_softmax_f32x4_kernel (per token)](./softmax/softmax.cu)|f32|f32||
203-
| ✔️ [sigmoid_f32_kernel](./sigmoid/sigmoid.cu)|f32|/||
204-
| ✔️ [sigmoid_f32x4_kernel](./sigmoid/sigmoid.cu)|f32|/||
205-
| ✔️ [relu_f32_kernel](./relu/relu.cu)|f32|/||
206-
| ✔️ [relu_f32x4_kernel](./relu/relu.cu)|f32|/||
207-
|[relu_f16_kernel](./relu/relu.cu)|f16|/||
208-
|[relu_f16x2_kernel](./relu/relu.cu)|f16|/||
209-
| ✔️ [layer_norm_f32_kernel (per token)](./layer-norm/layer_norm.cu)|f32|f32||
210-
| ✔️ [layer_norm_f32x4_kernel (per token)](./layer-norm/layer_norm.cu)|f32|f32||
211-
|[layer_norm_f16_kernel (per token)](./layer-norm/layer_norm.cu)|f16|f16||
212-
|[layer_norm_f16x2_kernel (per token)](./layer-norm/layer_norm.cu)|f16|f16||
213-
| ✔️ [rms_norm_f32_kernel (per token)](./rms-norm/rms_norm.cu)|f32|f32||
214-
| ✔️ [rms_norm_f32x4_kernel (per token)](./rms-norm/rms_norm.cu)|f32|f32||
215-
|[rms_norm_f16_kernel (per token)](./rms-norm/rms_norm.cu)|f16|f16||
216-
|[rms_norm_f16x2_kernel (per token)](./rms-norm/rms_norm.cu)|f16|f16||
217-
| ✔️ [flash_attn_1_fwd_f32_kernel](./flash-attn/flash_attn_1_fwd_f32.cu)|f32|f32|[link](./flash-attn)|
218-
|[flash_attn_2_fwd_f32_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|f32|f32|[link](./flash-attn)|
219-
|[flash_attn_2_fwd_f16_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|f16|f32|[link](./flash-attn)|
220-
|[flash_attn_2_fwd_bf16_kernel](./flash-attn/flash_attn_2_fwd_f32.cu)|bf16|f32|[link](./flash-attn)|
221-
| ✔️ [hard_nms cpp only](./nms/nms.cc)|f32|/||
222-
| ✔️ [notes v1(deprecated)](./notes-v1.cu)|f32|f32|/|
223-
224223
## ©️License
225224
GNU General Public License v3.0
226225

cuda-slides/.gitignore

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,14 @@
77
build
88
*.whl
99
tmp
10-
__pycache__
1110
*.onnx
11+
*.pb
12+
*.pbtxt
13+
*.pt
14+
*.pth
1215
*.engine
13-
16+
*.bin
17+
*.nsys
18+
*.nvvp
19+
*.nsys*
20+
*.sqlite

cutlass/.gitignore

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,14 @@
77
build
88
*.whl
99
tmp
10-
10+
*.onnx
11+
*.pb
12+
*.pbtxt
13+
*.pt
14+
*.pth
15+
*.engine
16+
*.bin
17+
*.nsys
18+
*.nvvp
19+
*.nsys*
20+
*.sqlite

dot-product/README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Dot Product
2+
3+
## 0x00 说明
4+
5+
包含以下内容:
6+
7+
- [X] dot_prod_f32_acc_with_f32_kernel
8+
- [X] dot_prod_f32x4_acc_with_f32_kernel(float4向量化版本)
9+
- [X] dot_prod_f16_acc_with_f32_kernel(fp16版本,使用fp32 acc)
10+
- [X] dot_prod_f16x2_acc_with_f32_kernel(fp16向量化版本,使用fp32 acc)
11+
- [X] PyTorch bindings
12+
13+
## 测试
14+
15+
```bash
16+
# 只测试Ada架构 不指定默认编译所有架构 耗时较长
17+
export TORCH_CUDA_ARCH_LIST=Ada
18+
python3 dot_product.py
19+
```
20+
21+
输出:
22+
23+
```bash
24+
--------------------------------------------------------------------------------
25+
out_f32f32: -88.81410217 , time:0.01135945ms
26+
out_f32x4f32: -88.81417847 , time:0.01171017ms
27+
out_f32f32_th: -88.81379700 , time:0.01147819ms
28+
--------------------------------------------------------------------------------
29+
out_f16f32: -88.62890625 , time:0.01113868ms
30+
out_f16x2f32: -88.65764618 , time:0.01108241ms
31+
out_f16f16_th: -88.75000000 , time:0.01112628ms
32+
--------------------------------------------------------------------------------
33+
```

0 commit comments

Comments
 (0)