Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DNR][Pipeliner] Enable automatic loop fusion #5726

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

Mogball
Copy link
Collaborator

@Mogball Mogball commented Jan 28, 2025

Performance of 09-persistent-matmul.py on H100.

Before (2 runs)

root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py 
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
273.146 4025.362 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 283.506 2666.310 cublas [M=8192, N=8192, K=512]
│  └─ nan 2666.310 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 223.326 307.709 matmul_kernel [M=8192, N=8192, K=512]
├─ 259.293 265.027 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 238.500 288.133 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 258.738 265.594 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.529 232.531 torch [M=8192, N=8192, K=512]
   └─ nan 232.531 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 223.33 Max: 295.53)
█ 288.31 - 295.53
█ 273.87 - 288.31
█ 259.43 - 273.87
█ 244.99 - 259.43
█ 230.55 - 244.99
█ 223.33 - 230.55

name User code    ◀  Only in left graph    ▶  Only in right graph

root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py 
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
273.367 4022.105 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 284.284 2659.011 cublas [M=8192, N=8192, K=512]
│  └─ nan 2659.011 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 221.823 309.795 matmul_kernel [M=8192, N=8192, K=512]
├─ 254.755 269.748 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 240.774 285.411 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 259.109 265.214 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.100 232.868 torch [M=8192, N=8192, K=512]
   └─ nan 232.868 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 221.82 Max: 295.10)
█ 287.77 - 295.10
█ 273.12 - 287.77
█ 258.46 - 273.12
█ 243.81 - 258.46
█ 229.15 - 243.81
█ 221.82 - 229.15

name User code    ◀  Only in left graph    ▶  Only in right graph

After:

root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py 
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
274.040 4012.227 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 285.369 2648.904 cublas [M=8192, N=8192, K=512]
│  └─ nan 2648.904 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 217.548 315.881 matmul_kernel [M=8192, N=8192, K=512]
├─ 262.312 261.976 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 244.740 280.785 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 255.113 269.368 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 292.108 235.253 torch [M=8192, N=8192, K=512]
   └─ nan 235.253 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 217.55 Max: 292.11)
█ 284.65 - 292.11
█ 269.74 - 284.65
█ 254.83 - 269.74
█ 239.92 - 254.83
█ 225.00 - 239.92
█ 217.55 - 225.00

name User code    ◀  Only in left graph    ▶  Only in right graph

root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py 
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
274.997 3998.267 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 285.498 2647.706 cublas [M=8192, N=8192, K=512]
│  └─ nan 2647.706 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 217.884 315.394 matmul_kernel [M=8192, N=8192, K=512]
├─ 262.534 261.755 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 246.617 278.649 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 262.525 261.764 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.007 232.942 torch [M=8192, N=8192, K=512]
   └─ nan 232.942 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 217.88 Max: 295.01)
█ 287.29 - 295.01
█ 271.87 - 287.29
█ 256.45 - 271.87
█ 241.02 - 256.45
█ 225.60 - 241.02
█ 217.88 - 225.60

name User code    ◀  Only in left graph    ▶  Only in right graph

@Mogball
Copy link
Collaborator Author

Mogball commented Jan 28, 2025

Blackwell baseline

~/code/triton$ python python/tutorials/09-persistent-matmul.py 
TMA benchmarks will be running with experimental grid constant TMA descriptor.
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
813.843 1351.012 ROOT
├─ nan 0.022 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.028 _ZN2at6native54_GLOBAL__N__ee6f1694_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET1_T2_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_f6float4S9_SO_SH_EEvSJ_SL_RKT3_T4_EUlifE_EEvlNS_15PhiloxCudaStateESK_SL_
├─ 880.461 858.544 cublas [M=8192, N=8192, K=512]
│  └─ nan 858.544 cutlass3x_sm100_tensorop_s256x256x16gemm_f16_f16_f32_f16_f16_256x256x64_0_tnn_align8_2sm_bias_f16_relu
├─ 549.243 125.117 matmul_kernel [M=8192, N=8192, K=512]
├─ 761.246 90.272 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 508.185 135.225 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 1011.275 67.953 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 930.519 73.851 torch [M=8192, N=8192, K=512]
   └─ nan 73.851 cutlass3x_sm100_tensorop_s256x256x16gemm_f16_f16_f32_f16_f16_256x256x64_0_tnn_align8_2sm_bias_f16_relu

Legend (Metric: tflop16/s (inc) Min: 508.18 Max: 1011.28)
█ 960.97 - 1011.28
█ 860.35 - 960.97
█ 759.73 - 860.35
█ 659.11 - 759.73
█ 558.49 - 659.11
█ 508.18 - 558.49

name User code    ◀  Only in left graph    ▶  Only in right graph

(.venv) jeffniu@larissa:~/code/triton$ python python/tutorials/09-persistent-matmul.py 
TMA benchmarks will be running with experimental grid constant TMA descriptor.
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
812.170 1353.795 ROOT
├─ nan 0.023 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.028 _ZN2at6native54_GLOBAL__N__ee6f1694_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET1_T2_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_f6float4S9_SO_SH_EEvSJ_SL_RKT3_T4_EUlifE_EEvlNS_15PhiloxCudaStateESK_SL_
├─ 879.339 859.639 cublas [M=8192, N=8192, K=512]
│  └─ nan 859.639 cutlass3x_sm100_tensorop_s256x256x16gemm_f16_f16_f32_f16_f16_256x256x64_0_tnn_align8_2sm_bias_f16_relu
├─ 548.474 125.292 matmul_kernel [M=8192, N=8192, K=512]
├─ 744.885 92.255 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 507.848 135.315 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 1007.384 68.216 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 941.010 73.027 torch [M=8192, N=8192, K=512]
   └─ nan 73.027 cutlass3x_sm100_tensorop_s256x256x16gemm_f16_f16_f32_f16_f16_256x256x64_0_tnn_align8_2sm_bias_f16_relu

Legend (Metric: tflop16/s (inc) Min: 507.85 Max: 1007.38)
█ 957.43 - 1007.38
█ 857.52 - 957.43
█ 757.62 - 857.52
█ 657.71 - 757.62
█ 557.80 - 657.71
█ 507.85 - 557.80

name User code    ◀  Only in left graph    ▶  Only in right graph

@Mogball
Copy link
Collaborator Author

Mogball commented Jan 30, 2025

...after!

~/code/triton$ python python/tutorials/09-persistent-matmul.py 
TMA benchmarks will be running with experimental grid constant TMA descriptor.
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
815.022 1349.057 ROOT
├─ nan 0.022 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.028 _ZN2at6native54_GLOBAL__N__ee6f1694_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET1_T2_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_f6float4S9_SO_SH_EEvSJ_SL_RKT3_T4_EUlifE_EEvlNS_15PhiloxCudaStateESK_SL_
├─ 880.667 858.343 cublas [M=8192, N=8192, K=512]
│  └─ nan 858.343 cutlass3x_sm100_tensorop_s256x256x16gemm_f16_f16_f32_f16_f16_256x256x64_0_tnn_align8_2sm_bias_f16_relu
├─ 550.935 124.732 matmul_kernel [M=8192, N=8192, K=512]
├─ 766.602 89.642 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 509.761 134.807 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 1018.712 67.457 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 928.320 74.026 torch [M=8192, N=8192, K=512]
   └─ nan 74.026 cutlass3x_sm100_tensorop_s256x256x16gemm_f16_f16_f32_f16_f16_256x256x64_0_tnn_align8_2sm_bias_f16_relu

Legend (Metric: tflop16/s (inc) Min: 509.76 Max: 1018.71)
█ 967.82 - 1018.71
█ 866.03 - 967.82
█ 764.24 - 866.03
█ 662.45 - 764.24
█ 560.66 - 662.45
█ 509.76 - 560.66

name User code    ◀  Only in left graph    ▶  Only in right graph

~/code/triton$ python python/tutorials/09-persistent-matmul.py 
TMA benchmarks will be running with experimental grid constant TMA descriptor.
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 
812.881 1352.611 ROOT
├─ nan 0.023 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.028 _ZN2at6native54_GLOBAL__N__ee6f1694_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET1_T2_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_f6float4S9_SO_SH_EEvSJ_SL_RKT3_T4_EUlifE_EEvlNS_15PhiloxCudaStateESK_SL_
├─ 874.728 864.171 cublas [M=8192, N=8192, K=512]
│  └─ nan 864.171 cutlass3x_sm100_tensorop_s256x256x16gemm_f16_f16_f32_f16_f16_256x256x64_0_tnn_align8_2sm_bias_f16_relu
├─ 551.067 124.703 matmul_kernel [M=8192, N=8192, K=512]
├─ 761.798 90.207 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 510.810 134.530 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 1015.370 67.679 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 964.209 71.270 torch [M=8192, N=8192, K=512]
   └─ nan 71.270 cutlass3x_sm100_tensorop_s256x256x16gemm_f16_f16_f32_f16_f16_256x256x64_0_tnn_align8_2sm_bias_f16_relu

Legend (Metric: tflop16/s (inc) Min: 510.81 Max: 1015.37)
█ 964.91 - 1015.37
█ 864.00 - 964.91
█ 763.09 - 864.00
█ 662.18 - 763.09
█ 561.27 - 662.18
█ 510.81 - 561.27

name User code    ◀  Only in left graph    ▶  Only in right graph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant