fused out correction in CP #1248

xiaoyao0115 · 2024-10-14T07:52:59Z

Description

Fused multiple kernels in the out correction computation of attention in CP into a single kernel, reducing the kernel launch time.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
[✓] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: xiaoyao0115 <[email protected]>

transformer_engine/pytorch/csrc/extensions/attention.cu

xrennvidia · 2024-10-16T07:30:05Z

transformer_engine/pytorch/csrc/extensions.h

+                           const at::Tensor &lse, const std::vector<at::Tensor> &lse_per_step,
+                           const at::Tensor &cu_seqlens, std::string qkv_format, int cp_size,
+                           int rank, bool causal, bool softmax_lse_in_packed_format);
+


put these two functions close to other CP helper functions.

why don't you fix this?

i will fix this

xrennvidia · 2024-10-16T07:33:42Z

transformer_engine/pytorch/attention.py

+                rank,
+                causal,
+                softmax_lse_in_packed_format,
+            )


can we merger this two function into one? I see they are almost same.
I think you do not necessarily need softmax_lse_ in cuda code? it's just a view of softmax_lse, so softmax_lse should be enough?

ok they are merged, and remove softmax_lse_ in cuda code

lots of arguments are same, only first one is different? should remove the duplicated code.

ok, thanks for your suggestion

transformer_engine/pytorch/csrc/extensions/attention.cu

Signed-off-by: tailaim <[email protected]>

merge main to fused_out_correction

Signed-off-by: tailaim <[email protected]>

Signed-off-by: xiaoyao0115 <[email protected]>

for more information, see https://pre-commit.ci

xrennvidia · 2024-12-10T08:07:03Z

transformer_engine/pytorch/csrc/extensions.h

+                           const at::Tensor &lse, const std::vector<at::Tensor> &lse_per_step,
+                           const at::Tensor &cu_seqlens, std::string qkv_format, int cp_size,
+                           int rank, bool causal, bool softmax_lse_in_packed_format);
+


why don't you fix this?

xrennvidia · 2024-12-10T09:50:01Z

transformer_engine/pytorch/attention.py

+            )
+        elif qkv_format == "bshd":
+            tex.fused_out_correction(
+                out.view(out.shape[-4], -1, *out.shape[-2:]),


why out.shape[-4]? I guess you mean batch size, which should be out.shape[0]?

yes, i will fix this

xrennvidia · 2024-12-10T09:51:55Z

transformer_engine/pytorch/attention.py

+                rank,
+                causal,
+                softmax_lse_in_packed_format,
+            )


lots of arguments are same, only first one is different? should remove the duplicated code.

xrennvidia · 2024-12-10T09:54:16Z

transformer_engine/pytorch/csrc/extensions/pybind.cpp

+  m.def("fused_out_correction", &fused_out_correction,
+        "fused out correction after qkv calculation without lse_",
+        py::call_guard<py::gil_scoped_release>());
+
  // Other granular functions


move this to the place which is next to other thd helper functions.

xrennvidia · 2024-12-10T09:57:14Z

transformer_engine/pytorch/csrc/extensions/attention.cu

@@ -1222,6 +1222,152 @@ std::vector<at::Tensor> fused_attn_bwd(
  return {dQ, dK, dV, dBias};
 }

+/***************************************************************************************************
+ * Support THD(including SBHD and BSHD) format for Context Parallel: Fused out correction in forward
+ **************************************************************************************************/


just say "Support BSHD, SBHD, and THD formats for Context Parallel: Fused out correction in forward"

xrennvidia · 2024-12-10T22:24:11Z

transformer_engine/pytorch/csrc/extensions/attention.cu

+    num_heads = out.size(1);
+    dim_per_head = out.size(2);
+    batch = cu_seqlens.size(0) - 1;
+    if (softmax_lse_in_packed_format) {


current CP implementation uses varlen_fwd only, so softmax_lse_in_packed_format can be True for SBHD format also, you cannot put this if-else statement under THD format.

Yes, I understand that softmax_lse_in_packed_format can be true not only in the THD format. However, in the THD format, if softmax_lse_in_packed_format is true, lse_seqlen needs to be specially handled, whereas in SBHD and BSHD formats, it does not.

xrennvidia · 2024-12-10T22:25:33Z

transformer_engine/common/fused_attn/thd_utils.h

@@ -102,58 +102,187 @@ __global__ void thd_lse_kernel(lse_dtype *lse, float *half_lse, int *cu_seqlens,
 }

 /***************************************************************************************************
- * Support THD format for Context Parallel: Out correction in forward
+ * Support THD(including SBHD and BSHD) format for Context Parallel: Out correction in forward


change this to "Support BSHD, SBHD, and THD formats for Context Parallel: Out correction in forward"

xrennvidia · 2024-12-10T22:27:45Z

transformer_engine/pytorch/csrc/extensions/attention.cu

+  constexpr int max_tensors = 64;
+  TensorList<max_tensors> tensors;
+
+  for (int i = 0; i < cp_size; i += max_tensors) {


what's the particular reason to have this for loop? why can't the fusion kernel handle CP>64 in a single fused kernel?

When cp_size is too large, this CUDA kernel experiences performance degradation due to the excessive number of input parameters (a total of 2 * cp_size + 2 tensor addresses). Therefore, following the implementation in https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/multi_tensor_apply.cuh, I adopted a batch processing approach when cp_size is too large

xrennvidia · 2024-12-10T22:41:53Z

transformer_engine/common/fused_attn/thd_utils.h

-      dtype *cur_out_per_step = out_per_step + idx_per_step;
+    for (int j = lane_id; j < num_loops_per_head; j += tile_size) {
+      size_t idx_out;
+      size_t idx_lse;


move idx_out and idx_lse into next-level of inner loop.

xrennvidia · 2024-12-12T02:12:53Z

transformer_engine/common/fused_attn/thd_utils.h

+
+  if constexpr (out_format == QKVFormat::TH) {
+    for (int i = threadIdx.x; i <= batch; i += blockDim.x) {
+      cu_seqlens_s[i] = cu_seqlens[i];


why is this needed?

we need this to initialize TensorFormat when THD format is applied

Signed-off-by: xiaoyao0115 <[email protected]>

fused out correction in CP

49a738e

Signed-off-by: xiaoyao0115 <[email protected]>

xrennvidia self-requested a review October 14, 2024 08:09

xrennvidia reviewed Oct 16, 2024

View reviewed changes

tailaim and others added 12 commits October 21, 2024 13:23

fused out correction after revision

314c841

Signed-off-by: tailaim <[email protected]>

clean code

c0f1998

Signed-off-by: tailaim <[email protected]>

Merge remote-tracking branch 'uptream/main' into fused_out_correction

c8af3f7

merge main to fused_out_correction

clean fusion kernel code

c66d980

Signed-off-by: tailaim <[email protected]>

Merge branch 'NVIDIA:main' into fused_out_correction

dafbd5a

clean code

96dfd5e

Signed-off-by: xiaoyao0115 <[email protected]>

Merge branch 'NVIDIA:main' into fused_out_correction

2df8bd4

Merge branch 'NVIDIA:main' into fused_out_correction

4f397da

Merge branch 'NVIDIA:main' into fused_out_correction

e33b2a2

Merge branch 'NVIDIA:main' into fused_out_correction

35c9836

refactor the kernel

fd83104

Signed-off-by: xiaoyao0115 <[email protected]>

format the code

f977de3

Signed-off-by: xiaoyao0115 <[email protected]>

xiaoyao0115 force-pushed the fused_out_correction branch from fa158ed to f977de3 Compare November 19, 2024 14:13

xiaoyao0115 and others added 2 commits December 1, 2024 17:13

move kernel to common

97882f1

Signed-off-by: xiaoyao0115 <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9d7e3cf

for more information, see https://pre-commit.ci

xrennvidia reviewed Dec 12, 2024

View reviewed changes

xiaoyao0115 added 2 commits December 12, 2024 13:01

Merge branch 'NVIDIA:main' into fused_out_correction

89bbeb7

Merge branch 'NVIDIA:main' into fused_out_correction

d346d9c

xiaoyao0115 force-pushed the fused_out_correction branch from ed446e7 to d346d9c Compare December 24, 2024 07:06

minor fixes based on review comments

b2a2fc2

Signed-off-by: xiaoyao0115 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fused out correction in CP #1248

fused out correction in CP #1248

xiaoyao0115 commented Oct 14, 2024

xrennvidia Oct 16, 2024

xiaoyao0115 Nov 26, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 12, 2024

xrennvidia Oct 16, 2024

xiaoyao0115 Nov 26, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 12, 2024

xrennvidia Dec 10, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 12, 2024

xrennvidia Dec 10, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 24, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 12, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 24, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 24, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 24, 2024

xrennvidia Dec 10, 2024

xiaoyao0115 Dec 24, 2024

xrennvidia Dec 12, 2024

xiaoyao0115 Dec 24, 2024

fused out correction in CP #1248

Are you sure you want to change the base?

fused out correction in CP #1248

Conversation

xiaoyao0115 commented Oct 14, 2024

Description

Type of change

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment