[nvidia] Support passing TMA descriptors by-value #4498

embg · 2024-08-09T22:56:58Z

Motivation

Currently, Triton passes TMA descriptors by-ref through global memory. This has a number of problems:

Significant launch overhead (5-10us) for the host-to-device memcpy
Users must insert fences for TMA descriptor cache flush (see Short-term solution for TMA descriptor cache management #4342). When users don't insert these fences correctly, they run into very strange bugs: run into dead loop when tuning the tma persistent kernel #4332
The memcpy makes it nearly impossible to use cudagraphs

There are two possible solutions:

Because of the tricky memory model for TMA descriptors on H100, creating a descriptor on-device requires moving data back and forth from L2 cache. This is relatively expensive (100s of cycles at least) and requires the user or compiler to correctly insert release/acquire fences.

In some cases, there is no way to avoid creating the descriptor on-device. But for many use-cases, it's perfectly fine to set up the descriptor on the host and pass by-value, avoiding both performance and correctness issues. This PR implements the by-value functionality.

User-level API

Whenever the user provides a kernel param which implements the method tma_desc_cpu_ptr(), Triton will lower that argument to a __grid_constant__ by-value param. The existing helper methods create_[1d/2d]_tma_descriptor were modified to return such a type, so existing code does not need any changes to take advantage of the new feature.

Implementation details

When a kernel param with tma_desc_cpu_ptr() is detected, we attach an attribute to that param at the TTIR level. The attribute is passed through to TTGIR. When lowering TTGIR to LLIR, we use code ported from Mosaic (jax-ml/jax#22175) to set up the correct LLVM attributes. The runtime is also modified to pass by-value TMA descriptors properly.

Limitations

This feature is currently broken when compiling an IRSource directly (which is useful for editing IR and re-compiling). That would require updating some regexes which infer the function signature from the IR. IRSource compilation still works fine for kernels which do not use the new feature.

Once the approach I'm taking here is reviewed, I plan to fix that limitation, either in this PR or in a follow-up PR.

python/test/unit/hopper/test_experimental_tma.py

ThomasRaoux · 2024-08-09T23:04:20Z

python/triton/runtime/build.py

@@ -42,7 +42,8 @@ def _build(name, src, srcdir, library_dirs, include_dirs, libraries):
    py_include_dir = sysconfig.get_paths(scheme=scheme)["include"]
    custom_backend_dirs = set(os.getenv(var) for var in ('TRITON_CUDACRT_PATH', 'TRITON_CUDART_PATH'))
    include_dirs = include_dirs + [srcdir, py_include_dir, *custom_backend_dirs]
-    cc_cmd = [cc, src, "-O3", "-shared", "-fPIC", "-o", so]
+    # for -Wno-psabi, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111047
+    cc_cmd = [cc, src, "-O3", "-shared", "-fPIC", "-Wno-psabi", "-o", so]


what is causing the extra warning?

GCC doesn't like the CUtensorMap struct. This is called out in the CUDA C++ Programming Guide as a false warning:

When passing the tensor map as a parameter, some versions of the GCC C++ compiler issue the warning “the ABI for passing parameters with 64-byte alignment has changed in GCC 4.6”. This warning can be ignored.

I don't think it can be suppressed inline via pragma, it has to be suppressed on the command line: https://godbolt.org/z/f5n5crhjG

htyu

Thanks for the good work. LGTM in general. Left a couple minor feedbacks.

htyu · 2024-08-14T02:30:56Z

lib/Conversion/TritonGPUToLLVM/FuncOpToLLVM.cpp

+          llvmFuncOp.setArgAttr(i, "nvvm.grid_constant",
+                                mlir::UnitAttr::get(llvmFuncOp.getContext()));
+          llvmFuncOp.setArgAttr(i, "llvm.align",
+                                mlir::IntegerAttr::get(i32_type, 64));


Is 64 a required alignment value?

Yes. Here is the definition of CUtensorMap in <cuda.h>:

typedef struct CUtensorMap_st { alignas(64) unsigned long long opaque[16]; } CUtensorMap;

third_party/nvidia/backend/driver.py

python/triton/tools/experimental_descriptor.py

ThomasRaoux

LGTM once the other comments are addressed

Summary: This PR follows [a recent PR in Triton](triton-lang/triton#4498) that supports passing TMA descriptors by-value using `__grid_constant__`. In this PR, we update the kernel `_attn_fwd_inner` to support the above new feature in Triton. To support auto-tune, we implement a helper class that wraps operations for TMA during auto-tune and computations in kernel respectively. In addition, the benchmark program now also checks whether the triton version supports this new feature. If it doesn't, the helper class applies the old way of handling TMA. The change has been tested on Triton from the standard installation of pytorch on conda, as well as the recent Triton including the above PR. Command for testing and experiment results: Before removing fences: P1541573348 After removing fences: P1541736645 1) CUDA_VISIBLE_DEVICES=5, old tma: 138.476 2) CUDA_VISIBLE_DEVICES=5, new tma, with fences: 152 - 164 3) CUDA_VISIBLE_DEVICES=5, new tma, after removing fences: 168.0 4) CUDA_VISIBLE_DEVICES=5, no tma: 187.881 The result is still behind no TMA and we can investigate further. Pull Request resolved: #2428 Reviewed By: embg Differential Revision: D61668142 Pulled By: sfzhu93 fbshipit-source-id: d08bab147c6b2197f73447ee8f30ede877e712ca

embg requested review from bertmaher, htyu and manman-ren August 9, 2024 22:57

embg commented Aug 9, 2024

View reviewed changes

python/test/unit/hopper/test_experimental_tma.py Show resolved Hide resolved

embg added 12 commits August 11, 2024 22:30

byval tma desc working prototype

8bb3c1d

nits for driver code

f14b9cb

TmaDescKernelParam class

1e010f7

refactor FuncOpConversion

1ed44a3

bugfix for null argAttrDict

98d7191

add lit test

4a8e66b

format

e3d4032

update unit tests for byval tma

20d63a8

format

8a3b707

check PTX in unit tests

b5b420f

remove local test script

cd03b7a

small bugfix

35dde67

embg force-pushed the grid_const_dev branch from 5d8e86c to 35dde67 Compare August 12, 2024 04:30

ThomasRaoux reviewed Aug 12, 2024

View reviewed changes

fence when byval_tma is false

84d0173

htyu reviewed Aug 14, 2024

View reviewed changes

ThomasRaoux approved these changes Aug 14, 2024

View reviewed changes

embg marked this pull request as ready for review August 15, 2024 03:49

embg requested a review from ptillet as a code owner August 15, 2024 03:49

embg added 2 commits August 15, 2024 12:19

nits

24079ae

Merge branch 'main' into grid_const_dev

6bec85b

embg merged commit c25f684 into triton-lang:main Aug 19, 2024
6 checks passed

embg deleted the grid_const_dev branch August 19, 2024 18:26

sfzhu93 mentioned this pull request Aug 22, 2024

add support for auto-tune TMA grid constant pytorch/benchmark#2428

Closed

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

gflegar mentioned this pull request Sep 19, 2024

Refactor the C code template in third_party/nvidia/backend/driver.py #4722

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvidia] Support passing TMA descriptors by-value #4498

[nvidia] Support passing TMA descriptors by-value #4498

embg commented Aug 9, 2024 •

edited

Loading

ThomasRaoux Aug 9, 2024

embg Aug 13, 2024

htyu left a comment

htyu Aug 14, 2024

embg Aug 15, 2024

ThomasRaoux left a comment •

edited

Loading

[nvidia] Support passing TMA descriptors by-value #4498

[nvidia] Support passing TMA descriptors by-value #4498

Conversation

embg commented Aug 9, 2024 • edited Loading

Motivation

User-level API

Implementation details

Limitations

ThomasRaoux Aug 9, 2024

Choose a reason for hiding this comment

embg Aug 13, 2024

Choose a reason for hiding this comment

htyu left a comment

Choose a reason for hiding this comment

htyu Aug 14, 2024

Choose a reason for hiding this comment

embg Aug 15, 2024

Choose a reason for hiding this comment

ThomasRaoux left a comment • edited Loading

Choose a reason for hiding this comment

embg commented Aug 9, 2024 •

edited

Loading

ThomasRaoux left a comment •

edited

Loading