Skip to content

[Offload] Allow CUDA Kernels to use arbitrarily large shared memory #145963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions offload/plugins-nextgen/cuda/dynamic_cuda/cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ DLWRAP(cuDeviceGet, 2)
DLWRAP(cuDeviceGetAttribute, 3)
DLWRAP(cuDeviceGetCount, 1)
DLWRAP(cuFuncGetAttribute, 3)
DLWRAP(cuFuncSetAttribute, 3)

// Device info
DLWRAP(cuDeviceGetName, 3)
Expand Down
2 changes: 2 additions & 0 deletions offload/plugins-nextgen/cuda/dynamic_cuda/cuda.h
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,7 @@ typedef enum CUdevice_attribute_enum {

typedef enum CUfunction_attribute_enum {
CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 0,
CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES = 8,
} CUfunction_attribute;

typedef enum CUctx_flags_enum {
Expand Down Expand Up @@ -295,6 +296,7 @@ CUresult cuDeviceGet(CUdevice *, int);
CUresult cuDeviceGetAttribute(int *, CUdevice_attribute, CUdevice);
CUresult cuDeviceGetCount(int *);
CUresult cuFuncGetAttribute(int *, CUfunction_attribute, CUfunction);
CUresult cuFuncSetAttribute(CUfunction, CUfunction_attribute, int);

// Device info
CUresult cuDeviceGetName(char *, int, CUdevice);
Expand Down
13 changes: 13 additions & 0 deletions offload/plugins-nextgen/cuda/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,9 @@ struct CUDAKernelTy : public GenericKernelTy {
private:
/// The CUDA kernel function to execute.
CUfunction Func;
/// The maximum amount of dynamic shared memory per thread group. By default,
/// this is set to 48 KB.
mutable uint32_t MaxDynCGroupMemLimit = 49152;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be mutable, we probably want to initialize this at kernel creation using the correct value from the cuFuncGetAttributes function. Alternatively we could just check that value every time we launch a kernel, though I don't know how much overhead that would add.

Making it mutable keeps it up-to-date I suppose, so it would avoid redundant work if we call the function multiple times with a different opt-in. However I'd say that's required for correctness because theoretically a user could use an API to modify it manually so it's probably best to just play it safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to check every time, and we can't fight the user. They could also change the context or other stuff, that's not our concern. The only real question is if we want to go back to a lower value or keep the high water mark. Going back might benefit performance for those launches, but I'd stick with the high water mark for now. Wrt. Mutable, we need some solution, open to suggestions, mutable isn't the worst here honestly.

};

/// Class wrapping a CUDA stream reference. These are the objects handled by the
Expand Down Expand Up @@ -1302,6 +1305,16 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
if (GenericDevice.getRPCServer())
GenericDevice.Plugin.getRPCServer().Thread->notify();

// In case we require more memory than the current limit.
if (MaxDynCGroupMem >= MaxDynCGroupMemLimit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior same as CUDA? I thought for CUDA users have to set it explicitly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do. But openmp has no such means. We need to handle it transparently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this code dedicated for OpenMP? I know it is, for now, but we might want to consider to check what the actual "user" here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of why I think we should remove more OpenMP logic from the core library. It would be better to make this guarantee at the libomptarget layer, since we likely want to maintain opt-in behavior. Worst case we use yet another environment variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this code dedicated for OpenMP?

What other code path would trigger this right now?

This is part of why I think we should remove more OpenMP logic from the core library. It would be better to make this guarantee at the libomptarget layer, since we likely want to maintain opt-in behavior. Worst case we use yet another environment variable.

We can't do this in libomptarget. The kernel handle is not there or we start to unravel the abstraction layers. As for the other users: From all I can tell, this is equally useful for SYCL and users that would use the new API "manually", since they can't call anything else either. Only for the user from CUDA/HIP, which is not actually supported right now, this would not be necessary. Let's keep the restructuring somewhat down to earth. If this, and other things, become a problem for other users, we will add "option flags" that guard the logic. Without having evidence for this, nor having evidence that this isn't more generally useful here, let's not try to force separation for separation's sake.

CUresult AttrResult = cuFuncSetAttribute(
Func, CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES, MaxDynCGroupMem);
Plugin::check(
AttrResult,
"Error in cuLaunchKernel while setting the memory limits: %s");
MaxDynCGroupMemLimit = MaxDynCGroupMem;
}

CUresult Res = cuLaunchKernel(Func, NumBlocks[0], NumBlocks[1], NumBlocks[2],
NumThreads[0], NumThreads[1], NumThreads[2],
MaxDynCGroupMem, Stream, nullptr, Config);
Expand Down
Loading