-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[Offload] Allow CUDA Kernels to use arbitrarily large shared memory #145963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -160,6 +160,9 @@ struct CUDAKernelTy : public GenericKernelTy { | |
private: | ||
/// The CUDA kernel function to execute. | ||
CUfunction Func; | ||
/// The maximum amount of dynamic shared memory per thread group. By default, | ||
/// this is set to 48 KB. | ||
mutable uint32_t MaxDynCGroupMemLimit = 49152; | ||
}; | ||
|
||
/// Class wrapping a CUDA stream reference. These are the objects handled by the | ||
|
@@ -1302,6 +1305,16 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice, | |
if (GenericDevice.getRPCServer()) | ||
GenericDevice.Plugin.getRPCServer().Thread->notify(); | ||
|
||
// In case we require more memory than the current limit. | ||
if (MaxDynCGroupMem >= MaxDynCGroupMemLimit) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this behavior same as CUDA? I thought for CUDA users have to set it explicitly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They do. But openmp has no such means. We need to handle it transparently. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, is this code dedicated for OpenMP? I know it is, for now, but we might want to consider to check what the actual "user" here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is part of why I think we should remove more OpenMP logic from the core library. It would be better to make this guarantee at the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What other code path would trigger this right now?
We can't do this in libomptarget. The kernel handle is not there or we start to unravel the abstraction layers. As for the other users: From all I can tell, this is equally useful for SYCL and users that would use the new API "manually", since they can't call anything else either. Only for the user from CUDA/HIP, which is not actually supported right now, this would not be necessary. Let's keep the restructuring somewhat down to earth. If this, and other things, become a problem for other users, we will add "option flags" that guard the logic. Without having evidence for this, nor having evidence that this isn't more generally useful here, let's not try to force separation for separation's sake. |
||
CUresult AttrResult = cuFuncSetAttribute( | ||
Func, CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES, MaxDynCGroupMem); | ||
Plugin::check( | ||
AttrResult, | ||
"Error in cuLaunchKernel while setting the memory limits: %s"); | ||
MaxDynCGroupMemLimit = MaxDynCGroupMem; | ||
} | ||
|
||
CUresult Res = cuLaunchKernel(Func, NumBlocks[0], NumBlocks[1], NumBlocks[2], | ||
NumThreads[0], NumThreads[1], NumThreads[2], | ||
MaxDynCGroupMem, Stream, nullptr, Config); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be mutable, we probably want to initialize this at kernel creation using the correct value from the
cuFuncGetAttributes
function. Alternatively we could just check that value every time we launch a kernel, though I don't know how much overhead that would add.Making it mutable keeps it up-to-date I suppose, so it would avoid redundant work if we call the function multiple times with a different opt-in. However I'd say that's required for correctness because theoretically a user could use an API to modify it manually so it's probably best to just play it safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to check every time, and we can't fight the user. They could also change the context or other stuff, that's not our concern. The only real question is if we want to go back to a lower value or keep the high water mark. Going back might benefit performance for those launches, but I'd stick with the high water mark for now. Wrt. Mutable, we need some solution, open to suggestions, mutable isn't the worst here honestly.