Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Provide a way to set the maximum dynamic shared memory size #94

Open
Gui-Yom opened this issue Dec 19, 2024 · 6 comments
Open

[FEA] Provide a way to set the maximum dynamic shared memory size #94

Gui-Yom opened this issue Dec 19, 2024 · 6 comments
Labels
feature request New feature or request

Comments

@Gui-Yom
Copy link

Gui-Yom commented Dec 19, 2024

Is your feature request related to a problem? Please describe.
There is, to my knowledge, no way to increase the default bound of 48KB on dynamic shared memory. This puts a limit on the kernels we can develop with Numba CUDA JIT.

Describe the solution you'd like
A method on CUDADispatcher or a parameter in the JIT decorator or an automatic call to cudaFuncSetAttribute (or equivalent) when needed.

Describe alternatives you've considered
I am not aware of any alternatives. I am not sure how I could get a pointer to the cuda function to call cudaFuncSetAttribute myself.

Additional context
NVIDIA/nvmath-python provides a way to use CuFFTDx device side APIs from numba cuda kernels. CuFFTDx requires a large amount of shared memory for bigger FFT sizes.

@Gui-Yom Gui-Yom added the feature request New feature or request label Dec 19, 2024
@gmarkall
Copy link
Collaborator

Thanks for the request! In the past I think I've been able to set these attributes by going through some APIs that weren't necessarily public, which could serve as a workaround for your use case - let me see if I can find a way to do this that you can use, prior to the implementation of such a feature.

@gmarkall
Copy link
Collaborator

Here's an example of a workaround:

from numba import cuda
from numba.cuda.cudadrv import drvapi, enums
from numba.cuda.cudadrv.driver import driver
import numpy as np

# Setup - add a binding for cuFuncSetAttribute:
#
# CUresult cuFuncSetAttribute(
#     CUfunction hfunc,
#     CUfunction_attribute attrib,
#     int value)
cfsa_name = 'cuFuncSetAttribute'
cfsa_args = (drvapi.c_int,
             drvapi.cu_function,
             drvapi.cu_function_attribute,
             drvapi.c_int)
drvapi.API_PROTOTYPES[cfsa_name] = cfsa_args


# Kernel eagerly compiled (because of the signature in the jit decorator) so
# that we can obtain the cufunc and set the maximum dynamic shared memory size
# prior to our attempt to launch it

@cuda.jit("void(float64[::1])")
def k(data):
    data[0] = 1


def set_max_dynamic_shared_memory(function, nbytes):
    """Set the maximum dynamic shared memory size for all overloads of a given
    function."""
    attrib = enums.CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES

    for sig, kernel in function.overloads.items():
        cufunc = kernel._codelibrary.get_cufunc()
        driver.cuFuncSetAttribute(cufunc.handle, attrib, nbytes)

        result = cufunc.read_func_attr(attrib)
        print(f"Max dynamic shared memory set to: {result}")


# Set max dynamic shared memory, with a little headroom
set_max_dynamic_shared_memory(k, 50000)

# 49153 is greater than the default 48K (by one byte)
k[1, 1, 0, 49153](np.zeros(1))

The workaround patches in cuFuncSetAttribute() into the driver's API bindings. We eagerly compile the kernel by providing signatures. This ensures we have the cufunc available on the device to set attributes for, and then we set the maximum dynamic shared memory size for all of the overloads (compiled variants) of the kernel.

On my system the above example fails to launch if I comment out the call to set_max_dynamic_shared_memory().

The above example works when using Numba's built in ctypes binding - it may need adjusting if the NVIDIA CUDA Python bindings are in use (I'm not sure yet if this is the case with nvmath-python) - if you run into an issue let me know and I should be able to revise the example.

@gmarkall
Copy link
Collaborator

WIP towards feature implementation is in https://github.com/gmarkall/numba-cuda/tree/set-max-dynshared - so far I just added an API for it to the cufunc, but really it needs an API that faces the user more closely - e.g. as suggested in the issue, either a parameter for the JIT decorator, or an automatic call.

I think I'd lean towards a parameter in the JIT decorator, because an automatic call at kernel launch time would get in the critical path for launching kernels and could potentially slow things down there.

@Gui-Yom
Copy link
Author

Gui-Yom commented Dec 19, 2024

Thanks, that was quick ! I can confirm the workaround works.

@gmarkall
Copy link
Collaborator

Thanks for confirming! I'll keep this issue open until a more user-friendly mechanism for it has been added to numba-cuda.

@nvlcambier
Copy link

FYI we have some helpers in the nvmath-python samples solving exactly this problem, see

Maybe this can help as a temporary solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants