[FEA] Provide a way to set the maximum dynamic shared memory size #94

Gui-Yom · 2024-12-19T11:23:54Z

Is your feature request related to a problem? Please describe.
There is, to my knowledge, no way to increase the default bound of 48KB on dynamic shared memory. This puts a limit on the kernels we can develop with Numba CUDA JIT.

Describe the solution you'd like
A method on CUDADispatcher or a parameter in the JIT decorator or an automatic call to cudaFuncSetAttribute (or equivalent) when needed.

Describe alternatives you've considered
I am not aware of any alternatives. I am not sure how I could get a pointer to the cuda function to call cudaFuncSetAttribute myself.

Additional context
NVIDIA/nvmath-python provides a way to use CuFFTDx device side APIs from numba cuda kernels. CuFFTDx requires a large amount of shared memory for bigger FFT sizes.

The text was updated successfully, but these errors were encountered:

gmarkall · 2024-12-19T12:08:18Z

Thanks for the request! In the past I think I've been able to set these attributes by going through some APIs that weren't necessarily public, which could serve as a workaround for your use case - let me see if I can find a way to do this that you can use, prior to the implementation of such a feature.

gmarkall · 2024-12-19T13:07:28Z

Here's an example of a workaround:

from numba import cuda
from numba.cuda.cudadrv import drvapi, enums
from numba.cuda.cudadrv.driver import driver
import numpy as np

# Setup - add a binding for cuFuncSetAttribute:
#
# CUresult cuFuncSetAttribute(
#     CUfunction hfunc,
#     CUfunction_attribute attrib,
#     int value)
cfsa_name = 'cuFuncSetAttribute'
cfsa_args = (drvapi.c_int,
             drvapi.cu_function,
             drvapi.cu_function_attribute,
             drvapi.c_int)
drvapi.API_PROTOTYPES[cfsa_name] = cfsa_args


# Kernel eagerly compiled (because of the signature in the jit decorator) so
# that we can obtain the cufunc and set the maximum dynamic shared memory size
# prior to our attempt to launch it

@cuda.jit("void(float64[::1])")
def k(data):
    data[0] = 1


def set_max_dynamic_shared_memory(function, nbytes):
    """Set the maximum dynamic shared memory size for all overloads of a given
    function."""
    attrib = enums.CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES

    for sig, kernel in function.overloads.items():
        cufunc = kernel._codelibrary.get_cufunc()
        driver.cuFuncSetAttribute(cufunc.handle, attrib, nbytes)

        result = cufunc.read_func_attr(attrib)
        print(f"Max dynamic shared memory set to: {result}")


# Set max dynamic shared memory, with a little headroom
set_max_dynamic_shared_memory(k, 50000)

# 49153 is greater than the default 48K (by one byte)
k[1, 1, 0, 49153](np.zeros(1))

The workaround patches in cuFuncSetAttribute() into the driver's API bindings. We eagerly compile the kernel by providing signatures. This ensures we have the cufunc available on the device to set attributes for, and then we set the maximum dynamic shared memory size for all of the overloads (compiled variants) of the kernel.

On my system the above example fails to launch if I comment out the call to set_max_dynamic_shared_memory().

The above example works when using Numba's built in ctypes binding - it may need adjusting if the NVIDIA CUDA Python bindings are in use (I'm not sure yet if this is the case with nvmath-python) - if you run into an issue let me know and I should be able to revise the example.

gmarkall · 2024-12-19T13:09:53Z

WIP towards feature implementation is in https://github.com/gmarkall/numba-cuda/tree/set-max-dynshared - so far I just added an API for it to the cufunc, but really it needs an API that faces the user more closely - e.g. as suggested in the issue, either a parameter for the JIT decorator, or an automatic call.

I think I'd lean towards a parameter in the JIT decorator, because an automatic call at kernel launch time would get in the critical path for launching kernels and could potentially slow things down there.

Gui-Yom · 2024-12-19T17:04:30Z

Thanks, that was quick ! I can confirm the workaround works.

gmarkall · 2024-12-19T17:25:14Z

Thanks for confirming! I'll keep this issue open until a more user-friendly mechanism for it has been added to numba-cuda.

nvlcambier · 2024-12-19T22:24:51Z

FYI we have some helpers in the nvmath-python samples solving exactly this problem, see

Maybe this can help as a temporary solution.

Gui-Yom added the feature request New feature or request label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Provide a way to set the maximum dynamic shared memory size #94

[FEA] Provide a way to set the maximum dynamic shared memory size #94

Gui-Yom commented Dec 19, 2024

gmarkall commented Dec 19, 2024

gmarkall commented Dec 19, 2024

gmarkall commented Dec 19, 2024

Gui-Yom commented Dec 19, 2024

gmarkall commented Dec 19, 2024

nvlcambier commented Dec 19, 2024

[FEA] Provide a way to set the maximum dynamic shared memory size #94

[FEA] Provide a way to set the maximum dynamic shared memory size #94

Comments

Gui-Yom commented Dec 19, 2024

gmarkall commented Dec 19, 2024

gmarkall commented Dec 19, 2024

gmarkall commented Dec 19, 2024

Gui-Yom commented Dec 19, 2024

gmarkall commented Dec 19, 2024

nvlcambier commented Dec 19, 2024