Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dynamic address of global and closure variables #15

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dlee992
Copy link
Contributor

@dlee992 dlee992 commented Jul 13, 2024

Fixes numba/numba#9084.

Subtasks:

  • change make_constant_array for CUDATarget: Copied from Graham's POC and guided it by USE_NV_BINDING.
  • Making copies of data to / from the device at launch time for global and closure variables on the host needs to be considered.
  • CUDA documentation should clarify that const copies are never automatically made, and indicate the const array constructor for their explicit construction.
  • Tests need to be added for both global and closure variable cases

Subtasks need to discuss before writing the code:

  • related to Point 2 above, allowing users to write this global or closure variables in kernel function or not? Since POC stills use make_constant_array, it enforces the underlying array is read-only.
  • Any other memory management considerations as required (e.g. across multiple devices, avoidance of leaks when making implicit copies from the host, etc.)

Subtasks should be handled in numba core:

  • follow the NVVM IR form: not use . in global identifier numba/numba#9642
  • Caching needs to be disabled for kernels with references to globals and closure variables. The docstring for BaseContext.add_dynamic_addr() suggests that addition of a dynamic address will disable caching, but this does not seem to be the case for CUDA at least.
  • a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry

@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Jul 15, 2024
@dlee992
Copy link
Contributor Author

dlee992 commented Jul 16, 2024

From Numba office hour, I will treat global/closure variables as read-only in this PR.

@dlee992 dlee992 changed the base branch from main to develop July 16, 2024 23:33
@dlee992
Copy link
Contributor Author

dlee992 commented Jul 17, 2024

Ha, interesting thing is after largely changing the behaviour of make_constant_array in CudaTarget, which breaks 5 tests for our original const array.

e.g., numba.cuda.tests.cudadrv.test_linker.TestLinker.test_get_const_mem_size, since make_constant_array is also serviced for const array creation.

FAIL: test_get_const_mem_size (numba.cuda.tests.cudadrv.test_linker.TestLinker.test_get_const_mem_size)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dli/numba-cuda/numba_cuda/numba/cuda/tests/cudadrv/test_linker.py", line 262, in test_get_const_mem_size
    self.assertGreaterEqual(const_mem_size, CONST1D.nbytes)
AssertionError: 0 not greater than or equal to 80

So need to keep the legacy behaviour (copying and creating a CUDA const array if users specify), and need to use a flag to mark if this's intentional behaviour or it's called from const.array_like series.

@dlee992
Copy link
Contributor Author

dlee992 commented Jul 17, 2024

seems Numba core treats ir.Const, ir.Global, ir.FreeVar in the same way:

    def lower_assign(self, ty, inst):
        value = inst.value
        # In nopython mode, closure vars are frozen like globals
        if isinstance(value, (ir.Const, ir.Global, ir.FreeVar)):
            res = self.context.get_constant_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

I think we need to add a strategy to distinguish global and const. A direct way is add get_global_generic, then we register different methods for different targets. But this could need bunch of code changes/additions in Numba core. Then change the branch above to sth below:

        if isinstance(value, ir.Const):
            res = self.context.get_constant_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

        if isinstance(value, (ir.Global, ir.FreeVar)):
            res = self.context.get_global_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

Do you have a better idea? @gmarkall

@gmarkall
Copy link
Collaborator

Do you have a better idea? @gmarkall

I'm thinking maybe we can get around this by editing the lowering of constants for arrays in the CUDA target. Presently all globals that are arrays will be lowered by the lowering in arrayobj.py: https://github.com/numba/numba/blob/fca98287a8fc9da1825ec4e6b570ef8eeaf4605c/numba/np/arrayobj.py#L3282-L3287

Perhaps if we register a @lower_constant(types.Array) with the CUDA target we can put the implementation there instead, and leave make_constant_array unchanged in the CUDA target.

To ensure that we then still get a constant array in the CUDA target when we call cuda.const.array_like(), we'd have to then change its lowering so that it calls make_constant_array() instead of doing nothing as it presently does:

@lower(cuda.const.array_like, types.Array)
def cuda_const_array_like(context, builder, sig, args):
# This is a no-op because CUDATargetContext.make_constant_array already
# created the constant array.
return args[0]

What do you think? Could this work?

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 2 - In Progress Currently a work in progress labels Jul 19, 2024
@dlee992
Copy link
Contributor Author

dlee992 commented Jul 21, 2024

Presently all globals that are arrays will be lowered by the lowering in arrayobj.py

Agreed.

Perhaps if we register a @lower_constant(types.Array) with the CUDA target

  1. I checked the definition and usage of lower_constant, which doesn't provide a way to specify target.
  2. If we register a new @lower_constant(types.Array) in the registry list, numba will append this new one at the end of the list.
  3. But when numba tries to find a lower_constant definition for types.Array, it will find the old one first, then directly use it.
  4. If we want to let new definition work, we have to pay attention to the import/definition order. This way is too tricky.

In theory, global and const should have their own ways for lowering. But now they use the same way since numba's legacy issue and design choice.

What do you think?

@gmarkall
Copy link
Collaborator

gmarkall commented Jul 22, 2024

I think you can do

from numba.cuda.cudaimpl import lower_constant

to get a decorator that registers lowerings for the CUDA target context only, which is how I believe we avoid these clashes when using the low-level API at present.

@dlee992
Copy link
Contributor Author

dlee992 commented Jul 22, 2024

Oh, yes! Thanks for pointing this out. I didn't notice CUDA creates a new Registry instance for itself.

registry = Registry()
lower = registry.lower
lower_attr = registry.lower_getattr
lower_constant = registry.lower_constant

@dlee992
Copy link
Contributor Author

dlee992 commented Jul 22, 2024

I added a new definition for @lower_constant(types.Array) in cudaimpl.py locally (didn't push yet), but right now CUDATargetContext loads the old definition in arrayobj.py first, which makes the new definition invalid/unused.

Need to reorder these imports to let CUDA additional registries be called before numba default registries.

def load_additional_registries(self):
# side effect of import needed for numba.cpython.*, the builtins
# registry is updated at import time.
from numba.cpython import numbers, tupleobj, slicing # noqa: F401
from numba.cpython import rangeobj, iterators, enumimpl # noqa: F401
from numba.cpython import unicode, charseq # noqa: F401
from numba.cpython import cmathimpl
from numba.misc import cffiimpl
from numba.np import arrayobj # noqa: F401
from numba.np import npdatetime # noqa: F401
from . import (
cudaimpl, printimpl, libdeviceimpl, mathimpl, vector_types
)

Updates: seems reordering the import and install_registry is still not enough.

@dlee992
Copy link
Contributor Author

dlee992 commented Jul 23, 2024

turns out the root cause is this function:

https://github.com/numba/numba/blob/fca98287a8fc9da1825ec4e6b570ef8eeaf4605c/numba/core/base.py#L56-L64

    def _select_compatible(self, sig):
        """
        Select all compatible signatures and their implementation.
        """
        out = {}
        for ver_sig, impl in self.versions:
            if self._match_arglist(ver_sig, sig):
                out[ver_sig] = impl
        return out

I added some prints during the loop and after the loop:

ver_sig: (<class 'numba.core.types.npytypes.Array'>,), module: numba.cuda.cudaimpl impl: <function constant_array at 0x14a5f40a62a0>
ver_sig: (<class 'numba.core.types.npytypes.Array'>,), module: numba.np.arrayobj impl: <function constant_array at 0x14a5f41eafc0>
out: {(<class 'numba.core.types.npytypes.Array'>,): <function constant_array at 0x14a5f41eafc0>}

So if the TargetContext has multiple lower_constant definition for the same type signature, it chooses the last one in the registry. Somehow in current registry order, the default definition is chosen.

Need to reorder these imports to let CUDA additional registries be called before numba default registries.

In the contrary, we need to ensure CUDA-specific registries are installed after default registries, to let the CUDA specific one be chosen.

@dlee992
Copy link
Contributor Author

dlee992 commented Jul 23, 2024

If I apply this change locally, finally the registry will work as expected: found the CUDA-specific definition for @lower_constant(types.Array):

diff --git a/numba_cuda/numba/cuda/target.py b/numba_cuda/numba/cuda/target.py
index be9487c..fd8b2b0 100644
--- a/numba_cuda/numba/cuda/target.py
+++ b/numba_cuda/numba/cuda/target.py
@@ -10,6 +10,7 @@ from numba.core.base import BaseContext
 from numba.core.callconv import BaseCallConv, MinimalCallConv
 from numba.core.typing import cmathdecl
 from numba.core import datamodel
+from numba.core.imputils import builtin_registry
 
 from .cudadrv import nvvm
 from numba.cuda import codegen, nvvmutils, ufuncs
@@ -109,6 +110,7 @@ class CUDATargetContext(BaseContext):
         # fix for #8940
         from numba.np.unsafe import ndarray # noqa F401
 
+        self.install_registry(builtin_registry)
         self.install_registry(cudaimpl.registry)
         self.install_registry(cffiimpl.registry)
         self.install_registry(printimpl.registry)
@@ -117,6 +119,10 @@ class CUDATargetContext(BaseContext):
         self.install_registry(mathimpl.registry)
         self.install_registry(vector_types.impl_registry)
 
+    def refresh(self):
+        self.load_additional_registries()
+        self.typing_context.refresh()
+
     def codegen(self):
         return self._internal_codegen

This should be a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry
@gmarkall What do you think?

@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jul 27, 2024
@gmarkall
Copy link
Collaborator

This should be a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry
@gmarkall What do you think?

I think you're right - thanks for locating how the order of implementations is considered. I think up until this point, nobody actually knew how Numba actually decided on which implementation is used.

@gmarkall
Copy link
Collaborator

@dlee992 Would you like to push your local changes, and we'll see how they go on CI? Or do you still have local outstanding issues with the test suite?

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Jul 27, 2024
Copy link

copy-pr-bot bot commented Jul 28, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@dlee992
Copy link
Contributor Author

dlee992 commented Jul 28, 2024

I pushed my workaround for taking care of the registry ordering.

However, I am unsure how to make make_constant_array working as before, since the last argument ary becomes from a real python ndarray to a ir.LoadInstr, which comes from the new @lower_constant(types.Array).
I have to admit I'm not good at figuring out how to generate LLVM IR with llvmlite interface. Seems it needs to get the underlying array from this ir.LoadInstr somehow. Perhaps I better read LLVM IR reference doc again.

In theory, global and const should have their own ways for lowering. But now they use the same way since numba's legacy issue and design choice.

BTW, it's kinda going back to this issue. If we can choose a different way for ir.Global and cuda.const.array_like at the very beginning, it would simplify the lowering parts.

@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jul 31, 2024
@gmarkall gmarkall changed the base branch from develop to main October 21, 2024 14:49
@gmarkall
Copy link
Collaborator

Retargeted to main, as I'm moving everything off develop (which is no longer needed).

@gmarkall gmarkall added the 0 - Blocked Cannot progress due to external reasons label Nov 29, 2024
@gmarkall
Copy link
Collaborator

Adding the "Blocked" label, until we figure out a way we can make this work.

@gmarkall gmarkall removed the 4 - Waiting on reviewer Waiting for reviewer to respond to author label Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA kernels should not make const copies of global and closure variables
2 participants