Use dynamic address of global and closure variables #15

dlee992 · 2024-07-13T22:33:58Z

Fixes numba/numba#9084.

Subtasks:

change make_constant_array for CUDATarget: Copied from Graham's POC and guided it by USE_NV_BINDING.
Making copies of data to ~~/ from~~ the device at launch time for global and closure variables on the host needs to be considered.
CUDA documentation should clarify that const copies are never automatically made, and indicate the const array constructor for their explicit construction.
Tests need to be added for both global and closure variable cases

Subtasks need to discuss before writing the code:

related to Point 2 above, allowing users to write this global or closure variables in kernel function or not? Since POC stills use make_constant_array, it enforces the underlying array is read-only.
Any other memory management considerations as required (e.g. across multiple devices, avoidance of leaks when making implicit copies from the host, etc.)

Subtasks should be handled in numba core:

follow the NVVM IR form: not use . in global identifier numba/numba#9642
Caching needs to be disabled for kernels with references to globals and closure variables. The docstring for BaseContext.add_dynamic_addr() suggests that addition of a dynamic address will disable caching, but this does not seem to be the case for CUDA at least.
a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry

dlee992 · 2024-07-16T20:22:27Z

From Numba office hour, I will treat global/closure variables as read-only in this PR.

dlee992 · 2024-07-17T19:01:31Z

Ha, interesting thing is after largely changing the behaviour of make_constant_array in CudaTarget, which breaks 5 tests for our original const array.

e.g., numba.cuda.tests.cudadrv.test_linker.TestLinker.test_get_const_mem_size, since make_constant_array is also serviced for const array creation.

FAIL: test_get_const_mem_size (numba.cuda.tests.cudadrv.test_linker.TestLinker.test_get_const_mem_size)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dli/numba-cuda/numba_cuda/numba/cuda/tests/cudadrv/test_linker.py", line 262, in test_get_const_mem_size
    self.assertGreaterEqual(const_mem_size, CONST1D.nbytes)
AssertionError: 0 not greater than or equal to 80

So need to keep the legacy behaviour (copying and creating a CUDA const array if users specify), and need to use a flag to mark if this's intentional behaviour or it's called from const.array_like series.

dlee992 · 2024-07-17T19:25:54Z

seems Numba core treats ir.Const, ir.Global, ir.FreeVar in the same way:

    def lower_assign(self, ty, inst):
        value = inst.value
        # In nopython mode, closure vars are frozen like globals
        if isinstance(value, (ir.Const, ir.Global, ir.FreeVar)):
            res = self.context.get_constant_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

I think we need to add a strategy to distinguish global and const. A direct way is add get_global_generic, then we register different methods for different targets. But this could need bunch of code changes/additions in Numba core. Then change the branch above to sth below:

        if isinstance(value, ir.Const):
            res = self.context.get_constant_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

        if isinstance(value, (ir.Global, ir.FreeVar)):
            res = self.context.get_global_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

Do you have a better idea? @gmarkall

gmarkall · 2024-07-19T09:36:53Z

Do you have a better idea? @gmarkall

I'm thinking maybe we can get around this by editing the lowering of constants for arrays in the CUDA target. Presently all globals that are arrays will be lowered by the lowering in arrayobj.py: https://github.com/numba/numba/blob/fca98287a8fc9da1825ec4e6b570ef8eeaf4605c/numba/np/arrayobj.py#L3282-L3287

Perhaps if we register a @lower_constant(types.Array) with the CUDA target we can put the implementation there instead, and leave make_constant_array unchanged in the CUDA target.

To ensure that we then still get a constant array in the CUDA target when we call cuda.const.array_like(), we'd have to then change its lowering so that it calls make_constant_array() instead of doing nothing as it presently does:

numba-cuda/numba_cuda/numba/cuda/cudaimpl.py

Lines 74 to 78 in 91f92d9

    
           @lower(cuda.const.array_like, types.Array) 
        
           def cuda_const_array_like(context, builder, sig, args): 
        
               # This is a no-op because CUDATargetContext.make_constant_array already 
        
               # created the constant array. 
        
               return args[0]

What do you think? Could this work?

dlee992 · 2024-07-21T18:26:37Z

Presently all globals that are arrays will be lowered by the lowering in arrayobj.py

Agreed.

Perhaps if we register a @lower_constant(types.Array) with the CUDA target

I checked the definition and usage of lower_constant, which doesn't provide a way to specify target.
If we register a new @lower_constant(types.Array) in the registry list, numba will append this new one at the end of the list.
But when numba tries to find a lower_constant definition for types.Array, it will find the old one first, then directly use it.
If we want to let new definition work, we have to pay attention to the import/definition order. This way is too tricky.

In theory, global and const should have their own ways for lowering. But now they use the same way since numba's legacy issue and design choice.

What do you think?

gmarkall · 2024-07-22T09:01:03Z

I think you can do

from numba.cuda.cudaimpl import lower_constant

to get a decorator that registers lowerings for the CUDA target context only, which is how I believe we avoid these clashes when using the low-level API at present.

dlee992 · 2024-07-22T15:22:13Z

Oh, yes! Thanks for pointing this out. I didn't notice CUDA creates a new Registry instance for itself.

numba-cuda/numba_cuda/numba/cuda/cudaimpl.py

Lines 19 to 22 in 91f92d9

    
           registry = Registry() 
        
           lower = registry.lower 
        
           lower_attr = registry.lower_getattr 
        
           lower_constant = registry.lower_constant

dlee992 · 2024-07-22T20:29:43Z

I added a new definition for @lower_constant(types.Array) in cudaimpl.py locally (didn't push yet), but right now CUDATargetContext loads the old definition in arrayobj.py first, which makes the new definition invalid/unused.

Need to reorder these imports to let CUDA additional registries be called before numba default registries.

numba-cuda/numba_cuda/numba/cuda/target.py

Lines 94 to 106 in 91f92d9

    
           def load_additional_registries(self): 
        
               # side effect of import needed for numba.cpython.*, the builtins 
        
               # registry is updated at import time. 
        
               from numba.cpython import numbers, tupleobj, slicing # noqa: F401 
        
               from numba.cpython import rangeobj, iterators, enumimpl # noqa: F401 
        
               from numba.cpython import unicode, charseq # noqa: F401 
        
               from numba.cpython import cmathimpl 
        
               from numba.misc import cffiimpl 
        
               from numba.np import arrayobj # noqa: F401 
        
               from numba.np import npdatetime # noqa: F401 
        
               from . import ( 
        
                   cudaimpl, printimpl, libdeviceimpl, mathimpl, vector_types 
        
               )

Updates: seems reordering the import and install_registry is still not enough.

dlee992 · 2024-07-23T02:22:38Z

turns out the root cause is this function:

https://github.com/numba/numba/blob/fca98287a8fc9da1825ec4e6b570ef8eeaf4605c/numba/core/base.py#L56-L64

    def _select_compatible(self, sig):
        """
        Select all compatible signatures and their implementation.
        """
        out = {}
        for ver_sig, impl in self.versions:
            if self._match_arglist(ver_sig, sig):
                out[ver_sig] = impl
        return out

I added some prints during the loop and after the loop:

ver_sig: (<class 'numba.core.types.npytypes.Array'>,), module: numba.cuda.cudaimpl impl: <function constant_array at 0x14a5f40a62a0>
ver_sig: (<class 'numba.core.types.npytypes.Array'>,), module: numba.np.arrayobj impl: <function constant_array at 0x14a5f41eafc0>
out: {(<class 'numba.core.types.npytypes.Array'>,): <function constant_array at 0x14a5f41eafc0>}

So if the TargetContext has multiple lower_constant definition for the same type signature, it chooses the last one in the registry. Somehow in current registry order, the default definition is chosen.

Need to reorder these imports to let CUDA additional registries be called before numba default registries.

In the contrary, we need to ensure CUDA-specific registries are installed after default registries, to let the CUDA specific one be chosen.

dlee992 · 2024-07-23T23:24:29Z

If I apply this change locally, finally the registry will work as expected: found the CUDA-specific definition for @lower_constant(types.Array):

diff --git a/numba_cuda/numba/cuda/target.py b/numba_cuda/numba/cuda/target.py
index be9487c..fd8b2b0 100644
--- a/numba_cuda/numba/cuda/target.py
+++ b/numba_cuda/numba/cuda/target.py
@@ -10,6 +10,7 @@ from numba.core.base import BaseContext
 from numba.core.callconv import BaseCallConv, MinimalCallConv
 from numba.core.typing import cmathdecl
 from numba.core import datamodel
+from numba.core.imputils import builtin_registry
 
 from .cudadrv import nvvm
 from numba.cuda import codegen, nvvmutils, ufuncs
@@ -109,6 +110,7 @@ class CUDATargetContext(BaseContext):
         # fix for #8940
         from numba.np.unsafe import ndarray # noqa F401
 
+        self.install_registry(builtin_registry)
         self.install_registry(cudaimpl.registry)
         self.install_registry(cffiimpl.registry)
         self.install_registry(printimpl.registry)
@@ -117,6 +119,10 @@ class CUDATargetContext(BaseContext):
         self.install_registry(mathimpl.registry)
         self.install_registry(vector_types.impl_registry)
 
+    def refresh(self):
+        self.load_additional_registries()
+        self.typing_context.refresh()
+
     def codegen(self):
         return self._internal_codegen

This should be a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry
@gmarkall What do you think?

gmarkall · 2024-07-27T09:48:05Z

This should be a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry
@gmarkall What do you think?

I think you're right - thanks for locating how the order of implementations is considered. I think up until this point, nobody actually knew how Numba actually decided on which implementation is used.

gmarkall · 2024-07-27T10:45:13Z

@dlee992 Would you like to push your local changes, and we'll see how they go on CI? Or do you still have local outstanding issues with the test suite?

copy-pr-bot · 2024-07-28T22:33:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dlee992 · 2024-07-28T23:11:54Z

I pushed my workaround for taking care of the registry ordering.

However, I am unsure how to make make_constant_array working as before, since the last argument ary becomes from a real python ndarray to a ir.LoadInstr, which comes from the new @lower_constant(types.Array).
I have to admit I'm not good at figuring out how to generate LLVM IR with llvmlite interface. Seems it needs to get the underlying array from this ir.LoadInstr somehow. Perhaps I better read LLVM IR reference doc again.

In theory, global and const should have their own ways for lowering. But now they use the same way since numba's legacy issue and design choice.

BTW, it's kinda going back to this issue. If we can choose a different way for ir.Global and cuda.const.array_like at the very beginning, it would simplify the lowering parts.

gmarkall · 2024-10-21T14:50:04Z

Retargeted to main, as I'm moving everything off develop (which is no longer needed).

gmarkall · 2024-11-29T22:28:11Z

Adding the "Blocked" label, until we figure out a way we can make this work.

apply the POC and add dataptr property for DeviceNDArrayBase

4f2c9c0

gmarkall added the 2 - In Progress Currently a work in progress label Jul 15, 2024

dlee992 changed the base branch from main to develop July 16, 2024 23:33

use devicearray.auto_device to auto copy global array to device

b3e7a8c

gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 2 - In Progress Currently a work in progress labels Jul 19, 2024

gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jul 27, 2024

gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Jul 27, 2024

change context refresh, restore make_constant_array, but it doesn't work

52ef5cb

gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jul 31, 2024

gmarkall mentioned this pull request Sep 23, 2024

follow the NVVM IR form: not use . in global identifier numba/numba#9642

Closed

gmarkall changed the base branch from develop to main October 21, 2024 14:49

gmarkall added the 0 - Blocked Cannot progress due to external reasons label Nov 29, 2024

gmarkall removed the 4 - Waiting on reviewer Waiting for reviewer to respond to author label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dynamic address of global and closure variables #15

Use dynamic address of global and closure variables #15

dlee992 commented Jul 13, 2024 •

edited

Loading

dlee992 commented Jul 16, 2024

dlee992 commented Jul 17, 2024 •

edited

Loading

dlee992 commented Jul 17, 2024 •

edited

Loading

gmarkall commented Jul 19, 2024

dlee992 commented Jul 21, 2024

gmarkall commented Jul 22, 2024 •

edited

Loading

dlee992 commented Jul 22, 2024

dlee992 commented Jul 22, 2024 •

edited

Loading

dlee992 commented Jul 23, 2024 •

edited

Loading

dlee992 commented Jul 23, 2024 •

edited

Loading

gmarkall commented Jul 27, 2024

gmarkall commented Jul 27, 2024

copy-pr-bot bot commented Jul 28, 2024

dlee992 commented Jul 28, 2024 •

edited

Loading

gmarkall commented Oct 21, 2024

gmarkall commented Nov 29, 2024

Use dynamic address of global and closure variables #15

Are you sure you want to change the base?

Use dynamic address of global and closure variables #15

Conversation

dlee992 commented Jul 13, 2024 • edited Loading

dlee992 commented Jul 16, 2024

dlee992 commented Jul 17, 2024 • edited Loading

dlee992 commented Jul 17, 2024 • edited Loading

gmarkall commented Jul 19, 2024

dlee992 commented Jul 21, 2024

gmarkall commented Jul 22, 2024 • edited Loading

dlee992 commented Jul 22, 2024

dlee992 commented Jul 22, 2024 • edited Loading

dlee992 commented Jul 23, 2024 • edited Loading

dlee992 commented Jul 23, 2024 • edited Loading

gmarkall commented Jul 27, 2024

gmarkall commented Jul 27, 2024

copy-pr-bot bot commented Jul 28, 2024

dlee992 commented Jul 28, 2024 • edited Loading

gmarkall commented Oct 21, 2024

gmarkall commented Nov 29, 2024

dlee992 commented Jul 13, 2024 •

edited

Loading

dlee992 commented Jul 17, 2024 •

edited

Loading

dlee992 commented Jul 17, 2024 •

edited

Loading

gmarkall commented Jul 22, 2024 •

edited

Loading

dlee992 commented Jul 22, 2024 •

edited

Loading

dlee992 commented Jul 23, 2024 •

edited

Loading

dlee992 commented Jul 23, 2024 •

edited

Loading

dlee992 commented Jul 28, 2024 •

edited

Loading