Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for __stcs and __ldcs intrinsics #42

Open
gmarkall opened this issue Aug 9, 2024 · 1 comment
Open

[FEA] Add support for __stcs and __ldcs intrinsics #42

gmarkall opened this issue Aug 9, 2024 · 1 comment
Labels
feature request New feature or request
Milestone

Comments

@gmarkall
Copy link
Collaborator

gmarkall commented Aug 9, 2024

This is to satisfy use cases that involve streaming loads and stores.

@gmarkall gmarkall added the feature request New feature or request label Aug 9, 2024
@gmarkall
Copy link
Collaborator Author

gmarkall commented Aug 9, 2024

A quick prototype / proof-of-concept:

from llvmlite import ir
from numba import config, cuda, types
from numba.core import cgutils
from numba.core.extending import intrinsic
from numba.core.errors import NumbaTypeError

import numpy as np

config.DUMP_ASSEMBLY = True


@intrinsic
def ldcs(typingctx, base):
    if not isinstance(base, types.Array) or base.dtype != types.float16:
        msg = f"ldcs operates on float16 arrays. Got type {base}"
        raise NumbaTypeError(msg)
    signature = types.float16(base)

    def codegen(context, builder, sig, args):
        int16 = ir.IntType(16)
        int16_ptr = int16.as_pointer()
        ldcs_type = ir.FunctionType(int16, [int16_ptr])
        ldcs = ir.InlineAsm(ldcs_type, "ld.global.cs.b16 $0, [$1];", "=h, l")

        base = cgutils.create_struct_proxy(sig.args[0])(context, builder,
                                                        value=args[0]).data
        return builder.call(ldcs, [base])

    return signature, codegen


@cuda.jit
def f(r, x):
    r[0] = ldcs(x)


x = cuda.device_array(1, np.float16)
r = cuda.device_array(1, np.float16)
f[1, 1](r, x)

which produces

{
        ...
	cvta.to.global.u64 	%rd3, %rd2;
	// begin inline asm
	ld.global.cs.b16 %rs1, [%rd1];
	// end inline asm
	st.global.u16 	[%rd3], %rs1;
	ret;

}

The API needs support for an index into the array, not to just access the first element of the passed array (similar to atomics).

@gmarkall gmarkall added this to the v0.0.19 milestone Oct 21, 2024
@gmarkall gmarkall modified the milestones: v0.0.20, v0.0.21, v0.0.22 Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant