Skip to content

Commit 80b9989

Browse files
crtrottmasterleinadtpadioleau
authored
Add KOKKOS_FUNCTION etc. documentation (#578)
* Add KOKKOS_FUNCTION etc. documentation * Address Damien's comments * Document KOKKOS_RELOCATABLE_FUNCTION * Fix typo. Co-authored-by: Thomas Padioleau <[email protected]> --------- Co-authored-by: Daniel Arndt <[email protected]> Co-authored-by: Thomas Padioleau <[email protected]>
1 parent 2c67f41 commit 80b9989

File tree

2 files changed

+240
-0
lines changed

2 files changed

+240
-0
lines changed

docs/source/API/core/Macros.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
Macros
22
======
33

4+
Function Markup Macros
5+
----------------------
6+
7+
.. toctree::
8+
:maxdepth: 1
9+
10+
macros-special/host_device_macros.rst
11+
412
Version Macros
513
--------------
614

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
2+
``Function Annotation Macros``
3+
==============================
4+
5+
.. role::cpp(code)
6+
:language: cpp
7+
8+
Defined in header ``<Kokkos_Macros.hpp>``
9+
10+
Usage:
11+
12+
.. code-block:: cpp
13+
14+
KOKKOS_FUNCTION void foo();
15+
KOKKOS_INLINE_FUNCTION void foo();
16+
KOKKOS_FORCEINLINE_FUNCTION void foo();
17+
KOKKOS_RELOCATABLE_FUNCTION void foo();
18+
auto l = KOKKOS_LAMBDA(int i) { ... };
19+
auto l = KOKKOS_CLASS_LAMBDA(int i) { ... };
20+
21+
These macros deal with the management of split compilation for device and host code.
22+
They fullfill the same purpose as the ``__host__ __device__`` markup in CUDA and HIP.
23+
Generally only functions marked with one of these macros can be used inside of parallel
24+
Kokkos code - i.e. all code executed in parallel algorithms must be marked up by one
25+
of these macros.
26+
27+
``KOKKOS_FUNCTION``
28+
-------------------
29+
30+
This macro is the equivalent of ``__host__ __device__`` markup in CUDA and HIP.
31+
Use it primarily on inline-defined member functions of classes and templated
32+
free functions
33+
34+
.. code-block:: cpp
35+
36+
class Foo {
37+
public:
38+
// inline defined constructor
39+
KOKKOS_FUNCTION Foo() { ... };
40+
41+
// inline defined member function
42+
template<class T>
43+
KOKKOS_FUNCTION void bar() const { ... }
44+
};
45+
46+
template<class T>
47+
KOKKOS_FUNCTION void foo(T v) { ... }
48+
49+
50+
``KOKKOS_INLINE_FUNCTION``
51+
--------------------------
52+
53+
This macro is the equivalent of ``__host__ __device__ inline`` markup in CUDA and HIP.
54+
Use it primarily for non-templated free functions:
55+
56+
.. code-block:: cpp
57+
58+
KOKKOS_INLINE_FUNCTION void foo() {}
59+
60+
Note that it is NOT a bug to use this macro for inline-defined member function of classes, or
61+
templated free functions. It is simply redundant since they are by default inline.
62+
63+
``KOKKOS_FORCEINLINE_FUNCTION``
64+
-------------------------------
65+
66+
This macro is the equivalent of ``__host__ __device__`` markup in CUDA and HIP, but also uses
67+
compiler dependent hints (if available) to enforce inlining.
68+
This can help with some functions which are often used, but it may also hurt compilation time,
69+
as well as runtime performance due to code-bloat. In some instances using ``KOKKOS_FORCEINLINE_FUNCTION``
70+
excessively can even cause compilation errors due to compiler specific limits of maximum inline limits.
71+
Use this macro only in conjunction with performing extensive performance checks.
72+
73+
.. code-block:: cpp
74+
75+
class Foo {
76+
public:
77+
KOKKOS_FORCEINLINE_FUNCTION
78+
Foo() { ... };
79+
80+
template<class T>
81+
KOKKOS_FORCEINLINE_FUNCTION
82+
void bar() const { ... }
83+
};
84+
85+
template<class T>
86+
KOKKOS_FORCEINLINE_FUNCTION
87+
void foo(T v) { ... }
88+
89+
``KOKKOS_RELOCATABLE_FUNCTION``
90+
-------------------------------
91+
92+
This macro is the equivalent of ``__host__ __device__`` markup in CUDA and HIP, and ``SYCL_EXTERNAL`` in SYCL.
93+
Use it for free functions that are compiled in one compilation unit but called from Kokkos
94+
parallel constructs defined in a different compilation unit.
95+
96+
.. code-block:: cpp
97+
98+
// functor.cpp
99+
#include <Kokkos_Macros.hpp>
100+
101+
KOKKOS_RELOCATABLE_FUNCTION void count_even(const long i, long& lcount) {
102+
lcount += (i % 2) == 0;
103+
}
104+
105+
.. code-block:: cpp
106+
107+
// main.cpp
108+
#include <Kokkos_Core.hpp>
109+
110+
KOKKOS_RELOCATABLE_FUNCTION void count_even(const long i, long& lcount);
111+
112+
int main(int argc, char* argv[]) {
113+
Kokkos::ScopeGuard scope_guard(argc, argv);
114+
long count = 0;
115+
Kokkos::parallel_reduce(
116+
n, KOKKOS_LAMBDA(const long i, long& lcount) { count_even(i, lcount); },
117+
count);
118+
}
119+
120+
Note that this macro can only be used if Kokkos was configured with only host execution spaces
121+
or if relocatable device code support was explicitly enabled for the CUDA, HIP, or SYCL backend.
122+
123+
``KOKKOS_LAMBDA``
124+
-----------------
125+
126+
This macro provides default capture clause and host device markup for lambdas. It is the equivalent of
127+
``[=] __host__ __device__`` in CUDA and HIP.
128+
It is used than creating C++ lambdas to be passed to Kokkos parallel dispatch mechanisms such as
129+
``parallel_for``, ``parallel_reduce`` and ``parallel_scan``.
130+
131+
.. code-block:: cpp
132+
133+
void foo(...) {
134+
...
135+
parallel_for("Name", N, KOKKOS_LAMBDA(int i) {
136+
...
137+
});
138+
...
139+
parallel_reduce("Name", N, KOKKOS_LAMBDA(int i, double& v) {
140+
...
141+
}, result);
142+
...
143+
}
144+
145+
.. warning:: Do not use ``KOKKOS_LAMBDA`` inside functions marked as ``KOKKOS_FUNCTION`` etc. or within a lambda marked with ``KOKKOS_LAMBDA``. Specifically do not use ``KOKKOS_LAMBDA`` to define lambdas for nested parallel calls. CUDA does not support that. Use plain C++ syntax instead: ``[=] (int i) {...}``.
146+
147+
.. warning:: When creating lambdas inside of class member functions you may need to use ``KOKKOS_CLASS_LAMBDA`` instead.
148+
149+
``KOKKOS_CLASS_LAMBDA``
150+
-----------------------
151+
152+
This macro provides default capture clause and host device markup for lambdas created inside of class member functions. It is the equivalent of
153+
``[=, *this] __host__ __device__`` in CUDA and HIP, capturing the parent class by value instead of by reference.
154+
155+
.. code-block:: cpp
156+
157+
class Foo {
158+
public:
159+
Foo() { ... };
160+
int data;
161+
162+
KOKKOS_FUNCTION print_data() const {
163+
printf("Data: %i\n",data);
164+
}
165+
void bar() const {
166+
parallel_for("Name", N, KOKKOS_CLASS_LAMBDA(int i) {
167+
...
168+
print_data();
169+
printf("%i %i\n",i,data);
170+
});
171+
}
172+
};
173+
174+
Note: If one wants to avoid capturing a copy of the entire class in the lambda, one has to create local
175+
copies of any accessed data members, and can not use non-static member functions inside the lambda:
176+
177+
.. code-block:: cpp
178+
179+
class Foo {
180+
public:
181+
Foo() { ... };
182+
int data;
183+
184+
KOKKOS_FUNCTION print_data() const {
185+
printf("Data: %i\n",data);
186+
}
187+
void bar() const {
188+
int data_copy = data;
189+
parallel_for("Name", N, KOKKOS_LAMBDA(int i) {
190+
...
191+
// can't call member functions
192+
// print_data();
193+
// use the copy of data
194+
printf("%i %i\n",i,data_copy);
195+
});
196+
}
197+
};
198+
199+
200+
``KOKKOS_DEDUCTION_GUIDE``
201+
--------------------------
202+
203+
This macro is used to annotate user-defined deduction guides.
204+
205+
206+
.. code-block:: cpp
207+
208+
template<class T, size_t N>
209+
class Foo {
210+
T data[N];
211+
public:
212+
template<class ... Args>
213+
KOKKOS_FUNCTION
214+
Foo(Args ... args):data{static_cast<T>(args)...} {}
215+
216+
KOKKOS_FUNCTION void print(int i) const {
217+
printf("%i\n",static_cast<int>(data[i]));
218+
}
219+
};
220+
221+
template<class T, class ... Args>
222+
KOKKOS_DEDUCTION_GUIDE
223+
Foo(T, Args...) -> Foo<T, 1+sizeof...(Args)>;
224+
225+
void bar() {
226+
Kokkos::parallel_for(1, KOKKOS_LAMBDA(int) {
227+
Foo f(1, 2., 3.2f);
228+
f.print(0);
229+
f.print(1);
230+
f.print(2);
231+
});
232+
}

0 commit comments

Comments
 (0)