-
Notifications
You must be signed in to change notification settings - Fork 99
List Functor level Interface in Batched Functions
Kokkos Batched APIs are designed and implemented for a specific application needs. Most of functions described here are used to create a block tridiagonal line solver where the blocksizes are tiny. This implies that computational intensity is not so high and we assume that the data fits into a cache. Thus, loop unrolling (Algo::Blocked
) can be benefitial for some architectures. To the contrary, one may consider the Algo::Unblocked
with a team to increase fine-grained parallelism on Cuda.
/// [template]MemberType: member type derived from a team policy
/// [in]tiny: tiny scalar value to perturb diagonals of A
/// [in/out]A: 2d view
int SerialAddRadial::invoke(const Scalartype tiny, const AViewType &A);
int TeamAddRadial<MemberType>::invoke(const MemberType &member,
const Scalartype tiny, const AViewType &A);
This perturbs diagonal values to be away from zeros: A = A + diag(epsilon*sign(diag(A)))
.
KokkosBatched_AddRadial_Decl.hpp
KokkosBatched_AddRadial_Impl.hpp
/// [template]MemberType: member type derived from a team policy
/// [template]TransType: transpose of A; Trans::NoTranspose, Trans::Transpose
/// [in]A: 1d or 2d view
/// [out]B: 1d or 2d view
int SerialCopy<TransType>::invoke(const AViewType &A, const BViewType &B);
int TeamCopy<MemberType,TransType>::invoke(const MemberType &member,
const AViewType &A, const BViewType &B);
Copy the elements of A
into B
performing B=op(A)
.
KokkosBatched_Copy_Decl.hpp
KokkosBatched_Copy_Impl.hpp
/// [template]MemberType: member type derived from a team policy
/// [in]alpha: scalar value
/// [in/out]A: 1d or 2d view
int SerialScale::invoke(const ScalarType alpha, const AViewType &A);
int TeamScale::invoke(const MemberType &member,
const ScalarType alpha, const AViewType &A);
Scales the values of A
with a scalar alpha
.
KokkosBatched_Scale_Decl.hpp
KokkosBatched_Scale_Impl.hpp
/// [template]MemberType: member type derived from a team policy
/// [in]alpha: scalar value
/// [out]A: 1d or 2d view
int SerialSet::invoke(const ScalarType alpha, const AViewType &A);
int TeamSet::invoke(const MemberType &member,
const ScalarType alpha, const AViewType &A);
Set the values of A
with a scalar alpha
.
KokkosBatched_Set_Decl.hpp
KokkosBatched_Set_Impl.hpp
/// [template]MemberType: member type derived from a team policy
/// [template]TransType: transpose of A; Trans::NoTranspose, Trans::Transpose
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in]alpha: scalar value
/// [in]A: 2d view
/// [in]x: 1d view
/// [in]beta: scalar value
/// [in/out]y: 1d view
int SerialGemv<TransType,AlgoType>
::invoke(const ScalarType alpha, const AViewType &A, const xViewType &x,
const ScalarType beta, const yViewType &y);
int TeamGemv<MemberType,TransType,AlgoType>
::invoke(const MemberType &member,
const ScalarType alpha, const AViewType &A, const xViewType &x,
const ScalarType beta, const yViewType &y);
Performs a general matrix-vector multiplication: y = beta y + alpha op(A) x
.
KokkosBatched_Gemv_Decl.hpp
KokkosBatched_Gemv_Serial_Impl.hpp
KokkosBatched_Gemv_Team_Impl.hpp
/// [template]MemberType: member type derived from a team policy
/// [template]UploType: indicates either upper triangular or lower triangular; Uplo::Upper, Uplo::Lower
/// [template]TransType: transpose of A; Trans::NoTranspose, Trans::Transpose
/// [template]DiagType: diagonals; Diag::Unit or Diag::NonUnit
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in]alpha: scalar value
/// [in]A: 2d view
/// [in]x: 1d view
/// [in]beta: scalar value
/// [in/out]y: 1d view
int SerialTrsv<UploType,TransType,DiagType,AlgoType>
::invoke(const ScalarType alpha, const AViewType &A, const xViewType &b);
int TeamGemv<MemberType,TransType,DiagType,AlgoType>
::invoke(const MemberType &member,
const ScalarType alpha, const AViewType &A, const xViewType &b);
Performs triangular solve operations: b = op(A) x
and b
is overwritten as A^{-1}b
. The DiagType
indicates that it considers diagonals as unit values or not.
KokkosBatched_Trsv_Decl.hpp
KokkosBatched_Trsv_Serial_Impl.hpp
KokkosBatched_Trsv_Team_Impl.hpp
/// [template]MemberType: member type derived from a team policy
/// [template]TransType: transpose of A; Trans::NoTranspose, Trans::Transpose
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in]alpha: scalar value
/// [in]A: 2d view
/// [in]x: 1d view
/// [in]beta: scalar value
/// [in/out]y: 1d view
int SerialGemm<ATransType,BTransType,AlgoType>
::invoke(const ScalarType alpha, const AViewType &A, const BViewType &B,
const ScalarType beta, const cViewType &C);
int TeamGemm<MemberType,ATransType,BTransType,AlgoType>
::invoke(const MemberType &member,
const ScalarType alpha, const AViewType &A, const BViewType &B,
const ScalarType beta, const CViewType &C);
Performs general matrix-matrix multiplications: C = beta C + alpha op(A) op(B)
.
KokkosBatched_Gemm_Decl.hpp
KokkosBatched_Gemm_Serial_Impl.hpp
KokkosBatched_Gemm_Team_Impl.hpp
/// [template]MemberType: member type derived from a team policy
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in/out]A: 2d view
/// [in]tiny: a magnitude scalar value compatible to the value type of A
int SerialLU<AlgoType>
::invoke(const AViewType &A,
const MagnitudeScalarType tiny = 0);
int TeamLU<MemberType,AlgoType>
::invoke(const MemberType &member,
const AViewType &A,
const MagnitudeScalarType tiny = 0);
Performs LU factorization without pivoting. Static pivots are applied with a tiny
value to avoid division by a zero. This LU factorization is not in general. However, it is useful for computing small block matrices; for instance, the routine is used to contruct a block Jacobi preconditioner.
KokkosBatched_LU_Decl.hpp
KokkosBatched_LU_Serial_Impl.hpp
KokkosBatched_LU_Team_Impl.hpp
/// [template]MemberType: member type derived from a team policy
/// [template]SideType: Side::Left or Side::Right
/// [template]UploType: Uplo::Upper or Uplo::Lower
/// [template]TransType: Trans::NoTranspose or Trans::Transpose
/// [template]DiagType: Diag::Unit or Diag::NonUnit
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in]alpha: a scalar value
/// [in]A: 2d view
/// [in/out]B: 2d view
int SerialTrsm<SideType,UploType,TransType,DiagType,AlgoType>
::invoke(const ScalarType alpha, const AViewType &A, const BViewType &B);
int TeamLU<MemberType,SideType,UploType,TransType,DiagType,AlgoType>
::invoke(const MemberType &member,
const ScalarType alpha, const AViewType &A, const BViewType);
Perform triangular solve with multiple right-hand side. With Side::Left
, it solves op(A) X = alpha B
and B
is overwritten by alpha op(A)^{-1} B
. With Side::Right
, it solves X op(A) = alpha B
and B
is overwritten by alpha B op(A)^{-1}
.
KokkosBatched_Trsm_Decl.hpp
KokkosBatched_Trsm_Serial_Impl.hpp
KokkosBatched_Trsm_Team_Impl.hpp
/// [template]SideType: Side::Left or Side::Right
/// [template]UploType: Uplo::Upper or Uplo::Lower
/// [template]TransType: Trans::NoTranspose or Trans::Transpose
/// [template]DiagType: Diag::Unit or Diag::NonUnit
/// [template]AlgoType: Unblocked
/// [in]alpha: a scalar value
/// [in]A: 2d view
/// [in/out]B: 2d view
int SerialTrmm<SideType,UploType,TransType,DiagType,AlgoType>
::invoke(const ScalarType alpha, const AViewType &A, const BViewType &B);
Solves triangular matrix multiply with multiple right-hand sides. With Side::Left
, it solves B = alpha * op(A) * B
. With Side::Right
, it solves B = alpha * B * op(A)
. op is non-transpose, transpose, or conjugate transpose if trans is "n", "t", or "c", respectivley.
KokkosBatched_Trmm_Decl.hpp
KokkosBatched_Trmm_Serial_Impl.hpp
/// [template]UploType: Uplo::Upper or Uplo::Lower
/// [template]DiagType: Diag::Unit or Diag::NonUnit
/// [template]AlgoType: Unblocked
/// [in/out]A: 2d view
int SerialTrtri<UploType,DiagType,AlgoType>
::invoke(const AViewType &A);
Finds the inverse of the triangular matrix, A. A = inv(A).
KokkosBatched_Trtri_Decl.hpp
KokkosBatched_Trtri_Serial_Impl.hpp
/// [template] AViewType
/// [template] UViewType
/// [template] VtViewType
/// [template] SViewType
/// [template] WViewType
/// [in] A: 2D view (m x n).
/// Will be overwritten by routine and contents after return are undefined.
/// [out] U: 2D view (m x m).
/// Will contain left singular vectors (in columns).
/// [out] s: 1D view (min(m, n)).
/// Will contain singular values.
/// [out] Vt: 2D view (n x n).
/// Will contain right singular vectors, transposed (in rows).
/// W: 1D work view (length max(m, n) or greater).
/// Must be contiguous. Contents undefined after return.
/// Unlike LAPACK, there's no faster path for when it's larger than the minimum.
int KokkosBatched::SerialSVD::invoke(
KokkosBatched::SVD_USV_Tag,
const AViewType &A,
const UViewType &U,
const SViewType &s,
const VtViewType &Vt,
const WViewType &W);
Computes the full singular value decomposition (SVD) of a general matrix A. On output, A == U * diag(s) * Vt
. U
and Vt
will be orthogonal. s
will contain the singular values: nonnegative values in descending order. Note that Vt (V^T) is the transposed right singular vectors, like in LAPACK's dgesvd interface.
/// [template] AViewType
/// [template] SViewType
/// [template] WViewType
/// [in] A: 2D view (m x n).
/// Will be overwritten by routine and contents after return are undefined.
/// [out] s: 1D view (min(m, n)).
/// Will contain singular values.
/// W: 1D work view (length at least max(m, n)).
/// Must be contiguous. Contents undefined after return.
int KokkosBatched::SerialSVD::invoke(
KokkosBatched::SVD_S_Tag,
const AViewType &A,
const SViewType &s,
const WViewType &W);
Computes the singular value decomposition (SVD) of a general matrix A, producing just the singular values (not vectors). Otherwise, the same as the full version above.
KokkosBatched::SerialSVD::invoke(KokkosBatched::SVD_USV_Tag(), A, U, sigma, Vt, work);
KokkosBatched::SerialSVD::invoke(KokkosBatched::SVD_S_Tag(), A, sigma, work);
KokkosBatched_SVD_Decl.hpp
SAND2021-11374 O
/// [template]ValueType: float, double, Kokkos::complex<float> and Kokkos::complex<double>
/// [template]MemorySpace: Kokkos::HostSpace, Kokkos::CudaSpace and Kokkos::CudaUVMSpace
enum DefaultVectorLength<ValueType,MemorySpace>::value;
For a different architecture, this enum provides a compile time SIMD length. For instance, Intel compiler flag __AVX512F__
set the length a) 16 for float b) 8 for double c) 8 for complex float and d) 4 for complex double. With a CUDA memory space, the default length is set as 16 for all value types. As the vectorization on a GPU is explicit and flexible, a user must understand the best SIMD length for different GPUs.
/// [template]T: value type
/// [template]l: SIMD length
struct Vector<SIMD<T>,l>
This struct includes a static array of values with a given size. The struct is specialized with a certain vector register type if the architecture supports the corresponding vector instructions. For more detailed information, please see the file KokkosBatched_Vector_SIMD.hpp
. As the vector type and instructions are dependent on hardware, someone wants to test this type with a new architecture. For such a case, one can provide a new set of SIMD type by creating a different tag type instead of SIMD. The SIMD type comes with full arithmetic overloading (e.g., +,-,*,/, etc.) and basic math functions (e.g., sqrt,cbrt,log,exp,pow,etc.). For a full description, see KokkosBatched_Vector_SIMD_Arith/Math.hpp
.
One tricky part in writin a generic routine with this SIMD type is logical and comparison operators. As it is impossible to write a generic routine with if
statement using a SIMD type (each array element needs to behave differently as a result of comparison), we provide overloaded operators returning a Vector<SIMD<bool>,l
. Then, the array of bools can be used as a predicate in a sequence of operations. A simple example is shown to explain the usage.
/// We want to add some values for only positive values in the SIMD array
double alpha = some_value;
Vecotr<SIMD<double>,4> a = random_value(); // set random values
const auto is_positive = a > 0; // construct a predicate
a += is_positive*alpha; // add the some_value for positive member of a
For more complex use cases, we provide the following conditional assign statements and collective functions.
/// Performs conditional assignment: cond ? if_true_value : if_false_value
/// [template]T: value type
/// [template]l: SIMD length
/// [in]cond: predicate
/// [in]if_true_value: values to be assigned for a true bool in the prediate
/// [in]if_false_value: values to be assigned for a false bool in the predicate
Vector<SIMD<T>,l> conditional_assign(const Vector<SIMD<bool>,l> cond,
const Vector<SIMD<T>,l> if_true_value,
const Vector<SIMD<T>,l> if_false_value);
/// Performs reduction with a given binary operator
/// [template]T: value type
/// [template]BinaryOp: a binary functor providing func(a,b) interface
/// [in]val: input value
/// [in]func: binary functor
/// [return]: r_val = func(...func(func(val[0],val[1]),val[2])...,val[l-1])
T reduce(const Vector<SIMD<T>,l> val, const BinaryOp &func);
/// Returns true if all values of the input SIMD type are true
bool is_all_true(const Vector<SIMD<bool>,l> cond);
/// Returns true if any values of the input SIMD type are true
bool is_any_true(const Vector<SIMD<bool>,l> cond);
/// Returns minimum value of the input SIMD type
T min(const Vector<SIMD<T>,l> val);
/// returns maximum value in the input SIMD type
T max(const Vector<SIMD<T>,l> val);
See more details in KokkosBatched_Vector_SIMD_Misc.hpp
.
KokkosBatched_Vector.hpp
KokkosBatched_Vector_SIMD_Arith.hpp
KokkosBatched_Vector_SIMD.hpp
KokkosBatched_Vector_SIMD_Logical.hpp
KokkosBatched_Vector_SIMD_Math.hpp
KokkosBatched_Vector_SIMD_Misc.hpp
KokkosBatched_Vector_SIMD_Relation.hpp
SAND2018-14051 W