Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddLower, PairwiseAdd/Sub and MaskedAbsOr operations #2405

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mazimkhan
Copy link

Adding special arithmetic operations for arm_sve-inl.h and generic_ops-inl.h:

  • AddLower adds the first lane of both input vectors and passes the lanes of vector a for all other lanes.
  • PairwiseAdd adds consecutive pairs of elements in each of the vectors and interleaves the resulting lanes.
  • PairwiseSub subtracts consecutive pairs of elements in each of the vectors and interleaves the resulting lanes.
  • PairwiseAdd128 adds consecutive pairs of elements in each of the vectors and then packs the results in 128 bit blocks, such that the results of vector a are in the lower half of the block and the results of vector b are in the upper half of the block.
  • PairwiseSub128 subtracts consecutive pairs of elements in each of the vectors and then packs the results in 128 bit blocks, such that the results of vector a are in the lower half of the block and the results of vector b are in the upper half of the block.

Tests have been added for the operations.

The instruction matrix in g3doc/instruction_matrix.pdf may need to be updated, but it appears to have been generated manually.

Copy link

google-cla bot commented Dec 11, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@jan-wassenberg
Copy link
Member

Thanks for adding! Are you able to sign the CLA?
No worries about instruction_matrix, that was an initial draft for R&D. The main documentation is quick_reference, which you have updated.

On the naming: Lower typically references the lower half. How about we rename AddLower to AddLane, similar to GetLane?

One concern about constexpr for Pairwise128Indices: this seems to require C++17, right?
Any thoughts on whether the op should only be provided for recent C++, or whether a fallback for older standards is possible?
(perhaps hardcoding tables for each sizeof(T)=1..8?)

@mazimkhan
Copy link
Author

Thanks @jan-wassenberg for a quick feedback. We will address your comments.

Regarding the CLA, someone from our org signed it and should have included my teams (including mine) email Ids in the CLA. For some reason it has not worked. Let me check if we need to do more from our end.

@jan-wassenberg
Copy link
Member

FYI the CLA check does mention the email "Author: @mazimkhan <az*****an​@cambridgeconsultants.com>" which looks correct.

@johnplatts
Copy link
Contributor

For 16-byte I16/U16/I32/U32/F32 vectors on SSSE3/SSE4/AVX2/AVX3/AVX10, PairwiseAdd128(d, a, b) is equivalent to _mm_hadd_*(a, b) and PairwiseSub128(d, a, b) is equivalent to _mm_hsub_*(a, b).

For 32-byte I16/U16/I32/U32/F32 vectors on AVX2/AVX3/AVX10, PairwiseAdd128(d, a, b) is equivalent to _mm256_hadd_*(a, b) and PairwiseSub128(d, a, b) is equivalent to _mm256_hsub_*(a, b).

@johnplatts
Copy link
Contributor

Here is an improvement to the implementation of PairwiseAdd128/PairwiseSub128:

namespace detail {

// detail::BlockwiseConcatOddEven(d, v) returns the even lanes of each block of
// v followed by the odd lanes of v
#if HWY_TARGET_IS_NEON || HWY_TARGET_IS_SVE || HWY_TARGET == HWY_RVV
template <class D, HWY_IF_T_SIZE_ONE_OF_D(D, (1 << 1) | (1 << 2)),
          HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D d,
                                                                 Vec<D> v) {
#if HWY_TARGET == HWY_RVV
  const ScalableTag<uint64_t, HWY_MAX(HWY_POW2_D(D), 0)> du64;
#else
  const Repartition<uint64_t, decltype(d)> du64;
#endif

  const auto evens = ConcatEven(d, v, v);
  const auto odds = ConcatOdd(d, v, v);
  return ResizeBitCast(d, InterleaveWholeLower(ResizeBitCast(du64, evens),
                                               ResizeBitCast(du64, odds)));
}

#else  // !(HWY_TARGET_IS_NEON || HWY_TARGET_IS_SVE || HWY_TARGET == HWY_RVV)

template <class D, HWY_IF_T_SIZE_D(D, 1), HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D d,
                                                                 Vec<D> v) {
#if HWY_TARGET == HWY_SSE2
  const RebindToUnsigned<decltype(d)> du;
  const RebindToSigned<RepartitionToWide<decltype(du)> > dw;

  const auto vu = BitCast(du, v);
  return BitCast(
      d, OrderedDemote2To(du, PromoteEvenTo(dw, vu), PromoteOddTo(dw, vu)));
#else
  const Repartition<uint8_t, decltype(d)> du8;
  const auto idx =
      BitCast(d, Dup128VecFromValues(du8, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7,
                                     9, 11, 13, 15));
  return TableLookupBytes(v, idx);
#endif
}

template <class D, HWY_IF_T_SIZE_D(D, 2), HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D d,
                                                                 Vec<D> v) {
#if HWY_TARGET == HWY_SSE2
  const RebindToSigned<decltype(d)> di;
  const RepartitionToWide<decltype(di)> dw;
  const auto vi = BitCast(di, v);
  return BitCast(
      d, OrderedDemote2To(di, PromoteEvenTo(dw, vi), PromoteOddTo(dw, vi)));
#else
  const Repartition<uint8_t, decltype(d)> du8;
  const auto idx = BitCast(d, Dup128VecFromValues(du8, 0, 1, 4, 5, 8, 9, 12, 13,
                                                  2, 3, 6, 7, 10, 11, 14, 15));
  return TableLookupBytes(v, idx);
#endif
}
#endif  // HWY_TARGET_IS_NEON || HWY_TARGET_IS_SVE || HWY_TARGET == HWY_RVV

template <class D, HWY_IF_T_SIZE_D(D, 4), HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D d,
                                                                 Vec<D> v) {
#if HWY_TARGET_IS_NEON || HWY_TARGET_IS_SVE
  const Repartition<uint64_t, decltype(d)> du64;
  const auto evens = ConcatEven(d, v, v);
  const auto odds = ConcatOdd(d, v, v);
  return BitCast(
      d, InterleaveWholeLower(BitCast(du64, evens), BitCast(du64, odds)));
#else
  (void)d;
  return Per4LaneBlockShuffle<3, 1, 2, 0>(v);
#endif
}

template <class D, HWY_IF_T_SIZE_D(D, 8), HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D /*d*/,
                                                                 Vec<D> v) {
  return v;
}

}  // namespace detail

// Pairwise add with output in 128 bit blocks of a and b.
template <class D, HWY_IF_V_SIZE_GT_D(D, 8)>
HWY_API Vec<D> PairwiseAdd128(D d, Vec<D> a, Vec<D> b) {
  return detail::BlockwiseConcatOddEven(d, PairwiseAdd(d, a, b));
}

// Pairwise sub with output in 128 bit blocks of a and b.
template <class D, HWY_IF_V_SIZE_GT_D(D, 8)>
HWY_API Vec<D> PairwiseSub128(D d, Vec<D> a, Vec<D> b) {
  return detail::BlockwiseConcatOddEven(d, PairwiseAdd(d, a, b));
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants