AddLower, PairwiseAdd/Sub and MaskedAbsOr operations #2405

mazimkhan · 2024-12-11T17:11:54Z

Adding special arithmetic operations for arm_sve-inl.h and generic_ops-inl.h:

AddLower adds the first lane of both input vectors and passes the lanes of vector a for all other lanes.
PairwiseAdd adds consecutive pairs of elements in each of the vectors and interleaves the resulting lanes.
PairwiseSub subtracts consecutive pairs of elements in each of the vectors and interleaves the resulting lanes.
PairwiseAdd128 adds consecutive pairs of elements in each of the vectors and then packs the results in 128 bit blocks, such that the results of vector a are in the lower half of the block and the results of vector b are in the upper half of the block.
PairwiseSub128 subtracts consecutive pairs of elements in each of the vectors and then packs the results in 128 bit blocks, such that the results of vector a are in the lower half of the block and the results of vector b are in the upper half of the block.

Tests have been added for the operations.

The instruction matrix in g3doc/instruction_matrix.pdf may need to be updated, but it appears to have been generated manually.

google-cla · 2024-12-11T17:11:58Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

jan-wassenberg · 2024-12-11T17:20:48Z

Thanks for adding! Are you able to sign the CLA?
No worries about instruction_matrix, that was an initial draft for R&D. The main documentation is quick_reference, which you have updated.

On the naming: Lower typically references the lower half. How about we rename AddLower to AddLane, similar to GetLane?

One concern about constexpr for Pairwise128Indices: this seems to require C++17, right?
Any thoughts on whether the op should only be provided for recent C++, or whether a fallback for older standards is possible?
(perhaps hardcoding tables for each sizeof(T)=1..8?)

mazimkhan · 2024-12-11T17:48:21Z

Thanks @jan-wassenberg for a quick feedback. We will address your comments.

Regarding the CLA, someone from our org signed it and should have included my teams (including mine) email Ids in the CLA. For some reason it has not worked. Let me check if we need to do more from our end.

jan-wassenberg · 2024-12-11T19:10:07Z

FYI the CLA check does mention the email "Author: @mazimkhan <az*****an@cambridgeconsultants.com>" which looks correct.

johnplatts · 2024-12-11T19:14:31Z

For 16-byte I16/U16/I32/U32/F32 vectors on SSSE3/SSE4/AVX2/AVX3/AVX10, PairwiseAdd128(d, a, b) is equivalent to _mm_hadd_*(a, b) and PairwiseSub128(d, a, b) is equivalent to _mm_hsub_*(a, b).

For 32-byte I16/U16/I32/U32/F32 vectors on AVX2/AVX3/AVX10, PairwiseAdd128(d, a, b) is equivalent to _mm256_hadd_*(a, b) and PairwiseSub128(d, a, b) is equivalent to _mm256_hsub_*(a, b).

johnplatts · 2024-12-11T21:44:07Z

Here is an improvement to the implementation of PairwiseAdd128/PairwiseSub128:

namespace detail {

// detail::BlockwiseConcatOddEven(d, v) returns the even lanes of each block of
// v followed by the odd lanes of v
#if HWY_TARGET_IS_NEON || HWY_TARGET_IS_SVE || HWY_TARGET == HWY_RVV
template <class D, HWY_IF_T_SIZE_ONE_OF_D(D, (1 << 1) | (1 << 2)),
          HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D d,
                                                                 Vec<D> v) {
#if HWY_TARGET == HWY_RVV
  const ScalableTag<uint64_t, HWY_MAX(HWY_POW2_D(D), 0)> du64;
#else
  const Repartition<uint64_t, decltype(d)> du64;
#endif

  const auto evens = ConcatEven(d, v, v);
  const auto odds = ConcatOdd(d, v, v);
  return ResizeBitCast(d, InterleaveWholeLower(ResizeBitCast(du64, evens),
                                               ResizeBitCast(du64, odds)));
}

#else  // !(HWY_TARGET_IS_NEON || HWY_TARGET_IS_SVE || HWY_TARGET == HWY_RVV)

template <class D, HWY_IF_T_SIZE_D(D, 1), HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D d,
                                                                 Vec<D> v) {
#if HWY_TARGET == HWY_SSE2
  const RebindToUnsigned<decltype(d)> du;
  const RebindToSigned<RepartitionToWide<decltype(du)> > dw;

  const auto vu = BitCast(du, v);
  return BitCast(
      d, OrderedDemote2To(du, PromoteEvenTo(dw, vu), PromoteOddTo(dw, vu)));
#else
  const Repartition<uint8_t, decltype(d)> du8;
  const auto idx =
      BitCast(d, Dup128VecFromValues(du8, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7,
                                     9, 11, 13, 15));
  return TableLookupBytes(v, idx);
#endif
}

template <class D, HWY_IF_T_SIZE_D(D, 2), HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D d,
                                                                 Vec<D> v) {
#if HWY_TARGET == HWY_SSE2
  const RebindToSigned<decltype(d)> di;
  const RepartitionToWide<decltype(di)> dw;
  const auto vi = BitCast(di, v);
  return BitCast(
      d, OrderedDemote2To(di, PromoteEvenTo(dw, vi), PromoteOddTo(dw, vi)));
#else
  const Repartition<uint8_t, decltype(d)> du8;
  const auto idx = BitCast(d, Dup128VecFromValues(du8, 0, 1, 4, 5, 8, 9, 12, 13,
                                                  2, 3, 6, 7, 10, 11, 14, 15));
  return TableLookupBytes(v, idx);
#endif
}
#endif  // HWY_TARGET_IS_NEON || HWY_TARGET_IS_SVE || HWY_TARGET == HWY_RVV

template <class D, HWY_IF_T_SIZE_D(D, 4), HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D d,
                                                                 Vec<D> v) {
#if HWY_TARGET_IS_NEON || HWY_TARGET_IS_SVE
  const Repartition<uint64_t, decltype(d)> du64;
  const auto evens = ConcatEven(d, v, v);
  const auto odds = ConcatOdd(d, v, v);
  return BitCast(
      d, InterleaveWholeLower(BitCast(du64, evens), BitCast(du64, odds)));
#else
  (void)d;
  return Per4LaneBlockShuffle<3, 1, 2, 0>(v);
#endif
}

template <class D, HWY_IF_T_SIZE_D(D, 8), HWY_IF_V_SIZE_GT_D(D, 8)>
static HWY_INLINE HWY_MAYBE_UNUSED Vec<D> BlockwiseConcatOddEven(D /*d*/,
                                                                 Vec<D> v) {
  return v;
}

}  // namespace detail

// Pairwise add with output in 128 bit blocks of a and b.
template <class D, HWY_IF_V_SIZE_GT_D(D, 8)>
HWY_API Vec<D> PairwiseAdd128(D d, Vec<D> a, Vec<D> b) {
  return detail::BlockwiseConcatOddEven(d, PairwiseAdd(d, a, b));
}

// Pairwise sub with output in 128 bit blocks of a and b.
template <class D, HWY_IF_V_SIZE_GT_D(D, 8)>
HWY_API Vec<D> PairwiseSub128(D d, Vec<D> a, Vec<D> b) {
  return detail::BlockwiseConcatOddEven(d, PairwiseAdd(d, a, b));
}

johnplatts · 2025-01-31T03:59:28Z

@mazimkhan Could you please go ahead in the changes that I made in the hwy_pairwise_add_enh_013025 branch of the https://github.com/johnplatts/jep_google_highway repository (which are contained in commit johnplatts@ed95fe6)?

Here are the git commands for merging the changes in commit johnplatts@914cb69:

git remote add jep_hwy https://github.com/johnplatts/jep_google_highway.git
git fetch jep_hwy hwy_pairwise_add_enh_013025
git merge jep_hwy/hwy_pairwise_add_enh_013025

wbb-ccl · 2025-01-31T10:03:25Z

@johnplatts Your update seems to have an issue with compiling for SSE2.

jan-wassenberg · 2025-01-31T10:09:22Z

x86_128-inl.h:3959:20: error: always_inline function '_mm_hadd_pd' requires target feature 'sse3'

Yes, the HADD is not available on SSE2, we can make it conditional on HWY_TARGET < HWY_SSE2.

johnplatts · 2025-01-31T10:55:58Z

@johnplatts Your update seems to have an issue with compiling for SSE2.

x86_128-inl.h:3959:20: error: always_inline function '_mm_hadd_pd' requires target feature 'sse3'

Yes, the HADD is not available on SSE2, we can make it conditional on HWY_TARGET < HWY_SSE2.

@wbb-ccl I have made a change to hwy/ops/x86_128-inl.h the hwy_pairwise_add_enh_013025 branch, and those changes can be found in commit johnplatts@61804b1.

johnplatts · 2025-01-31T11:48:20Z

@wbb-ccl I made some additional changes to arithmetic_test.cc in commit johnplatts@c446e6f of the hwy_pairwise_add_enh_013025 branch to fix compiler warnings with earlier versions of GCC.

g3doc/quick_reference.md

Remove AddLower, MaskedAddOr can be used instead Rename MaskedAbsOrZero and reorder MaskedAbsOr args

Co-authored-by: John Platts <[email protected]>

wbb-ccl force-pushed the cc_up_arithmetic branch from cc5c4e8 to f36d5f5 Compare January 30, 2025 17:17

jan-wassenberg reviewed Feb 5, 2025

View reviewed changes

g3doc/quick_reference.md Outdated Show resolved Hide resolved

mazimkhan and others added 9 commits February 6, 2025 11:31

AddLower, PairwiseAdd/Sub and MaskedAbsOr operations

b47886f

Add quick reference for MaskedAbsOr(Zero)

b926f4c

Fix review comments

6631054

Remove AddLower, MaskedAddOr can be used instead Rename MaskedAbsOrZero and reorder MaskedAbsOr args

Improved PairwiseAdd128/PairwiseSub128

0eb5670

Co-authored-by: John Platts <[email protected]>

Made additional changes to PairwiseAdd128/PairwiseSub128

5f6f5fc

Fixed error with PairwiseAdd128/PairwiseSub128 on SSE2 target

e6ab8d7

Fixed compiler warning in TestPairwiseAdd128 and TestPairwiseSub128

6b1e20d

Fixed compiler warning in TestPairwiseAdd and TestPairwiseSub

74d5897

Move PairwiseAdd/Sub128 to blockwise

f2b6329

wbb-ccl force-pushed the cc_up_arithmetic branch from f0c0995 to f2b6329 Compare February 6, 2025 11:32

jan-wassenberg added the ready to pull label Feb 7, 2025

wbb-ccl mentioned this pull request Feb 11, 2025

Generic-only new operations vranhub/aeroway#5

Open

cc-wvda mentioned this pull request Feb 11, 2025

Aeroway release 1.2 vranhub/aeroway#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AddLower, PairwiseAdd/Sub and MaskedAbsOr operations #2405

AddLower, PairwiseAdd/Sub and MaskedAbsOr operations #2405

mazimkhan commented Dec 11, 2024

google-cla bot commented Dec 11, 2024

jan-wassenberg commented Dec 11, 2024

mazimkhan commented Dec 11, 2024

jan-wassenberg commented Dec 11, 2024

johnplatts commented Dec 11, 2024

johnplatts commented Dec 11, 2024

johnplatts commented Jan 31, 2025

wbb-ccl commented Jan 31, 2025 •

edited

Loading

jan-wassenberg commented Jan 31, 2025

johnplatts commented Jan 31, 2025

johnplatts commented Jan 31, 2025

AddLower, PairwiseAdd/Sub and MaskedAbsOr operations #2405

Are you sure you want to change the base?

AddLower, PairwiseAdd/Sub and MaskedAbsOr operations #2405

Conversation

mazimkhan commented Dec 11, 2024

google-cla bot commented Dec 11, 2024

jan-wassenberg commented Dec 11, 2024

mazimkhan commented Dec 11, 2024

jan-wassenberg commented Dec 11, 2024

johnplatts commented Dec 11, 2024

johnplatts commented Dec 11, 2024

johnplatts commented Jan 31, 2025

wbb-ccl commented Jan 31, 2025 • edited Loading

jan-wassenberg commented Jan 31, 2025

johnplatts commented Jan 31, 2025

johnplatts commented Jan 31, 2025

wbb-ccl commented Jan 31, 2025 •

edited

Loading