Choosing NEON over SVE when fixed size vectors are used where possible #2060

Ryo-not-rio · 2024-04-04T16:17:36Z

I've noticed quite a severe performance hit when writing highway code using fixed size vectors where the size is smaller than the number of available lanes in SVE. This occurred when porting NEON code written for 128-bit vectors into highway on a SVE machine which has 256-bit SVE vectors. Would it be possible for highway to choose NEON vectors for fixed size vectors where the specified size is smaller or equal to 128 bits?

johnplatts · 2024-04-04T20:01:04Z

It is possible to bitcast SVE vectors to NEON vectors and vice versa on GCC and Clang releases that have support for the arm_neon_sve_bridge.h header, including Clang 14 and later and GCC 14 and later.

An uint8x16_t vector can be bitcast to a svuint8_t vector by doing svset_neonq_u8(svundef_u8(), v) on compilers that support the arm_neon_sve_bridge.h header, and a svuint8_t vector can be bitcast to an uint8x16_t vector using svget_neonq_u8(v) on compilers that support the arm_neon_sve_bridge.h header.

It is also possible to re-implement the HWY_SVE2_128 target to use the fixed-size vector, mask, and tuple types in arm_neon-inl.h (which are wrappers around fixed-sized NEON vectors) instead of the SVE scalable vector, mask, and tuple types in arm_sve-inl.h on compilers that have support for the arm_neon_sve_bridge.h header as full SVE vectors are exactly 16 bytes on the HWY_SVE2_128 target.

johnplatts · 2024-04-04T21:26:57Z

Here is a link to a Compiler Explorer snippet that demonstrates the use of the ARM NEON SVE Bridge intrinsics (which are defined in the arm_neon_sve_bridge.h header) to convert between NEON vectors and SVE vectors on the HWY_NEON target:
https://godbolt.org/z/EK8h36Err

jan-wassenberg · 2024-04-05T08:38:31Z

If I understand correctly, the issue is that we use FixedTag<uint32_t, 4>, which on SVE requires Load/Store etc to do extra work to limit the work to 128 bits.

+1 to John's comment that SVE2_128 would work when running on Neoverse V2, but I think this use case is running on V1 which actually has 256-bit vectors.

I don't have experience with the SVE/NEON bridge, that sounds interesting. But perhaps I don't fully understand the use case. If we are porting from NEON code, why not just use the NEON target? Is the issue that dynamic dispatch chooses SVE, even though for this use case NEON would be better?

If so, we can either set HWY_DISABLED_TARGETS (HWY_NEON|HWY_NEON_WITHOUT_AES), or call hwy::DisableTargets at runtime to influence the dynamic dispatch.

Ryo-not-rio · 2024-04-05T08:48:03Z

Yes, the use case is running on V1 and when there are some scalable vectors used in parts of the code where fixed sized vectors are used in other parts of the code. We haven't tested using dynamic dispatch - only static dispatch - but even with dynamic dispatch, I imagine if there's currently not a way to use NEON vectors for parts of the code and SVE in other parts of the code. Am I correct in this understanding or is there actually a way of specifying?

jan-wassenberg · 2024-04-05T09:48:37Z

hm, if the code is isolated and not alternating between SVE/NEON in the same function or source file, it is easy to compile one source file with SVE disabled (so it would use NEON on Arm), and the other one not.

I suppose we could compile both NEON and SVE in the SVE target, and whenever the N in Simd<T, N, kPow2> is <= 16/sizeof(T), only enable the NEON functions. This would probably require quite a few updates to the SFINAE conditions in both files, disabling SVE for small vectors, and disabling NEON for non-capped.

Ryo-not-rio · 2024-04-05T10:28:24Z

I think that would be the ideal solution but for now, how would one specify whether to use NEON or SVE on a per-function basis? I don't envision using NEON and SVE mixed in one function so if there's a way to just specify it for functions, that would most likely be enough

jan-wassenberg · 2024-04-05T12:04:08Z

It can work like this.

template <class D, HWY_IF_V_SIZE_LE_D(D, 16), typename T>
NeonType Func(D d) { return NeonType(); }

template <class D, HWY_IF_V_SIZE_GT_D(D, 16), typename T>
SveType Func(D d) { return SveType(); }

and for functions not involving a D=Simd<T, N, kPow2>, we rely on normal C++ overloading because NeonType and SveType are not the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choosing NEON over SVE when fixed size vectors are used where possible #2060

Choosing NEON over SVE when fixed size vectors are used where possible #2060

Ryo-not-rio commented Apr 4, 2024

johnplatts commented Apr 4, 2024

johnplatts commented Apr 4, 2024

jan-wassenberg commented Apr 5, 2024

Ryo-not-rio commented Apr 5, 2024

jan-wassenberg commented Apr 5, 2024

Ryo-not-rio commented Apr 5, 2024

jan-wassenberg commented Apr 5, 2024

Choosing NEON over SVE when fixed size vectors are used where possible #2060

Choosing NEON over SVE when fixed size vectors are used where possible #2060

Comments

Ryo-not-rio commented Apr 4, 2024

johnplatts commented Apr 4, 2024

johnplatts commented Apr 4, 2024

jan-wassenberg commented Apr 5, 2024

Ryo-not-rio commented Apr 5, 2024

jan-wassenberg commented Apr 5, 2024

Ryo-not-rio commented Apr 5, 2024

jan-wassenberg commented Apr 5, 2024