-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add Neon implementation of std::swap_ranges #5819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@microsoft-github-policy-service agree company="Arm" |
|
Hopefully I’ve done something reasonable here - please let me know if you would prefer a different approach when adding new implementations to vector_algorithms.cpp The performance numbers are below - looks good apart from the case of size(1), which I think is due to the added overhead of the function call and the conditional checks now
|
Add an implementation of std::swap_ranges using Neon intrinsics.
d3e9269 to
7e0a46c
Compare
This comment was marked as resolved.
This comment was marked as resolved.
stl/src/vector_algorithms.cpp
Outdated
| _Left = vld1_lane_u32(static_cast<uint32_t*>(_First1), _Left, 0); | ||
| _Right = vld1_lane_u32(static_cast<uint32_t*>(_First2), _Right, 0); | ||
| vst1_lane_u32(static_cast<uint32_t*>(_First1), _Right, 0); | ||
| vst1_lane_u32(static_cast<uint32_t*>(_First2), _Left, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming, no change requested: This is accessing _First1 and _First2 as uint32_t*, but they aren't necessarily 4-byte aligned. Is this cromulent?
We have test coverage for this scenario so I think we're good:
STL/tests/std/tests/VSO_0000000_vector_algorithms/test.cpp
Lines 1119 to 1123 in 1118f37
| // also test unaligned input | |
| const auto endOffset = min(static_cast<ptrdiff_t>(dataCount), attempts + 1); | |
| assert( | |
| right.begin() + (endOffset - 1) == swap_ranges(left.begin() + 1, left.begin() + endOffset, right.begin())); | |
| last_known_good_swap_ranges(leftCopy.begin() + 1, leftCopy.begin() + endOffset, rightCopy.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this was discussed on the discord also, but for the record here - yes this is fine as the LD1 lane instructions this will produce allow unaligned loads.
|
Thanks @hazzlim, this is awesome! 😻 I pushed some commits, can you rerun your benchmark measurements? We finally have ARM64 runtime testing in PR checks, but I won't be able to gather perf measurements myself until Feb 2026-ish. The changes to override |
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Nice, thanks for doing this! Not sure how I missed that we could remove the loops for I will re-run the perf and report, it may well be unobservable but I think it should also improve the LDP generation so that's a win :) |
Add an implementation of std::swap_ranges using Neon intrinsics.