Skip to content

Conversation

@hazzlim
Copy link

@hazzlim hazzlim commented Oct 31, 2025

Add an implementation of std::swap_ranges using Neon intrinsics.

@hazzlim hazzlim requested a review from a team as a code owner October 31, 2025 11:02
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Oct 31, 2025
@hazzlim
Copy link
Author

hazzlim commented Oct 31, 2025

@microsoft-github-policy-service agree company="Arm"

@hazzlim
Copy link
Author

hazzlim commented Oct 31, 2025

Hopefully I’ve done something reasonable here - please let me know if you would prefer a different approach when adding new implementations to vector_algorithms.cpp

The performance numbers are below - looks good apart from the case of size(1), which I think is due to the added overhead of the function call and the conditional checks now std::swap_ranges is not being inlined into the benchmark.

  MSVC Speedup
std_swap_ranges<uint8_t, highly_aligned_allocator>/1 0.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/5 0.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/15 1.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/26 4.1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/38 4.8x
std_swap_ranges<uint8_t, highly_aligned_allocator>/60 7.3x
std_swap_ranges<uint8_t, highly_aligned_allocator>/125 12.3x
std_swap_ranges<uint8_t, highly_aligned_allocator>/800 21.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/3000 22.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/9000 22.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/1 0.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/5 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/15 1.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/26 3.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/38 4.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/60 7.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/125 11.8x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/800 14.6x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/3000 15.1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/9000 14.7x
   
   
  clang-cl Speedup
std_swap_ranges<uint8_t, highly_aligned_allocator>/1 0.6x
std_swap_ranges<uint8_t, highly_aligned_allocator>/5 1.1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/15 1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/26 1.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/38 1.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/60 1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/125 1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/800 1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/3000 1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/9000 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/1 0.6x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/5 1.2x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/15 1.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/26 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/38 1.2x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/60 1.4x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/125 1.4x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/800 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/3000 1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/9000 1.1x

Add an implementation of std::swap_ranges using Neon intrinsics.
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Oct 31, 2025
AlexGuteniev

This comment was marked as resolved.

@hazzlim

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej self-assigned this Oct 31, 2025
Comment on lines 166 to 169
_Left = vld1_lane_u32(static_cast<uint32_t*>(_First1), _Left, 0);
_Right = vld1_lane_u32(static_cast<uint32_t*>(_First2), _Right, 0);
vst1_lane_u32(static_cast<uint32_t*>(_First1), _Right, 0);
vst1_lane_u32(static_cast<uint32_t*>(_First2), _Left, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming, no change requested: This is accessing _First1 and _First2 as uint32_t*, but they aren't necessarily 4-byte aligned. Is this cromulent?

We have test coverage for this scenario so I think we're good:

// also test unaligned input
const auto endOffset = min(static_cast<ptrdiff_t>(dataCount), attempts + 1);
assert(
right.begin() + (endOffset - 1) == swap_ranges(left.begin() + 1, left.begin() + endOffset, right.begin()));
last_known_good_swap_ranges(leftCopy.begin() + 1, leftCopy.begin() + endOffset, rightCopy.begin());

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this was discussed on the discord also, but for the record here - yes this is fine as the LD1 lane instructions this will produce allow unaligned loads.

@StephanTLavavej
Copy link
Member

Thanks @hazzlim, this is awesome! 😻 I pushed some commits, can you rerun your benchmark measurements? We finally have ARM64 runtime testing in PR checks, but I won't be able to gather perf measurements myself until Feb 2026-ish. The changes to override /Os for ARM64, and to eliminate unnecessary loops, should improve performance but perhaps by an unobservable amount.

@StephanTLavavej StephanTLavavej removed their assignment Nov 4, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Nov 4, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Nov 4, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@hazzlim
Copy link
Author

hazzlim commented Nov 5, 2025

Thanks @hazzlim, this is awesome! 😻 I pushed some commits, can you rerun your benchmark measurements? We finally have ARM64 runtime testing in PR checks, but I won't be able to gather perf measurements myself until Feb 2026-ish. The changes to override /Os for ARM64, and to eliminate unnecessary loops, should improve performance but perhaps by an unobservable amount.

Nice, thanks for doing this! Not sure how I missed that we could remove the loops for len < 64, nice one 😺

I will re-run the perf and report, it may well be unobservable but I think it should also improve the LDP generation so that's a win :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Status: Merging

Development

Successfully merging this pull request may close these issues.

4 participants