Add Neon implementation of std::swap_ranges #5819

hazzlim · 2025-10-31T11:02:09Z

Add an implementation of std::swap_ranges using Neon intrinsics.

hazzlim · 2025-10-31T11:05:07Z

@microsoft-github-policy-service agree company="Arm"

hazzlim · 2025-10-31T11:11:45Z

Hopefully I’ve done something reasonable here - please let me know if you would prefer a different approach when adding new implementations to vector_algorithms.cpp

The performance numbers are below - looks good apart from the case of size(1), which I think is due to the added overhead of the function call and the conditional checks now std::swap_ranges is not being inlined into the benchmark.

	MSVC Speedup
std_swap_ranges<uint8_t, highly_aligned_allocator>/1	0.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/5	0.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/15	1.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/26	4.1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/38	4.8x
std_swap_ranges<uint8_t, highly_aligned_allocator>/60	7.3x
std_swap_ranges<uint8_t, highly_aligned_allocator>/125	12.3x
std_swap_ranges<uint8_t, highly_aligned_allocator>/800	21.9x
std_swap_ranges<uint8_t, highly_aligned_allocator>/3000	22.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/9000	22.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/1	0.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/5	1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/15	1.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/26	3.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/38	4.9x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/60	7.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/125	11.8x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/800	14.6x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/3000	15.1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/9000	14.7x


	clang-cl Speedup
std_swap_ranges<uint8_t, highly_aligned_allocator>/1	0.6x
std_swap_ranges<uint8_t, highly_aligned_allocator>/5	1.1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/15	1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/26	1.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/38	1.4x
std_swap_ranges<uint8_t, highly_aligned_allocator>/60	1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/125	1.5x
std_swap_ranges<uint8_t, highly_aligned_allocator>/800	1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/3000	1x
std_swap_ranges<uint8_t, highly_aligned_allocator>/9000	1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/1	0.6x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/5	1.2x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/15	1.5x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/26	1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/38	1.2x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/60	1.4x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/125	1.4x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/800	1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/3000	1x
std_swap_ranges<uint8_t, not_highly_aligned_allocator>/9000	1.1x

Add an implementation of std::swap_ranges using Neon intrinsics.

stl/src/vector_algorithms.cpp

…H 2108.

stl/src/vector_algorithms.cpp

StephanTLavavej · 2025-11-04T18:01:12Z

stl/src/vector_algorithms.cpp

+            _Left             = vld1_lane_u32(static_cast<uint32_t*>(_First1), _Left, 0);
+            _Right            = vld1_lane_u32(static_cast<uint32_t*>(_First2), _Right, 0);
+            vst1_lane_u32(static_cast<uint32_t*>(_First1), _Right, 0);
+            vst1_lane_u32(static_cast<uint32_t*>(_First2), _Left, 0);


Just confirming, no change requested: This is accessing _First1 and _First2 as uint32_t*, but they aren't necessarily 4-byte aligned. Is this cromulent?

We have test coverage for this scenario so I think we're good:

STL/tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

Lines 1119 to 1123 in 1118f37

// also test unaligned input

const auto endOffset = min(static_cast<ptrdiff_t>(dataCount), attempts + 1);

assert(

right.begin() + (endOffset - 1) == swap_ranges(left.begin() + 1, left.begin() + endOffset, right.begin()));

last_known_good_swap_ranges(leftCopy.begin() + 1, leftCopy.begin() + endOffset, rightCopy.begin());

I can see this was discussed on the discord also, but for the record here - yes this is fine as the LD1 lane instructions this will produce allow unaligned loads.

stl/src/vector_algorithms.cpp

StephanTLavavej · 2025-11-04T18:43:17Z

Thanks @hazzlim, this is awesome! 😻 I pushed some commits, can you rerun your benchmark measurements? We finally have ARM64 runtime testing in PR checks, but I won't be able to gather perf measurements myself until Feb 2026-ish. The changes to override /Os for ARM64, and to eliminate unnecessary loops, should improve performance but perhaps by an unobservable amount.

StephanTLavavej · 2025-11-04T23:42:30Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

hazzlim · 2025-11-05T10:32:56Z

Thanks @hazzlim, this is awesome! 😻 I pushed some commits, can you rerun your benchmark measurements? We finally have ARM64 runtime testing in PR checks, but I won't be able to gather perf measurements myself until Feb 2026-ish. The changes to override /Os for ARM64, and to eliminate unnecessary loops, should improve performance but perhaps by an unobservable amount.

Nice, thanks for doing this! Not sure how I missed that we could remove the loops for len < 64, nice one 😺

I will re-run the perf and report, it may well be unobservable but I think it should also improve the LDP generation so that's a win :)

hazzlim requested a review from a team as a code owner October 31, 2025 11:02

github-project-automation bot moved this to Initial Review in STL Code Reviews Oct 31, 2025

github-project-automation bot added this to STL Code Reviews Oct 31, 2025

Add Neon implementation of std::swap_ranges

7e0a46c

Add an implementation of std::swap_ranges using Neon intrinsics.

hazzlim force-pushed the swap-ranges-neon branch from d3e9269 to 7e0a46c Compare October 31, 2025 11:15

StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Oct 31, 2025

This comment was marked as resolved.

Sign in to view

Address whitespace & preprocessor review comments

55a3715

This comment was marked as resolved.

Sign in to view

StephanTLavavej self-assigned this Oct 31, 2025

AlexGuteniev reviewed Oct 31, 2025

View reviewed changes

stl/src/vector_algorithms.cpp Show resolved Hide resolved

cpplearner reviewed Oct 31, 2025

View reviewed changes

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

hazzlim and others added 5 commits November 3, 2025 15:11

Remove unecessary ifdef

86896d1

ARM64 doesn't need legacy __std_swap_ranges_trivially_swappable.

250a17a

Override /Os for all architectures, before any function defns, cite G…

a16cb5a

…H 2108.

Improve perf: Only the 64-byte step needs to loop.

2239670

Reduce the scope of _Mask_64.

3a42a4f

StephanTLavavej reviewed Nov 4, 2025

View reviewed changes

StephanTLavavej approved these changes Nov 4, 2025

View reviewed changes

StephanTLavavej removed their assignment Nov 4, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Nov 4, 2025

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Nov 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Neon implementation of std::swap_ranges #5819

Add Neon implementation of std::swap_ranges #5819

hazzlim commented Oct 31, 2025

Uh oh!

hazzlim commented Oct 31, 2025

Uh oh!

hazzlim commented Oct 31, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej Nov 4, 2025

Uh oh!

hazzlim Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Nov 4, 2025

Uh oh!

StephanTLavavej commented Nov 4, 2025

Uh oh!

hazzlim commented Nov 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// also test unaligned input
	const auto endOffset = min(static_cast<ptrdiff_t>(dataCount), attempts + 1);
	assert(
	right.begin() + (endOffset - 1) == swap_ranges(left.begin() + 1, left.begin() + endOffset, right.begin()));
	last_known_good_swap_ranges(leftCopy.begin() + 1, leftCopy.begin() + endOffset, rightCopy.begin());

Add Neon implementation of std::swap_ranges #5819

Are you sure you want to change the base?

Add Neon implementation of std::swap_ranges #5819

Conversation

hazzlim commented Oct 31, 2025

Uh oh!

hazzlim commented Oct 31, 2025

Uh oh!

hazzlim commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

hazzlim Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Nov 4, 2025

Uh oh!

StephanTLavavej commented Nov 4, 2025

Uh oh!

hazzlim commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hazzlim commented Oct 31, 2025 •

edited

Loading

hazzlim commented Nov 5, 2025 •

edited

Loading