Add a "slide" operation (like x86's `alignr` and ARM's `vext`) #164

valadaptive · 2025-12-13T02:35:52Z

Depends on #159. Working on this is what sent me down that rabbit hole in the first place.

Progress towards #29, implementing the first lane-shuffling operations.

This PR adds an operation that concatenates two vectors and then takes a window of the concatenation. In other words, it takes two n-element vectors and a "window shift" of s, and returns the last n - s elements of the first vector concatenated with the first s elements of the second. This is like the vext family on ARM or alignr on x86.

This can be used to implement "shift items" or "rotate items" operations, by providing a zero vector for one operand to get "shift" behavior and providing the same operand twice to get "rotate" behavior.

There are two variants of this operation: one that operates over the full width of the vector and one that operates within 128-bit blocks. Even on AVX2, _mm256_alignr_epi8 operates within 128-bit lanes, and it takes some extra permutes to make a full-width version, so I think it makes sense to provide a per-block version. This will also be the case when I implement fully-general swizzles.

The shift amount is provided as a const generic argument, since the underlying intrinsics also expose it that way. In many cases, we need to do math on that const generic argument before passing it to the intrinsic--we might need to convert it from a usize to i32, divide it by the number of bytes per scalar element, wrap it modulo 16, etc. Rust doesn't let us do this yet, so I've added "faux-dynamic" versions of the alignr/vext intrinsics that are implemented as a huge match statement, one for each of the 16 byte shift amounts. Since we inline everything, these should be evaluated at compile time.

I haven't yet confirmed that this generates the LLVM IR that we expect on all targets. The codegen seems to be producing shufflevectors on x86 and AArch64 (I haven't looked at WebAssembly), but all the functions go through some level of indirection via vectorize or call_once or something so it's hard to match them up to what they're supposed to be doing.

I'm not fully tied to the name "slide". It's hard to find a good name for this operation. x86's "alignr" makes me think of memory alignment, and ARM's "vext" ("vector extend") sounds like you're just combining two vectors into a wider one.

valadaptive · 2025-12-24T22:15:07Z

All the merge conflicts are now resolved, since I've rebased this post-codegen rework. I'm using this operation in my video effect project, to implement an IIR filter.

LaurenzV

I haven't tried to fully understand all of the logic of the added (helper) methods, but I've gotten a good overview and overall it seems fine to add. However, I do have some comments/remarks.

LaurenzV · 2025-12-31T12:50:45Z

fearless_simd_tests/tests/harness/mod.rs

+    let a = mask64x4::from_slice(simd, &[1, 2, 3, 4]);
+    let b = mask64x4::from_slice(simd, &[5, 6, 7, 8]);
+    assert_eq!(*a.slide::<0>(b), [1, 2, 3, 4]);
+    assert_eq!(*a.slide::<2>(b), [3, 4, 5, 6]); // crosses block


Don't all slide tests cross the block? Just wondering why we add a comment here but not to the others. 🤔

LaurenzV · 2025-12-31T12:53:50Z

fearless_simd_tests/tests/harness/slide_exhaustive.rs

Do we really need all these tests? I feel like the ones in mod.rs should be enough?

LaurenzV · 2025-12-31T12:58:42Z

fearless_simd_gen/src/ops.rs

        let method_ident = Ident::new(self.method, Span::call_site());
        let sig_inner = match &self.sig {
-            OpSig::Splat | OpSig::LoadInterleaved { .. } | OpSig::StoreInterleaved { .. } => {
+            OpSig::Splat => {


Why was Splat changed here as well?

LaurenzV · 2025-12-31T13:05:05Z

fearless_simd_gen/src/ops.rs

    ),
 ];

+pub(crate) fn base_trait_ops() -> Vec<Op> {


Small thing, but as far as I can see this method is only used in one place, and returning an iterator here should be sufficient?

LaurenzV · 2025-12-31T13:06:13Z

fearless_simd_gen/src/ops.rs

        match self {
-            Self::Splat
-            | Self::LoadInterleaved { .. }
+            Self::Splat => &["simd", "val"],


Again, on purpose that splat was changed here?

LaurenzV · 2025-12-31T13:16:37Z

fearless_simd_gen/src/mk_neon.rs

+        /// expected to be constant in practice, so the match statement will be optimized out. This exists because
+        /// Rust doesn't currently let you do math on const generics.
+        #[inline(always)]
+        unsafe fn dyn_vext_128(a: uint8x16_t, b: uint8x16_t, shift: usize) -> uint8x16_t {


If we already have an unsafe block inside we don't need it on the function itself, no?

LaurenzV · 2025-12-31T13:27:33Z

fearless_simd_gen/src/mk_neon.rs

+                let from_bytes = generic_op_name("cvt_from_bytes", vec_ty);
+
+                let byte_shift = if scalar_bytes == 1 {
+                    quote! { SHIFT }


In this case we probably wouldn't need to call the dyn methods either but can just call the intrinsic, right?

LaurenzV · 2025-12-31T13:33:51Z

fearless_simd_gen/src/mk_x86.rs

+            let blocks_idx = 0..num_blocks;
+
+            // Unroll the construction of the blocks. I tried using `array::from_fn`, but the compiler thought the
+            // closure was too big and didn't inline it.


Even if you annotate the closure with #[inline(always)]?

LaurenzV · 2025-12-31T13:51:54Z

fearless_simd/src/generated/avx2.rs

+#[doc = r" Concatenates `b` and `a` (each 1 x __m256i = 2 blocks) and extracts 2 blocks starting at byte offset"]
+#[doc = r" `shift_bytes`. Extracts from [b : a] (b in low bytes, a in high bytes), matching alignr semantics."]
+#[inline(always)]
+unsafe fn cross_block_alignr_256x1(a: __m256i, b: __m256i, shift_bytes: usize) -> __m256i {


Same as in a different location, we probably don't need to mark the function as unsafe if we use an unsafe block inside?

LaurenzV · 2025-12-31T13:58:30Z

fearless_simd/src/generated/sse4_2.rs

+#[doc = r" Concatenates `b` and `a` (each N blocks) and extracts N blocks starting at byte offset `shift_bytes`."]
+#[doc = r" Extracts from [b : a] (b in low bytes, a in high bytes), matching `alignr` semantics."]
+#[inline(always)]
+unsafe fn cross_block_alignr_128x4(


I'm fine leaving it this way for now, but I'm wondering whether this is really going to be faster than just using the fallback approach? Have you done any benchmarks on that? (Also for 128x2 and 256x2 in AVX2) Especially because cross_block_slide_blocks_at does quite a bit of work and is called 4 times.

(Also applies to NEON and WASM, basiaclly anywhere where we polyfill a larger vector width than the base one.)

valadaptive requested review from LaurenzV and Ralith December 13, 2025 02:35

valadaptive force-pushed the rotato branch 3 times, most recently from 403ac25 to 9e2efa5 Compare December 18, 2025 01:31

valadaptive added 2 commits December 20, 2025 19:39

Add "slide" operation

e89127f

Autogenerate signatures for splat and slide

f76e133

valadaptive force-pushed the rotato branch from 9e2efa5 to f76e133 Compare December 21, 2025 00:43

valadaptive added 2 commits December 20, 2025 19:47

Comment x86 dyn_alignr_helpers

52adf73

Fix Clippy warnings

99d542d

LaurenzV reviewed Dec 31, 2025

View reviewed changes

Add a "slide" operation (like x86's alignr and ARM's vext) #164

Are you sure you want to change the base?

Add a "slide" operation (like x86's alignr and ARM's vext) #164

Uh oh!

Conversation

valadaptive commented Dec 13, 2025

Uh oh!

valadaptive commented Dec 24, 2025

Uh oh!

LaurenzV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add a "slide" operation (like x86's `alignr` and ARM's `vext`) #164

Add a "slide" operation (like x86's `alignr` and ARM's `vext`) #164