Skip to content

Conversation

@valadaptive
Copy link
Contributor

Depends on #170. The PR stack is getting a bit large.

When looking into gradient rendering in Vello, I noticed that we often perform many reads from the gradient color LUT. Each of those reads requires a bounds check.

The indices into this LUT come from vector types, so theoretically, we should be able to elide all the bounds checks like this:

let mut indices: u32x16<S> = [...];
assert!(!self.lut.is_empty());
assert!(self.lut.len() <= u32::MAX);
indices = indices.min((self.lut.len() - 1) as u32);

Unfortunately, the compiler still does not recognize that the indices are guaranteed to be in-bounds. To elide the bounds checks, we need to perform the min operation on every index individually, after converting it to a usize.

We can avoid this by introducing a "gather" operation here which ensures all the indices are valid (panicking if the source slice is empty and clamping the indices), then doing a bunch of unchecked accesses in a loop. There's an equivalent "scatter" operation which does the same thing but for writes.

These operations work on arbitrary slice types and gather returns an array. The only vector type involved is the one that holds all the indices.

This PR intentionally does not introduce any operations that gather into/scatter from vector types. That also means that no hardware gather/scatter instructions are used right now, and the performance benefit comes solely from avoiding bounds checks. I'm not sure if we should be reserving the names gather and scatter for operations that work on vectors.

type Gathered<T> = [T; #len];

#[inline(always)]
fn gather<T: Copy>(self, src: &[T]) -> Self::Gathered<T> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether it makes sense to add a debug assertion that no index is exceeds the slice length? This way it's still possible to detect bugs instead of silently clamping to the largest index. But then we would have to remove the guarantee from the documentation that it will always be clamped.

@valadaptive
Copy link
Contributor Author

Before merging, we should figure out what's going on with the Vello use case this was intended for. The theory is that unchecked indexing improves performance (I noticed a ton of bounds checks in the generated ARM assembly). Indeed, using unchecked indexing directly improves performance:

btw, I tried the inlining you suggested on main again. If I do that, as mentioned performance drops (from around 470ms to 520ms). however, if I replace the lut check with an unchecked access in the inlined version, it goes down to 460ms! If I do the same on main(just changing to unchecked), I get the same performance, so I think for some reason the compiler is able to optimize some of the index checks away in current main, but it stops working once you inline the iterator... Even though it doesnt make much sense

But the "gather" version is 525ms, slower than not using it.

Copy link
Collaborator

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think it would be good to have some tests as well, especially for the edge cases.

self.simd.#min_method(self, ((src.len() - 1) as Self::Element).simd_into(self.simd))
};

let inbounds = &*inbounds;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this one's necessary? Same for the other two functions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm talking about the let inbounds = &*inbounds;.)

@valadaptive
Copy link
Contributor Author

I've added tests for the scatter/gather ops (well, I generated them, but I looked over them and they cover all the edge cases). Some of these are #[should_panic] tests.

I've also made some supporting changes to the simd_test macro: it now just copies all input attributes to the generated test functions (I've removed the individual ignore functions since they've been no-ops for a while), and if it detects a #[should_panic] attribute on a test function, it runs the test using the fallback implementation even if the target CPU features aren't supported.

Surprisingly, cargo-nextest can actually run #[should_panic] tests even on targets like wasm32-wasip1 that have panic=abort, since each test is run in its own process.

I also noticed that I forgot to add SimdGather and SimdScatter bounds to the native-width u8s, u16s, and u32s types, so I did that as well.

Finally, I removed the unnecessary &*inbounds. I thought these were necessary for being able to call get_unchecked on SIMD types directly, but I think the Deref impl takes care of that.

@valadaptive valadaptive requested a review from LaurenzV December 26, 2025 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants