Bulk processed pixel conversions #2383

FlareFlo · 2024-11-18T13:34:21Z

Adding internal API to bulk process pixel conversions with per-conversion friendly SIMD/SWIR would speed up many image ops in this crate. I would like to hear some feedback on the general concept and its applications for the future.

Draft

I'm willing to take a shot at implementing these APIs as far as possible, I have done some bikeshedding and benchmarks as seen here. I have observed a 3x speedup when converting a 16mb f32 buffer to u8, while taking these new constraints into account.
Generally, most conversion APIs would need to be extended like this, though this new API would not be mandatory to implement as a scalar fallback should always be possible.

HeroicKatora · 2024-11-18T19:38:24Z

I appreciate the extension for minimalism alone, very nice if that default'able extension method to an internal trait would work out to 3× speedups for some of the higher-level operations. In general, yes, anything that can be designed for batching but isn't deserves a critical look.

What's the main hold-up for turning your linked implementation sketch and commit into a PR?

FlareFlo · 2024-11-18T20:05:33Z

I appreciate the extension for minimalism alone, very nice if that default'able extension method to an internal trait would work out to 3× speedups for some of the higher-level operations. In general, yes, anything that can be designed for batching but isn't deserves a critical look.

What's the main hold-up for turning your linked implementation sketch and commit into a PR?

The holdup is calling this method internally, as most APIs only call from_color 3-4 times (once per channel or alpha) per pixel at a time.
Right now im attempting to further modify upstream APIs to support calling into this new bulk functions.
Once that is complete ill draft a PR and get feedback on some of the decisions that have to be made.

FlareFlo · 2024-11-18T20:15:44Z

An example issue would be FromColor as already mentioned:

    fn from_color_bulk(output: &mut [Self], input: &[Rgb<f32>])
    where
        Self: Sized,
    {
        // Safe because `Foo` is `#[repr(transparent)]`
        let input: &[f32] = unsafe {
            std::slice::from_raw_parts(
                input.as_ptr() as *const f32,
                input.len() * std::mem::size_of::<f32>(),
            )
        };

        // Same goes for T
        let output: &mut [u8] = unsafe {
            std::slice::from_raw_parts_mut(
                output.as_ptr() as *mut u8,
                output.len() * std::mem::size_of::<u8>(),
            )
        };

        u8::from_bulk_primitive(input, output);
    }

Here i have to somehow turn a buffer of Rgb<f32> into &[f32] (the same goes for output buffer). This current 'solution' is just slapped together unsafe (assuming Rgb is repr Transparent among other things).

fintelia · 2024-11-18T21:26:54Z

I'm supportive of this investigation, though we should make sure to benchmark carefully and to double check that the same performance gains aren't possible the current interfaces.

I poked around in godbolt a bit with the from_primitive method from your PR. It didn't vectorize cleanly as written, but switching from using the round function to + 0.5 before casting seems to mostly get it to. There's a subtle difference between the SIMD and non-SIMD round methods that may be the cause: On f32 ties round away from zero, while SimdFloat rounds them towards zero.

Another point to mention is that std::simd is curently nightly-only so we can't use it except behind a non-default feature flag. And even there, we'd want a substantial performance difference to justify the added overhead of dealing with unstable (and potentially changing) library methods.

fintelia · 2024-11-18T21:28:44Z

So far we've avoided ever dealing with &[Rgb<f32>] or similar slices, so I'm not thrilled about changing that. For now using unsafe casts seems fine, but I'd want to see performance numbers/consider alternatives before making a final decision.

FlareFlo · 2024-11-18T21:30:26Z

I'm supportive of this investigation, though we should make sure to benchmark carefully and to double check that the same performance gains aren't possible the current interfaces.

I poked around in godbolt a bit with the from_primitive method from your PR. It didn't vectorize cleanly as written, but switching from using the round function to + 0.5 before casting seems to mostly get it to. There's a subtle difference between the SIMD and non-SIMD round methods that may be the cause: On f32 ties round away from zero, while SimdFloat rounds them towards zero.

Another point to mention is that std::simd is curently nightly-only so we can't use it except behind a non-default feature flag. And even there, we'd want a substantial performance difference to justify the added overhead of dealing with unstable (and potentially changing) library methods.

This is why im still toying around with all of this.
For a possible merge i would use portable-simd.
Ill try and get this implemented into a working state and present a proper benchmark to compare any possible gains.

FlareFlo · 2024-11-18T21:47:41Z

Some simple preliminary (and unscientific) benchmarks show a significant improvement for a 4096² pixel image (f32 -> u8), which is the motivation of my efforts.

(AVX-512 CPU for native run)

awxkee · 2024-11-18T23:18:33Z

SIMD will be much better here on x86 since Rust make a lot efforts on f32 -> u8, with NEON conversion is done in one instruction and there will be no difference, some improvements will be only for x86 where conversion f32 -> u8 goes without explicit saturation and Rust makes additional checks.

You can check the time that Rust spending to check that it won't cause an UB on x86:

#[inline(always)]
fn from_primitive(float: f32) -> u8 {
    unsafe {
        (normalize_float(float, u8::MAX as f32) + 0.5).to_int_unchecked::<u8>()
    }
}

On aarch64 SIMD versions highly likely will be worse than Rust codegen.

Other things in loop should not make real overhead.

awxkee · 2024-11-18T23:44:49Z

Who wants to check:

https://godbolt.org/z/YYGqjzasf

https://developer.arm.com/architectures/instruction-sets/intrinsics/vcvtaq_u32_f32

FlareFlo · 2024-11-19T00:32:04Z

SIMD will be much better here on x86 since Rust make a lot efforts on f32 -> u8, with NEON conversion is done in one instruction and there will be no difference, some improvements will be only for x86 where conversion f32 -> u8 goes without explicit saturation and Rust makes additional checks.

You can check the time that Rust spending to check that it won't cause an UB on x86:
#[inline(always)]
fn from_primitive(float: f32) -> u8 {
    unsafe {
        (normalize_float(float, u8::MAX as f32) + 0.5).to_int_unchecked::<u8>()
    }
}
On aarch64 SIMD versions highly likely will be worse than Rust codegen.

Other things in loop should not make real overhead.

I observed that the compiler can actually emit pretty fast SIMD code for scalar implementations, given that proper slice sizes are used (such as casting Rgb). So it might be enough to just provide scalar bulk ops and let the compiler take the wheel.

FlareFlo · 2024-11-20T09:03:38Z

I believe this should be packaged into three stages of PR/Work that ill explain below, that i would like to hear some thoughts on. Ill end each case with progress that i have made in that direction.

Define Bulk API such that any conversion could use use it, but only implement (or rather, keep) the trait provided default impl (scalar).
This change should be ready to PR in less than a week if i find a few free hours.
Create specialized implementations for specific Pixel types using scalar implementations, applying unsafe where necessary, to enable the compiler to see past Pixel types when optimizing.
Possible in the near future, but requires manual auditing and additional testing (compared to my slapped together POC)
Investigate if we even need SIMD on individual channel-to-channel conversions (as i have observed the compiler/LLVM producing quite fast SIMD from scalar implementations on its own).
Should be very doable once step 2 is available where SIMD makes sense (f32 -> u8/u16), though the question regarding nightly requirements for std::simd needs to be answered.

fintelia · 2024-11-20T21:05:28Z

I think we should start by adding a benchmark of the user-facing functionality that we want to optimize. In this case, I believe it would either be measuring either ImageBuffer::convert or one/several of the DynamicImage::into_XXX methods. This will serve as a baseline and enable us to evaluate the impact of subsequent changes.

From there, the next step would be a proof-of-concept implementation showing that the overall performance difference that is possible. It is also an opportunity to experiment with whether there's other ways to get similar performance gains.

If results from the previous step are compelling, then we can figure out the right sequence to merge things in. Your proposed sequence of first defining the API, then adding specialized implementations, and following up with further optimizations where beneficial seems broadly reasonable. This is also where we'd have to answer questions about usage of unsafe (ideally we'd avoid unless absolutely necessary) and using explicit SIMD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk processed pixel conversions #2383

Bulk processed pixel conversions #2383

FlareFlo commented Nov 18, 2024

HeroicKatora commented Nov 18, 2024 •

edited

Loading

FlareFlo commented Nov 18, 2024

FlareFlo commented Nov 18, 2024

fintelia commented Nov 18, 2024

fintelia commented Nov 18, 2024

FlareFlo commented Nov 18, 2024

FlareFlo commented Nov 18, 2024 •

edited

Loading

awxkee commented Nov 18, 2024 •

edited

Loading

awxkee commented Nov 18, 2024

FlareFlo commented Nov 19, 2024

FlareFlo commented Nov 20, 2024

fintelia commented Nov 20, 2024 •

edited

Loading

Bulk processed pixel conversions #2383

Bulk processed pixel conversions #2383

Comments

FlareFlo commented Nov 18, 2024

Draft

HeroicKatora commented Nov 18, 2024 • edited Loading

FlareFlo commented Nov 18, 2024

FlareFlo commented Nov 18, 2024

fintelia commented Nov 18, 2024

fintelia commented Nov 18, 2024

FlareFlo commented Nov 18, 2024

FlareFlo commented Nov 18, 2024 • edited Loading

awxkee commented Nov 18, 2024 • edited Loading

awxkee commented Nov 18, 2024

FlareFlo commented Nov 19, 2024

FlareFlo commented Nov 20, 2024

fintelia commented Nov 20, 2024 • edited Loading

HeroicKatora commented Nov 18, 2024 •

edited

Loading

FlareFlo commented Nov 18, 2024 •

edited

Loading

awxkee commented Nov 18, 2024 •

edited

Loading

fintelia commented Nov 20, 2024 •

edited

Loading