Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk processed pixel conversions #2383

Open
FlareFlo opened this issue Nov 18, 2024 · 12 comments
Open

Bulk processed pixel conversions #2383

FlareFlo opened this issue Nov 18, 2024 · 12 comments

Comments

@FlareFlo
Copy link
Contributor

Adding internal API to bulk process pixel conversions with per-conversion friendly SIMD/SWIR would speed up many image ops in this crate. I would like to hear some feedback on the general concept and its applications for the future.

Draft

I'm willing to take a shot at implementing these APIs as far as possible, I have done some bikeshedding and benchmarks as seen here. I have observed a 3x speedup when converting a 16mb f32 buffer to u8, while taking these new constraints into account.
Generally, most conversion APIs would need to be extended like this, though this new API would not be mandatory to implement as a scalar fallback should always be possible.

@HeroicKatora
Copy link
Member

HeroicKatora commented Nov 18, 2024

I appreciate the extension for minimalism alone, very nice if that default'able extension method to an internal trait would work out to 3× speedups for some of the higher-level operations. In general, yes, anything that can be designed for batching but isn't deserves a critical look.

What's the main hold-up for turning your linked implementation sketch and commit into a PR?

@FlareFlo
Copy link
Contributor Author

I appreciate the extension for minimalism alone, very nice if that default'able extension method to an internal trait would work out to 3× speedups for some of the higher-level operations. In general, yes, anything that can be designed for batching but isn't deserves a critical look.

What's the main hold-up for turning your linked implementation sketch and commit into a PR?

The holdup is calling this method internally, as most APIs only call from_color 3-4 times (once per channel or alpha) per pixel at a time.
Right now im attempting to further modify upstream APIs to support calling into this new bulk functions.
Once that is complete ill draft a PR and get feedback on some of the decisions that have to be made.

@FlareFlo
Copy link
Contributor Author

An example issue would be FromColor as already mentioned:

    fn from_color_bulk(output: &mut [Self], input: &[Rgb<f32>])
    where
        Self: Sized,
    {
        // Safe because `Foo` is `#[repr(transparent)]`
        let input: &[f32] = unsafe {
            std::slice::from_raw_parts(
                input.as_ptr() as *const f32,
                input.len() * std::mem::size_of::<f32>(),
            )
        };

        // Same goes for T
        let output: &mut [u8] = unsafe {
            std::slice::from_raw_parts_mut(
                output.as_ptr() as *mut u8,
                output.len() * std::mem::size_of::<u8>(),
            )
        };

        u8::from_bulk_primitive(input, output);
    }

Here i have to somehow turn a buffer of Rgb<f32> into &[f32] (the same goes for output buffer). This current 'solution' is just slapped together unsafe (assuming Rgb is repr Transparent among other things).

@fintelia
Copy link
Contributor

I'm supportive of this investigation, though we should make sure to benchmark carefully and to double check that the same performance gains aren't possible the current interfaces.

I poked around in godbolt a bit with the from_primitive method from your PR. It didn't vectorize cleanly as written, but switching from using the round function to + 0.5 before casting seems to mostly get it to. There's a subtle difference between the SIMD and non-SIMD round methods that may be the cause: On f32 ties round away from zero, while SimdFloat rounds them towards zero.

Another point to mention is that std::simd is curently nightly-only so we can't use it except behind a non-default feature flag. And even there, we'd want a substantial performance difference to justify the added overhead of dealing with unstable (and potentially changing) library methods.

@fintelia
Copy link
Contributor

So far we've avoided ever dealing with &[Rgb<f32>] or similar slices, so I'm not thrilled about changing that. For now using unsafe casts seems fine, but I'd want to see performance numbers/consider alternatives before making a final decision.

@FlareFlo
Copy link
Contributor Author

I'm supportive of this investigation, though we should make sure to benchmark carefully and to double check that the same performance gains aren't possible the current interfaces.

I poked around in godbolt a bit with the from_primitive method from your PR. It didn't vectorize cleanly as written, but switching from using the round function to + 0.5 before casting seems to mostly get it to. There's a subtle difference between the SIMD and non-SIMD round methods that may be the cause: On f32 ties round away from zero, while SimdFloat rounds them towards zero.

Another point to mention is that std::simd is curently nightly-only so we can't use it except behind a non-default feature flag. And even there, we'd want a substantial performance difference to justify the added overhead of dealing with unstable (and potentially changing) library methods.

This is why im still toying around with all of this.
For a possible merge i would use portable-simd.
Ill try and get this implemented into a working state and present a proper benchmark to compare any possible gains.

@FlareFlo
Copy link
Contributor Author

FlareFlo commented Nov 18, 2024

Some simple preliminary (and unscientific) benchmarks show a significant improvement for a 4096² pixel image (f32 -> u8), which is the motivation of my efforts.
image
image
(AVX-512 CPU for native run)

@awxkee
Copy link
Contributor

awxkee commented Nov 18, 2024

SIMD will be much better here on x86 since Rust make a lot efforts on f32 -> u8, with NEON conversion is done in one instruction and there will be no difference, some improvements will be only for x86 where conversion f32 -> u8 goes without explicit saturation and Rust makes additional checks.

You can check the time that Rust spending to check that it won't cause an UB on x86:

#[inline(always)]
fn from_primitive(float: f32) -> u8 {
    unsafe {
        (normalize_float(float, u8::MAX as f32) + 0.5).to_int_unchecked::<u8>()
    }
}

On aarch64 SIMD versions highly likely will be worse than Rust codegen.

Other things in loop should not make real overhead.

@awxkee
Copy link
Contributor

awxkee commented Nov 18, 2024

@FlareFlo
Copy link
Contributor Author

SIMD will be much better here on x86 since Rust make a lot efforts on f32 -> u8, with NEON conversion is done in one instruction and there will be no difference, some improvements will be only for x86 where conversion f32 -> u8 goes without explicit saturation and Rust makes additional checks.

You can check the time that Rust spending to check that it won't cause an UB on x86:

#[inline(always)]
fn from_primitive(float: f32) -> u8 {
    unsafe {
        (normalize_float(float, u8::MAX as f32) + 0.5).to_int_unchecked::<u8>()
    }
}

On aarch64 SIMD versions highly likely will be worse than Rust codegen.

Other things in loop should not make real overhead.

I observed that the compiler can actually emit pretty fast SIMD code for scalar implementations, given that proper slice sizes are used (such as casting Rgb). So it might be enough to just provide scalar bulk ops and let the compiler take the wheel.

@FlareFlo
Copy link
Contributor Author

I believe this should be packaged into three stages of PR/Work that ill explain below, that i would like to hear some thoughts on. Ill end each case with progress that i have made in that direction.

  1. Define Bulk API such that any conversion could use use it, but only implement (or rather, keep) the trait provided default impl (scalar).
    This change should be ready to PR in less than a week if i find a few free hours.
  2. Create specialized implementations for specific Pixel types using scalar implementations, applying unsafe where necessary, to enable the compiler to see past Pixel types when optimizing.
    Possible in the near future, but requires manual auditing and additional testing (compared to my slapped together POC)
  3. Investigate if we even need SIMD on individual channel-to-channel conversions (as i have observed the compiler/LLVM producing quite fast SIMD from scalar implementations on its own).
    Should be very doable once step 2 is available where SIMD makes sense (f32 -> u8/u16), though the question regarding nightly requirements for std::simd needs to be answered.

@fintelia
Copy link
Contributor

fintelia commented Nov 20, 2024

I think we should start by adding a benchmark of the user-facing functionality that we want to optimize. In this case, I believe it would either be measuring either ImageBuffer::convert or one/several of the DynamicImage::into_XXX methods. This will serve as a baseline and enable us to evaluate the impact of subsequent changes.

From there, the next step would be a proof-of-concept implementation showing that the overall performance difference that is possible. It is also an opportunity to experiment with whether there's other ways to get similar performance gains.

If results from the previous step are compelling, then we can figure out the right sequence to merge things in. Your proposed sequence of first defining the API, then adding specialized implementations, and following up with further optimizations where beneficial seems broadly reasonable. This is also where we'd have to answer questions about usage of unsafe (ideally we'd avoid unless absolutely necessary) and using explicit SIMD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants