Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resizing images is slow #2223

Open
Shnatsel opened this issue May 1, 2024 · 35 comments
Open

Resizing images is slow #2223

Shnatsel opened this issue May 1, 2024 · 35 comments
Labels
kind: slow Not wrong, but still unusable

Comments

@Shnatsel
Copy link
Contributor

Shnatsel commented May 1, 2024

The image resizing routines in image seem to be really slow compared to the competition.

For example, fast_image_resize crate benchmarks show that image is 5x slower than libvips and fast_image_resize's fallback implementation, and 20x slower than fast_image_resize's AVX2 implementation. It also does runtime CPU feature detection.

I have run the RGB benchmarks locally against image v0.25.1 to reproduce the results, and I can confirm that they are accurate on my machine - the performance gap is real:

|            | Nearest |  Box  | Bilinear | Bicubic | Lanczos3 |
|------------|:-------:|:-----:|:--------:|:-------:|:--------:|
| image      |  21.10  |   -   |  61.47   | 103.41  |  145.23  |
| resize     |    -    | 21.92 |  43.90   |  81.38  |  120.40  |
| libvips    |  5.64   | 46.89 |  15.42   |  23.99  |  31.24   |
| fir rust   |  0.20   | 7.27  |  11.57   |  20.53  |  29.96   |
| fir sse4.1 |  0.20   | 3.23  |   4.52   |  7.91   |  11.45   |
| fir avx2   |  0.21   | 2.32  |   2.90   |  5.04   |   8.25   |
@Shnatsel
Copy link
Contributor Author

Shnatsel commented May 1, 2024

It's tempting to just use fast_image_resize - or at least its fallback implementation, and perhaps make all the unsafe SIMD intrinsics an opt-in feature like we did with rav1e and its nasm feature.

The drawback is that there seems to be some unsafe code in the fallback implementation that would need to be audited. Also, it seems the crate has never been fuzzed.

@slckl
Copy link

slckl commented Aug 16, 2024

Would it make sense to sacrifice a bit of perf and make the fast_image_resize fallback implementation safe and port it over?

After a quick look it seems like it could still yield decent perf improvements here.

@Shnatsel
Copy link
Contributor Author

Yes, that sounds like a very good idea!

@Shnatsel Shnatsel added the kind: slow Not wrong, but still unusable label Sep 13, 2024
@awxkee
Copy link
Contributor

awxkee commented Oct 15, 2024

I might do performance rescaling without unsafe in different crate if that sounds interesting.
Performance will be slightly slower or equal if everything will go well ( of course if compare to scalar implementation of fir ).

Here I already have scaling with performace close to fast-image-resize so it is not so hard to made a new one scalar implementation without unsafety.

@Shnatsel
Copy link
Contributor Author

Yes, having fast and zero-unsafe resizing would be very beneficial. Then image could either depend on the crate, or the code could be copied over if need be.

For integer types you should be able to get close to the performance of explicit SIMD using autovectorization. f32 is tricky because the compiler won't autovectorize floats, but these kinds of images are also not very common so I'm not too concerned about them being slower right now, and eventually we'll be able to make them fast once std::simd is stabilized.

@awxkee
Copy link
Contributor

awxkee commented Oct 15, 2024

This type of convolution is very unlikely to be correctly auto-vectorized especially for very common u8 data-type.

I would say that when rely on auto-vectorization this convolution of RGBA f32 image performance might be exact or even better sometimes on u8 data type.

Ok, I think I'll do so.

When question probably I want to ask before: are you expecting this using rayon, single-threaded scaling or featured rayon, how is this expected?

@Shnatsel
Copy link
Contributor Author

I think it's best to stick to a single-threaded implementation for now.

The scalar fallback path in fast-image-resize downscales a 4K image in 20ms, which is already very fast, so I don't think more optimization for large images is needed. On the flip side, using rayon would make resizing lots of small images slower. It would also hurt batch workloads that resize multiple images at once.

@awxkee
Copy link
Contributor

awxkee commented Oct 16, 2024

So, here some intermediate results with #![forbid(unsafe_code)] against fast-image-resize scalar implementation for NASA 8-bit image 4928x3279 downscaling two times.

Seems to be good for a safety price.
However on each x2 downscaling gap will be increasing, since kernel size will grow and bounds checks on slices will be increasing sharply.
However, I don't think that downscaling image in 10-20-100 times is the case to be worried about.

Test were made on laptop so about 10-15% moving performance timing should be considered.

Pic scale RGB: Lanczos 3
time: [68.415 ms 68.923 ms 69.444 ms]

Fast image resize RGB: Lanczos 3
time: [43.627 ms 44.255 ms 45.004 ms]

Pic scale RGBA: Lanczos 3
time: [60.059 ms 60.103 ms 60.155 ms]

Fast image resize RGBA: Lanczos 3
time: [50.734 ms 51.478 ms 52.298 ms]

Also tests with optional rayon since it was interesting to try it out.

Pic scale RGB: Lanczos 3
time: [9.1101 ms 9.1725 ms 9.2402 ms]

Fast image resize RGB: Lanczos 3
time: [8.5962 ms 8.7137 ms 8.8374 ms]

Pic scale RGBA: Lanczos 3
time: [9.3803 ms 9.4747 ms 9.5712 ms]

Fast image resize RGBA: Lanczos 3
time: [10.355 ms 10.624 ms 10.924 ms]

@awxkee
Copy link
Contributor

awxkee commented Oct 16, 2024

Also I'm compiling on aarch64 where is neon is always available and LLVM will do some improvements according to that.
On more tricky architectures as x86 without target-cpu=native behaviour might change significantly.

@Shnatsel
Copy link
Contributor Author

That looks like a great improvement!

As for bounds checks, I've written a very detailed article about avoiding them without unsafe code:
https://shnatsel.medium.com/how-to-avoid-bounds-checks-in-rust-without-unsafe-f65e618b4c1e

@awxkee
Copy link
Contributor

awxkee commented Oct 16, 2024

Yup, I've read that, thanks.

Perhaps, I know everything of this :)

At the moment state, in single-threading mode on aarch64 on the same image, is:

Pic scale RGBA: Lanczos 3
                        time:   [35.462 ms 35.788 ms 36.122 ms]

Fast image resize RGBA: Lanczos 3
                        time:   [49.988 ms 50.339 ms 50.852 ms]

Pic scale RGB: Lanczos 3
                        time:   [46.607 ms 46.942 ms 47.275 ms]

Fast image resize RGB: Lanczos 3
                        time:   [44.320 ms 44.941 ms 45.676 ms]

Pic scale LA: Lanczos 3 time:   [37.117 ms 37.350 ms 37.595 ms]

Fast image resize LA: Lanczos 3
                        time:   [43.058 ms 43.459 ms 43.912 ms]

Pic scale Plane: Lanczos 3
                        time:   [26.315 ms 26.593 ms 26.892 ms]

Fast image resize Plane: Lanczos 3
                        time:   [19.720 ms 19.928 ms 20.137 ms]

Planar 8-bit images are little suffer, perhaps I will stop on this because this is not very common.
I'll check x86 and f32 + u16, hope everything will go just fine.

@awxkee
Copy link
Contributor

awxkee commented Oct 17, 2024

So, here is final results.

Surprusingly, I didn't expect that can be so good sometimes with forbid unsafe.

Tests on x86 was made without explicitly allowing SIMD extensions, allowing it or target-cpu=native should also add ~15% performance. Also I have not the best one x86 laptop, notable performance spikes should be considered anyway.
As this completely depends on compiler decisions performance will also move with compiler version in both directions.

I'll publish soon final crate to crates.io then you can integrate or take some parts if interested.

Here and here still a few bottlenecks with bounds checks, but at the moments this is the best one I have achieved. Perhaps, if there will be some ideas how to completely avoid these bounds checks this will add up to ~10-15% also.

If you wish to see there something specific, let me know.

@awxkee
Copy link
Contributor

awxkee commented Oct 17, 2024

Package is at crates.io, you can pick everything you need from that crate.

Hope, that helps.

@Shnatsel
Copy link
Contributor Author

I've tried out pic-scale-safe in wondermagick, here's the commit: Shnatsel/wondermagick@cfbb975

@awxkee I could not find a way to scale a grayscale+alpha u16 image, which is one of the formats that the image crate supports. There is resize_plane8_with_alpha but no resize_plane16_with_alpha, as far as I can tell? That's blocking the switch for me.

For RGB and RGBA images I've tried on the thumbnailing workload (Nearest to 5x the target size, then Lancsoz3 down to the requested size) the difference in end-to-end benchmarks is within 1% compared to fast_image_resize, so I'd really like to switch over from FIR once the missing function is implemented!

Also the boilerplate I had to write to wire things up was non-intuitive and quite unpleasant, so I'd really like to get this into image proper rather than require users to use the external crate and wire things up manually.

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

Glad to hear it helps!

@awxkee I could not find a way to scale a grayscale+alpha u16 image, which is one of the formats that the image crate supports. There is resize_plane8_with_alpha but no resize_plane16_with_alpha, as far as I can tell? That's blocking the switch for me.

Will be done.

Also the boilerplate I had to write to wire things up was non-intuitive and quite unpleasant, so I'd really like to get this into image proper rather than require users to use the external crate and wire things up manually.

I'll think about that to make this convenient for everyone.

I often do so because most of users works in 1 or 2 pixel formats and thats all, without sometimes crazy codegen, and compilation time for all formats at once. However, yes, for whom interested in all at one times this is not so good.

I've tried out pic-scale-safe in wondermagick, here's the commit: Shnatsel/wondermagick@cfbb975

Are you intentionally do not using linear or perceptual colorspace? For scaling, blurring, and other almost any convolution ops except maybe edge detection, it is quite preferred since after scaling you completely erasing high intensities, and make dark tones even darker, and as much you downscale, as much blurring are receiving distortion grows.

There is also artice in imagemagick how that works.

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

Here is an example, how non-linearity changes the result. It is blurred but approach with scaling the same.
Better open in separate tabs to see the diff, or use some software for that.

blurred_stack_linear
blurred_stack_non_linear

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

Planar16 has been added in 0.1.1 version.
Here already written convenience method.
It is intentionally made separate.

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

Also, please, note: scaling as general rendering op as blur and others, requires premultiplied alpha, since image crate follows always unpremultiplied you almost always must pass flag that you want to premultiply alpha first.

Premultiplication is not really cheap also, so it might be reasonable to do fast check if premultiplication is necessary: alpha is not constant or do not equal to a bit-depth maximum

Here is the method, or I can integrate it into the library:

fn has_non_constant_cap_alpha<
    V: Copy + PartialEq + BitXor<V, Output = V> + AddAssign + 'static,
    const ALPHA_CHANNEL_INDEX: usize,
    const CHANNELS: usize,
>(
    store: &[V],
    width: usize,
    bit_depth: u32,
) -> bool
where
    i32: AsPrimitive<V>, u32: AsPrimitive<V>,
{
    assert!(ALPHA_CHANNEL_INDEX < CHANNELS);
    assert!(CHANNELS <= 4);
    if store.is_empty() {
        return false;
    }
    let max = ((1 << bit_depth) - 1).as_();
    if max != store[ALPHA_CHANNEL_INDEX] {
       return true;
    }
    let mut row_sums: V = 0.as_();
    for row in store.chunks_exact(width * CHANNELS) {
        for color in row.chunks_exact(CHANNELS) {
            row_sums += color[ALPHA_CHANNEL_INDEX].bitxor(max);
        }
        if row_sums != 0.as_() {
            return true
        }
    }

    let zeros = 0.as_();

    row_sums.ne(&zeros)
}

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

Here in wrapper premultiplication is made, however, the check above are not.

@Shnatsel
Copy link
Contributor Author

Than you!

Are you intentionally do not using linear or perceptual colorspace?

Yes. In imagemagick you have to pass flags to explicitly remap the image to linear colorspace and back. I am aware that resizing in linear color space is more correct, and I will add support for explicit flags for it later.

As for alpha pre-multiplication for other image ops - yes, we do have issues with blur due to non-premultiplied alpha. That is something that we would very much like to fix in image.

I see that wondermagick also diverges with imagemagick reference when resizing due to not using premultiplied alpha. I'll go ahead and fix that. I'll measure the results and hopefully that'll help inform how explicit we want to make alpha premultiplication in the API.

@fintelia
Copy link
Contributor

fintelia commented Oct 20, 2024

Premultiplied alpha adds its own challenges. If you want to export as PNG or whatever then you'll need to undo the premultiplication before saving. But unfortunately premultiplied alpha it isn't fully reversible: The color data is lost for fully transparent areas, and precision is sharply degraded in the nearly transparent areas.

@Shnatsel
Copy link
Contributor Author

That is a good point. And that is why I think premultiplied alpha should not be the default. But both blurring and resizing already alter pixel values significantly, so I don't think loss of precision in nearly transparent areas would be a problem for these operations.

I guess could store alpha-premultiplied images in a higher precision to avoid precision loss? For example, we could expand u8 pixels to u16 simply by bit-shifting them (equivalent to multiplying by 256) and then dividing by the originally u8 alpha channel would still keep precision nearly intact. And I guess u16 pixels would have to be upsampled to f32. But that would require 2x more memory and would be quite slow due to the sheer amount of memory we need to read and write, so I don't think it's worth it.

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

That is a good point. And that is why I think premultiplied alpha should not be the default. But both blurring and resizing already alter pixel values significantly, so I don't think loss of precision in nearly transparent areas would be a problem for these operations.

I guess could store alpha-premultiplied images in a higher precision to avoid precision loss? For example, we could expand u8 pixels to u16 simply by bit-shifting them (equivalent to multiplying by 256) and then dividing by the originally u8 alpha channel would still keep precision nearly intact. And I guess u16 pixels would have to be upsampled to f32. But that would require 2x more memory and would be quite slow due to the sheer amount of memory we need to read and write, so I don't think it's worth it.

This will at least double processing cost.

If you want to save precision at least at reasonable amount you have to go with f32, and this will made x3 processing cost with x4 memory usage ( if swap file will be used or device with slow RAM for ex. mobile, this will also add significant drawback on processing speed ) , but precision will be almost preserved, might be 10-50 ULP lost on processing, for common media data this is no matter. In common for common media image 30000 ULP precision lost might be okay :) this changes almost nothing.

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

I think 99.99% of users just want to get correctly blurred/scaled image instead of saved precision of pixel with intensity 1 with alpha 1 what of course not possible until y're not in floating point. For this case it might be noted that somewhere is precise version that save data as much as posssible.

@fintelia
Copy link
Contributor

The cases I worry about are resizing solid color images and either having the pixel values in the output image slightly off, or having a fully transparent regions getting their red/green/blue channels all set to zero

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

For solid images with constant alpha pre-multiplication can be avoided by method above by checking if alpha is constant and max. This is an easy part.

@Shnatsel
Copy link
Contributor Author

I mean, the color of fully transparent regions doesn't matter since they are fully transparent, and preserving it only causes bloat in the final PNG file. That's why e.g. GIMP usually preprocesses the file to strip that color when saving as PNG.

For solid color images with no alpha variation, the proposed check on whether premultiplication is needed would indeed take care of it. And for an image with solid color an a gradient in the alpha channel, the precision loss would be exactly identical to what you'd get when blending it anyway.

I agree that the loss of precision is unfortunate, and it seems that there is no solution that's both entirely correct and efficient, only trade-offs. But I think it should be less bad than the color bleed artifacts I'm getting when resizing the test image from #2324. That's also the option both imagemagick and GIMP picked.

@awxkee
Copy link
Contributor

awxkee commented Oct 20, 2024

For the second case it seems you'll need to keep 2 versions of image if want to preserve it.
For. ex if you have RGBA you can premutliply one, drop alpha channel, scale RGB, and preserve it.

Not really clear what to do next with that.

Perhaps if you'll want to store original rescaled pixels will have to transfer alpha channel from one to another, but this will be almost everytime incorrect.

@fintelia
Copy link
Contributor

I've lost track of what is the actual proposal at this point. Is the idea that DynamicImage::resize would apply premultiplication, resize the image, and then reverse the premultiplication? But only if it was judged necessary by some heuristic?

@Shnatsel
Copy link
Contributor Author

I'm just going to wire it up to wondermagick tonight so I can show you the code and so we could all easily try it out. Talk is cheap.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Oct 20, 2024

Okay, I've wired up pre-multiplication to wondermagick on commit Shnatsel/wondermagick@3964554

For comparison, the version using pic-scale-safe without premultiplication is on commit Shnatsel/wondermagick@99a60a3

You can check them out and run tests on them.

How it works

If the image has an alpha channel:

  1. Check if the entire image has the same alpha value. If yes, skip premultiplication entirely. This lets us avoid the loss of precision when premultiplication isn't necessary for correctness.
  2. In the resize operation, premultiply the alpha on the input
  3. Run the resize operation
  4. Un-premultiply the alpha on the output

What this fixes

I tested this image that contains only white and transparency: https://commons.wikimedia.org/wiki/File:Transparent_Clouds_Map_Without_Asia.png

Command used for testing: target/release/wm-convert Transparent_Clouds_Map_Without_Asia.png -scale 500 ~/out.png

If resized without pre-multiplication, the black from fully transparent pixels bleeds into the clouds, and darkens the image. This is clearly incorrect. View this image on a white background to see that it now darkens it:

out

And this is the correct result when using premultiplied alpha:

out

The behavior with premultiplied alpha matches both imagemagick and GIMP.

You can also clearly see color bleed on my synthetic testcase from #2324

What this doesn't break

Here's a test image filled with color #160803, and with a pretty low alpha value:

mostly-transparent

Thanks to the heuristic for not premultiplying by alpha when alpha is constant, this image survives completely unscathed and the exact color information is preserved after resizing:

out

What's left to do

The current check for constant alpha is slow. Here's a faster version, but I couldn't get the generics to work out because Primitive doesn't require BitXor and AddAssign:

Code that fails to compile
#[must_use]
fn has_constant_alpha<P, Container>(img: &ImageBuffer<P, Container>) -> bool
where
    P: Pixel + 'static,
    Container: std::ops::Deref<Target = [P::Subpixel]>,
{
    let first_pixel_alpha = match img.pixels().next() {
        Some(pixel) => pixel.channels().last().unwrap(), // there doesn't seem to be a better way to retrieve the alpha channel :(
        None => return true, // empty input image
    };
    // A naive, slower implementation that branches on every pixel: 
    //
    // img.pixels().map(|pixel| pixel.channels().last().unwrap()).all(|alpha| alpha == first_pixel_alpha);
    //
    // Instead of doing that we scan every row first with cheap arithmetic instructions
    // and only compare the sum of divergences on every row, which should be 0
    let mut sum_of_diffs = Default::default(); // primitive numbers default to 0
    for row in img.rows() {
        row.for_each(|pixel| {
            let alpha = pixel.channels().last().unwrap(); // there doesn't seem to be a better way to retrieve the alpha channel :(
            sum_of_diffs += alpha.bitxor(first_pixel_alpha);
        });
        if sum_of_diffs != Default::default() {
            return false
        }
    }

    true
}

I guess I'll just have to rewrite it in terms of a match on DynamicImage because I couldn't get the generic version to work, and also ideally we need a wider accumulator than pixel size for integers because otherwise for there's a 1 in 256 chance of an incorrect result due to overflow for 8-bit images.

Where do we go from here

Once I optimize the alpha check I'd like to upstream this resizing method into image, on account of it being both faster and more correct. I'd also like to add the same pre-multiplication logic to the blur operations on RGBA images.

I do not want to expose pre-multiplication through the public API. Combined with the check we do for the same alpha in the entire image, exposing these details would make the API quite inwieldy. The drawback is that if someone wants to downscale an image and then immediately blur it, we would do some unnecessary work, but that's the price I'm willing to pay for a much saner API.

The open questions about switching to this resize method are:

  1. Would depending on pic-scale-safe as an external crate be acceptable? @awxkee would you be OK with adding image maintainers as committers on the repository and publishers to the crate on crates.io?
  2. pic-scale-safe crate currently has a single dependency, num-traits. Is that an issue? Should it be removed, and is it feasible to do so? image already depends on num-traits so it's not a new dependency, so it doesn't matter.

@awxkee
Copy link
Contributor

awxkee commented Oct 21, 2024

Would depending on pic-scale-safe as an external crate be acceptable? @awxkee would you be OK with adding image maintainers as committers on the repository and publishers to the crate on crates.io?

Sure, I almost did that for image crate.

The current check for constant alpha is slow. Here's a faster version, but I couldn't get the generics to work out because Primitive doesn't require BitXor and AddAssign:

If bitxor won't be added as trait better do always premultiplication because medium tier and low tier, and often even high tier, x86 cpu have so bad branch prediction that it definitely won't worth it ( and do not forget that on not the best RAM for big images iterating with reading something likes ~200-500MB with doing just nothing is not cheap by itself ).

I guess I'll just have to rewrite it in terms of a match on DynamicImage because I couldn't get the generic version to work, and also ideally we need a wider accumulator than pixel size for integers because otherwise for there's a 1 in 256 chance of an incorrect result due to overflow for 8-bit images.

Sure, this is my mistake, accumulator should be u32. And u64 for f32 or more likely you we'll need another trick for f32, since alpha more vary but it is negligible and f32 bits represenation can overflow u32 shortly, however u64 should work if EPSILON is ok, to be avoided. As to me, check at least EPSILON might be more preferred. So other traits might be also required.

Something like that

    for color in row.chunks_exact(CHANNELS) {
        row_sums += (color[ALPHA_CHANNEL_INDEX] - max).abs();
    }
    if row_sums > f32::EPSILON {
        return true;
    }

Or check by ULP's

    let ones_bits = 1f32.to_bits();

    for color in row.chunks_exact(CHANNELS) {
        row_ulp += (color[ALPHA_CHANNEL_INDEX].to_bits() as i64 - ones_bits as i64).abs() as u32;
    }
    // Some noticeable ULP, for ex. 500
    if row_ulp > 500 {
        return true;
    }

pic-scale-safe crate currently has a single dependency, num-traits. Is that an issue? Should it be removed, and is it feasible to do so?

Without a num-traits code base will permamently add about 10 times of almost same code base size for maintain and burden. You can see here how does this looks like, when you have 20-30k lines of code doing just same thing.

@fintelia I can also provide safe replacement for dcv if this is reasonable. Here is message from PR

@awxkee
Copy link
Contributor

awxkee commented Oct 21, 2024

Check if the entire image has the same alpha value. If yes, skip premultiplication entirely. This lets us avoid the loss of precision when premultiplication isn't necessary for correctness.

Also premultiplication might be completely avoided for Nearest Neighbors interpolation, because it has no sense for this case.

@Shnatsel
Copy link
Contributor Author

Good suggestion, thanks! I've applied it in wondermagick.

I also got the optimized alpha check to compile by consulting an LLM and using some cursed generic bounds. So as of Shnatsel/wondermagick@05d6916 I believe the implementation is complete, and the performance is representative of what it would be like if integrated into image.

@awxkee
Copy link
Contributor

awxkee commented Oct 25, 2024

As I'm helping with resizing I decided theck this one.

So here is definetly Box filter in class named resizeAreaFast_Invoker is clearly that it averages pixel over the area for downscaling.

Here is the statement where thet saying that Area method do not exists on upscaling so they simulating it with bilinear scaling.

And here they are computing coefficients for simulating area with bilinear with some strange weights.

So if it makes sense to solve for you I can reproduce this behavior in the lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: slow Not wrong, but still unusable
Projects
None yet
Development

No branches or pull requests

4 participants