Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding animated WebP is 4x slower than libwebp-sys + webp-animation #119

Open
Shnatsel opened this issue Oct 22, 2024 · 15 comments
Open

Comments

@Shnatsel
Copy link
Contributor

In image v0.25.4 and image-webp v0.2.0, decoding the attached animated WebP is 4x slower than using libwebp-sys + webp-animation: sample.zip

image

use std::error::Error;
use image::{codecs::webp::WebPDecoder, AnimationDecoder, ImageReader};

fn main() -> Result<(), Box<dyn Error>> {
    let input = std::env::args().nth(1).unwrap();    
    let reader = ImageReader::open(input)?.into_inner();
    let decoder = WebPDecoder::new(reader)?;
    let mut iter = decoder.into_frames();
    while let Some(_frame) = iter.next() {}
    Ok(())
}

hyperfine results:

  Time (mean ± σ):     411.9 ms ±   3.9 ms    [User: 372.2 ms, System: 38.8 ms]
  Range (min … max):   407.6 ms … 416.8 ms    10 runs

libwebp-sys + webp-animation

use std::error::Error;
use webp_animation::prelude::*;

fn main() -> Result<(), Box<dyn Error>> {
    let input = std::env::args().nth(1).unwrap();    

    let buffer = std::fs::read(input).unwrap();
    let decoder = Decoder::new(&buffer).unwrap();
    let mut iter = decoder.into_iter();
    while let Some(_frame) = iter.next() {}
    Ok(())
}

hyperfine results:

  Time (mean ± σ):      95.9 ms ±   0.4 ms    [User: 128.7 ms, System: 7.4 ms]
  Range (min … max):    95.3 ms …  96.7 ms    30 runs

Analysis

webp-animation shows a bit of multi-threading happening on the profile, with user time being longer than the total execution time, but even accounting for that image-webp is 3x slower.

Breakdown of where the time is spent in image, recorded by samply: https://share.firefox.dev/4fc3utg

The greatest contributors seem to be image_webp::vp8::Vp8Decoder::decode_frame (48%), image_webp::extended::do_alpha_blending (20%), image_webp::vp8::Frame::fill_rgba (16%).

Within decode_frame the biggest contributor is image_webp::vp8::Vp8Decoder::read_coefficients (12% self time, 32% total time), and the code of that function looks like it could be optimized further to reduce bounds checks, etc. #71 is also relevant, but only accounts for 20% of the total time.

@Shnatsel
Copy link
Contributor Author

I looked at the profile and the associated code a bit. The two low-hanging optimization opportunities are:

  1. Applying the YUV->RGB optimization from Refactor fill_rgb to fill 2 pixels at a time #13 to the RGBA codepath as well
  2. The alpha blending currently uses an f64 to blend every u8. No wonder that's slow - it kills vectorization. We should make it operate in integers (u16 might be enough?) and use chunks_exact() to get the compiler to vectorize it.

@Shnatsel
Copy link
Contributor Author

Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864

@awxkee
Copy link

awxkee commented Dec 15, 2024

Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864

Here is ready method if needed https://github.com/awxkee/pic-scale-safe/blob/7f1066e006aac5eea4c1474f5bbe81cae837b2b7/src/alpha.rs#L43

Division by alpha if required much less straightforward.

@Shnatsel
Copy link
Contributor Author

I've attempted to optimize alpha blending by performing it in u16 instead of f64. I got the primitives working (rounding integer division both by 255 and by an arbitrary u8) but when I assemble the entire thing with them it falls apart. Pretty early, too - calculating the resulting alpha is already wrong, we don't even get to the interesting part of blending the RGB values.

This is what I've got so far: https://github.com/image-rs/image-webp/compare/main...Shnatsel:image-webp:faster-alpha-blending?expand=1

By the way, @awxkee in your div_by_255 the .min(255) part is completely unnecessary. Instead of outputting 256 it already outputs 0, so you don't need any clamping. My branch includes an exhaustive correctness test for that function, you can try it for yourself. I kept the .min in my branch out of the abundance of caution, but I am convinced that it does nothing.

@awxkee
Copy link

awxkee commented Dec 18, 2024

min was accidentally added when I was trying to force auto-vectorization. You're correct that the division is exact, so clamping is unnecessary.

I think your rounding_div will be incredibly slow to perform two integer divisions at one time.
Maybe it’s worth trying to generate a reduction table since the denominator range is small [1,255], to perform Montgomery multiplication?
As LLVM does when dividing by constant https://godbolt.org/z/qnE5E8Eo3, it performs Montgomery multiplication in form b=(a*R)>>k.
Additionally, there are two divisions at the same time for the same denominator, so each step will require only one read from the reduction tables.

@awxkee
Copy link

awxkee commented Dec 18, 2024

And also it is possible to make special branch for numbers power of 2 for even faster division:

fn is_power_of_two(n: u32) -> bool {
    n != 0 && (n & (n - 1)) == 0
}

fn power_of_two(n: u32) -> u32 {
    n.trailing_zeros()
}

And then perform: value >> power_of_two

@awxkee
Copy link

awxkee commented Dec 18, 2024

Here is table to try out: https://gist.github.com/awxkee/b8df92e4f97346fb56ae91dc4dcca779
You can benchmark if it worth it on your CPU, because I'm actually not sure that it'll be better.

@Shnatsel
Copy link
Contributor Author

Thank you! I'll benchmark that and dig deeper into the performance of these things once we actually have a working alpha blending routine. Right now I'm not even sure if u16 is even enough for alpha blending to work, so I don't want to over-invest into optimizing something I might throw away later.

@Shnatsel
Copy link
Contributor Author

Okay, I checked how libwebp does it, and they actually do it in u32 rather than u16:

https://github.com/webmproject/libwebp/blob/e4f7a9f0c7c9fbfae1568bc7fa5c94b989b50872/src/demux/anim_decode.c#L215-L267

We should probably just port that.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 18, 2024

I've ported the libwebp algorithm. It is really inaccurate at low alpha levels but nobody is going to notice that anyway. It gives a 8% end-to-end performance boost on this sample.

https://github.com/Shnatsel/image-webp/blob/e9a8a6710db5d4af193312ad4983681068bc5534/src/alpha_blending.rs#L5-L52

My conversion of the f64 algorithm using u16 is more accurate than the libwebp blending routine, but is also a lot slower than even f64. However, the loop it's in is really not conducive to vectorization. If I can get the compiler to vectorize it and process 8 values at once via vector instructions, that might change.

Realistically though, the libwebp blending is still going to be faster, if we're willing to take the accuracy hit at very low alpha values.

@awxkee
Copy link

awxkee commented Dec 18, 2024

You could perhaps replace their 'approximate division by 255' with exact division? I think this is the main inaccuracies producer.
Also here should be nice work division by reduction table, because you may perform 3 accurate divisions with one table read, however I'm not sure then about speed, this is something really needs to be checked.

@Shnatsel
Copy link
Contributor Author

I turned an assert! into a debug_assert! and that must have unlocked some huge optimizations because decoding is now 16% faster end-to-end, so the alpha blending function must be ~5x faster

https://github.com/Shnatsel/image-webp/blob/c0cfcb1d2afbb581e8a5b25614d6f9cecfb497d5/src/alpha_blending.rs#L6-L59

I also re-added support for big endian. C doesn't have a [u8; 4] type so they have to do cursed endianness-dependent bit shifts instead.

@Shnatsel
Copy link
Contributor Author

@awxkee I've replaced libwebp's division approximation with your div_by_255 and got improved precision without sacrificing performance!

Combined with the image_webp::vp8::Frame::fill_rgba optimization in #122, we're now 27% faster end-to-end on this image compared to where we started.

@awxkee
Copy link

awxkee commented Dec 19, 2024

Actually if you're using u32 instead of u16 there will be a shorter reach for division by 255 = (a * 32897) >> 23.
This method in u16 makes more sense for SIMD vectors in u16 type.

@Shnatsel
Copy link
Contributor Author

That method results in a less precise approximation of the floating-point division, and I'm seeing a greater divergence from the floating-point reference. I believe the trick with the other div_by_255 is that it also rounds to nearest, instead of always rounding down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants