Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized Paeth filtering (multiple pixels at the same time) #549

Open
IJzerbaard opened this issue Dec 15, 2024 · 6 comments
Open

Vectorized Paeth filtering (multiple pixels at the same time) #549

IJzerbaard opened this issue Dec 15, 2024 · 6 comments

Comments

@IJzerbaard
Copy link

IJzerbaard commented Dec 15, 2024

I've seen the other issue about vectorized Paeth filtering, this is not that.

It's possible to apply the Paeth filter to multiple pixels at once. Not on 4 sequential pixels, which would be the neatest case for SIMD, but 4 anti-diagonal pixels. 4 anti-diagonal pixels can be collected together in a vector relatively efficiently by loading 4 rows of pixels, staggered so that the first row has an x-offset of 3 pixels compared to the last row, then transpose those 4 rows (similar to _MM_TRANSPOSE4_PS, but with integer vectors). That produces 4 of those sets of 4 anti-diagonal pixels for the price of 8 shuffles, and after applying the filter another 8 shuffles are needed to un-transpose them (some additional shuffles are needed to update the "top" aka B vector between columns). All that shuffling is not free but in my tests it was well worth doing.

Some diagrams:

paeth_load_data

paeth_transposed_data

I've already worked out how to do this and integrate it into a PNG decoder in the context of the (C++) stb_image PNG decoder. Though filtering 4 or 8 rows at the same time does not fit very naturally onto the "possibly different filter for each row" nature of the PNG format, it is still possible. There is a requirement to find 4 or 8 adjacent rows that all have the Paeth filter type, but that appears to be common enough to not invalidate the approach, and I did get some real-world wins on real files. Not on every file, obviously.

Though I described primarily how to filter a 32bpp format, a 24bpp format could use the same technique by expanding (on demand, right before the transpose) every 12 bytes to 16 with pshufb.

This approach requires unsafe code in order to use platform-specific intrinsics, without which I see no way to implement the transposes (autovectorization does not like to do that sort of thing) nor the pshufb required for 24bpp images, which may not fit with the goals of this library. Porting this approach to NEON may be possible, I'm not a NEON expert and I'm not sure whether it would be worth doing (given the typical latency of 2 cycles for NEON operations, and the fairly latency-sensitive nature of this filter - especially the 4-row version, 8-rows has more ILP to hide that latency). Using Rusts portable SIMD abstraction may be possible (it's still marked experimental).

The Avg filter may be vectorized with a similar approach, even Up and Sub could be but that's overkill. Filters could be combined, using vector-select to pick the right filter on a per-row basis, but that adds a bunch of overhead which I expect is not worth it.

@Shnatsel
Copy link
Contributor

This crate outputs row by row in a streaming fashion. So we cannot easily integrate approaches that require multiple rows to be present at once.

However, zune-png in Rust and WUFFS (written in a custom language) do not have this restriction and output the entire file at once. It would be a lot easier to integrate this technique into those implementations.

@kornelski
Copy link
Contributor

kornelski commented Dec 18, 2024

Having the next_row() API isn't a deal-breaker. The decoder already has scratch buffers for handling interlacing. It could use buffering for filtering too.

The question is whether multi-row filters are worth the extra cost?


I guess this could be done with no extra copies when using fdeflate. Deflate only needs to backreference the window size, max 32KB, so the decompression could be 32KB ahead of filtering, and filtering could be done in-place.

@Shnatsel
Copy link
Contributor

@IJzerbaard could you share the stb_image integration so we could plug it into https://github.com/fintelia/corpus-bench and see how this affects performance on real-world corpora?

@HeroicKatora
Copy link
Member

Filtering multiple rows was previously considered but none of the ideas I recall had a lot of promise, so it wasn't focussed on. The whole thing except for Chromium motivated public next_row should be highly amenable to it though. The situation is different for this Paeth design when compared to other algorithmic speedups since most of filtering is memory-bound by now.

If I read the above sketch correctly, then the interlacing of different rows would also allow loading some data into SIMD registers once per 4 rows, instead of once per row. This effectively reduces the required bandwidth. The Up filter would also benefit in a similar fashion and the corpus should reveal real-world incidence for this. I think the largest cost is indeed the size of the unfiltering buffer I'm not sure how relevant to the overall performance it is for that to fit into a small cache. There may be a performance cliff on some hardware combinations with images that are particularly large in width. The best way to find out if that concern is relevant is benchmarking of course as suggested.

@fintelia
Copy link
Contributor

Something along these lines was previously attempted in #377. It wasn't able to get a performance improvement at the time, but some of the scaffolding around accumulating multiple paeth rows might be helpful if we want to try again.

@IJzerbaard
Copy link
Author

The thing that I had integrated into STB is not really fit for sharing, but I have a benchmark where I compare just libpng's Paeth filter vs an 8-rows-at-a-time filter: https://gitlab.com/-/snippets/4781267

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants