Vectorized Paeth filtering (multiple pixels at the same time) #549

IJzerbaard · 2024-12-15T10:57:27Z

I've seen the other issue about vectorized Paeth filtering, this is not that.

It's possible to apply the Paeth filter to multiple pixels at once. Not on 4 sequential pixels, which would be the neatest case for SIMD, but 4 anti-diagonal pixels. 4 anti-diagonal pixels can be collected together in a vector relatively efficiently by loading 4 rows of pixels, staggered so that the first row has an x-offset of 3 pixels compared to the last row, then transpose those 4 rows (similar to _MM_TRANSPOSE4_PS, but with integer vectors). That produces 4 of those sets of 4 anti-diagonal pixels for the price of 8 shuffles, and after applying the filter another 8 shuffles are needed to un-transpose them (some additional shuffles are needed to update the "top" aka B vector between columns). All that shuffling is not free but in my tests it was well worth doing.

Some diagrams:

I've already worked out how to do this and integrate it into a PNG decoder in the context of the (C++) stb_image PNG decoder. Though filtering 4 or 8 rows at the same time does not fit very naturally onto the "possibly different filter for each row" nature of the PNG format, it is still possible. There is a requirement to find 4 or 8 adjacent rows that all have the Paeth filter type, but that appears to be common enough to not invalidate the approach, and I did get some real-world wins on real files. Not on every file, obviously.

Though I described primarily how to filter a 32bpp format, a 24bpp format could use the same technique by expanding (on demand, right before the transpose) every 12 bytes to 16 with pshufb.

This approach requires unsafe code in order to use platform-specific intrinsics, without which I see no way to implement the transposes (autovectorization does not like to do that sort of thing) nor the pshufb required for 24bpp images, which may not fit with the goals of this library. Porting this approach to NEON may be possible, I'm not a NEON expert and I'm not sure whether it would be worth doing (given the typical latency of 2 cycles for NEON operations, and the fairly latency-sensitive nature of this filter - especially the 4-row version, 8-rows has more ILP to hide that latency). Using Rusts portable SIMD abstraction may be possible (it's still marked experimental).

The Avg filter may be vectorized with a similar approach, even Up and Sub could be but that's overkill. Filters could be combined, using vector-select to pick the right filter on a per-row basis, but that adds a bunch of overhead which I expect is not worth it.

The text was updated successfully, but these errors were encountered:

Shnatsel · 2024-12-18T10:37:53Z

This crate outputs row by row in a streaming fashion. So we cannot easily integrate approaches that require multiple rows to be present at once.

However, zune-png in Rust and WUFFS (written in a custom language) do not have this restriction and output the entire file at once. It would be a lot easier to integrate this technique into those implementations.

kornelski · 2024-12-18T15:52:51Z

Having the next_row() API isn't a deal-breaker. The decoder already has scratch buffers for handling interlacing. It could use buffering for filtering too.

The question is whether multi-row filters are worth the extra cost?

I guess this could be done with no extra copies when using fdeflate. Deflate only needs to backreference the window size, max 32KB, so the decompression could be 32KB ahead of filtering, and filtering could be done in-place.

Shnatsel · 2024-12-18T16:44:19Z

@IJzerbaard could you share the stb_image integration so we could plug it into https://github.com/fintelia/corpus-bench and see how this affects performance on real-world corpora?

HeroicKatora · 2024-12-18T17:46:22Z

Filtering multiple rows was previously considered but none of the ideas I recall had a lot of promise, so it wasn't focussed on. The whole thing except for Chromium motivated public next_row should be highly amenable to it though. The situation is different for this Paeth design when compared to other algorithmic speedups since most of filtering is memory-bound by now.

If I read the above sketch correctly, then the interlacing of different rows would also allow loading some data into SIMD registers once per 4 rows, instead of once per row. This effectively reduces the required bandwidth. The Up filter would also benefit in a similar fashion and the corpus should reveal real-world incidence for this. I think the largest cost is indeed the size of the unfiltering buffer I'm not sure how relevant to the overall performance it is for that to fit into a small cache. There may be a performance cliff on some hardware combinations with images that are particularly large in width. The best way to find out if that concern is relevant is benchmarking of course as suggested.

fintelia · 2024-12-18T18:51:17Z

Something along these lines was previously attempted in #377. It wasn't able to get a performance improvement at the time, but some of the scaffolding around accumulating multiple paeth rows might be helpful if we want to try again.

IJzerbaard · 2024-12-18T19:05:52Z

The thing that I had integrated into STB is not really fit for sharing, but I have a benchmark where I compare just libpng's Paeth filter vs an 8-rows-at-a-time filter: https://gitlab.com/-/snippets/4781267

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized Paeth filtering (multiple pixels at the same time) #549

Vectorized Paeth filtering (multiple pixels at the same time) #549

IJzerbaard commented Dec 15, 2024 •

edited

Loading

Shnatsel commented Dec 18, 2024

kornelski commented Dec 18, 2024 •

edited

Loading

Shnatsel commented Dec 18, 2024

HeroicKatora commented Dec 18, 2024

fintelia commented Dec 18, 2024

IJzerbaard commented Dec 18, 2024

Vectorized Paeth filtering (multiple pixels at the same time) #549

Vectorized Paeth filtering (multiple pixels at the same time) #549

Comments

IJzerbaard commented Dec 15, 2024 • edited Loading

Shnatsel commented Dec 18, 2024

kornelski commented Dec 18, 2024 • edited Loading

Shnatsel commented Dec 18, 2024

HeroicKatora commented Dec 18, 2024

fintelia commented Dec 18, 2024

IJzerbaard commented Dec 18, 2024

IJzerbaard commented Dec 15, 2024 •

edited

Loading

kornelski commented Dec 18, 2024 •

edited

Loading