-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized Paeth filtering (multiple pixels at the same time) #549
Comments
This crate outputs row by row in a streaming fashion. So we cannot easily integrate approaches that require multiple rows to be present at once. However, zune-png in Rust and WUFFS (written in a custom language) do not have this restriction and output the entire file at once. It would be a lot easier to integrate this technique into those implementations. |
Having the The question is whether multi-row filters are worth the extra cost? I guess this could be done with no extra copies when using |
@IJzerbaard could you share the stb_image integration so we could plug it into https://github.com/fintelia/corpus-bench and see how this affects performance on real-world corpora? |
Filtering multiple rows was previously considered but none of the ideas I recall had a lot of promise, so it wasn't focussed on. The whole thing except for Chromium motivated public If I read the above sketch correctly, then the interlacing of different rows would also allow loading some data into SIMD registers once per 4 rows, instead of once per row. This effectively reduces the required bandwidth. The |
Something along these lines was previously attempted in #377. It wasn't able to get a performance improvement at the time, but some of the scaffolding around accumulating multiple paeth rows might be helpful if we want to try again. |
The thing that I had integrated into STB is not really fit for sharing, but I have a benchmark where I compare just libpng's Paeth filter vs an 8-rows-at-a-time filter: https://gitlab.com/-/snippets/4781267 |
I've seen the other issue about vectorized Paeth filtering, this is not that.
It's possible to apply the Paeth filter to multiple pixels at once. Not on 4 sequential pixels, which would be the neatest case for SIMD, but 4 anti-diagonal pixels. 4 anti-diagonal pixels can be collected together in a vector relatively efficiently by loading 4 rows of pixels, staggered so that the first row has an x-offset of 3 pixels compared to the last row, then transpose those 4 rows (similar to _MM_TRANSPOSE4_PS, but with integer vectors). That produces 4 of those sets of 4 anti-diagonal pixels for the price of 8 shuffles, and after applying the filter another 8 shuffles are needed to un-transpose them (some additional shuffles are needed to update the "top" aka B vector between columns). All that shuffling is not free but in my tests it was well worth doing.
Some diagrams:
I've already worked out how to do this and integrate it into a PNG decoder in the context of the (C++) stb_image PNG decoder. Though filtering 4 or 8 rows at the same time does not fit very naturally onto the "possibly different filter for each row" nature of the PNG format, it is still possible. There is a requirement to find 4 or 8 adjacent rows that all have the Paeth filter type, but that appears to be common enough to not invalidate the approach, and I did get some real-world wins on real files. Not on every file, obviously.
Though I described primarily how to filter a 32bpp format, a 24bpp format could use the same technique by expanding (on demand, right before the transpose) every 12 bytes to 16 with
pshufb
.This approach requires unsafe code in order to use platform-specific intrinsics, without which I see no way to implement the transposes (autovectorization does not like to do that sort of thing) nor the
pshufb
required for 24bpp images, which may not fit with the goals of this library. Porting this approach to NEON may be possible, I'm not a NEON expert and I'm not sure whether it would be worth doing (given the typical latency of 2 cycles for NEON operations, and the fairly latency-sensitive nature of this filter - especially the 4-row version, 8-rows has more ILP to hide that latency). Using Rusts portable SIMD abstraction may be possible (it's still marked experimental).The Avg filter may be vectorized with a similar approach, even Up and Sub could be but that's overkill. Filters could be combined, using vector-select to pick the right filter on a per-row basis, but that adds a bunch of overhead which I expect is not worth it.
The text was updated successfully, but these errors were encountered: