-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow using zune-png
for PNG decoding
#1852
Comments
We should try to understand how the performance compares between |
The trade-off is that |
It would be very surprising to me if streaming is the cause of difference in performance. deflate should require no more than 64kB buffer size to make the available data virtually indistinguishable. Any even larger buffers and I would expect the only necessary difference to be cache effects and the cost of storing temporary state (which is constant size? some buffered bits and a pointer to where in the stream we are, what code table is active) in between the data passes. I'm not aware of any form of cache, lookup table, or algorithmic improvement that works uniquely for infinite lookahead here. It's at least counterintuitive to identify streaming. |
The gist of the algorithm is that it assumes the output size is always sufficient to use large copies and other SIMD-like optimizations, without dropping down to the slow loop near the end of the buffer. It might be possible to create a streaming implementation of DEFLATE with similar performance characteristics, but AFAIK it hasn't been done so far. There's a detailed write-up of how WUFFS achieves high performance, a lot of which boils down to "spend more time in fast loops and not near the end of the buffer" which sounds very similar to what |
The relevant bit in the WUFFS write-up is Running off a cliff which neatly explains the difference, but the entire thing is worth a read.
|
I'd love to see the I'm confident that
|
One advantage of the streaming approach is that we don't have to do separate passes over the image for each step of decoding (file read, decompression, unfiltering, and de-interlacing). Only benchmarking can say which side of the tradeoff is better |
I've also been wanting to expand |
Could you point me to the fancy logic to decode multiple literals? Now I'm curious! |
https://github.com/image-rs/fdeflate/blob/main/src/decompress.rs#L364 Each lookup into the decoding tables produces two data bytes from data_table and an advance_bytes value that the number of input bits and output bytes to advance by. By using 12-bit indexes, it is very common that multiple literals will be output from a single table lookup. As an added bonus, since the input buffer is required to be zero-initialized we can support advancing by 3+ input literals as long as all after the first two are zero. |
image-rs/image-png#254 is also low-hanging fruit for |
The difference is in the following things.
2.Defiltering. The de-filtering code has SSE paths, no amount of coaxing the compiler will get it to generate good SSE code(I've tried).
I'll elaborate on point 3 down below |
PS: allow me to do this incrementally, I do have other things that need my attention. But here is one.
Your code Vs mine Mine is longer because I choose to an interesting formatting convention, but the gist is yours does bounds check per code, and tries to do both Luma, and RGB into one function, mine separates them. The resulting code difference can be seen at https://godbolt.org/z/e44M37xz8 Yours does it in place, while mine moves from old buffer to a new buffer, yours should be faster than mine. |
Curiously, this optimization was considered for |
While they probably won't make it match zune, there were some recent performance improvements in miniz_oxide so should help a bit with the performance of image-png once it is updated to use miniz_oxide 0.7.1. As for the compression side, I could misremember but I think miniz_oxide via flate2 still ends up using a double buffer (aka outputting to one buffer and then from that to a different buffer) which could probably be avoided. |
I've now got an experimental branch of |
Is it possible if I can add it to benchmark suite? |
Feel free to experiment with it. I made https://github.com/fintelia/image-png/tree/finflate-github with just the |
Yea there is some good improvement there, congratulations.
Very good indeed! |
Just to clarify from my side: The competition in decoder performance is quite awesome to see. Small and large steps and iteratively identifying the currently slowest steps and weak points respectively. For me this pretty clearly shows that being able to experiment is highly useful. And really no matter individual performance results here there should be a mechanism to add a respectively new alternative backend and choose between them for the sake of comparison. That would also for me imply: the optimal mechanism should not be a pure compile-time choice. Features allow some dependency minimization but they should not be exclusive as that just increases cycle time etc. Compare this to mechanism for SIMD-flavor. In all use cases we iterated clearly towards runtime choices instead of a cfg-choice, because Rust allows this and it just works. |
Update on this: many of the optimizations from *I'd caution against reading too much into a single number: both decoders have files they're comparatively faster at decoding and it is hard to gather a representative sample of how performance compares on average. But the linked benchmark does at least show that the perf gap has considerably narrowed. |
With the additional performance improvements done in image-rs/image-png#416, the It seems switching to |
congratulations on all those involved 🎉🎉 wondering if the optimizations are live on crates.io or a git branch with all of them, would love to run all of them. |
I believe all non-API-breaking changes are now in master for the There is one more PR outstanding that shows a performance improvement but requires an API change: image-rs/image-png#421 |
zune-png
is a very fast PNG implementation in Rust.It's safe code except for SIMD in filters, which is the same policy as
image-rs
crates use.It's in the final stages of testing, with just one known issue remaining on a very obscure type of image, and is being continuously fuzzed on CI. I gave it the full real-world testing treatment, which found a fair amount of bugs and I'm confident we'll squash the remaining ones soon.
Advantages:
Drawbacks:
png
crate will remain as a dependencyfdeflate
See also: #1845 for supporting
zune-jpeg
The text was updated successfully, but these errors were encountered: