Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Reader::scrach_buffer field + resulting breaking API changes. #421

Closed
wants to merge 26 commits into from

Conversation

anforowicz
Copy link
Contributor

PTAL?

This mostly follows what we've discussed in #417.

If we merge this, then some coordination may be required for releasing a new version of the png crate and then updating the image crate:

  • I am happy to volunteer to update the image crate after a new version of the png crate is released. OTOH, I am not sure how to get notified when this happens?
  • I am not sure when exactly it would be desirable to release a new version of the png crate. It may be worth waiting until some additional changes get merged:
    • additional performance-related improvements (e.g. PRs that stem from the "copy avoidance series of commits")
    • additional performance-related breaking-changes (e.g. I think that Read => BufRead is on the table - I want to try measuring this after flushing out other changes from the "copy avoidance series of commits")

@anforowicz
Copy link
Contributor Author

Hmmm... after having already submitted the PR, I now started to wonder:

  • Maybe the remaining API should just be called next_row (rather than next_interlaced_row)

  • Maybe I should simplify the description in CHANGES.md to something like:

      ```
      * Breaking API changes:
         - Removing the `Row` and `InterlacedRow` structs
         - Removing the `Reader::next_interlaced_row` method
         - Changing the signature of the `Reader::next_row` method
      ```
    

@fintelia
Copy link
Contributor

fintelia commented Nov 2, 2023

I would like to eventually make this API change, but for now I wonder if it would be simpler to make next_frame call self.next_pass() and self.next_interlaced_row_impl() directly. The scratch buffer variable would still exist, but it would only be allocated/used if the user specifically requested rows one-by-one.

@anforowicz
Copy link
Contributor Author

One motivation for the changes in this PR is to improve the performance of image::codecs::png::PngReader which calls into next_row. Changes to next_frame wouldn't help with that.

It seems that sometimes decoding row-by-row (or at least smaller-than-frame-chunk-by-chunk) may be desirable if pixels needs some kind of post-processing (e.g. applying gamma, alpha premultiplication, etc.) and the user of the png crate wants to do such post-processing while the image data is in L1 cache. image::codecs::png::PngReader is compatible with that idea, but next_frame would add a considerably larger L1 cache pressure. ("seems" because I don't have hard data and/or benchmarks to confirm this. But it does seem like a valid idea. I think.)

@anforowicz
Copy link
Contributor Author

anforowicz commented Nov 2, 2023

Is the main concern the desire to avoid breaking changes?

One way to address this concern may be to instead add a new API (next_row2?) and the #[deprecated] attribute to the old APIs (structs and methods).

OTOH, IMO some breaking API changes may be required for performance one way or another. So maybe another way to proceed is to wait until those other breaking changes have PRs with measurements that show their impact for the noncompressed benches and then land all the breaking changes around the same time. (For example, I plan to send the Read => BufRead changes after first flushing out 2 less controversial PRs - so, in a week or two maybe, depending on how the other PRs are received :-))

@fintelia
Copy link
Contributor

fintelia commented Nov 3, 2023

The goal is to bundle all the breaking changes we'd like to make into a single 0.18 release, rather than have a couple breaking releases in rapid succession. But that means that breaking changes will take longer to land than non-breaking changes. So if there are pieces that can be split out, it likely makes sense to merge those first.

These changes generally look good to me though, so I'm fine with letting them linger until we're ready to do a breaking release

@anforowicz
Copy link
Contributor Author

The goal is to bundle all the breaking changes we'd like to make into a single 0.18 release, rather than have a couple breaking releases in rapid succession.
...
These changes generally look good to me though, so I'm fine with letting them linger until we're ready to do a breaking release

Ack. That sounds good and is totally reasonable.

Just as an FYI, let me point out that currently I think that I may want to make the following 3 breaking changes:

  • This PR (Reader::next_row taking &mut [u8] instead of returning &[u8])
  • Requiring that the reader passed to the png crate is already BufRead and not just Read (this is in the spirit of copy avoidance but wasn't explicitly covered in my recent measurements). Like this PR, that other PR will probably also require follow-up changes in the image crate.
  • Changing ZlibStream::decompress (and ZlibStream::finish_compressed_chunks) so it takes &mut [u8] instead of taking and appending to &mut Vec<u8>. This has been a part of the copy-avoidance-series-of-commits and has been covered by my recent measurements. This is technically a breaking API change, but this is a fairly low-level API that I think/hope shouldn't see that much usage (or at least should see less usage than the other changes above).

I think that it's probably desirable to first flush out 3-4 other PRs with non-breaking changes (starting with the noncompressed benchmarks). Flushing those out will help to highlight/magnify the relative improvement from the breaking changes (at least from the last 2; the current PR isn't directly covered by the benchmarks in the png crate and I am hesitant to expand the scope/shape of the benchmarks just to justify the PR).

@anforowicz anforowicz force-pushed the scratch-buffer-removal branch from f4c1a10 to 7150f4d Compare November 15, 2023 17:31
@anforowicz
Copy link
Contributor Author

Status update:

fintelia and others added 19 commits January 28, 2024 17:44
This commit moves `expand_paletted_into_rgb8` and
`expand_paletted_into_rgba8` (and their unit tests) into a separate
`transform/palette.rs` module.  This prepares room for encapsulating
extra complexity in this module in follow-up commits, where we will
start to precompute and memoize some data when creating a `TransformFn`.

This commit just moves the code around - it should have no impact on
correctness or performance.
The `PLTE` chunk's size should be a multiple of 3 (since it contains RGB
entries - 3 bytes per entry).

Additionally, taking 10000 samples in the `bench_create_fn` benchmarks
is a bit excessive after memoization.
This commit changes the `TransformFn` type alias from `fn(...)` into
`Box<dyn Fn(...)>`.  This allows the `TransformFn` to have store some
precomputer, memoized state that we plan to add in follow-up commits.

In theory this commit may have negative performance impact, but in the
grand scheme of things it disappears into the measurement noise.  In
particular, when there is no state, then `Box` shouldn't allocate.
Before this commit `expand_paletted_into_rgba8` would:

* Perform 2 lookups - `palette.get(i)` and `trns.get(i)`
* Check via `unwrap_or` if `i` was within the bounds of `palette`/`trns`

This commit introduces `create_rgba_palette` which combines `palette`
and `trns` into a fixed-size `[[u8;4]; 256]` look-up table (called
`rgba_palette` in the code).  After this commit
`expand_paletted_into_rgba8` only needs to perform a single look-up and
doesn't need to check the bounds.  This helps to improve the expansion
time by 60+%:

- expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461:
  [-60.208% -60.057% -59.899%] (p = 0.00 < 0.05)
- expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461:
  [-77.520% -77.407% -77.301%] (p = 0.00 < 0.05)

`expand_paletted_into_rgb8` performs only a single lookup before and
after this commit, but avoiding bounds checks still helps to improve the
expansion time by ~12%:

- expand_paletted(exec)/trns=no/src_bits=4/src_size=5461:
  [-12.357% -12.005% -11.664%] (p = 0.00 < 0.05)
- expand_paletted(exec)/trns=no/src_bits=8/src_size=5461:
  [-13.135% -12.584% -12.092%] (p = 0.00 < 0.05)

Understandably, this commit regresses the time of `create_transform_fn`.
Future commits will reduce this regression 2-4 times:

- expand_paletted(ctor)/plte=256/trns=256:
  [+3757.2% +3763.8% +3770.5%] (p = 0.00 < 0.05)
- expand_paletted(ctor)/plte=224/trns=32:
  [+3807.3% +3816.2% +3824.6%] (p = 0.00 < 0.05)
- expand_paletted(ctor)/plte=16/trns=1:
  [+1672.0% +1675.0% +1678.1%] (p = 0.00 < 0.05)
Before this commit `expand_into_rgb8` would copy 3 bytes at a time into
the output.  After this commit it copies 4 bytes at a time (possibly
cloberring pixels that will be populated during the next iteration -
this is ok).  This improved the performance as follows:

expand_paletted(exec)/trns=no/src_bits=8/src_size=5461
time:   [-23.852% -23.593% -23.319%] (p = 0.00 < 0.05)
This improves the performance as follows:

- expand_paletted(ctor)/plte=256/trns=256
  [-40.581% -40.396% -40.211%] (p = 0.00 < 0.05)
- expand_paletted(ctor)/plte=224/trns=32
  [-24.070% -23.840% -23.592%] (p = 0.00 < 0.05)

Small palettes are mostly unaffected:

- expand_paletted(ctor)/plte=16/trns=1
  [-0.2525% +0.0338% +0.3239%] (p = 0.81 > 0.05)
Remove remaining uses of miniz_oxide for decoding
waywardmonkeys and others added 7 commits June 29, 2024 11:39
…te-usages

Remove usages of `extern crate`
This commit is desirable, because it avoids copying of image data across
intermediate buffers.  This commit was motivated by the data gathered in
image-rs#416 (comment)
The commit results in the followin performance gains seen in the
recently introduced `row-by-row/128x128-4k-idat` benchmark:

- time: [-18.401% -17.807% -17.192%] (p = 0.00 < 0.05)
- time: [-9.4276% -8.8789% -8.2860%] (p = 0.00 < 0.05)
- time: [-12.389% -11.780% -11.181%] (p = 0.00 < 0.05)

Fixes image-rs#417
@anforowicz anforowicz force-pushed the scratch-buffer-removal branch from 7150f4d to 29e90cc Compare July 17, 2024 17:03
@anforowicz
Copy link
Contributor Author

@fintelia, can you PTAL again?

Process notes

I see that this PR has been rebased on top of https://github.com/image-rs/image-png/tree/next, but this branch is 24 commits behind https://github.com/image-rs/image-png/tree/master and this makes GitHub is quite confused. In particular, it is not quite correct that in this PR “anforowicz wants to merge 26 commits” (as GitHub says at the top) - I just want to merge the 2 commits here: master...anforowicz:image-png:scratch-buffer-removal: b7d0c06 and 29e90cc. So, maybe you want to rebase on top of the master branch again?

I am also not sure what the right process is for landing breaking API changes. I went ahead and bumped up the version to 0.18.0 in CHANGE.md and Cargo.toml but please shout if I shouldn’t be doing that just yet.

Performance Impact

Performance of next_row matters mostly in scenarios that require some post-processing (e.g. doing a gamma correction or rgba8=>bgra8 conversion). Reasons for post-processing after decoding each row (rather than after decoding the whole image) include 1) incremental/progressive/partial decoding scenarios (this matters both for interlaced and non-interlaced images) and 2) desire to do the post-processing while the row pixels are still in L1 cache. The performance gains are mostly limited to scenarios where post-processing can happen in-place (otherwise, the memory copy would still happen as part of the post-processing) - assuming that rgba8 images can be post-processed in place this covers around 54% (rgba8 images) to 89% (rgba8 + indexed images) of scenarios involving web images (percentages are based on the data here).

This PR adds a benchmark that covers next_row of a generated, non-compressed image (i.e. focusing on parsing and memory copies instead of decompression overhead). The PR improves the benchmark results by 8-17% (see the commit description for more details).

In theory the changes in this PR can also result in a small performance gain in some non-incremental decoding scenarios in next_frame - avoiding an extra copy of 7th-pass pixels of interlaced images (decoding these pixels directly into the target buffer instead of calling adam7::expand_pass). I haven’t implemented these changes, because interlaced images seem to only happen in 1-2% of web images (see the data here).

Breaking Changes

At this point this PR is the only breaking change that I plan to do (given that the BufReader-related changes don’t give clear performance gains as discussed elsewhere). I think this means that the earlier comment “to bundle all the breaking changes” doesn’t apply anymore.

I note that a bit ago the image crate stopped calling the next_row function (see the commit here). This reduces the impact / scope of the proposed breaking changes (although image is just one crate).

If the breaking API changes are nevertheless concerning, then we can also keep the old APIs (maybe marking them as #[deprecated]) and introduce a separate new API. If we want to do that, then we’d have to decide what to call the new API:

  • The old APIs are pub fn next_row(&mut self) -> Result<Option<Row<'_>>, DecodingError> and pub fn next_interlaced_row ….
  • Maybe the new API can look like pub fn read_row(&mut self, row: &mut [u8]) -> Result<Option<InterlaceInfo>, DecodingError> (i.e. naming it read_row instead of next_row). I don’t quite like the asymmetry between next_frame and read_row, but this should work.

@anforowicz anforowicz mentioned this pull request Jul 17, 2024
@fintelia
Copy link
Contributor

I'm supportive of this PR, but I'm too burned out right now to review everything and manage a major release. As far as the BufRead changes, the last image crate release did add a BufRead bound for decoders so at least that aspect no longer applies.

@anforowicz
Copy link
Contributor Author

anforowicz commented Aug 16, 2024

I'm supportive of this PR, but I'm too burned out right now to review everything and manage a major release. As far as the BufRead changes, the last image crate release did add a BufRead bound for decoders so at least that aspect no longer applies.

Ack. My thinking is gradually evolving toward keeping Reader.read_row for now:

  • This avoids a breaking change. And even if we wanted to take the breaking changes, it probably would be desirable to initially mark read_row as #[deprecated] and only later (maybe after another major release?) remove it.
  • Reader.read_row has reasonable performance if the caller unconditionally post-processes and copies (because in this case no memory copy is saved). And in such a scenario having Reader.read_row is a nice convenience.

So, now I am leaning toward abandoning the current PR and introducing a separate, new API instead in
#493.

WDYT?

@kornelski kornelski added this to the 0.18 milestone Dec 8, 2024
@fintelia fintelia deleted the branch image-rs:next December 20, 2024 21:11
@fintelia fintelia closed this Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants