Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Require BufRead instead of just Read for inputs. #427

Closed
wants to merge 3 commits into from

Conversation

anforowicz
Copy link
Contributor

PTAL?

@fintelia
Copy link
Contributor

fintelia commented Nov 27, 2023

This PR changes the decoder from pulling fixed size 32KB chunks of input data to having the caller push arbitrary sized chunks of data into the decoder. The default for BufReader is 8KB, so that'll likely be a common choice for users who don't have the full image data already loaded into memory. That makes me a bit nervous since it adds a performance parameter out of our control and means there will be edge cases around small buffers that are hard to handle

From testing locally on the QOI benchmark suite, I'm seeing a 1-2% performance regression with a (default sized) BufReader<&[u8]> over a plain &[u8]. I don't expect anyone would actually use that type, but BufReader<File> is probably pretty common and may approximate it.

I wonder if the performance gains you've seen on this were caused by avoiding the 32KB allocation+memset on very small images? If so, we could think about changes to control the size of the buffer based on the image size (likely by switching away from internally using a BufReader, and instead having a standalone buffer for reading IDAT chunks)

@anforowicz
Copy link
Contributor Author

This PR changes the decoder from pulling fixed size 32KB chunks of input data to having the caller push arbitrary sized chunks of data into the decoder. The default for BufReader is 8KB, so that'll likely be a common choice for users who don't have the full image data already loaded into memory. That makes me a bit nervous since it adds a performance parameter out of our control and means there will be edge cases around small buffers that are hard to handle

I wonder if adding a benchmark that uses BufReader would lower the risk a little bit? For example, we could add one additional benchmark testcase: noncompressed-128x128-with-default-bufreader.png. WDYT?

From testing locally on the QOI benchmark suite, I'm seeing a 1-2% performance regression with a (default sized) BufReader<&[u8]> over a plain &[u8]. I don't expect anyone would actually use that type, but BufReader<File> is probably pretty common and may approximate it.

I think your experiment supports merging this PR:

  • Your experiment shows that an extra BufReader<...> leads to a performance regression (you say that when you compare BufReader<&[u8]> vs &[u8] the version with BufReader is slower by 1-2%)
  • Your experiment does not approximate what would happen for BufReader<File> (since there is no BufReader<...> in &[u8]). For such approximation, I think that you can try comparing BufReader<u8> with 32kB of buffer (the old default) vs BufReader<u8> with 8kB of buffer (the new default from std).

I wonder if the performance gains you've seen on this were caused by avoiding the 32KB allocation+memset on very small images?

I think the data from the commit message suggests the opposite - the performance gains were much more pronounced for bigger images. I speculate that this is because the savings come from avoiding copying all of the image data, rather than coming from the initial, fixed-size data or memset. According to the commit message the savings looked as follows:

  • noncompressed-8x8.png:
    • [-2.2881% -1.4801% -0.4110%] (p = 0.00 < 0.05)
    • [-7.5687% -7.2013% -6.8838%] (p = 0.00 < 0.05)
  • noncompressed-128x128.png:
    • [-12.495% -12.132% -11.760%] (p = 0.00 < 0.05)
    • [-10.597% -10.230% -9.8399%] (p = 0.00 < 0.05)

@fintelia
Copy link
Contributor

Sorry, I should have been more clear. The performance regression I found only happens with this PR but not on the main branch. That was what concerned me.

Running a different experiment, I tried making this patch to the benchmarks:

--- a/benches/decoder.rs
+++ b/benches/decoder.rs
@@ -48,7 +48,7 @@ fn bench_file(c: &mut Criterion, data: Vec<u8>, name: String) {
     group.throughput(Throughput::Bytes(info.buffer_size() as u64));
     group.bench_with_input(name, &data, |b, data| {
         b.iter(|| {
-            let decoder = Decoder::new(data.as_slice());
+            let decoder = Decoder::new(std::io::BufReader::new(data.as_slice()));
             let mut decoder = decoder.read_info().unwrap();
             decoder.next_frame(&mut image).unwrap();
         })

I then ran each benchmark against the main branch and again against this PR:

decode/kodim17.png
                        time:   [+1.6160% +1.8703% +2.1665%] (p = 0.00 < 0.05)
                        thrpt:  [-2.1206% -1.8359% -1.5903%]

decode/kodim07.png
                        time:   [+1.3616% +1.6080% +1.9906%] (p = 0.00 < 0.05)
                        thrpt:  [-1.9517% -1.5826% -1.3433%]

decode/kodim02.png
                        time:   [+2.9415% +3.2626% +3.5050%] (p = 0.00 < 0.05)
                        thrpt:  [-3.3863% -3.1595% -2.8574%]

decode/kodim23.png
                        time:   [-0.3615% +0.5706% +1.3460%] (p = 0.21 > 0.05)
                        thrpt:  [-1.3281% -0.5674% +0.3629%]

decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png
                        time:   [+6.5478% +6.8271% +7.1205%] (p = 0.00 < 0.05)
                        thrpt:  [-6.6472% -6.3908% -6.1454%]

decode/Transparency.png time:
                        time:   [+1.7512% +2.0166% +2.3704%] (p = 0.00 < 0.05)
                        thrpt:  [-2.3156% -1.9768% -1.7211%]

decode/generated-png:noncompressed-8x8.png
                        time:   [-0.0576% +0.1127% +0.2824%] (p = 0.20 > 0.05)
                        thrpt:  [-0.2816% -0.1125% +0.0576%]

decode/generated-png:noncompressed-128x128.png
                        time:   [+21.104% +21.693% +22.380%] (p = 0.00 < 0.05)
                        thrpt:  [-18.287% -17.826% -17.427%]

I'll also note I'm somewhat biased here. This API change would have some ripple effects for parts of the main image crate's API, probably requiring that crate to either unconditionally wrap PNG input streams in a BufReader or require all users of all decoders to adhere to the BufRead trait bound. If possible, I'd rather find a way to avoid that without sacrificing performance

@anforowicz
Copy link
Contributor Author

@fintelia - could you please upload your benchmarking setup to a github repo + point out the hashes of the before/after commits? This would help me understand what is being compared in your experiment.

Is it maybe like this?:

  • Before: origin/master (i.e. f80dfe9) + benches/decoder.rs changes from your comment above (resulting in 2 BufReaders: first 8kB buffering and then 32kB buffering)
  • After: this PR (i.e. f24d4ff) + benches/decoder.rs changes from your comment above (only 1 BufReader with 8kB buffer)

This API change would have some ripple effects for parts of the main image crate's API, probably requiring that crate to either unconditionally wrap PNG input streams in a BufReader or require all users of all decoders to adhere to the BufRead trait bound.

Yes, this is true. I think I'd prefer to require all users of image::codecs::png::PngDecoder to adhere to the BufRead trait bound (but not necessarily "all decoders").

I am also biased - the speed of decoding of in-memory PNGs is most important to me, because this is the form of PNG input in Chromium's //ui/gfx and in Chromium's Blink (in fact, if you squint your eyes, then blink::SegmentReader::GetSomeData is a little bit like BufReader::fill_buf, although there is no equivalent of BufReader::consume and blink::PngImageReader has to track "consumption" by managing the position argument / read_offset_ field).


I think that understanding the performance impact of this PR is important, so let's continue the discussion here. At the same time, maybe the 2 PRs at #429 and #428 are less controversial and maybe this is where we can focus the review efforts for now?

Landing those other 2 PRs first may be desirable to:

  • Magnify the impact of the BufReader changes (and make it more easily measurable/detectable) by landing the other improvements first. This should help to magnify any negative as well as positive impact.
  • Bundle the breaking changes together, as you've suggested elsewhere

@anforowicz
Copy link
Contributor Author

BTW, one additional argument for not merging this PR yet is that so far Chromium has not landed any code that uses the png crate. Hopefully this will happen soon, but I can understand the desire to prioritize the experience of existing consumers of the png crate (over the hypothetical, future experience of aspirational consumers - like Chromium).

Still, I am curious about the benchmarking results above, and would like to debug and understand them better. (Although as I said before, this seems lower priority than discussing the other 2, less controversial PRs.)

@anforowicz anforowicz force-pushed the bufread-for-performance branch from f24d4ff to e2eb3f8 Compare December 17, 2023 17:36
@anforowicz anforowicz force-pushed the bufread-for-performance branch from e2eb3f8 to c67d90e Compare January 3, 2024 23:15
@anforowicz
Copy link
Contributor Author

For full transparency, let me share some recent observations. I have rebased this PR on top of the latest changes, but when rerunning the benchmarks I've realized that for some testcases (e.g. noncompressed, 2048x2048 image, split across 64kB IDAT chunks) I observe a significant regression. The binary size and the number of instructions goes down (6550545993 instructions => 6358944886 instructions - 3% improvement), but the stalled backend cycles increase from 5.37% backend cycles idle to 40.61% backend cycles idle (which translates into a regression of the overall runtime). I think that getting rid of the intermediate BufReader means that decompression can consume much bigger chunks of input at a time (whole IDAT - 64kB, instead of the size of the intermediate BufReader - 32kB) and I am guessing that this change is somehow negatively impacting hardware-driven memory prefetching. I don't yet understand what is happening exactly, what actions we can take, and what it means for this PR.

@fintelia
Copy link
Contributor

fintelia commented Jan 4, 2024

Any chance your CPU has a 64KB L1 cache? The current approach does three copies with size=32KB: input -> BufReader -> out_buffer -> data_stream, which should stay inside the L1 cache.

This PR changes it to two copies input -> out_buffer -> data_stream, but in the process makes the copy size depend on the underlying image's IDAT sizes. Which with 64KB IDATs means that the working set for each copy becomes 128KB

@anforowicz
Copy link
Contributor Author

Let me convert this PR to a "draft", so that it won't get accidentally merged before we understand the performance impact better. There are multiple factors at play, so I think that (instead of continuing the discussion here) I'll try to post something to #416.

This commit makes a breaking API change - it changes the `R: Read`
constraint (in `Decoder` and `Reader` structs) to the `R: BufRead`
constraint.  This helps performance by avoiding copying the input data
into an additional, intermediate `BufReader` that used to be stored in
the (internal) `ReadDecoder::reader` field (after these changes that
field is `R` rather than `BufReader`).  In particular, some input types
(e.g. when decoding from a `&[u8]`) already implement `BufRead` and for
such types it is wasteful to introduce additional buffering via
`BufReader`.

The impact of the change is significant, but relatively small - this
means that it mostly shows up in `noncompressed...` benchmarks which
magnify the cost of code paths that are not related to `fdeflate` nor
`unfilter`.  Impact on benchmark runtime looks as follows (run once, and
then rerun after compiling before+after with a fresh nightly `rustc`):

* kodim02.png:
    - No change in performance detected (p = 0.08 > 0.05)
    - [+1.3713% +1.7241% +2.0960%] (p = 0.00 < 0.05)
* kodim07.png:
    - [-1.1466% -0.6693% -0.2705%] (p = 0.00 < 0.05)
    - No change in performance detected. (p = 0.35 > 0.05)
* kodim17.png:
    - [-2.3062% -1.2878% +0.1746%] (p = 0.05 < 0.05)
    - [-2.7355% -1.9939% -0.7986%] (p = 0.00 < 0.05)
* kodim23.png:
    - No change in performance detected. (p = 0.51 > 0.05)
    - [-1.4834% -1.0648% -0.6692%] (p = 0.00 < 0.05)
* Lohengrin...png:
    - [-2.0606% -1.7935% -1.4756%] (p = 0.00 < 0.05)
    - [-4.2412% -3.6723% -3.0327%] (p = 0.00 < 0.05)
* Transparency.png:
    - [+1.4991% +1.8812% +2.3429%] (p = 0.00 < 0.05)
    - [-0.7939% -0.5746% -0.3590%] (p = 0.00 < 0.05)
* noncompressed-8x8.png:
    - [-2.2881% -1.4801% -0.4110%] (p = 0.00 < 0.05)
    - [-7.5687% -7.2013% -6.8838%] (p = 0.00 < 0.05)
* noncompressed-128x128.png:
    - [-12.495% -12.132% -11.760%] (p = 0.00 < 0.05)
    - [-10.597% -10.230% -9.8399%] (p = 0.00 < 0.05)
Similarily to
image-rs@1636b55
this commit tries to ensure that the working set fits into the L1 cache.
Before this commit, the whole `ZlibStream::out_buffer` could be filled
out and this buffer is potentially bigger than the typical 32kB of the
L1 cache.

After this commit, `MAX_INCREMENTAL_DECOMPRESSION_SIZE` limits how many
bytes can be written to `ZlibStream::out_buffer` in a single call to
`fdeflate::Decompressor::read`.
After this commit the size of the initial allocation of
`ZlibStream::out_buffer` should be big enough to avoid having to grow
the capacity of the buffer backing this vector of bytes.  Removing this
overhead is expected to have a positive impact on runtime performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants