Reduce copying and allocations #422

fintelia · 2023-11-03T23:57:45Z

This eliminate Reader::prev and adds special handling of unfiltering for the first row.

It has minimal impact overall on the QOI corpus. However, if limited to QOI corpus images under 64KB, the average decoding speed goes from ~~244 MP/s to 436 MP/s~~ on my machine. Edit: I think I messed up the benchmarks but I still see a modest perf improvement

cc: @anforowicz

This eliminate Reader::prev and adds special handling of unfiltering for the first row.

anforowicz

Nice - thanks!

Some notes:

There is a choice here between 1) special-casing handling of empty previous in fn unfilter (what this PR does) and 2) having some zeros at the beginning of data_stream. The latter would result in smaller code size + wouldn't require a branch at the beginning of unfilter. But the difference between 1 and 2 is probably small, and the current PR is an improvement, so probably no point in overthinking this aspect.
There is also a choice between 1) having a single data_stream buffer that gets compacted from time to time (what this PR does) vs 2) having two buffers and avoiding compaction. The latter can be implemented by 2a) always having rowlen of bytes in prev and curr and swapping them (implemented here) or 2b) having 2 longer buffers (implemented here). I think it may be worth comparing all 3 approaches (e.g. testing/benching them using the noncompressed test inputs) to see how big the cost of compaction is.

anforowicz · 2023-11-06T18:09:50Z

src/decoder/mod.rs

@@ -756,17 +757,21 @@ impl<R: Read> Reader<R> {
        }

        // Get a reference to the current row and point scan_start to the next one.


nit: Probably need to remove the remaining reference to scan_start here

anforowicz · 2023-11-06T18:21:21Z

src/decoder/mod.rs

-                self.current.drain(..self.scan_start).for_each(drop);
-                self.scan_start = 0;
+            if self.prev_start > 0 {
+                self.data_stream.drain(..self.prev_start).for_each(drop);


The drain / for_each(drop) approach seems complex - I wonder if the compiler is able to optimize this as well as a more direct move - e.g. something like:

self.data_stream.copy_within(self.prev_start.., 0); self.data_stream.truncate(self.data_stream.len() - self.current_start);

I can't really read assembly well, but the drain approach seems to at least result in longer assembly: https://godbolt.org/z/z48q1xej5

OTOH, maybe this feedback doesn't belong to this PR and we should fix this in a separate PR? (e.g. drain-based compaction is also used in zlib.rs)

anforowicz · 2023-11-06T18:29:45Z

src/decoder/mod.rs

@@ -444,8 +444,7 @@ impl<R: Read> Reader<R> {
            return Err(DecodingError::LimitsExceeded);
        }

-        self.prev.clear();
-        self.prev.resize(self.subframe.rowlen, 0);
+        self.prev_start = self.current_start;


optional, feel free to ignore: There is minimal code duplication here (e.g. new line 710 looks the same) + I wonder if the code would be more readable if this single line was hoisted into a well-named separate function (maybe reset_prev_row)?

anforowicz · 2023-11-06T18:31:03Z

src/decoder/mod.rs

+    data_stream: Vec<u8>,
+    /// Index in `data_stream` where the previous row starts.
+    prev_start: usize,
+    /// Index in `data_stream` where the current row starts.


I wonder if it would be useful to point out that

prev_start points at pixel data - rowlen - 1 worth of bytes, but 2) current_start points filter kind byte + pixel data (potentially incomplete - at most rowlen worth of bytes).

prev_start == current_start indicates that there is no previous row

anforowicz · 2023-11-06T18:34:02Z

src/filter.rs

@@ -452,6 +461,82 @@ pub(crate) fn unfilter(
                *curr = curr.wrapping_add(above);
            }
        }
+        Avg if previous.is_empty() => match tbpp {


I wonder if roundtrip_ascending_previous_line and roundtrip tests (below in the same source file) could be duplicated and/or tweaked to explicitly cover round-tripping when there is no previous row?

anforowicz · 2023-11-06T22:22:10Z

it may be worth comparing all 3 approaches

Not sure if the data below helps or just stirs the pot unnecessarily :-) :

prev-row-copy-avoidance--double-long-buffer (see here)
- noncompressed-8x8.png: change in time:
  [-3.4699% -3.2960% -3.1177%] (p = 0.00 < 0.05)
- noncompressed-128x128.png: change in time:
  [-1.7766% -1.3460% -0.9357%] (p = 0.00 < 0.05)
prev-row-copy-avoidance--fintelia (see here)
- noncompressed-8x8.png: change in time:
  [-5.5093% -5.3152% -5.0900%] (p = 0.00 < 0.05)
- noncompressed-128x128.png: change in time:
  [-5.2180% -4.8080% -4.3969%] (p = 0.00 < 0.05)
prev-row-copy-avoidance--swap-prev-and-curr (see here)
- noncompressed-8x8.png: change in time:
  [+16.659% +16.859% +17.074%] (p = 0.00 < 0.05)
- noncompressed-128x128.png: change in time:
  [-17.995% -17.647% -17.318%] (p = 0.00 < 0.05)

I am very surprised by the wild changes from prev-row-copy-avoidance--swap-prev-and-curr.

I still worry about the cost of compaction (i.e. the calls to drain in Reader::next_raw_interlaced_row and ZlibStream::transfer_finished_data), but maybe we should just merge this PR and move on.

anforowicz · 2023-11-09T19:56:41Z

@fintelia, I just wanted to double-check that you've seen all of my comments above. (The performance comparison is probably safe to ignore, but I was wondering about my comments/suggestions about unit test coverage and about tweaking code comments that refer to the deleted/renamed scan_start field.)

fintelia · 2023-11-09T21:53:38Z

Somehow didn't see those comments or your changes on #420. Not exactly sure what happened, but I'll address them when I have a chance

Reduce copying and allocations

5dc873f

This eliminate Reader::prev and adds special handling of unfiltering for the first row.

fintelia force-pushed the no-prev branch from ec798bc to 5dc873f Compare November 4, 2023 00:13

anforowicz reviewed Nov 6, 2023

View reviewed changes

fintelia merged commit 310edea into image-rs:master Nov 9, 2023
19 checks passed

fintelia deleted the no-prev branch November 9, 2023 21:53

fintelia mentioned this pull request Nov 14, 2023

Optimize compaction of Reader::data_stream #426

Merged

anforowicz mentioned this pull request Nov 15, 2023

Remove Reader::scrach_buffer field + resulting breaking API changes. #421

Closed

anforowicz mentioned this pull request Jan 3, 2024

Cap buffer sizes via ZlibStream::set_max_total_output. #429

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce copying and allocations #422

Reduce copying and allocations #422

fintelia commented Nov 3, 2023 •

edited

Loading

anforowicz left a comment

anforowicz Nov 6, 2023

anforowicz Nov 6, 2023

anforowicz Nov 6, 2023

anforowicz Nov 6, 2023

anforowicz Nov 6, 2023

anforowicz commented Nov 6, 2023

anforowicz commented Nov 9, 2023

fintelia commented Nov 9, 2023

		@@ -756,17 +757,21 @@ impl<R: Read> Reader<R> {
		}

		// Get a reference to the current row and point scan_start to the next one.

Reduce copying and allocations #422

Reduce copying and allocations #422

Conversation

fintelia commented Nov 3, 2023 • edited Loading

anforowicz left a comment

Choose a reason for hiding this comment

anforowicz Nov 6, 2023

Choose a reason for hiding this comment

anforowicz Nov 6, 2023

Choose a reason for hiding this comment

anforowicz Nov 6, 2023

Choose a reason for hiding this comment

anforowicz Nov 6, 2023

Choose a reason for hiding this comment

anforowicz Nov 6, 2023

Choose a reason for hiding this comment

anforowicz commented Nov 6, 2023

anforowicz commented Nov 9, 2023

fintelia commented Nov 9, 2023

fintelia commented Nov 3, 2023 •

edited

Loading