Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce copying and allocations #422

Merged
merged 1 commit into from
Nov 9, 2023
Merged

Conversation

fintelia
Copy link
Contributor

@fintelia fintelia commented Nov 3, 2023

This eliminate Reader::prev and adds special handling of unfiltering for the first row.

It has minimal impact overall on the QOI corpus. However, if limited to QOI corpus images under 64KB, the average decoding speed goes from 244 MP/s to 436 MP/s on my machine. Edit: I think I messed up the benchmarks but I still see a modest perf improvement

cc: @anforowicz

This eliminate Reader::prev and adds special handling of unfiltering
for the first row.
Copy link
Contributor

@anforowicz anforowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - thanks!

Some notes:

  • There is a choice here between 1) special-casing handling of empty previous in fn unfilter (what this PR does) and 2) having some zeros at the beginning of data_stream. The latter would result in smaller code size + wouldn't require a branch at the beginning of unfilter. But the difference between 1 and 2 is probably small, and the current PR is an improvement, so probably no point in overthinking this aspect.

  • There is also a choice between 1) having a single data_stream buffer that gets compacted from time to time (what this PR does) vs 2) having two buffers and avoiding compaction. The latter can be implemented by 2a) always having rowlen of bytes in prev and curr and swapping them (implemented here) or 2b) having 2 longer buffers (implemented here). I think it may be worth comparing all 3 approaches (e.g. testing/benching them using the noncompressed test inputs) to see how big the cost of compaction is.

@@ -756,17 +757,21 @@ impl<R: Read> Reader<R> {
}

// Get a reference to the current row and point scan_start to the next one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably need to remove the remaining reference to scan_start here

self.current.drain(..self.scan_start).for_each(drop);
self.scan_start = 0;
if self.prev_start > 0 {
self.data_stream.drain(..self.prev_start).for_each(drop);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drain / for_each(drop) approach seems complex - I wonder if the compiler is able to optimize this as well as a more direct move - e.g. something like:

            self.data_stream.copy_within(self.prev_start.., 0);
            self.data_stream.truncate(self.data_stream.len() - self.current_start);

I can't really read assembly well, but the drain approach seems to at least result in longer assembly: https://godbolt.org/z/z48q1xej5

OTOH, maybe this feedback doesn't belong to this PR and we should fix this in a separate PR? (e.g. drain-based compaction is also used in zlib.rs)

@@ -444,8 +444,7 @@ impl<R: Read> Reader<R> {
return Err(DecodingError::LimitsExceeded);
}

self.prev.clear();
self.prev.resize(self.subframe.rowlen, 0);
self.prev_start = self.current_start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional, feel free to ignore: There is minimal code duplication here (e.g. new line 710 looks the same) + I wonder if the code would be more readable if this single line was hoisted into a well-named separate function (maybe reset_prev_row)?

data_stream: Vec<u8>,
/// Index in `data_stream` where the previous row starts.
prev_start: usize,
/// Index in `data_stream` where the current row starts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be useful to point out that

    1. prev_start points at pixel data - rowlen - 1 worth of bytes, but 2) current_start points filter kind byte + pixel data (potentially incomplete - at most rowlen worth of bytes).
  • prev_start == current_start indicates that there is no previous row

@@ -452,6 +461,82 @@ pub(crate) fn unfilter(
*curr = curr.wrapping_add(above);
}
}
Avg if previous.is_empty() => match tbpp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if roundtrip_ascending_previous_line and roundtrip tests (below in the same source file) could be duplicated and/or tweaked to explicitly cover round-tripping when there is no previous row?

@anforowicz
Copy link
Contributor

it may be worth comparing all 3 approaches

Not sure if the data below helps or just stirs the pot unnecessarily :-) :

  • prev-row-copy-avoidance--double-long-buffer (see here)

    • noncompressed-8x8.png: change in time:
      [-3.4699% -3.2960% -3.1177%] (p = 0.00 < 0.05)
    • noncompressed-128x128.png: change in time:
      [-1.7766% -1.3460% -0.9357%] (p = 0.00 < 0.05)
  • prev-row-copy-avoidance--fintelia (see here)

    • noncompressed-8x8.png: change in time:
      [-5.5093% -5.3152% -5.0900%] (p = 0.00 < 0.05)
    • noncompressed-128x128.png: change in time:
      [-5.2180% -4.8080% -4.3969%] (p = 0.00 < 0.05)
  • prev-row-copy-avoidance--swap-prev-and-curr (see here)

    • noncompressed-8x8.png: change in time:
      [+16.659% +16.859% +17.074%] (p = 0.00 < 0.05)
    • noncompressed-128x128.png: change in time:
      [-17.995% -17.647% -17.318%] (p = 0.00 < 0.05)

I am very surprised by the wild changes from prev-row-copy-avoidance--swap-prev-and-curr.

I still worry about the cost of compaction (i.e. the calls to drain in Reader::next_raw_interlaced_row and ZlibStream::transfer_finished_data), but maybe we should just merge this PR and move on.

@fintelia fintelia merged commit 310edea into image-rs:master Nov 9, 2023
19 checks passed
@anforowicz
Copy link
Contributor

@fintelia, I just wanted to double-check that you've seen all of my comments above. (The performance comparison is probably safe to ignore, but I was wondering about my comments/suggestions about unit test coverage and about tweaking code comments that refer to the deleted/renamed scan_start field.)

@fintelia
Copy link
Contributor Author

fintelia commented Nov 9, 2023

Somehow didn't see those comments or your changes on #420. Not exactly sure what happened, but I'll address them when I have a chance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants