Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Evolution #534

Open
fintelia opened this issue Nov 15, 2024 · 12 comments
Open

API Evolution #534

fintelia opened this issue Nov 15, 2024 · 12 comments

Comments

@fintelia
Copy link
Contributor

In response to another issue, I started writing up some thoughts on how what I'd do if I were designing this crate's API from scratch, but realized it would probably be a better fit as a standalone discussion...

One difference would be the use of Seek to expose a more flexible API. The default behavior would be to composite frames, with an opt-out to get the raw frame contents. Also probably a clearer distinction between low-level and high-level decoding, though I'm not 100% sure how I'd handle the interaction between the two.

// METADATA

/// Create reader and populate IHDR
fn new<R: BufRead + Seek>(r: R) -> Self;

/// Set memory limit for decoding.
fn set_limits(&mut self, limits: Limits);

/// Read forward in the PNG until the first IDAT. Decode small metadata chunks, but
/// only store the byte offsets of large chunks like iCCP or eXIf.
fn read_metadata(&mut self) -> Result<Info, DecodingError>;

/// Seek to the iCCP chunk and return its contents.
fn icc_profile(&mut self) -> Result<Vec<u8>, DecodingError>;
// HIGH LEVEL DECODING

/// Read the static image / first frame of the animation.
fn read_image(&mut self, buf: &mut [u8]) -> Result<(), DecodingError>;

/// Read the next frame, starts from the first frame, regardless of whether `read_image` has been called.
fn read_frame(&mut self, buf: &mut [u8]) -> Result<(), DecodingError>;

/// Same as now.
fn next_frame_info(&mut self) -> Result<&FrameControl, DecodingError>;

/// Seek back to the first frame of the animation.
fn reset_animation(&mut sefl) -> Result<()>;

/// Causes `read_frame` to return raw frames. Otherwise `read_frame` does compositing.
fn disable_compositing(&mut self);
// LOW LEVEL DECODING

/// Get info about the next interlaced row that will be read
fn next_row_info(&mut self) -> Result<RowInfo, DecodingError>

/// Read the next row.
fn next_interlaced_row(&mut self, buf: &mut [u8]) -> Result<(), DecodingError>
@anforowicz
Copy link
Contributor

One difference would be the use of Seek to expose a more flexible API.

This sounds good. &u8 alone doesn't implement Seek, but one can always wrap it in Cursor<...> so this is probably ok.

FWIW all inputs that I've worked with so far can implement Seek in one form or another (&[u8], SkStream, blink::SegmentReader). And all of them have been backed by in-memory buffers, so in theory it should be possible to wrap all of them in BufRead, although some currently don't expose/require such interface (SkStream) so requiring BufRead may be a bit tricky.

The default behavior would be to composite frames, with an opt-out to get the raw frame contents.

This seems a bit tricky:

  • BlendOp may indeed be possible if client code can pass in a "previous" frame (scare quotes, because it is previous frames after applying dispose ops). OTOH, the client may want to postpone BlendOp until after client-level post-processing (e.g. I assume that blending may give slightly different results depending on whether it is applied before or after color-profile/gamma-correction; please shout if this assumption is wrong - I don't have much experience with images/codecs).
  • One interesting observation is that some of the proposed enhancements (e.g. blending) are somewhat codec-independent. It may be worth considering having a separate crate that codecs can reuse.
  • Doing DisposeOp within the png crate seems undesirable for clients that may want to do their own buffer management (e.g. to discard older animation frames when under memory pressure).

One idea that we may want to consider (I don't feel strongly about this) would be to offer 3 levels of API: StreamingDecoder (low-level, push), Reader (low/medium-level, pull via Read) and then a new level which does more higher-level things (blendop, disposeop, buffer management for animations, maybe color profile handling).

Just my 2 cents...

@fintelia
Copy link
Contributor Author

And all of them have been backed by in-memory buffers, so in theory it should be possible to wrap all of them in BufRead, although some currently don't expose/require such interface (SkStream) so requiring BufRead may be a bit tricky.

Worst case, they could be wrapped in a BufReader so it wouldn't be any worse that now. But it does seem a bit unfortunate that some interfaces wouldn't be able to directly expose their internal buffers.

The default behavior would be to composite frames, with an opt-out to get the raw frame contents.

This seems a bit tricky:
BlendOp may indeed be possible if client code can pass in a "previous" frame [...]

The idea is that the Reader would internally track the canvas and apply blend+dispose ops to it. That's how the image-webp crate currently works. For a sophisticated user like Chrome, this is all unnecessary. They would just call a disable method and deal with compositing and buffer management externally. However, my estimate is that most users are far less sophisticated. They just want the frame contents and any compositing or buffer management is just added hassle (or more likely they pull in the main image crate because they don't want to handle those aspects).

One interesting observation is that some of the proposed enhancements (e.g. blending) are somewhat codec-independent. It may be worth considering having a separate crate that codecs can reuse.

This is something that I wasn't really aware of before hearing about how Chrome has a common abstraction. Each of the format specifications has its own description of how compositing works, and I never did a detailed comparison between them to realize they were all the same. It could totally make sense to spin this logic into its own crate.

@kornelski
Copy link
Contributor

I'm not a fan of APIs that write to &mut [u8]:

  • If I'm reading the whole frame into memory, I usually just want pixels in Vec<Rgba>. Giving a slice is an extra step.

  • There's efficient-ish vec![0; len], but it doesn't handle OOM. try_reserve() + resize(0) is slower, because it can't get zero pages from the allocator. Allocation of bitmaps is one place where OOM can realistically happen, so &mut [u8] isn't zero copy, because I'm copying zeros first. Writing to &mut [MaybeUninit<u8>] unfortunately isn't safe, but could be used internally to support both [u8] and Vec outputs.

  • When the API requires space for the whole frame, I can't do stream processing. For example, I may want to convert color space, or decode image at 1/nth resolution for a thumbnail. In these cases streaming line by line avoids allocating whole image twice. However, when the user is allowed to provide a buffer for less than the whole image, then keeping the previous line for filtering without copying becomes tricky. The user would have to literally provide the previous line :/

@kornelski
Copy link
Contributor

kornelski commented Nov 17, 2024

What about unknown chunks? I imagine there would be some visitor to collect them, along with their position. Needed to transcode arbitrary PNGs.

In next_frame_info you can give an owned FrameControl. The library will want to reset the frame to default anyway, so you can avoid lending, and mem::take it.

disable_compositing makes decoder stateful. I think it's safe to assume that user will know what they want since the very beginning. There could be specific constructors, like new_still_image and new_animation (lots of uses don't support anims).

client may want to postpone BlendOp until after client-level post-processing

The client is not allowed to do this. Blending happens inside the decoder, on raw pixel intensities: https://www.w3.org/TR/png-3/#13Alpha-channel-processing

@kornelski
Copy link
Contributor

Currently the StreamingDecoder needs to be able to parse one byte at a time, which makes parsing basic things like u32s or IHDR much more complicated than if it could simply call read_exact().

All of this complication is because StreamingDecoder::update() exists, and may get buffer of any size. How about dropping the update() method?

It would be replaced with something else that lets it get/read a certain number of bytes when it needs it. It could be &dyn BufRead, since that still supports io::Read and can be used to read_exact() 4 bytes or the whole IHDR.

Alternatively, the current_chunk.raw_bytes buffering mechanism could be generalized to collect a required size if the buf input isn't large enough for the current state.

@kornelski
Copy link
Contributor

Is StreamingDecoder::reset() valuable? It can recycle one or two 32KB buffers.

OTOH it's probably already buggy, since it doesn't reset all the fields used in the constructor, and that may be a recurring theme with it.

@fintelia
Copy link
Contributor Author

fintelia commented Dec 22, 2024

I have been wondering whether we should make StreamingDecoder private in the 0.18 release so we can try out alternatives to the byte-at-a-time decoding strategy. The hard part is figuring out whether anyone is using it / how disruptive it would be

Edit: I've got a branch where I've been exploring using BufRead + read_exact for decoding. This is the main decoding logic.

@kornelski
Copy link
Contributor

Yeah, something like this.

I'm concerned that R: BufRead could easily duplicate parsing functions by monomorphisation, so maybe use &dyn BufRead or a custom dyn trait/callback?

I'd go further and instead loading u32s, I'd try to load whole chunks. It should be possible to use bufreader.fill_buf() slice directly when it's large enough, for entire chunks, falling back to copying only if the buf is too small. When it's backed by a Vec<u8>, then the whole file would be available without buffering, and even the default 8K buffer could be large enough for most chunks.

@kornelski
Copy link
Contributor

kornelski commented Dec 22, 2024

Oh, I can do that even without changing the API!

Part of the StreamingDecoder state could be how many contiguous bytes it needs before proceeding to the next state, and StreamingDecoder::update(buf) could have an early check that if buf.len() is large enough, it goes through to the state machine with borrowed data, and if it isn't large enough, then it copies to a scratch buffer, and borrows the scratch buffer in the state machine.

@anforowicz
Copy link
Contributor

StreamingDecoder::update(buf) could have an early check that if buf.len() is large enough, it goes through to the state machine with borrowed data, and if it isn't large enough, then it copies to a scratch buffer, and borrows the scratch buffer in the state machine.

Interesting idea. The main challenge here is that as long as the scratch buffer is not empty, it needs to be consumed first. So once we start using the scratch buffer, it's tricky to stop. ZlibStream used to have an in_buffer that is similar to the idea here, and removing it led to some code simplifications + to some small performance gains.

It should be possible to use bufreader.fill_buf() slice directly when it's large enough, for entire chunks, falling back to copying only if the buf is too small.

You probably already realize this, but let me point it out explicitly - we may need to consume small input chunks even if using a BufReader with a big buffer. This is because it is always possible that the first byte of a u32 (e.g. chunk type) will start at the very end of the BufReader's buffer - this is discussed at https://graphallthethings.com/posts/better-buf-read. OTOH this problem can be worked around by ignoring the standard BufRead / BufReader and instead working with BetterBufReader - e.g. we can guarantee that StreamingDecoder gets at least (say) 4 bytes at a time by replacing ReadDecoder::reader with BetterBufReader.

@kornelski
Copy link
Contributor

The main challenge here is that as long as the scratch buffer is not empty, it needs to be consumed first

I'm not planning to support such buffering model. What I'm proposing is an interface where asking for n contiguous bytes is a commitment to immediately consume exactly n bytes, no more, no less. Un-reading is not supported. Peeking is not supported. It's like having only read_exact(), but one that returns a slice instead of copying into one. This way you can never end up in a situation where you have 1 byte left in the scratch buffer and 1000 bytes in the input.

Such buffering interface is not usable for inflate, but it should be fine for the PNG structure where all sizes needed can be known ahead of time. The scratch buffer can be filled up exactly to the level it needs to be, used up entirely every time, and leave nothing behind.

@anforowicz
Copy link
Contributor

The main challenge here is that as long as the scratch buffer is not empty, it needs to be consumed first

I'm not planning to support such buffering model. What I'm proposing is an interface where asking for n contiguous bytes is a commitment to immediately consume exactly n bytes, no more, no less. Un-reading is not supported. Peeking is not supported. It's like having only read_exact(), but one that returns a slice instead of copying into one. This way you can never end up in a situation where you have 1 byte left in the scratch buffer and 1000 bytes in the input.

Such buffering interface is not usable for inflate, but it should be fine for the PNG structure where all sizes needed can be known ahead of time. The scratch buffer can be filled up exactly to the level it needs to be, used up entirely every time, and leave nothing behind.

Thank you for explaining. This does address my concern (especially the "used up entirely every time" part).

OTOH this problem can be worked around by ignoring the standard BufRead / BufReader and instead working with BetterBufReader - e.g. we can guarantee that StreamingDecoder gets at least (say) 4 bytes at a time by replacing ReadDecoder::reader with BetterBufReader.

FWIW I tried to implement my idea - see the commits here. The main idea has 2 parts (implemented in the 1st commit):

  1. When StreamingDecoder returns Ok((0, Decoded::Nothing)) then it means "need bigger input buf to make progress`.
  2. Replace ReadDecoder::reader with a custom buffer that understand the meaning of Ok((0, Decoded::Nothing)).

This way we can remove all other buffers from StreamingDecoder (e.g. U32::bytes and U32::accumulated_count, as well as ChunkState::raw_bytes (removed in the 2nd commit). (I assume that the idea in the previous comment wouldn't remove those buffers, but maybe consolidating them into a single, unified buffer is still desirable.)

And as a follow-up, in the 3rd commit I changed how the first bytes (including the PNG signature) are handled - instead of two U32s, I just ask for 16 bytes.

I like how stream.rs got simplified. OTOH, I got performance regressions in the benchmarks and I don't know how to fix them... :-( Part of the performance regression is that my BufReader2 doesn't use MaybeUninit (which would require unsafe) which (I think) means that it spends extra time initializing the buffer. But this doesn't explain away the whole performance regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants