-
Notifications
You must be signed in to change notification settings - Fork 10
feat: add offset codec
#36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Maybe call this |
|
This could support both prefix and suffix. |
|
another name idea, if prefix and suffix are added: |
|
I may need to add another parameter to control how to handle existing chunks. Should the codec open the chunk, seek to the offset, and use the existing header OR should it always write zeros or the defined prefix? My current thought is that the lack of of a defined prefix means that zeros should be prepended if a chunk does not exist but that the header of a existing chunk should be preserved. If there is specified prefix, then that prefix should always be written as the header even if the chunk exists. Some implementatins may find the write behavior problematic if the underlying storage does not allow seeking before writing. In this case, a read may be necessary before writing.
I have proposed another extension called "suffix" that is a chunk key encoding. I also note the name collision between the
Appending a suffix could be another codec. The CRC32c checksum codec could be considered as a specific form of an appending codec where the last four bytes could be ignored as they are not needed to interpret the preceding bytes. Alternatively, we could make it so that a "negative" offset refers to bytes to skip at the end of a byte sequence. In this case, we may need to rename the "prefix" parameter to "padding". To add both a header and a footer to a file we would just use two offset codecs, one with a positive offset and one with a negative offset. A suffix would be able to better support other foreign image file formats. In particular, I was thinking about PNG files where you would need a terminal IEND chunk. Another case might be ZIP shards. This might overlap with the sharding codec in that we are effectively defining the byte offset and nbytes for a single chunk shard. An alternative here would be to consider extensions to the sharding codec where we define the index either in an external key (another file) or have a fixed index common to all shards defined in the codec configuration. One difference from the sharding codec is that we define a byte offset from the beginning and an offset from the end rather than the size of the "inset". In composition with the sharding-indexed codec, this would also allow the shard index to not have to exist at the exact beginning or end of the file. |
|
Another name I just thought of for a combined prefix and suffix codec is "byte_range". We could then support syntax similar to the HTTP Range header: |
|
Re name collision: technically it is fine since they are different namespaces, and indeed we already have a collision between the Re preserving existing prefix/suffix content: that is outside of the existing capabilities of a codec. That would likely require changes to codec APIs in existing implementations. |
|
I'm currently leaning towards supporting both prefix and suffix but not at the same time within one instance of the codec. In order to add ignored bytes (padding) to the beginning and end one would need to apply the codec twice as follows. |
Add an
offsetcodec. This defines a byte offset to seek to before decoding bytes.0x00, of length equal to the byte offset.prefixconfiguration parameter could be used to set a static header of length equal to the byte offset.Proposed Applications:
prefixconfiguration parameter.suffixchunk-key-encoding to add a.tifffile extension to the chunks.Importantly, the skipped bytes are not intended to have any meaning for the Zarr decoding process itself.