feat: add `offset` codec #36

mkitti · 2025-10-29T09:19:07Z

Add an offset codec. This defines a byte offset to seek to before decoding bytes.

When reading, a leading number of bytes is ignored.
When writing, a sequence of bytes is prepended:
- By default, this could just be a sequence of null bytes, 0x00, of length equal to the byte offset.
- A prefix configuration parameter could be used to set a static header of length equal to the byte offset.
- Implementations may allow for dynamic header generation depending on the input bytes to the codec.
- If a chunk exists, leave the existing skipped bytes as is.

Proposed Applications:

Decode N5 chunks by skipping over the N5 header. Optionally, include a static N5 header as the prefix configuration parameter.
Write chunks as TIFF files. Combine with the suffix chunk-key-encoding to add a .tiff file extension to the chunks.

Importantly, the skipped bytes are not intended to have any meaning for the Zarr decoding process itself.

jbms · 2025-10-29T16:26:19Z

Maybe call this skip_bytes or padding_bytes?

jbms · 2025-10-29T16:26:39Z

This could support both prefix and suffix.

d-v-b · 2025-10-29T16:34:39Z

another name idea, if prefix and suffix are added: inset

mkitti · 2025-10-29T17:07:22Z

I may need to add another parameter to control how to handle existing chunks. Should the codec open the chunk, seek to the offset, and use the existing header OR should it always write zeros or the defined prefix?

My current thought is that the lack of of a defined prefix means that zeros should be prepended if a chunk does not exist but that the header of a existing chunk should be preserved. If there is specified prefix, then that prefix should always be written as the header even if the chunk exists.

Some implementatins may find the write behavior problematic if the underlying storage does not allow seeking before writing. In this case, a read may be necessary before writing.

Maybe call this skip_bytes or padding_bytes?
another name idea, if prefix and suffix are added: inset

skip_bytes would probably make sense for the codec as written now. That name seems to come from the reading perspective rather than writing. When writing a new chunk, there are not really any bytes to "skip", yet. Thus, I would lean towards padding_bytes or inset.

I have proposed another extension called "suffix" that is a chunk key encoding. I also note the name collision between the conditional codec and optional data type. Perhaps we need some extension naming conventions to avoid such collisions or is the distinct nature of the extensions sufficient?

This could support both prefix and suffix.

Appending a suffix could be another codec. The CRC32c checksum codec could be considered as a specific form of an appending codec where the last four bytes could be ignored as they are not needed to interpret the preceding bytes.

Alternatively, we could make it so that a "negative" offset refers to bytes to skip at the end of a byte sequence. In this case, we may need to rename the "prefix" parameter to "padding". To add both a header and a footer to a file we would just use two offset codecs, one with a positive offset and one with a negative offset.

A suffix would be able to better support other foreign image file formats. In particular, I was thinking about PNG files where you would need a terminal IEND chunk. Another case might be ZIP shards.

This might overlap with the sharding codec in that we are effectively defining the byte offset and nbytes for a single chunk shard. An alternative here would be to consider extensions to the sharding codec where we define the index either in an external key (another file) or have a fixed index common to all shards defined in the codec configuration.

One difference from the sharding codec is that we define a byte offset from the beginning and an offset from the end rather than the size of the "inset". In composition with the sharding-indexed codec, this would also allow the shard index to not have to exist at the exact beginning or end of the file.

mkitti · 2025-10-29T17:14:14Z

Another name I just thought of for a combined prefix and suffix codec is "byte_range". We could then support syntax similar to the HTTP Range header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Range

Range: <unit>=<range-start>-
Range: <unit>=<range-start>-<range-end>
Range: <unit>=<range-start>-<range-end>, …, <range-startN>-<range-endN>
Range: <unit>=-<suffix-length>

jbms · 2025-10-29T17:23:54Z

Re name collision: technically it is fine since they are different namespaces, and indeed we already have a collision between the bytes codec and the bytes data type, which are quite unrelated and in fact incompatible, but it would be better to avoid collisions to reduce confusion.

Re preserving existing prefix/suffix content: that is outside of the existing capabilities of a codec. That would likely require changes to codec APIs in existing implementations.

mkitti · 2025-10-30T23:24:16Z

I'm currently leaning towards supporting both prefix and suffix but not at the same time within one instance of the codec. In order to add ignored bytes (padding) to the beginning and end one would need to apply the codec twice as follows.

codecs: [
    { "name": "bytes" },
    {
        "name": "pad",
        "configuration": {
            "location": "beginning",
            "nbytes": 120,
            "padding": <Base64 encoded string>
        }
     },
    {
        "name": "pad",
        "configuration": {
            "location": "end",
            "nbytes": 96,
            "padding": <Base64 encoded string>
        }
     }
]

Propose offset codec

cb35826

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add `offset` codec #36

feat: add `offset` codec #36

Uh oh!

mkitti commented Oct 29, 2025

Uh oh!

jbms commented Oct 29, 2025

Uh oh!

jbms commented Oct 29, 2025

Uh oh!

d-v-b commented Oct 29, 2025

Uh oh!

mkitti commented Oct 29, 2025

Uh oh!

mkitti commented Oct 29, 2025

Uh oh!

jbms commented Oct 29, 2025

Uh oh!

mkitti commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add offset codec #36

Are you sure you want to change the base?

feat: add offset codec #36

Uh oh!

Conversation

mkitti commented Oct 29, 2025

Uh oh!

jbms commented Oct 29, 2025

Uh oh!

jbms commented Oct 29, 2025

Uh oh!

d-v-b commented Oct 29, 2025

Uh oh!

mkitti commented Oct 29, 2025

Uh oh!

mkitti commented Oct 29, 2025

Uh oh!

jbms commented Oct 29, 2025

Uh oh!

mkitti commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add `offset` codec #36

feat: add `offset` codec #36