Skip to content

Conversation

@mkitti
Copy link
Contributor

@mkitti mkitti commented Oct 29, 2025

Add an offset codec. This defines a byte offset to seek to before decoding bytes.

  • When reading, a leading number of bytes is ignored.
  • When writing, a sequence of bytes is prepended:
    • By default, this could just be a sequence of null bytes, 0x00, of length equal to the byte offset.
    • A prefix configuration parameter could be used to set a static header of length equal to the byte offset.
    • Implementations may allow for dynamic header generation depending on the input bytes to the codec.
    • If a chunk exists, leave the existing skipped bytes as is.

Proposed Applications:

  • Decode N5 chunks by skipping over the N5 header. Optionally, include a static N5 header as the prefix configuration parameter.
  • Write chunks as TIFF files. Combine with the suffix chunk-key-encoding to add a .tiff file extension to the chunks.

Importantly, the skipped bytes are not intended to have any meaning for the Zarr decoding process itself.

@jbms
Copy link
Contributor

jbms commented Oct 29, 2025

Maybe call this skip_bytes or padding_bytes?

@jbms
Copy link
Contributor

jbms commented Oct 29, 2025

This could support both prefix and suffix.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 29, 2025

another name idea, if prefix and suffix are added: inset

@mkitti
Copy link
Contributor Author

mkitti commented Oct 29, 2025

I may need to add another parameter to control how to handle existing chunks. Should the codec open the chunk, seek to the offset, and use the existing header OR should it always write zeros or the defined prefix?

My current thought is that the lack of of a defined prefix means that zeros should be prepended if a chunk does not exist but that the header of a existing chunk should be preserved. If there is specified prefix, then that prefix should always be written as the header even if the chunk exists.

Some implementatins may find the write behavior problematic if the underlying storage does not allow seeking before writing. In this case, a read may be necessary before writing.

Maybe call this skip_bytes or padding_bytes?
another name idea, if prefix and suffix are added: inset

skip_bytes would probably make sense for the codec as written now. That name seems to come from the reading perspective rather than writing. When writing a new chunk, there are not really any bytes to "skip", yet. Thus, I would lean towards padding_bytes or inset.

I have proposed another extension called "suffix" that is a chunk key encoding. I also note the name collision between the conditional codec and optional data type. Perhaps we need some extension naming conventions to avoid such collisions or is the distinct nature of the extensions sufficient?

This could support both prefix and suffix.

Appending a suffix could be another codec. The CRC32c checksum codec could be considered as a specific form of an appending codec where the last four bytes could be ignored as they are not needed to interpret the preceding bytes.

Alternatively, we could make it so that a "negative" offset refers to bytes to skip at the end of a byte sequence. In this case, we may need to rename the "prefix" parameter to "padding". To add both a header and a footer to a file we would just use two offset codecs, one with a positive offset and one with a negative offset.

A suffix would be able to better support other foreign image file formats. In particular, I was thinking about PNG files where you would need a terminal IEND chunk. Another case might be ZIP shards.

This might overlap with the sharding codec in that we are effectively defining the byte offset and nbytes for a single chunk shard. An alternative here would be to consider extensions to the sharding codec where we define the index either in an external key (another file) or have a fixed index common to all shards defined in the codec configuration.

One difference from the sharding codec is that we define a byte offset from the beginning and an offset from the end rather than the size of the "inset". In composition with the sharding-indexed codec, this would also allow the shard index to not have to exist at the exact beginning or end of the file.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 29, 2025

Another name I just thought of for a combined prefix and suffix codec is "byte_range". We could then support syntax similar to the HTTP Range header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Range

Range: <unit>=<range-start>-
Range: <unit>=<range-start>-<range-end>
Range: <unit>=<range-start>-<range-end>, …, <range-startN>-<range-endN>
Range: <unit>=-<suffix-length>

@jbms
Copy link
Contributor

jbms commented Oct 29, 2025

Re name collision: technically it is fine since they are different namespaces, and indeed we already have a collision between the bytes codec and the bytes data type, which are quite unrelated and in fact incompatible, but it would be better to avoid collisions to reduce confusion.

Re preserving existing prefix/suffix content: that is outside of the existing capabilities of a codec. That would likely require changes to codec APIs in existing implementations.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 30, 2025

I'm currently leaning towards supporting both prefix and suffix but not at the same time within one instance of the codec. In order to add ignored bytes (padding) to the beginning and end one would need to apply the codec twice as follows.

codecs: [
    { "name": "bytes" },
    {
        "name": "pad",
        "configuration": {
            "location": "beginning",
            "nbytes": 120,
            "padding": <Base64 encoded string>
        }
     },
    {
        "name": "pad",
        "configuration": {
            "location": "end",
            "nbytes": 96,
            "padding": <Base64 encoded string>
        }
     }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants