Skip to content

a codec simplification plan #3162

Open
Open
@d-v-b

Description

@d-v-b

I propose that this code should work:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "zarr",
#     "pytest"
# ]
# ///

import zarr
from numcodecs import GZip
import pytest

compressors = (GZip(), zarr.codecs.GzipCodec())
zarr_formats = (2,3)
@pytest.mark.parametrize('compressor', compressors)
@pytest.mark.parametrize('zarr_format', zarr_formats)
def test(compressor, zarr_format):
    x = zarr.create_array(
        {},
        shape=(10,), 
        dtype='uint8', 
        compressors=compressor, 
        zarr_format=zarr_format)

if __name__ == '__main__':
    pytest.main([__file__, f'-c {__file__}'])
../../.cache/uv/environments-v2/test-48d946355ee2ef42/lib/python3.11/site-packages/zarr/core/array.py:4723: TypeError
==================================================================================================== short test summary info ====================================================================================================
FAILED  /home/bennettd/dev/zarr-python::test[2-compressor1] - ValueError: Invalid compressor. Expected None, a numcodecs.abc.Codec, or a dict representation of a numcodecs.abc.Codec. Got <class 'zarr.codecs.gzip.GzipCodec'> instead.
FAILED  /home/bennettd/dev/zarr-python::test[3-compressor0] - TypeError: 'GZip' object is not iterable

Ignoring the details of the errors here, the reason those errors exist is because our codec handling is weird and requires the use of separate objects (zarr.codecs.GzipCodec for v3, numcodecs.GZip for v2) to express the same thing (gzip compression).

Here is a proposal to fix this:

  • we add methods to our codec base class which enables it to handling zarr v2 and zarr v3 metadata. This means the exact same codec class can be used for zarr v2 or zarr v3. This is how the new dtypes work and I think it's a good design.
  • We define a protocol, in this repo, that models the structure of the numcodecs codec abstract base class. That codec would look like this:
from typing import Protocol, ClassVar, runtime_checkable

@runtime_checkable
class Numcodec(Protocol):
    codec_id: ClassVar[str]

    def encode(self, buf: ArrayLike) -> ArrayLike:
        ...

    def decode(self, buf: ArrayLike) -> ArrayLike:
        ...
    
    def get_config() -> Mapping[str, Object] # this return type enables typed dicts
        ...
    @classmethod
    def from_config(cls, data: Mapping[str, object]) -> Self:
        ...

We then define routines, like the ones in @brokkoli71's PR, that automatically wrap user input to handle objects that implement numcodecs.abc.codec in the respective zarr-python codec class. Because create_array takes separate filters, serializer, compressors kwargs, we know which codec class (array-array, array-bytes, byte-bytes) is the correct output for wrapping.

We can also define class methods on the Codec base class that enable construction of the codec from an implements-numcodec python object.

This will allow zarr-python to eventually drop the numcodecs requirement entirely if we see fit, without any compatibility loss. Given the anemic maintenance of numcodecs, I see this is a very good thing.

There is one remaining concern -- how to handle codecs defined in the zarr v3 spec. For example, this case:

from numcodecs import GZip
import zarr
zarr.create_array(..., compressors=GZip(), zarr_format=3)

Here we should inspect the codec_id attribute of the user-provided codec, and see if that codec is one of the core codecs enshrined in the spec. if so, we should replace the user-provided codec with the one defined in the zarr spec. This ensures that, even if something changes about the codec configuration in numcodecs, zarr-python does not propagate invalid metadata. Users who want to circumvent this behavior can either subclass the zarr-python codec classes, or use a lower-level array constructor that doesn't do these checks.

Together these changes will achieve the following goals:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions