Description
I propose that this code should work:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr",
# "pytest"
# ]
# ///
import zarr
from numcodecs import GZip
import pytest
compressors = (GZip(), zarr.codecs.GzipCodec())
zarr_formats = (2,3)
@pytest.mark.parametrize('compressor', compressors)
@pytest.mark.parametrize('zarr_format', zarr_formats)
def test(compressor, zarr_format):
x = zarr.create_array(
{},
shape=(10,),
dtype='uint8',
compressors=compressor,
zarr_format=zarr_format)
if __name__ == '__main__':
pytest.main([__file__, f'-c {__file__}'])
../../.cache/uv/environments-v2/test-48d946355ee2ef42/lib/python3.11/site-packages/zarr/core/array.py:4723: TypeError
==================================================================================================== short test summary info ====================================================================================================
FAILED /home/bennettd/dev/zarr-python::test[2-compressor1] - ValueError: Invalid compressor. Expected None, a numcodecs.abc.Codec, or a dict representation of a numcodecs.abc.Codec. Got <class 'zarr.codecs.gzip.GzipCodec'> instead.
FAILED /home/bennettd/dev/zarr-python::test[3-compressor0] - TypeError: 'GZip' object is not iterable
Ignoring the details of the errors here, the reason those errors exist is because our codec handling is weird and requires the use of separate objects (zarr.codecs.GzipCodec
for v3, numcodecs.GZip
for v2) to express the same thing (gzip compression).
Here is a proposal to fix this:
- we add methods to our codec base class which enables it to handling zarr v2 and zarr v3 metadata. This means the exact same codec class can be used for zarr v2 or zarr v3. This is how the new dtypes work and I think it's a good design.
- We define a protocol, in this repo, that models the structure of the numcodecs codec abstract base class. That codec would look like this:
from typing import Protocol, ClassVar, runtime_checkable
@runtime_checkable
class Numcodec(Protocol):
codec_id: ClassVar[str]
def encode(self, buf: ArrayLike) -> ArrayLike:
...
def decode(self, buf: ArrayLike) -> ArrayLike:
...
def get_config() -> Mapping[str, Object] # this return type enables typed dicts
...
@classmethod
def from_config(cls, data: Mapping[str, object]) -> Self:
...
We then define routines, like the ones in @brokkoli71's PR, that automatically wrap user input to handle objects that implement numcodecs.abc.codec in the respective zarr-python codec class. Because create_array
takes separate filters
, serializer
, compressors
kwargs, we know which codec class (array-array, array-bytes, byte-bytes) is the correct output for wrapping.
We can also define class methods on the Codec base class that enable construction of the codec from an implements-numcodec python object.
This will allow zarr-python to eventually drop the numcodecs requirement entirely if we see fit, without any compatibility loss. Given the anemic maintenance of numcodecs, I see this is a very good thing.
There is one remaining concern -- how to handle codecs defined in the zarr v3 spec. For example, this case:
from numcodecs import GZip
import zarr
zarr.create_array(..., compressors=GZip(), zarr_format=3)
Here we should inspect the codec_id
attribute of the user-provided codec, and see if that codec is one of the core codecs enshrined in the spec. if so, we should replace the user-provided codec with the one defined in the zarr spec. This ensures that, even if something changes about the codec configuration in numcodecs, zarr-python does not propagate invalid metadata. Users who want to circumvent this behavior can either subclass the zarr-python codec classes, or use a lower-level array constructor that doesn't do these checks.
Together these changes will achieve the following goals:
- allow users to use
numcodecs
-compatible codecs with zarr v3 - simplify our codebase
- protect users from surprising numcodecs changes that can cause compatibility issues (examples: Adds checksum flag to zstd codec numcodecs#519, (feat):
typesize
declared with constructor forBlosc
numcodecs#713). Numcodecs does not offer any compatibility guarantees for its codecs. Zarr python needs to be stricter. - weaken our reliance on numcodecs, potentially allowing us to declare it an optional dependency