Replies: 7 comments 9 replies
-
| cc @zarr-developers/python-core-devs | 
Beta Was this translation helpful? Give feedback.
-
| 
 um, shards??? | 
Beta Was this translation helpful? Give feedback.
-
| 
 I would be in favour of either a tuple-of-tuples (one tuple per dimension) (a-la dask), or an array of the same dimension as the zarr array where chunks[(i, j, k)] would contain the chunk shape of the chunk that is at position i along dim 0, j along dim 1, and k along dim 2. Both forms have the same information, so probably the former is better — we could provide function(s) for converting between the various forms. (And potentially, having chunks return a simple tuple-of-int for the most common use case of uniform chunk size might not be a bad idea. But I understand that it complicates the API significantly.) | 
Beta Was this translation helpful? Give feedback.
-
| 
 TIL 😅 Very handy! I actually have ~no experience with the filters and codecs APIs so I don't have useful input here... But seeing a diversity of example uses would help me form some ideas about common APIs... | 
Beta Was this translation helpful? Give feedback.
-
| For the record, ZEP0003 proposes chunks in the per-axis list style, as dask uses. If we want to change that, we should do that by updating the ZEP (and accept it). I would vote to keep the current structure for simplicity, and because it's what dask uses (one of the primary consumers) and because it leaves the possibility of mixing regular and variable-length chunking for the axes of a given array. | 
Beta Was this translation helpful? Give feedback.
-
| 
 This was essentially a convention and there was no technical difference between filters and compressors - the compressor object would be evaluated just the same (and can be used) as a filter as if it were the last item in the filter list. I think we probably even have a "silent encoder" of numpy->bytes if the output of the last in the chain isn't already 1D, and "silent decoder" if decoding yields bytes of the right number to fit in the output array, but not certain on these points. The inputs and outputs are also "array-like" (i.e., python buffer protocol, effectively), so the real difference is the dimensionality requirement of the encoded and decoded data on each codec. That isn't defined anywhere except by trying it and in the codec documentation. Nevertheless, I think we can characterise all existing codecs and put them in the three V3 codec categories. There are not that many! For some unknown third party codec appearing in entrypoints of registered at runtime (imagecodecs?), if the author doesn't have the time to provide the characterisation, we have options: 
 | 
Beta Was this translation helpful? Give feedback.
-
| 
 I think the answer to this (partly) comes from whether v3 aims to be as backwards compatible with v2 as possible, or is being seen as a big break and a chance to introduce (well documented, but breaking) API changes. There seems to be a lot of back and forth on this in various issues, and it seems to me like it would be helpful for the core devs to discuss and make a decision on this. | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I don't think we have settled the overall shape of the top-level
Arrayobject inzarr-python3.x. This should be a priority for stable release. So, I'm opening this discussion so people can suggest ideas / brainstorm / vent about how theArrayobject should look and behave in v3. I will open with a summary of the specific challenges we need to solve, and some ideas I have for each one.The
ArrayAPI in 3.xHere's an annotated outline of the shape of the
Arrayobject in v3 today. I'm just going to enumerate the properties of the class and ignore the methods for now. You can see the source code for this object here.v2-only attributes
The following attributes are present in the
zarr-python2.xArrayclass (source) but not present in the 3.xArrayclass.The reasons for these not being implemented in 3.x vary from "we haven't figured out v3 semantics for this" (
filters,compressor,synchronizer), to "we haven't gotten around to it yet (write_empty_chunks), and also "this triggers traversal over all the chunks and might not be a good idea for an array attribute" (nchunks_initialized)chunk_storecompressorfilterssynchronizeritemsizenbytesnbytes_storedcdata_shapenchunksnchunks_initializedis_viewoindexvindexblockswrite_empty_chunksmeta_arrayI'm happy to discuss the v3 future for any one of these attributes. We may need to spin those discussions out into separate issues.
specific challenges
I will enumerate some specific challenges with the
ArrayAPI that we need to solve in 3.x.sharding
Zarr V3 introduces the possibility of creating sharded chunks, i.e. chunks that contains subchunks that can be addressed to a contiguous byte range in a chunk. If you are a reading from a sharded array, you will want to iterate over the subchunks. This means we need to make this property of an array simple to specify when creating an array, and simple to access when an array is already created.
Neither of these things are true today. We do not have an
Arrayattribute that conveys the subchunk size. Instead, here is how you would get the subchunks of an array:How should we make this subchunk information specifiable and accessible from the
Arrayobject? The simple solution would be to add aArray.subchunksattribute that uses theget_subchunksroutine I sketched out, and to add asubchunkskeyword argument toArray.create. Maybe people have other ideas, or proposals for a better name than "subchunks". Note that thissubchunksproperty is defined inside the array serialization / compression routines, which are also specified inArray.create, so adding asubchunkskeyword argument for array creation would impact other parts of theArray.createAPI.chunksandchunk_shapeArray.createspecifies the same information with two keyword arguments:chunksis a v2-specific argument, andchunk_shapeis a v3-specific argument. we should pick one, and use it for both v2 and v3. but see the next point.chunk grids
Zarr v2 uses a regular chunk grid, which means that all chunks are the same size, which means that a single chunk shape is a complete description of the chunk grid. Hence, in 2.x,
chunksis just a tuple of ints. But in Zarr v3, the chunk grid is an extension point and there is an active proposal to add support for a rectilinear chunk grid, i.e. a chunk grid where the chunks do not have the same shape. In this case, there are two ways to specify the chunk shape, an explicit list of chunk sizes:[(10,10), (20,10), (20,10), (20, 20)]or a list of chunk sizes per axis:[(10, 20), (10, 20)].So for zarr-python 3.x, we should have a plan for what the
chunksattribute will look like for rectilinear chunk grids. We could also consider solving the sharding problem with thechunksattribute, e.g. by defining an object with specific attributes for chunks (shards) and subchunks.serialization: compressor, filters, codecs
Zarr V2 metadata defines
filters(a collection of chunk <-> chunk transformations) andcompressor(a single chunk <-> byte array transformation)Zarr V3 metadata instead has a single
codecsattribute, which is a structured list that may contain some number of chunk <-> chunk transformation (ArrayArrayCodec), must contain one and only one chunk <-> byte array transformation (termedArrayBytesCodec), and may contain some number of byte array <-> bye array transformations (BytesBytesCodec).ArrayArrayCodecArrayBytesCodecandBytesBytesCodec.ArrayBytesCodec, which contains its own collection of codecs.We have a few challenges for the v3 api. Each one of these is a potential discussion point:
list[filter],compressor) and v3 array serialization (list[ArrayArrayCodec],ArrayBytesCodec[subcodecs],list[BytesBytesCodec]) with the same API?ShardingCodec. We need to provide a uniform interface to this information.Array.createtakes all the codecs in a single list, and it isn't obvious how to construct that list so that sharding happens. I don't think users will be happy with this (see the previous points about sharding).discussion
I am curious to hear what the community thinks about any of these points. The
Arrayobject has perhaps the most user contact out of any other class inzarr-python. It's imperative that we end up with a design that most users can be happy with (or, a design that they are not unhappy with).Beta Was this translation helpful? Give feedback.
All reactions