-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virtual Dataset Workflow Tracking Issue #197
Comments
This is so awesome, thank you for open sourcing your work and the impressive documentation/issue tracking! Just wanted to share the snippet below that works for me, since there has been some changes on those branches since this code was posted. In particular, only import xarray as xr
from virtualizarr import open_virtual_dataset
from virtualizarr.writers.icechunk import dataset_to_icechunk
url = 's3://met-office-atmospheric-model-data/global-deterministic-10km/20221001T0000Z/20221001T0000Z-PT0000H00M-CAPE_mixed_layer_lowest_500m.nc'
so = dict(anon=True, default_fill_cache=False, default_cache_type="none")
# create xarray dataset
ds = open_virtual_dataset(url, reader_options={'storage_options': so}, indexes={})
# create an icechunk store
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
storage = StorageConfig.filesystem(str('ukmet'))
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
virtual_ref_config=VirtualRefConfig.s3_anonymous(region='eu-west-2'),
))
# use virtualizarr to write the dataset to icechunk
dataset_to_icechunk(ds, store)
# commit to save progress
store.commit(message="Initial commit")
# open it back up
ds = xr.open_zarr(store, zarr_version=3, consolidated=False)
# plot!
ds.atmosphere_convective_available_potential_energy.plot() |
Thanks @maxrjones !! I updated the code sample up top to match just to make sure its all on the same page |
Icechunk support was merged to VirtualiZarr main! zarr-developers/VirtualiZarr#256 I updated the top post with the latest instructions Edit: And released!! https://virtualizarr.readthedocs.io/en/latest/generated/virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk.html#virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk |
I listed out a current breakdown of the work to be done in kerchunk here if anyone is interested in helping to drive this effort foward! |
I wonder, do we have examples of supermassive iced datasets yet, with millions of references? I wanted to see how the msgpack format stacks up against kerchunk's parquet format, particularly the ability to only load partitions of the reference data. |
numcodecs 0.14.0 is out with included support for zarr 3 codecs using the The last piece to this puzzle is getting kerchunk fully working with zarr 3 stores which is a work in progress |
Great! Would you mind submitting a PR to VirtualiZarr to change this dependency? |
I tried 100 million virtual references in #401, which kind of already works. (Which is surprising given how no effort has gone into optimizing anything yet!)
(This was done in zarr-developers/VirtualiZarr#301) |
Since icechunk has upgraded to use zarr-python 3.0, I think most recent versions of icechunk (>alpha 7) don't work with VirtualiZarr. I have been using a custom branch for icechunk work to get around this until we can fully migrate VirtualiZarr to zarr-python 3. Am I correct in stating that the current instructions at the top
will no longer work because icechunk>alpha7 will break virtualizarr's icechunk writer? |
Correct. And if you create an outdated icechunk store it can't be used with newer versions. |
As of today everything works! Closing this issue as complete, will update other docs to reflect the same shortly |
In order to create and use virtual datasets with python, users will want to use
kerchunk
andvirtualizarr
. These are just starting down the path to zarr 3 and icechunk compatability. This issue will be used to track progress and relevant PRs:zarr-python
v3 compatibility fsspec/kerchunk#516All of this can be installed with
pip
. However we need to install with three steps for now to avoid version conflicts:This assumes also having
fsspec
ands3fs
andh5
installed:With all of this installed, HDF5 virtual datasets currently work like this:
Updated 2/4/2025
The text was updated successfully, but these errors were encountered: