Virtual Dataset Workflow Tracking Issue #197

mpiannucci · 2024-10-12T14:51:01Z

In order to create and use virtual datasets with python, users will want to use kerchunk and virtualizarr. These are just starting down the path to zarr 3 and icechunk compatability. This issue will be used to track progress and relevant PRs:

Support writing to icechunk from Virtualizarr: Add Icechunk Support zarr-developers/VirtualiZarr#256 Writing virtual references into Icechunk from VirtualiZarr #103
Support zarr 3 codecs in Virtualizarr: Fix v3 codec pipeline VirtualiZarr#4
Zarr 3 support for kerchunk: zarr-python v3 compatibility fsspec/kerchunk#516
Numcodecs zarr 3 wrapper: Add wrappers for zarr v3 zarr-developers/numcodecs#524 + Sync with zarr 3 beta zarr-developers/numcodecs#597
Xarray zarr 3 support: Compatibility for zarr-python 3.x pydata/xarray#9552

All of this can be installed with pip. However we need to install with three steps for now to avoid version conflicts:

pip install icechunk xarray VirtualiZarr kerchunk

This assumes also having fsspec and s3fs and h5 installed:

pip install fsspec s3fs h5py h5netcdf

With all of this installed, HDF5 virtual datasets currently work like this:

from datetime import datetime, timezone
import icechunk
import xarray as xr
import virtualizarr

url = 's3://met-office-atmospheric-model-data/global-deterministic-10km/20250204T0000Z/20250204T0000Z-PT0000H00M-pressure_at_mean_sea_level.nc'
so = dict(anon=True, default_fill_cache=False, default_cache_type="none")

# create virtualizarr dataset
vds = virtualizarr.open_virtual_dataset(url, reader_options={'storage_options': so}, indexes={})

# create an icechunk repo that can read virtual chunks from eu-west-region with anonymous access
storage = icechunk.local_filesystem_storage("./ukmet")
config = icechunk.RepositoryConfig.default()

config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("s3", "s3://", icechunk.s3_store(region="eu-west-2")))
credentials = icechunk.containers_credentials(s3=icechunk.s3_credentials(anonymous=True))

repo = icechunk.Repository.create(storage, config, credentials)

# create a session, and write to a group inside it using virtualizarr
session = repo.writable_session("main")
vds.virtualize.to_icechunk(session.store, group="msl", last_updated_at=datetime.now(timezone.utc))

# commit to save progress
session.commit("Add msl pressure")

# open it back up
ds = xr.open_zarr(session.store, group="msl", zarr_format=3, consolidated=False, decode_times=False)
ds

# plot!
ds.air_pressure_at_sea_level.plot()

Updated 2/4/2025

The text was updated successfully, but these errors were encountered:

maxrjones · 2024-10-17T16:08:56Z

This is so awesome, thank you for open sourcing your work and the impressive documentation/issue tracking!

Just wanted to share the snippet below that works for me, since there has been some changes on those branches since this code was posted. In particular, only dataset_to_icechunk is available and storage_options is required for successful execution. Also if it's helpful for anyone working on JupyterHubs quay.io/developmentseed/warp-resample-profiling:eac145edd638 has all the dependencies installed in the order you specified.

import xarray as xr
from virtualizarr import open_virtual_dataset
from virtualizarr.writers.icechunk import dataset_to_icechunk

url = 's3://met-office-atmospheric-model-data/global-deterministic-10km/20221001T0000Z/20221001T0000Z-PT0000H00M-CAPE_mixed_layer_lowest_500m.nc'
so = dict(anon=True, default_fill_cache=False, default_cache_type="none")

# create xarray dataset
ds = open_virtual_dataset(url, reader_options={'storage_options': so}, indexes={})

# create an icechunk store
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
storage = StorageConfig.filesystem(str('ukmet'))
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
    virtual_ref_config=VirtualRefConfig.s3_anonymous(region='eu-west-2'),
))

# use virtualizarr to write the dataset to icechunk
dataset_to_icechunk(ds, store)

# commit to save progress
store.commit(message="Initial commit")

# open it back up
ds = xr.open_zarr(store, zarr_version=3, consolidated=False)

# plot!
ds.atmosphere_convective_available_potential_energy.plot()

mpiannucci · 2024-10-17T16:38:55Z

Thanks @maxrjones !! I updated the code sample up top to match just to make sure its all on the same page

mpiannucci · 2024-10-22T16:18:29Z

Icechunk support was merged to VirtualiZarr main! zarr-developers/VirtualiZarr#256

I updated the top post with the latest instructions

Edit: And released!! https://virtualizarr.readthedocs.io/en/latest/generated/virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk.html#virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk

mpiannucci · 2024-10-23T17:45:55Z

I listed out a current breakdown of the work to be done in kerchunk here if anyone is interested in helping to drive this effort foward!

martindurant · 2024-10-29T14:56:18Z

I wonder, do we have examples of supermassive iced datasets yet, with millions of references? I wanted to see how the msgpack format stacks up against kerchunk's parquet format, particularly the ability to only load partitions of the reference data.

mpiannucci · 2024-11-13T13:34:20Z

numcodecs 0.14.0 is out with included support for zarr 3 codecs using the numcodecs. prefix. I have updated the installation instructions in the op.

The last piece to this puzzle is getting kerchunk fully working with zarr 3 stores which is a work in progress

TomNicholas · 2024-11-14T22:27:53Z

numcodecs 0.14.0 is out with included support for zarr 3 codecs using the numcodecs. prefix. I have updated the installation instructions in the op.

Great! Would you mind submitting a PR to VirtualiZarr to change this dependency?

TomNicholas · 2024-11-20T20:48:05Z

I wonder, do we have examples of supermassive iced datasets yet, with millions of references?

I tried 100 million virtual references in #401, which kind of already works. (Which is surprising given how no effort has gone into optimizing anything yet!)

Great! Would you mind submitting a PR to VirtualiZarr to change this dependency?

(This was done in zarr-developers/VirtualiZarr#301)

abarciauskas-bgse · 2025-01-24T22:11:14Z

Since icechunk has upgraded to use zarr-python 3.0, I think most recent versions of icechunk (>alpha 7) don't work with VirtualiZarr. I have been using a custom branch for icechunk work to get around this until we can fully migrate VirtualiZarr to zarr-python 3. Am I correct in stating that the current instructions at the top

pip install icechunk xarray VirtualiZarr

will no longer work because icechunk>alpha7 will break virtualizarr's icechunk writer?

mpiannucci · 2025-01-24T22:17:50Z

Correct. And if you create an outdated icechunk store it can't be used with newer versions.

mpiannucci · 2025-02-04T14:18:13Z

As of today everything works! Closing this issue as complete, will update other docs to reflect the same shortly

This was referenced Oct 12, 2024

Write virtual references to Icechunk earth-mover/VirtualiZarr#1

Closed

[DOCS] Virtual datasets #214

Closed

TomNicholas added the virtual references 👻 Involves virtual kerchunk/virtualizarr chunk references label Nov 7, 2024

mpiannucci mentioned this issue Nov 15, 2024

Remove numcodecs specific install zarr-developers/VirtualiZarr#301

Merged

7 tasks

maxrjones mentioned this issue Nov 29, 2024

Dependency Issue for Kerchunk -> Icechunk via Virtualizarr zarr-developers/VirtualiZarr#321

Open

abarciauskas-bgse mentioned this issue Jan 25, 2025

Icechunk docs update #627

Open

3 tasks

mpiannucci closed this as completed Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virtual Dataset Workflow Tracking Issue #197

Virtual Dataset Workflow Tracking Issue #197

mpiannucci commented Oct 12, 2024 •

edited

Loading

maxrjones commented Oct 17, 2024 •

edited

Loading

mpiannucci commented Oct 17, 2024

mpiannucci commented Oct 22, 2024 •

edited

Loading

mpiannucci commented Oct 23, 2024 •

edited

Loading

martindurant commented Oct 29, 2024

mpiannucci commented Nov 13, 2024 •

edited

Loading

TomNicholas commented Nov 14, 2024

TomNicholas commented Nov 20, 2024

abarciauskas-bgse commented Jan 24, 2025

mpiannucci commented Jan 24, 2025

mpiannucci commented Feb 4, 2025

Virtual Dataset Workflow Tracking Issue #197

Virtual Dataset Workflow Tracking Issue #197

Comments

mpiannucci commented Oct 12, 2024 • edited Loading

maxrjones commented Oct 17, 2024 • edited Loading

mpiannucci commented Oct 17, 2024

mpiannucci commented Oct 22, 2024 • edited Loading

mpiannucci commented Oct 23, 2024 • edited Loading

martindurant commented Oct 29, 2024

mpiannucci commented Nov 13, 2024 • edited Loading

TomNicholas commented Nov 14, 2024

TomNicholas commented Nov 20, 2024

abarciauskas-bgse commented Jan 24, 2025

mpiannucci commented Jan 24, 2025

mpiannucci commented Feb 4, 2025

mpiannucci commented Oct 12, 2024 •

edited

Loading

maxrjones commented Oct 17, 2024 •

edited

Loading

mpiannucci commented Oct 22, 2024 •

edited

Loading

mpiannucci commented Oct 23, 2024 •

edited

Loading

mpiannucci commented Nov 13, 2024 •

edited

Loading