Prototyping VirtualiZarr, IceChunk and friends #956

betolink · 2025-02-25T17:56:39Z

betolink
Feb 25, 2025
Maintainer

earthaccess' main goal is to simplify access to NASA data, IMO second to that is to enable scientific workflows. This goes beyond just providing authenticated file-objects to the data granules. I think we have a good opportunity to start playing around(more intentionally) with technologies like VirtualZarr, IceChunk and even PangeoForge.

I think the future of data access at scale will require features present in these packages (and then more), and earthaccess can be the bridge between how a file is accessed and these tools. An example:

A researcher needs data that happens to be L2

import xarray as xr
import earthaccess as ea

results = ea.search_data(
  short_name="SWOT_L2_LR_SSH_Expert_2.0",
  temporal=("2023"),
)
files = ea.open(results)
# now we would like that Xarray magically opens our data into a cube but uh oh, we know that won't be fast or easy
ds = xr.open_mfdatasets(results) # probably won't work

The first thing earthaccess can do is to provide users with guidance on why this is slow, a series of warnings on opening remote archival data formats. Next, we should enable the use of caching strategies in fsspec and default to blockcache or first, a long long overdue change in earthaccess.open(). Then we should also mention, "Consider opening the data using a virtual reference" (if available). earthaccess currently released this great work from @ayushnag and @TomNicholas see: https://earthaccess.readthedocs.io/en/latest/tutorials/dmrpp-virtualizarr/ and if the references are not available, we could guide users on how to build them with earthaccess.consolidate_metadata(), especially if they are going to be using a particular set of results on a reproducible workflow.

Second attempt (from the docs):

results = earthaccess.search_data(
  short_name="SWOT_L2_LR_SSH_Expert_2.0",
  temporal=("2023"),
)

def preprocess(ds: xr.Dataset) -> xr.Dataset:
    # Add cycle number and pass_number as dimensions
    return ds.expand_dims(["cycle_num", "pass_num"]).assign_coords(
        cycle_num=[ds.attrs["cycle_number"]], pass_num=[ds.attrs["pass_number"]]
    )

swot = earthaccess.open_virtual_mfdataset(
    results,
    access="indirect",
    load=False,
    preprocess=preprocess,
    concat_dim="pass_num",
    coords="all",
    compat="override",
    combine_attrs="drop_conflicts",
)

This already works! 🚀 but in most cases -if we need multiple granules- requires a few a priori knowledge about the data, this is why we have to use a pre-process function on each granule we open. Here is another idea: a collection of pre-process functions that is community maintained and can be used through earthaccess so when we load many files we increase the chances that we can open a valid data cube with Xarray. In this case the process just expands the dimensions to concatenate on the orbit but in many cases, it will validate, drop, add or modify variables and attributes.

Next, as more DAACs start building these references, earthaccess can easily retrieve them so we can use them directly in Xarray.
Perhaps DAACs will build references per year, or areas of interest or even a cube for the whole datasets. That is something we'll have to coordinate so we can identify what is available and how can we present it to the user. Compatible collections could be labeled as "Virtualizable"

mapper = earthaccess.get_virtual_reference(
    short_name = 'MUR-JPL-L4-GLOB-v4.1'
) # will return the virtual data cube, already assembled by a DAAC 
ds = xr.open_dataset(mapper,
                     engine="zarr",
                     chunks={},
                     backend_kwargs={"consolidated": False})

Lastly, we should try to package this set of access patterns so their use is intuitive and simple.

Also tagging @DeanHenze

betolink · 2025-02-25T19:49:01Z

betolink
Feb 25, 2025
Maintainer Author

And one question I've been fiddling for a while now... can L2 data be ARCO by design? so when we integrate it is less work for libraries like Xarray. Some ideas, if we ask data providers to use cloud optimized HDF5 and then there is an IceChunk store on top of the files (the cool kids call it "data lake") and these chunks are geo referenced see zarr-developers/geozarr-spec#4 then we could have the best of both worlds. I'm probably thinking on what @rabernat is building with EarthMover but starting with common-sensical L2 products instead of having to rechunk and ingest the whole collection.

0 replies

TomNicholas · 2025-02-25T20:02:11Z

TomNicholas
Feb 25, 2025

I fully agree, but as a bit of an outsider I am unclear what earthaccess is trying to be. Is it a universal portal? Is it a data catalog? Is it a search tool? Is it a data aggregation pipeline?

I personally think the north star should be to access all NASA data as a set of Icechunk stores. The users should not have to know what a "granule" is, they just want datacubes. (They also shouldn't have to deal with login credentials either, but that might be tricky...)

earthaccess can be the bridge between how a file is accessed and these tools

One vision for earthaccess is the creator of such public datacubes (i.e. Icechunk stores) from whatever complex system of files exists inside NASA. This is analogous to Pangeo-Forge, and the earthaccess community could maintain recipes for going from granules & DMR++ to Icechunk stores. However we probably shouldn't literally use Pangeo-Forge, as that predates the existence of Icechunk/ VirtualiZarr/Cubed.

Another vision for earthaccess would be a search tool over the top of such datacubes, that translates granule-level knowledge to datacube-level search results. But that idea is more related to the aims FROST (see particularly the general problem of creating org-specific catalogs over general data stores TomNicholas/FROST#5).

If the recipes are created on-demand instead of pre-defined then the catalog and the search start to blend into one, and relates more to what @ayushnag did.

8 replies

mfisher87 Feb 26, 2025
Maintainer

Thanks so much for elaborating, Tom!

TomNicholas Mar 20, 2025

(They also shouldn't have to deal with login credentials either, but that might be tricky...)

After speaking in-person with @paraseba (Icechunk author), @mfisher87, and @frizatch at the Earthmover geospatial happy hour in Denver last week, we think Icechunk could be extended to handle NASA Earthdata Login credentials.

This means the vision above is possible: Earthaccess concentrates on creating Icechunk stores so that all users simply access pre-prepared (or even auto-generated) virtual zarr datacubes. To me this is the natural direction for this project given that VirtualiZarr and Icechunk now exist, and the work that @betolink and @ayushnag have been doing already with VirtualiZarr and DMR++.

Curious to hear what people think of this idea.

briannapagan Mar 20, 2025

Thanks all for this great discussion. From @betolink:

Search: CMR<->STAC should be interchangeable.

Do we have any news on long terms plans to ensure CMR remains stable? CMR has become more than a catalog for NASA and I think there is healthy debate to be had on the benefits versus risks that this has caused. Regardless, CMR does not have any schema to nicely handle the concepts of data cubes/stores. I am cautious these days on creating anything too reliant on CMR, and instead have been thinking more along the lines of what metadata should be in STAC or Zarr or GeoDataCubes that we are currently relying only on CMR. I think we should be adopting one of these standards and ensuring NASA datasets are compliant. Lastly, with these new standards for data cubes - the standard is both a catalog and the data itself, this breaks alot of CMR logic as well.

One open question - how would this apply for datasets that do not work with opendap (and therefore dmrpp files)? At GES DISC I think something like 70% of datasets had opendap, but still not all. How would we make data cubes of those datasets?

TomNicholas Mar 20, 2025

How would we make data cubes of those datasets?

I'm not sure NASA needs data cubes of everything, or that that is even possible. In Ryan's recent pitch to NASA he advocated for using virtual datacubes as complementary to an archiving system that tracks individual files, as virtual chunks allow icechunk to be a relatively lightweight layer on top. As opposed to insisting that every file somehow be made into a datacube.

The point I'm trying to make overall is that virtual datacubes are the ideal user experience (as that is effectively what ARCO data is), and Earthaccess could focus on how to make and advertise those datacubes.

briannapagan Mar 20, 2025

For sure - not every dataset will fit in a datacube and we can agree that this is complementary i.e. an extra layer of abstracted metadata on top. I think my main point is:

Earthaccess could focus on how to make and advertise those datacubes.

taking this conversation beyond earthaccess so as to not get siloed, maybe starting off by fitting datasets in the STAC datacube extension and see if that meets the need.

betolink · 2025-07-09T05:05:57Z

betolink
Jul 9, 2025
Maintainer Author

@kylebarron joined the earthaccess hacking hour today (July 8th), and we talked about Obstore bringing more streamlined integrations with cloud storage. Just a small recap, earthaccess implements 3 access patterns to data:

download(files, local_path): As it names suggests, we download files... if the URL is one that we can access out of AWS(HTTPS) we use fsspec HTTP session and pqdm to download the files to a local directory. We don't have any features like pausing/resuming or saving the state of a download besides not overwriting files that we have downloaded already. If the link is S3 we use fsspec get our files sequentially, this is slow as we should be using multithreading, here is the first point where we could make use of Obstore, looks like the Earthdata auth just got implemented so in theory we should be able to pass the same credential endpoint and the added benefit is that Obstore will take care of renewing the temporary 1-hour credentials, @chuckwondo also mentioned that he implemented a workaround at the boto3 level.
open(files): This is a core feature in earthaccess, allows us to use a remote file as a local one streaming data into memory(mostly xarray). We just create authenticated fsspec sessions, open the remore files and pass them to xarray. If we were to use Obstore for this, we would need to make sure that it supports all the POSIX interfaces required by say h5netcdf when it requests bytes from a python file-like object. Kyle mentioned that most of that is already there and there are some missing things that could be implemented soon via another project called Obspec
If we could do the following:

store = obstore.S3Store(auth=NASAEarthdatacredentials())
file_handler = store.open("s3://nasa-cumulus/some_file.h5")
ds = xr.open_dataset(file_handler, engine="h5netcdf")

will be mostly there, caching would be the other feature we need, we know the caching type impacts performance when opening hdf5 vs netcdf vs cogs etc.

open_virtual_dataset: Recently thanks to the amazing work of @ayushnag and @TomNicholas we got a dmrpp parser that translates dmr chunk manifests(NASA OPeNDAP) into Zarr chunk manifests, then we use xarray to concatenate many of these into a virtual data cube (if we are lucky, see the comments at the start of the thread) but if it works it's really neat! users can get a data cube kinda fast. I think there has been work recently in Virtualizarr to adopt Obstore as the main I/O engine. So here we don't have to do anything much except getting the dmrpp files using Obstore. Also, is not a main thing yet but if DAACs start producing consolidated data cubes, it would be interesting to see if we can produce them as Icechunk stores, @DeanHenze at PODAAC started doing this and ran into issues precisely with Zarr V3 being async and fsspec async mappers not supported as a thing there (cc @maxrjones).

I think it would be really helpful to have some time to discuss this with the wider community: what is the best path for these integrations, what's the lowest hanging fruit that will deliver the best performance and ease of use to NASA data?
I don't know if this means a meeting or we just open issues and work async on them but a good start would be just getting feedback from people that may be interested on this (feel free to tag more people).

cc. @abarciauskas-bgse @sharkinsspatial @maxrjones @weiji14 @TomNicholas @jhkennedy @danielfromearth @DeanHenze

0 replies

jbusecke · 2025-07-09T16:26:32Z

jbusecke
Jul 9, 2025

Just wanted to ping this thread here. I have been prototyping the generation of virtual icechunks here. The basic workflow is:

Open and concatenate virtual datasets with virtualizarr.open_virtual_mfdataset, backed by obstore and lithops
Write references to an icechunk repo stored on a nasa s3 bucket (currently only testing on a scratch bucket)
Opening the icechunk repo, providing appropriate earthdata credentials and opening the store with xarray to do some compute.

This works right now! While we still need to think about details, we are seeing some significant performance improvements for this example compared to using xarray with fsspec and dask to open files. I am currently trying to see how we can simplify the authentication part (at least for the specific case that both the icechunk store and the files it points to are located in buckets that can be authenticated via earthaccess) discussion. Would love to get some feedback over there.

0 replies

sharkinsspatial · 2025-07-15T18:27:38Z

sharkinsspatial
Jul 15, 2025

@betolink Apologies, I probably should have jumped on to this thread a few months ago 😆. I would definitely love to have some wider discussion about how some of these converging efforts can best serve the community 🚀. For reference, I have a bit of a higher level view of where these efforts overlap and where I believe they should go.

NASA Collection Level datacubes

As @jbusecke described, we've been working on generating larger datacubes for high value collections. We generate Icechunk stores with virtual references pointing to the chunks stored in archival HDF5/NetCDF4 files in DAAC buckets using Virtualizarr. As new data is published to the collection, its virtual reference is appended to the Icechunk store. The concept here is that for many high value gridded datasets, having a Zarr interface into the entire collection allows users to select time ranges and AOIs of interest for analysis lazily through a single well defined interface. Additionally, engineering teams building applications on top of this data store no longer need to understand querying CMR for data search and working with lower level h5netcdf libraries for data acccess and can simply rely on the Zarr API for both data search and data access. The use of a transactional store like Icechunk also allows powerful notification mechanisms for when chunk references in the store or added or updated so that downstream applications can react to new data (an area where the pilot CMR notifications system seems to have struggled).

As @TomNicholas stated previously, I personally don't envision building and maintaining collection datacubes for all NASA datasets but instead focusing on high value, commonly used datasets in the short term.

As a final note, this model works very well for regularly gridded datasets with identical coverage for each timestep, it is less efficient for spatiotemporally sparse observational datasets (HLS for instance). More on this in a moment.

Earthaccess

Earthaccess has already demonstrated massive value by reducing the friction of common CMR search logic. It's additional xarray opening capabilities provide a great user experience for users interacting with smaller subsets of data. As for employing obstore as an I/O backend for various xarray backend engines such as h5netcdf that you mentioned above in the "open" section, I can report that we use obstore extensively for our format parsers in Virtualizarr and it supports all the I/O patterns required by h5py. That said, I'd love to see us experiment with incrementally moving the "open" functionality towards "open_virtual_dataset" rather than relying on legacy backends. As much of your cloud optimized HDF work has noted, there are a myriad of complex factors in both file structure and I/O pipeline configuration that alter the performance of reading. This complexity is drastically reduced with virtualization.

While having pre-generated DMRPP indexes for granules reduces the overhead of opening a virtual dataset, I think there are many cases where the cost of scanning a granule using a Virtualizarr parser to generate a virtual representation would be offset by the improved performance and reduced complexity of fetching chunks from the virtual version. Additionally, the next release of Virtualizarr will allow users to generate a virtual representation and access the file's underlying chunks via obstore directly without the need for serializing the references to Icechunk.

TLDR, I feel we should focus our efforts with Earthaccess on continuing to provide a great CMR abstraction and easy file opening interface for smaller subsets of data for analysis users.

Non-gridded data

As mentioned above providing collection level datacube interfaces through Icechunk/Zarr is a good path forward for regularly gridded data. But what is the story for non-gridded data? Earthaccess provides a good experience for many of these datasets at a smaller scale now. The biggest issue we see with current systems is the disconnect between metadata management and file/chunk management. In both the CMR and STAC ecosystem we are managing metadata search and access with one system while data access requires aa host of different format specific solutions. In an ideal world this would be a single "store" similar to the collection level Icechunk stores described above. We're currently investigating new options for storing virtual references (and native chunks) and their associated granule metadata in a single store that would provide SQL like query access and Zarr like array access in a single interface. The focus on this effort would be supporting irregularly gridded data like swath and Level 2 datasets. Hopefully we'll have some further updates on this soon.

0 replies

Prototyping VirtualiZarr, IceChunk and friends #956

Uh oh!

Uh oh!

betolink Feb 25, 2025 Maintainer

Replies: 5 comments · 8 replies

Uh oh!

Uh oh!

betolink Feb 25, 2025 Maintainer Author

Uh oh!

TomNicholas Feb 25, 2025

Uh oh!

mfisher87 Feb 26, 2025 Maintainer

Uh oh!

TomNicholas Mar 20, 2025

Uh oh!

briannapagan Mar 20, 2025

Uh oh!

TomNicholas Mar 20, 2025

Uh oh!

briannapagan Mar 20, 2025

Uh oh!

betolink Jul 9, 2025 Maintainer Author

Uh oh!

jbusecke Jul 9, 2025

Uh oh!

Uh oh!

sharkinsspatial Jul 15, 2025

NASA Collection Level datacubes

Earthaccess

Non-gridded data

betolink
Feb 25, 2025
Maintainer

Replies: 5 comments 8 replies

betolink
Feb 25, 2025
Maintainer Author

TomNicholas
Feb 25, 2025

mfisher87 Feb 26, 2025
Maintainer

betolink
Jul 9, 2025
Maintainer Author

jbusecke
Jul 9, 2025

sharkinsspatial
Jul 15, 2025