Replies: 5 comments 8 replies
-
And one question I've been fiddling for a while now... can L2 data be ARCO by design? so when we integrate it is less work for libraries like Xarray. Some ideas, if we ask data providers to use cloud optimized HDF5 and then there is an IceChunk store on top of the files (the cool kids call it "data lake") and these chunks are geo referenced see zarr-developers/geozarr-spec#4 then we could have the best of both worlds. I'm probably thinking on what @rabernat is building with EarthMover but starting with common-sensical L2 products instead of having to rechunk and ingest the whole collection. |
Beta Was this translation helpful? Give feedback.
-
I fully agree, but as a bit of an outsider I am unclear what earthaccess is trying to be. Is it a universal portal? Is it a data catalog? Is it a search tool? Is it a data aggregation pipeline? I personally think the north star should be to access all NASA data as a set of
One vision for earthaccess is the creator of such public datacubes (i.e. Icechunk stores) from whatever complex system of files exists inside NASA. This is analogous to Pangeo-Forge, and the earthaccess community could maintain recipes for going from granules & DMR++ to Icechunk stores. However we probably shouldn't literally use Pangeo-Forge, as that predates the existence of Icechunk/ VirtualiZarr/Cubed. Another vision for earthaccess would be a search tool over the top of such datacubes, that translates granule-level knowledge to datacube-level search results. But that idea is more related to the aims FROST (see particularly the general problem of creating org-specific catalogs over general data stores TomNicholas/FROST#5). If the recipes are created on-demand instead of pre-defined then the catalog and the search start to blend into one, and relates more to what @ayushnag did. |
Beta Was this translation helpful? Give feedback.
-
@kylebarron joined the earthaccess hacking hour today (July 8th), and we talked about Obstore bringing more streamlined integrations with cloud storage. Just a small recap, earthaccess implements 3 access patterns to data:
store = obstore.S3Store(auth=NASAEarthdatacredentials())
file_handler = store.open("s3://nasa-cumulus/some_file.h5")
ds = xr.open_dataset(file_handler, engine="h5netcdf") will be mostly there, caching would be the other feature we need, we know the caching type impacts performance when opening hdf5 vs netcdf vs cogs etc.
I think it would be really helpful to have some time to discuss this with the wider community: what is the best path for these integrations, what's the lowest hanging fruit that will deliver the best performance and ease of use to NASA data? cc. @abarciauskas-bgse @sharkinsspatial @maxrjones @weiji14 @TomNicholas @jhkennedy @danielfromearth @DeanHenze |
Beta Was this translation helpful? Give feedback.
-
Just wanted to ping this thread here. I have been prototyping the generation of virtual icechunks here. The basic workflow is:
This works right now! While we still need to think about details, we are seeing some significant performance improvements for this example compared to using xarray with fsspec and dask to open files. I am currently trying to see how we can simplify the authentication part (at least for the specific case that both the icechunk store and the files it points to are located in buckets that can be authenticated via earthaccess) discussion. Would love to get some feedback over there. |
Beta Was this translation helpful? Give feedback.
-
@betolink Apologies, I probably should have jumped on to this thread a few months ago 😆. I would definitely love to have some wider discussion about how some of these converging efforts can best serve the community 🚀. For reference, I have a bit of a higher level view of where these efforts overlap and where I believe they should go. NASA Collection Level datacubesAs @jbusecke described, we've been working on generating larger datacubes for high value collections. We generate Icechunk stores with virtual references pointing to the chunks stored in archival HDF5/NetCDF4 files in DAAC buckets using Virtualizarr. As new data is published to the collection, its virtual reference is appended to the Icechunk store. The concept here is that for many high value gridded datasets, having a Zarr interface into the entire collection allows users to select time ranges and AOIs of interest for analysis lazily through a single well defined interface. Additionally, engineering teams building applications on top of this data store no longer need to understand querying CMR for data search and working with lower level As @TomNicholas stated previously, I personally don't envision building and maintaining collection datacubes for all NASA datasets but instead focusing on high value, commonly used datasets in the short term. As a final note, this model works very well for regularly gridded datasets with identical coverage for each timestep, it is less efficient for spatiotemporally sparse observational datasets (HLS for instance). More on this in a moment. EarthaccessEarthaccess has already demonstrated massive value by reducing the friction of common CMR search logic. It's additional xarray opening capabilities provide a great user experience for users interacting with smaller subsets of data. As for employing While having pre-generated DMRPP indexes for granules reduces the overhead of opening a virtual dataset, I think there are many cases where the cost of scanning a granule using a TLDR, I feel we should focus our efforts with Earthaccess on continuing to provide a great CMR abstraction and easy file opening interface for smaller subsets of data for analysis users. Non-gridded dataAs mentioned above providing collection level datacube interfaces through Icechunk/Zarr is a good path forward for regularly gridded data. But what is the story for non-gridded data? Earthaccess provides a good experience for many of these datasets at a smaller scale now. The biggest issue we see with current systems is the disconnect between metadata management and file/chunk management. In both the CMR and STAC ecosystem we are managing metadata search and access with one system while data access requires aa host of different format specific solutions. In an ideal world this would be a single "store" similar to the collection level Icechunk stores described above. We're currently investigating new options for storing virtual references (and native chunks) and their associated granule metadata in a single store that would provide SQL like query access and Zarr like array access in a single interface. The focus on this effort would be supporting irregularly gridded data like swath and Level 2 datasets. Hopefully we'll have some further updates on this soon. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
earthaccess' main goal is to simplify access to NASA data, IMO second to that is to enable scientific workflows. This goes beyond just providing authenticated file-objects to the data granules. I think we have a good opportunity to start playing around(more intentionally) with technologies like VirtualZarr, IceChunk and even PangeoForge.
I think the future of data access at scale will require features present in these packages (and then more), and earthaccess can be the bridge between how a file is accessed and these tools. An example:
A researcher needs data that happens to be L2
The first thing earthaccess can do is to provide users with guidance on why this is slow, a series of warnings on opening remote archival data formats. Next, we should enable the use of caching strategies in fsspec and default to blockcache or first, a long long overdue change in
earthaccess.open()
. Then we should also mention, "Consider opening the data using a virtual reference" (if available). earthaccess currently released this great work from @ayushnag and @TomNicholas see: https://earthaccess.readthedocs.io/en/latest/tutorials/dmrpp-virtualizarr/ and if the references are not available, we could guide users on how to build them withearthaccess.consolidate_metadata()
, especially if they are going to be using a particular set of results on a reproducible workflow.Second attempt (from the docs):
This already works! 🚀 but in most cases -if we need multiple granules- requires a few a priori knowledge about the data, this is why we have to use a pre-process function on each granule we open. Here is another idea: a collection of pre-process functions that is community maintained and can be used through earthaccess so when we load many files we increase the chances that we can open a valid data cube with Xarray. In this case the process just expands the dimensions to concatenate on the orbit but in many cases, it will validate, drop, add or modify variables and attributes.
Next, as more DAACs start building these references, earthaccess can easily retrieve them so we can use them directly in Xarray.
Perhaps DAACs will build references per year, or areas of interest or even a cube for the whole datasets. That is something we'll have to coordinate so we can identify what is available and how can we present it to the user. Compatible collections could be labeled as "Virtualizable"
Lastly, we should try to package this set of access patterns so their use is intuitive and simple.
Also tagging @DeanHenze
Beta Was this translation helpful? Give feedback.
All reactions