ENH: Different backends, external catalog formats #622

aulemahal · 2023-07-07T21:38:09Z

Is your feature request related to a problem? Please describe.
At Ouranos, we use intake-esm to catalog our on-premise data. There are a few types of datasets that produce enormous catalog files, which are then slow and heavy to manipulate in-memory with pandas and intake-esm. (The biggest culprit is the data from our RCM that has a single netCDF file for each variable and month, and there's a good supply of simulations...)

Adjacent problem : intake-esm supports having a list of variables in the variable columns, but that's not cleanly implemented in CSVs, so hacky solutions have to be used.

Describe the solution you'd like
It could be interesting to have choice of catalog backend instead of only pandas' DataFrames from CSVs.

For example, polars provide a few performance improvements on pandas. For example it can "scan" a CSV instead of reading it into memory, which at least accelerates the creation of the catalog.

Alternatively, a real database could be more interesting than a CSV if would avoid loading all the lines in memory. Rather, each search call could return a real SQL query.

Or dask's DataFrame ?

In any case, I think the first step would be to generalize ESMCatalogModel so that it can be subclassed for different types of backend. I'm not sure what the minimum API would be though. And I also don't know how this backend choice could be managed in the ESM collection spec itself.

Describe alternatives you've considered
Waiting longer for my current code to run is the most common alternative I've used ;).

To have lighter in-memory DataFrames, we pass a series of dtypes to read_csv_kwargs in our main code. See some code in xscen. But that's only doable there because the column names are kinda fixed within the context of the package. The "category" dtype drastically reduces the size of columns with a lot of repetition. pyarrow is useful for string columns as well.

Notice also the hacky code (above) that parses the lists of variables.

Additional context
I guess there are two distinct things in my suggestions:

Allowing more input formats that the CSV for external catalogs (sql, parquet, etc)
Allowing a different table backend for potential performance improvement (pandas, dask, polars, sql, etc)

Sadly, I don't have time to work on this myself. However, if this issue gains momentum and is of interest for more than just my group, my organization might be willing to invest some resources, most likely through an internship.

The text was updated successfully, but these errors were encountered:

mgrover1 · 2024-05-02T12:57:26Z

Revisiting this @aulemahal - I think coordinating with @nocollier 's work on intake-esgf would be helpful - within that package, the core idea is implementing different indexes (catalogs in the intake-esm space), that have a set of methods - search, get_file_info, and from_tracking_ids. The package currently works with SOLR databases, as well as the GLOBUS-hosted elasticsearch index.

At the ESGF conference last week, we brought up the desire to coordinate efforts between intake-esm and intake-esm, mainly making the two catalog cross-compatible. Perhaps it would be easier to setup a call to discuss the next steps here? Coordination on this effort would be great!

aulemahal · 2024-05-09T17:58:26Z

Hi @mgrover1! I would be available for a video call. I can't promise much development time from our side, but at least I can pitch in with ideas and discussion.

mgrover1 · 2024-05-09T18:50:32Z

Is there a day/time that works best for you next week?

aulemahal · 2024-05-09T19:04:19Z

Anytime tuesday 3-6pm, wednesday 3-6pm or thursday 9am-4pm. (EDT, UTC-04)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Different backends, external catalog formats #622

ENH: Different backends, external catalog formats #622

aulemahal commented Jul 7, 2023

mgrover1 commented May 2, 2024

aulemahal commented May 9, 2024

mgrover1 commented May 9, 2024

aulemahal commented May 9, 2024

ENH: Different backends, external catalog formats #622

ENH: Different backends, external catalog formats #622

Comments

aulemahal commented Jul 7, 2023

mgrover1 commented May 2, 2024

aulemahal commented May 9, 2024

mgrover1 commented May 9, 2024

aulemahal commented May 9, 2024