Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Different backends, external catalog formats #622

Open
aulemahal opened this issue Jul 7, 2023 · 4 comments
Open

ENH: Different backends, external catalog formats #622

aulemahal opened this issue Jul 7, 2023 · 4 comments

Comments

@aulemahal
Copy link
Contributor

Is your feature request related to a problem? Please describe.
At Ouranos, we use intake-esm to catalog our on-premise data. There are a few types of datasets that produce enormous catalog files, which are then slow and heavy to manipulate in-memory with pandas and intake-esm. (The biggest culprit is the data from our RCM that has a single netCDF file for each variable and month, and there's a good supply of simulations...)

Adjacent problem : intake-esm supports having a list of variables in the variable columns, but that's not cleanly implemented in CSVs, so hacky solutions have to be used.

Describe the solution you'd like
It could be interesting to have choice of catalog backend instead of only pandas' DataFrames from CSVs.

For example, polars provide a few performance improvements on pandas. For example it can "scan" a CSV instead of reading it into memory, which at least accelerates the creation of the catalog.

Alternatively, a real database could be more interesting than a CSV if would avoid loading all the lines in memory. Rather, each search call could return a real SQL query.

Or dask's DataFrame ?

In any case, I think the first step would be to generalize ESMCatalogModel so that it can be subclassed for different types of backend. I'm not sure what the minimum API would be though. And I also don't know how this backend choice could be managed in the ESM collection spec itself.

Describe alternatives you've considered
Waiting longer for my current code to run is the most common alternative I've used ;).

To have lighter in-memory DataFrames, we pass a series of dtypes to read_csv_kwargs in our main code. See some code in xscen. But that's only doable there because the column names are kinda fixed within the context of the package. The "category" dtype drastically reduces the size of columns with a lot of repetition. pyarrow is useful for string columns as well.

Notice also the hacky code (above) that parses the lists of variables.

Additional context
I guess there are two distinct things in my suggestions:

  1. Allowing more input formats that the CSV for external catalogs (sql, parquet, etc)
  2. Allowing a different table backend for potential performance improvement (pandas, dask, polars, sql, etc)

Sadly, I don't have time to work on this myself. However, if this issue gains momentum and is of interest for more than just my group, my organization might be willing to invest some resources, most likely through an internship.

@mgrover1
Copy link
Collaborator

mgrover1 commented May 2, 2024

Revisiting this @aulemahal - I think coordinating with @nocollier 's work on intake-esgf would be helpful - within that package, the core idea is implementing different indexes (catalogs in the intake-esm space), that have a set of methods - search, get_file_info, and from_tracking_ids. The package currently works with SOLR databases, as well as the GLOBUS-hosted elasticsearch index.

At the ESGF conference last week, we brought up the desire to coordinate efforts between intake-esm and intake-esm, mainly making the two catalog cross-compatible. Perhaps it would be easier to setup a call to discuss the next steps here? Coordination on this effort would be great!

@aulemahal
Copy link
Contributor Author

Hi @mgrover1! I would be available for a video call. I can't promise much development time from our side, but at least I can pitch in with ideas and discussion.

@mgrover1
Copy link
Collaborator

mgrover1 commented May 9, 2024

Is there a day/time that works best for you next week?

@aulemahal
Copy link
Contributor Author

Anytime tuesday 3-6pm, wednesday 3-6pm or thursday 9am-4pm. (EDT, UTC-04)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants