-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Different backends, external catalog formats #622
Comments
Revisiting this @aulemahal - I think coordinating with @nocollier 's work on intake-esgf would be helpful - within that package, the core idea is implementing different indexes (catalogs in the intake-esm space), that have a set of methods - At the ESGF conference last week, we brought up the desire to coordinate efforts between intake-esm and intake-esm, mainly making the two catalog cross-compatible. Perhaps it would be easier to setup a call to discuss the next steps here? Coordination on this effort would be great! |
Hi @mgrover1! I would be available for a video call. I can't promise much development time from our side, but at least I can pitch in with ideas and discussion. |
Is there a day/time that works best for you next week? |
Anytime tuesday 3-6pm, wednesday 3-6pm or thursday 9am-4pm. (EDT, UTC-04) |
Is your feature request related to a problem? Please describe.
At Ouranos, we use
intake-esm
to catalog our on-premise data. There are a few types of datasets that produce enormous catalog files, which are then slow and heavy to manipulate in-memory with pandas and intake-esm. (The biggest culprit is the data from our RCM that has a single netCDF file for each variable and month, and there's a good supply of simulations...)Adjacent problem : intake-esm supports having a list of variables in the variable columns, but that's not cleanly implemented in CSVs, so hacky solutions have to be used.
Describe the solution you'd like
It could be interesting to have choice of catalog backend instead of only pandas' DataFrames from CSVs.
For example,
polars
provide a few performance improvements on pandas. For example it can "scan" a CSV instead of reading it into memory, which at least accelerates the creation of the catalog.Alternatively, a real database could be more interesting than a CSV if would avoid loading all the lines in memory. Rather, each
search
call could return a real SQL query.Or dask's DataFrame ?
In any case, I think the first step would be to generalize
ESMCatalogModel
so that it can be subclassed for different types of backend. I'm not sure what the minimum API would be though. And I also don't know how this backend choice could be managed in the ESM collection spec itself.Describe alternatives you've considered
Waiting longer for my current code to run is the most common alternative I've used ;).
To have lighter in-memory DataFrames, we pass a series of
dtypes
toread_csv_kwargs
in our main code. See some code in xscen. But that's only doable there because the column names are kinda fixed within the context of the package. The "category" dtype drastically reduces the size of columns with a lot of repetition.pyarrow
is useful for string columns as well.Notice also the hacky code (above) that parses the lists of variables.
Additional context
I guess there are two distinct things in my suggestions:
Sadly, I don't have time to work on this myself. However, if this issue gains momentum and is of interest for more than just my group, my organization might be willing to invest some resources, most likely through an internship.
The text was updated successfully, but these errors were encountered: