Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars backend #705

Open
charles-turner-1 opened this issue Mar 5, 2025 · 2 comments
Open

Polars backend #705

charles-turner-1 opened this issue Mar 5, 2025 · 2 comments

Comments

@charles-turner-1
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Some of our catalogues can take a long time to load - this has a negative impact on user experience.

Describe the solution you'd like
I've previously had a lot of success speeding dataframe operations up by doing everything in Polars, and then converting back to a pandas dataframe where necessary. I've done a very basic test, and with next to no work, this gives a ~2.5x speed increase: see diff & screencap.

-                df = pd.read_csv(
-                    cat.catalog_file,
-                    storage_options=storage_options,
-                    **read_csv_kwargs,
-                )
-            else:
-                df = pd.DataFrame(cat.catalog_dict)
+                read_csv_kwargs.pop('converters',None) # Hack, different args for polars
+                df = pl.read_csv(
+                    cat.catalog_file,
+                    storage_options=storage_options,
+                    **read_csv_kwargs,
+                ).to_pandas()
+            else:
+                df = pl.DataFrame(cat.catalog_dict).to_pandas()
Image

Making this work would require some code changes (currently breaks a few tests), but I'm fairly confident that I could do this relatively quickly, and that it wouldn't take a great deal of effort to take the performance benefit from ~2.5x to ~10-100x.

I would advocate doing any actual dataframe operations in polars, and then transforming back to a pandas dataframe when the user calls it to avoid any user facing changes.

NB. Polars has no dependencies, so this has the additional benefit of not making solving the environment a gigantic pain.

Is this something people would be interested in?

@marc-white
Copy link
Collaborator

How different are the interfaces to pandas and Polars? Would "ripping off the bandaid" and forcing the users to start going in with Polars instead be a worthwhile effort?

I'd also be interested in:

  • What's the time penalty for converting between Polars and pandas, and how does that scale with data size?
  • What happens when a user gets one thing out of the datastore (as a pandas frame), and then passes it back in to the datastore as a comparator/selector for something else? Will that then need to be converted back into Polars internally?

@charles-turner-1
Copy link
Collaborator Author

How different are the interfaces to pandas and Polars? Would "ripping off the bandaid" and forcing the users to start going in with Polars instead be a worthwhile effort?

They're really quite different - people say polars is much more similar to Spark (and dplyr, I think?). I've never used either but by and large, polars syntax is much more verbose, and IMO, much more suitable for infrastucture code, but less suitable for just playing around. It's plausible that's just because I only really ever use it when pandas is insufficient, so the syntax isn't quiet as familiar.

I'd also be interested in:

  • What's the time penalty for converting between Polars and pandas, and how does that scale with data size?

My experience is that it's generally pretty small. It does seem to scale ~ linearly with data size, but the increase in time reading data with pandas seems to increase linearly too, with a greater coefficient, ie. as the dataset gets larger, it becomes more & more worth reading with polars & converting to pandas where necessary.

  • What happens when a user gets one thing out of the datastore (as a pandas frame), and then passes it back in to the datastore as a comparator/selector for something else? Will that then need to be converted back into Polars internally?

I think moving all the internals to polars and then only generating a pandas dataframe when the user asks for a dataframe would actually be the fastest way to do things. We'd need to convert back and forth into/from polars dataframes, but those operations are petty cheap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants