Polars backend #705

charles-turner-1 · 2025-03-05T02:11:48Z

Is your feature request related to a problem? Please describe.
Some of our catalogues can take a long time to load - this has a negative impact on user experience.

Describe the solution you'd like
I've previously had a lot of success speeding dataframe operations up by doing everything in Polars, and then converting back to a pandas dataframe where necessary. I've done a very basic test, and with next to no work, this gives a ~2.5x speed increase: see diff & screencap.

-                df = pd.read_csv(
-                    cat.catalog_file,
-                    storage_options=storage_options,
-                    **read_csv_kwargs,
-                )
-            else:
-                df = pd.DataFrame(cat.catalog_dict)
+                read_csv_kwargs.pop('converters',None) # Hack, different args for polars
+                df = pl.read_csv(
+                    cat.catalog_file,
+                    storage_options=storage_options,
+                    **read_csv_kwargs,
+                ).to_pandas()
+            else:
+                df = pl.DataFrame(cat.catalog_dict).to_pandas()

Making this work would require some code changes (currently breaks a few tests), but I'm fairly confident that I could do this relatively quickly, and that it wouldn't take a great deal of effort to take the performance benefit from ~2.5x to ~10-100x.

I would advocate doing any actual dataframe operations in polars, and then transforming back to a pandas dataframe when the user calls it to avoid any user facing changes.

NB. Polars has no dependencies, so this has the additional benefit of not making solving the environment a gigantic pain.

Is this something people would be interested in?

marc-white · 2025-03-05T23:07:19Z

How different are the interfaces to pandas and Polars? Would "ripping off the bandaid" and forcing the users to start going in with Polars instead be a worthwhile effort?

I'd also be interested in:

What's the time penalty for converting between Polars and pandas, and how does that scale with data size?
What happens when a user gets one thing out of the datastore (as a pandas frame), and then passes it back in to the datastore as a comparator/selector for something else? Will that then need to be converted back into Polars internally?

charles-turner-1 · 2025-03-05T23:33:10Z

How different are the interfaces to pandas and Polars? Would "ripping off the bandaid" and forcing the users to start going in with Polars instead be a worthwhile effort?

They're really quite different - people say polars is much more similar to Spark (and dplyr, I think?). I've never used either but by and large, polars syntax is much more verbose, and IMO, much more suitable for infrastucture code, but less suitable for just playing around. It's plausible that's just because I only really ever use it when pandas is insufficient, so the syntax isn't quiet as familiar.

I'd also be interested in:

What's the time penalty for converting between Polars and pandas, and how does that scale with data size?

My experience is that it's generally pretty small. It does seem to scale ~ linearly with data size, but the increase in time reading data with pandas seems to increase linearly too, with a greater coefficient, ie. as the dataset gets larger, it becomes more & more worth reading with polars & converting to pandas where necessary.

What happens when a user gets one thing out of the datastore (as a pandas frame), and then passes it back in to the datastore as a comparator/selector for something else? Will that then need to be converted back into Polars internally?

I think moving all the internals to polars and then only generating a pandas dataframe when the user asks for a dataframe would actually be the fastest way to do things. We'd need to convert back and forth into/from polars dataframes, but those operations are petty cheap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars backend #705

Polars backend #705

charles-turner-1 commented Mar 5, 2025

marc-white commented Mar 5, 2025

charles-turner-1 commented Mar 5, 2025

Polars backend #705

Polars backend #705

Comments

charles-turner-1 commented Mar 5, 2025

marc-white commented Mar 5, 2025

charles-turner-1 commented Mar 5, 2025