-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars backend #705
Comments
How different are the interfaces to pandas and Polars? Would "ripping off the bandaid" and forcing the users to start going in with Polars instead be a worthwhile effort? I'd also be interested in:
|
They're really quite different - people say polars is much more similar to Spark (and dplyr, I think?). I've never used either but by and large, polars syntax is much more verbose, and IMO, much more suitable for infrastucture code, but less suitable for just playing around. It's plausible that's just because I only really ever use it when pandas is insufficient, so the syntax isn't quiet as familiar.
My experience is that it's generally pretty small. It does seem to scale ~ linearly with data size, but the increase in time reading data with pandas seems to increase linearly too, with a greater coefficient, ie. as the dataset gets larger, it becomes more & more worth reading with polars & converting to pandas where necessary.
I think moving all the internals to polars and then only generating a pandas dataframe when the user asks for a dataframe would actually be the fastest way to do things. We'd need to convert back and forth into/from polars dataframes, but those operations are petty cheap. |
Is your feature request related to a problem? Please describe.
Some of our catalogues can take a long time to load - this has a negative impact on user experience.
Describe the solution you'd like
I've previously had a lot of success speeding dataframe operations up by doing everything in Polars, and then converting back to a pandas dataframe where necessary. I've done a very basic test, and with next to no work, this gives a ~2.5x speed increase: see diff & screencap.
Making this work would require some code changes (currently breaks a few tests), but I'm fairly confident that I could do this relatively quickly, and that it wouldn't take a great deal of effort to take the performance benefit from ~2.5x to ~10-100x.
I would advocate doing any actual dataframe operations in polars, and then transforming back to a pandas dataframe when the user calls it to avoid any user facing changes.
NB. Polars has no dependencies, so this has the additional benefit of not making solving the environment a gigantic pain.
Is this something people would be interested in?
The text was updated successfully, but these errors were encountered: