Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression/summarizer performance collapse? #74

Open
dgrnbrg opened this issue Sep 25, 2019 · 2 comments
Open

Regression/summarizer performance collapse? #74

dgrnbrg opened this issue Sep 25, 2019 · 2 comments

Comments

@dgrnbrg
Copy link

dgrnbrg commented Sep 25, 2019

Hello, I am using many regressions in parallel over a single call to summarize. I've noticed that if I run ~20 regressions on a dataset with 5M rows, it seems to take 45-60 minutes to summarize. If I run a single regression on a similarly-sized dataset, however, it only takes a minute or two to summarize. What kinds of performance characteristics should I expect, and how can I avoid this kind of performance collapse?

Thank you!

@icexelloss
Copy link
Member

icexelloss commented Sep 26, 2019

Hi @dgrnbrg! When you say in parallel, are you running multiple regressions on a summarize call, i.e.,

df.summarize([regression1, regression2...])

Or

Are you calling df.summarize(regressionX) with multiple python thread?

@dgrnbrg
Copy link
Author

dgrnbrg commented Oct 1, 2019

Hey @icexelloss! I am running multiple (20-ish) regressions on a summarize call. I found that it's very fast if I run 4-6 regressions per call, but the performance hits a cliff at some point. This is also on full calls to summarize, so I don't think it's a streaming windows thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants