Regression/summarizer performance collapse? #74

dgrnbrg · 2019-09-25T14:36:58Z

Hello, I am using many regressions in parallel over a single call to summarize. I've noticed that if I run ~20 regressions on a dataset with 5M rows, it seems to take 45-60 minutes to summarize. If I run a single regression on a similarly-sized dataset, however, it only takes a minute or two to summarize. What kinds of performance characteristics should I expect, and how can I avoid this kind of performance collapse?

Thank you!

The text was updated successfully, but these errors were encountered:

icexelloss · 2019-09-26T17:14:17Z

Hi @dgrnbrg! When you say in parallel, are you running multiple regressions on a summarize call, i.e.,

df.summarize([regression1, regression2...])

Or

Are you calling df.summarize(regressionX) with multiple python thread?

dgrnbrg · 2019-10-01T14:22:14Z

Hey @icexelloss! I am running multiple (20-ish) regressions on a summarize call. I found that it's very fast if I run 4-6 regressions per call, but the performance hits a cliff at some point. This is also on full calls to summarize, so I don't think it's a streaming windows thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression/summarizer performance collapse? #74

Regression/summarizer performance collapse? #74

dgrnbrg commented Sep 25, 2019

icexelloss commented Sep 26, 2019 •

edited

Loading

dgrnbrg commented Oct 1, 2019

Regression/summarizer performance collapse? #74

Regression/summarizer performance collapse? #74

Comments

dgrnbrg commented Sep 25, 2019

icexelloss commented Sep 26, 2019 • edited Loading

dgrnbrg commented Oct 1, 2019

icexelloss commented Sep 26, 2019 •

edited

Loading