Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiling package and results for pandas extensions #11

Merged
merged 12 commits into from
Apr 3, 2019
Merged

Conversation

jesteria
Copy link
Member

@jesteria jesteria commented Apr 2, 2019

Branch summary

The below package was used to generate the following performance results for the Ohio-Pandas integration extensions to DataFrame ­– pg_copy_to & pg_copy_from.

Results summary

In general, all solutions involving the Ohio extensions require significantly less RAM; and, (except for the buffer_size=1 case, which is no longer the default), Ohio extensions are also significantly faster.

Unlike Pandas built-in methods to_sql and read_sql, doing a custom PostgreSQL COPY via io.StringIO is competitive, but still significantly slower and more RAM-intensive than Ohio extensions. (And Ohio extensions have the added benefit of being one-liners, simple method invocations, and "built-in" to DataFrame, once the integration is loaded.)

Profiling

The profiling package, prof, is executable as a package, (like a module):

$ python -m prof …

I added a management command wrapper largely for added documentation and cleanliness:

$ manage profile …

Generally, I used variations of the following command:

$ time manage profile - --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

I.e.:

  • Profile everything 3 times
  • Plot averages with error bars
  • Use some dirtyduck data – (it doesn't really matter what the CSV is, but it should be big, and the package was tested against something like this, with lots of floats and ints, and one as_of_dates column)
  • It's probably best to randomize the order in which things are profiled
  • It takes a long time to run, so GNU time reports on that.
  • Store console (debugging) output in a file.

Results

These take a really long time to run, and so while you could just run it as above, and get plots for each group, I generally filtered down to the one group of the two I was trying to test.

Copy from DataFrame to database table

$ time manage profile - --tag-filter "to database" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

image

Basically, to_sql is a lost cause, (even playing with the multi parameter). However, the StringIO method comes close; and, it's unclear how close, at this scale.

So, ignoring to_sql:

$ time manage profile - --tag-filter "to database" --filter "^(copy|ohio)" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

image

Now we can see more clearly what was apparent in the previous plot: the StringIO version is significantly slower. (This makes sense, since it has to encode all the data into a string before beginning to COPY it into the database.)

The error bars on RAM usage are a mess, however. It doesn't help that I'm comparing a ton of different configurations of Ohio, (including unreleased ones, which this profiling has almost certainly proven to be unnecessary).

These RAM usage measurements were tough to make seem "true." Running all profilers, by default, in a subprocess made a huge difference; but, for whatever reason, you can still get a wide variance – perhaps because I'm just running these on my laptop, and even at three trials, a weird one can throw things off.

In conclusion, I'd say that Ohio – especially with the current default configuration – does fair better on RAM usage, as well. And, that's just with this particular dataset. Unlike the StringIO version, the overhead of Ohio's solution shouldn't increase linearly.

Copy from database table to DataFrame

$ time manage profile - --tag-filter "to dataframe" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

image

Pandas – read_sql – is again a huge outlier in slowness and RAM usage.

$ time manage profile - -f "(ohio|stringio)" --tag-filter "to dataframe" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

image

Unlike in the previous context, here StringIO does much worse on RAM. And, again, it's significantly slower, (if not tremendously so).

The Ohio solutions all do about the same, (and it's unclear which of them is definitely best).


All that said, I'm happy to run more trials. And, with this package, others are welcome to do the same! 😸

* profilers for copying dataframe to database
* profilers moved to subpackage
* profiler group tagging
* profilers executed in subprocess to ensure memory measurements
* automatic plotting of memory & time results for each profiler group
* randomize profiler order in group
* run multiple trials
* filter by profiler function name and/or by group tag
* removed in-development artifacts (txt, yaml, png)
* made management command wrapper handle arguments and errors correctly
@jesteria jesteria requested review from saleiro and thcrock April 2, 2019 23:41
prof/util.py Outdated


def histogram(stream):
counts = collections.defaultdict(int)
Copy link

@thcrock thcrock Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is histogram here just reimplementing collections.Counter? Although I usually don't use it this way it appears it can take an iterable in the constructor and I would guess does exactly what histogram is doing here.

https://docs.python.org/3/library/collections.html#collections.Counter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks – I didn't see this added in Python 3. (It's not hard to do, but it did seem like something that belonged there.)

return f"contextmanager({self.func})"


class contextmanager(_ComposingMixin, Wrapper):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this explain why it's a better contextmanager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was an oversight – documentation added (here in util and in tool).

@thcrock
Copy link

thcrock commented Apr 3, 2019

It feels like your gigantic PR introductory text belongs in something more persistently visible, like a markdown file in the repo linked to from the README. Technically you are describing changes, but also so much more! I think PR introductory text works better for things that are useful context to the reviewer, but the audience for this text seems like potential users

jesteria added 2 commits April 3, 2019 13:46
This profiler permits a proper comparision to pg_copy_to, (and exposes
potential opportunities for improvement).
@jesteria
Copy link
Member Author

jesteria commented Apr 3, 2019

@thcrock I agree! And I fully intend to publish these results in some such way. I just want to be sure of what we have, here, first.

@jesteria
Copy link
Member Author

jesteria commented Apr 3, 2019

All right I have:

  • moved some experimental stuff into dependent branch experiment/profiling
  • cleaned up some utility stuff and added missing doc strings
  • replaced custom histogram with collections.Counter
  • added a profiler of DataFrame > StringIO > COPY that goes through a DataFrame accessor, such that we're comparing apples-to-apples

The results of the latter are interesting.

image

As we saw before, the "generic" version, which uses df.to_csv(string_io), is way slow. It has similar RAM usage; but, that's misleading, since there might be some weird (and/or bad) RAM usage going on when Pandas initiates the extension.

The novelty of the new profiler, which goes through a Pandas extension – df.stringio_copy_to – is that it should get no advantages with RAM. And it's ludicrous how much RAM it does use. (And, again, we can expect this to increase as the payload scales.)

What's interesting is that the StringIO accessor is as fast as it is. It's just using csv.writer.

So …

  1. DataFrame.to_csv(StringIO) might be terrible, (at least without special configuration).
  2. We're still gaining what we set out to gain from pg_copy_to – low, stable RAM usage, and huge a speed increase (thanks to COPY). The speed differential between pg_copy_to and the StringIO version isn't huge, (and the former is hugely preferable for its RAM usage). But, for future speed improvement, we might consider things like replacing PipeTextIO in pg_copy_to with another ohio utility, (as noted in that method's comments). (Or, an asyncio implementation of PipeTextIO might help, as well.)

@jesteria
Copy link
Member Author

jesteria commented Apr 3, 2019

I've added an issue, #12, for exploring speed improvement in pg_copy_to.

And we already have #4 for exploring an asyncio implementation in PipeTextIO.

I just kicked off another profiling run, in hopes of getting nice clean plots for pg_copy_to, comparing it to to_sql and to the off-the-cuff StringIO extension noted above, (excluding more experimental trials).

I should probably do the same for pg_copy_from.

(Unfortunately, as fast as my laptop is, it'll take a while. Of course, this is mostly the fault of Pandas to_sql.)


$ time manage profile - --filter "^(pandas|copy|ohio_pg_copy_to_100{,2}$)" --tag-filter "to database" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee ./logs/profile-$(date +%s).txt
[input] size data (rows x columns): 896677 x 83 (74424191)

*********************************** trial 0 ***********************************

----------------------- copy from dataframe to database -----------------------

[ohio_pg_copy_to_10] begin: pg_copy_to(buffer_size=10) {DataFrame → PipeTextIO → COPY}
[ohio_pg_copy_to_10] time (s): 80.74
[ohio_pg_copy_to_10] memory used (mb): 1212 → 3268 (2056 added)
[ohio_pg_copy_to_10] memory overhead (mb): 3268
[ohio_pg_copy_to_10] count table (rows): 896677

[free] memory (mb): 99 → 99 (0 freed)

[ohio_pg_copy_to_1000] begin: pg_copy_to(buffer_size=1000) {DataFrame → PipeTextIO → COPY}
[ohio_pg_copy_to_1000] time (s): 76.47
[ohio_pg_copy_to_1000] memory used (mb): 1212 → 3241 (2029 added)
[ohio_pg_copy_to_1000] memory overhead (mb): 3241
[ohio_pg_copy_to_1000] count table (rows): 896677

[free] memory (mb): 99 → 99 (0 freed)

[pandas_to_sql] begin: pandas.DataFrame.to_sql
[pandas_to_sql] time (s): 720.1
[pandas_to_sql] memory used (mb): 1212 → 8089 (6877 added)
[pandas_to_sql] memory overhead (mb): 8089
[pandas_to_sql] count table (rows): 896677

[free] memory (mb): 99 → 99 (0 freed)

[pandas_to_sql_multi_100] begin: pandas.DataFrame.to_sql(chunksize=100, method='multi')
…

Copy link

@thcrock thcrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. There's probably a better way to brand these in the future. You now have multiple 'like a python standard library but more advanced' things put into Ohio. Some of them are hard to find here if somebody were to want to use elsewhere (who goes combing through util.py files?)

You could:

  1. Go the toolz route and make boring 'itertoolz' 'functoolz' in the same repo (maybe it wouldn't be called ohio in that case) https://github.com/pytoolz/toolz
  2. Milk the 'ohio' thing into ones that make no sense: 'ohcontext', 'ohfunc'
  3. Expend each of them into its own thing: 'contextliberal', 'funktools'
  4. Something better than any of these

But this works and is fine for now.

@jesteria
Copy link
Member Author

jesteria commented Apr 3, 2019

@thcrock Hah sure. I was thinking the same, it might be nice to put some of these "toolz" elsewhere; but, I'm not too worried about that for the moment.

Certainly I won't just put stuff unrelated to io into the public interface of ohio – the work here is just about the repository's profiling package, prof. (If people find prof/util.py on Github and want to use it, cool; but, there's zero expectation of this. It's an internal module, which setup.py doesn't see.)

Perhaps the profiling framework should be split out, such that what's here is more of a stub; but, same story.

Once we want to reuse this stuff, there's work to do. But that's a welcome alternative to restarting from scratch. I think better to just make use of it here first, and see, (than to make library atomicity a prerequisite).

Anyway, thanks 👍

@jesteria jesteria merged commit 1f208d9 into master Apr 3, 2019
@jesteria jesteria deleted the profiling branch April 3, 2019 20:57
@jesteria jesteria mentioned this pull request Apr 10, 2019
jesteria added a commit that referenced this pull request Apr 11, 2019
basic documentation

**Changes**

* rounded out library docstrings, such that the heart of the (API) documentation is there
* integrated with Sphinx (and autodoc)
  * such that standard documentation can be generated from docstrings, (tho HTML currently unused)
  * submitted patches to extension [sphinx-contrib.restbuilder](sphinx-contrib/restbuilder#8), such that Sphinx can also build RST docs – (it's a little surprising that Sphinx can't on its own or otherwise its build process can't be intercepted to retrieve RST formatted output before it's parsed into an abstract tree, considering its input is RST, albeit Sphinx- and autodoc-extended RST; but, nonetheless, this appeared to be the easiest way forward ... though markdown is another obvious target, it doesn't cleanly support all the directives that you get from RST ... and Github won't render README.html).
  * scripted a development command to generate HTML docs and README.rst

In the end, I don't love [how the auto-generated README.rst looks](https://github.com/dssg/ohio/tree/docs), at least not as Github renders it; but, for something like this, I think it's fine, and keeps API documentation in the code.

A discussion of [benchmarking the Pandas extension](#11) is included in `doc/index.rst` (and from there included in built documentation).

resolves #1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants