profiling package and results for pandas extensions #11

jesteria · 2019-04-02T23:41:54Z

Branch summary

The below package was used to generate the following performance results for the Ohio-Pandas integration extensions to DataFrame – pg_copy_to & pg_copy_from.

Results summary

In general, all solutions involving the Ohio extensions require significantly less RAM; and, (except for the buffer_size=1 case, which is no longer the default), Ohio extensions are also significantly faster.

Unlike Pandas built-in methods to_sql and read_sql, doing a custom PostgreSQL COPY via io.StringIO is competitive, but still significantly slower and more RAM-intensive than Ohio extensions. (And Ohio extensions have the added benefit of being one-liners, simple method invocations, and "built-in" to DataFrame, once the integration is loaded.)

Profiling

The profiling package, prof, is executable as a package, (like a module):

$ python -m prof …

I added a management command wrapper largely for added documentation and cleanliness:

$ manage profile …

Generally, I used variations of the following command:

$ time manage profile - --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

I.e.:

Profile everything 3 times
Plot averages with error bars
Use some dirtyduck data – (it doesn't really matter what the CSV is, but it should be big, and the package was tested against something like this, with lots of floats and ints, and one as_of_dates column)
It's probably best to randomize the order in which things are profiled
It takes a long time to run, so GNU time reports on that.
Store console (debugging) output in a file.

Results

These take a really long time to run, and so while you could just run it as above, and get plots for each group, I generally filtered down to the one group of the two I was trying to test.

Copy from DataFrame to database table

$ time manage profile - --tag-filter "to database" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

Basically, to_sql is a lost cause, (even playing with the multi parameter). However, the StringIO method comes close; and, it's unclear how close, at this scale.

So, ignoring to_sql:

$ time manage profile - --tag-filter "to database" --filter "^(copy|ohio)" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

Now we can see more clearly what was apparent in the previous plot: the StringIO version is significantly slower. (This makes sense, since it has to encode all the data into a string before beginning to COPY it into the database.)

The error bars on RAM usage are a mess, however. It doesn't help that I'm comparing a ton of different configurations of Ohio, (including unreleased ones, which this profiling has almost certainly proven to be unnecessary).

These RAM usage measurements were tough to make seem "true." Running all profilers, by default, in a subprocess made a huge difference; but, for whatever reason, you can still get a wide variance – perhaps because I'm just running these on my laptop, and even at three trials, a weird one can throw things off.

In conclusion, I'd say that Ohio – especially with the current default configuration – does fair better on RAM usage, as well. And, that's just with this particular dataset. Unlike the StringIO version, the overhead of Ohio's solution shouldn't increase linearly.

Copy from database table to DataFrame

$ time manage profile - --tag-filter "to dataframe" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

Pandas – read_sql – is again a huge outlier in slowness and RAM usage.

$ time manage profile - -f "(ohio|stringio)" --tag-filter "to dataframe" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee profile-$(date +%s).txt

Unlike in the previous context, here StringIO does much worse on RAM. And, again, it's significantly slower, (if not tremendously so).

The Ohio solutions all do about the same, (and it's unclear which of them is definitely best).

All that said, I'm happy to run more trials. And, with this package, others are welcome to do the same! 😸

…ule `profile`

* profilers for copying dataframe to database * profilers moved to subpackage * profiler group tagging * profilers executed in subprocess to ensure memory measurements * automatic plotting of memory & time results for each profiler group * randomize profiler order in group * run multiple trials * filter by profiler function name and/or by group tag * removed in-development artifacts (txt, yaml, png) * made management command wrapper handle arguments and errors correctly

thcrock · 2019-04-03T15:45:22Z

prof/util.py

+
+
+def histogram(stream):
+    counts = collections.defaultdict(int)


Is histogram here just reimplementing collections.Counter? Although I usually don't use it this way it appears it can take an iterable in the constructor and I would guess does exactly what histogram is doing here.

https://docs.python.org/3/library/collections.html#collections.Counter

Cool, thanks – I didn't see this added in Python 3. (It's not hard to do, but it did seem like something that belonged there.)

thcrock · 2019-04-03T15:50:07Z

prof/util.py

+        return f"contextmanager({self.func})"
+
+
+class contextmanager(_ComposingMixin, Wrapper):


Can this explain why it's a better contextmanager?

Yep, that was an oversight – documentation added (here in util and in tool).

thcrock · 2019-04-03T18:43:26Z

It feels like your gigantic PR introductory text belongs in something more persistently visible, like a markdown file in the repo linked to from the README. Technically you are describing changes, but also so much more! I think PR introductory text works better for things that are useful context to the reviewer, but the audience for this text seems like potential users

This profiler permits a proper comparision to pg_copy_to, (and exposes potential opportunities for improvement).

jesteria · 2019-04-03T19:10:07Z

@thcrock I agree! And I fully intend to publish these results in some such way. I just want to be sure of what we have, here, first.

jesteria · 2019-04-03T19:33:58Z

All right I have:

moved some experimental stuff into dependent branch experiment/profiling
cleaned up some utility stuff and added missing doc strings
replaced custom histogram with collections.Counter
added a profiler of DataFrame > StringIO > COPY that goes through a DataFrame accessor, such that we're comparing apples-to-apples

The results of the latter are interesting.

As we saw before, the "generic" version, which uses df.to_csv(string_io), is way slow. It has similar RAM usage; but, that's misleading, since there might be some weird (and/or bad) RAM usage going on when Pandas initiates the extension.

The novelty of the new profiler, which goes through a Pandas extension – df.stringio_copy_to – is that it should get no advantages with RAM. And it's ludicrous how much RAM it does use. (And, again, we can expect this to increase as the payload scales.)

What's interesting is that the StringIO accessor is as fast as it is. It's just using csv.writer.

So …

DataFrame.to_csv(StringIO) might be terrible, (at least without special configuration).
We're still gaining what we set out to gain from pg_copy_to – low, stable RAM usage, and huge a speed increase (thanks to COPY). The speed differential between pg_copy_to and the StringIO version isn't huge, (and the former is hugely preferable for its RAM usage). But, for future speed improvement, we might consider things like replacing PipeTextIO in pg_copy_to with another ohio utility, (as noted in that method's comments). (Or, an asyncio implementation of PipeTextIO might help, as well.)

jesteria · 2019-04-03T20:04:44Z

I've added an issue, #12, for exploring speed improvement in pg_copy_to.

And we already have #4 for exploring an asyncio implementation in PipeTextIO.

I just kicked off another profiling run, in hopes of getting nice clean plots for pg_copy_to, comparing it to to_sql and to the off-the-cuff StringIO extension noted above, (excluding more experimental trials).

I should probably do the same for pg_copy_from.

(Unfortunately, as fast as my laptop is, it'll take a while. Of course, this is mostly the fault of Pandas to_sql.)

$ time manage profile - --filter "^(pandas|copy|ohio_pg_copy_to_100{,2}$)" --tag-filter "to database" --random --count 3 --plot ./plots/ ../dirtyduck/risks_aggregation.csv | tee ./logs/profile-$(date +%s).txt
[input] size data (rows x columns): 896677 x 83 (74424191)

*********************************** trial 0 ***********************************

----------------------- copy from dataframe to database -----------------------

[ohio_pg_copy_to_10] begin: pg_copy_to(buffer_size=10) {DataFrame → PipeTextIO → COPY}
[ohio_pg_copy_to_10] time (s): 80.74
[ohio_pg_copy_to_10] memory used (mb): 1212 → 3268 (2056 added)
[ohio_pg_copy_to_10] memory overhead (mb): 3268
[ohio_pg_copy_to_10] count table (rows): 896677

[free] memory (mb): 99 → 99 (0 freed)

[ohio_pg_copy_to_1000] begin: pg_copy_to(buffer_size=1000) {DataFrame → PipeTextIO → COPY}
[ohio_pg_copy_to_1000] time (s): 76.47
[ohio_pg_copy_to_1000] memory used (mb): 1212 → 3241 (2029 added)
[ohio_pg_copy_to_1000] memory overhead (mb): 3241
[ohio_pg_copy_to_1000] count table (rows): 896677

[free] memory (mb): 99 → 99 (0 freed)

[pandas_to_sql] begin: pandas.DataFrame.to_sql
[pandas_to_sql] time (s): 720.1
[pandas_to_sql] memory used (mb): 1212 → 8089 (6877 added)
[pandas_to_sql] memory overhead (mb): 8089
[pandas_to_sql] count table (rows): 896677

[free] memory (mb): 99 → 99 (0 freed)

[pandas_to_sql_multi_100] begin: pandas.DataFrame.to_sql(chunksize=100, method='multi')
…

thcrock

The code looks good to me. There's probably a better way to brand these in the future. You now have multiple 'like a python standard library but more advanced' things put into Ohio. Some of them are hard to find here if somebody were to want to use elsewhere (who goes combing through util.py files?)

You could:

Go the toolz route and make boring 'itertoolz' 'functoolz' in the same repo (maybe it wouldn't be called ohio in that case) https://github.com/pytoolz/toolz
Milk the 'ohio' thing into ones that make no sense: 'ohcontext', 'ohfunc'
Expend each of them into its own thing: 'contextliberal', 'funktools'
Something better than any of these

But this works and is fine for now.

jesteria · 2019-04-03T20:50:41Z

@thcrock Hah sure. I was thinking the same, it might be nice to put some of these "toolz" elsewhere; but, I'm not too worried about that for the moment.

Certainly I won't just put stuff unrelated to io into the public interface of ohio – the work here is just about the repository's profiling package, prof. (If people find prof/util.py on Github and want to use it, cool; but, there's zero expectation of this. It's an internal module, which setup.py doesn't see.)

Perhaps the profiling framework should be split out, such that what's here is more of a stub; but, same story.

Once we want to reuse this stuff, there's work to do. But that's a welcome alternative to restarting from scratch. I think better to just make use of it here first, and see, (than to make library atomicity a prerequisite).

Anyway, thanks 👍

basic documentation **Changes** * rounded out library docstrings, such that the heart of the (API) documentation is there * integrated with Sphinx (and autodoc) * such that standard documentation can be generated from docstrings, (tho HTML currently unused) * submitted patches to extension [sphinx-contrib.restbuilder](sphinx-contrib/restbuilder#8), such that Sphinx can also build RST docs – (it's a little surprising that Sphinx can't on its own or otherwise its build process can't be intercepted to retrieve RST formatted output before it's parsed into an abstract tree, considering its input is RST, albeit Sphinx- and autodoc-extended RST; but, nonetheless, this appeared to be the easiest way forward ... though markdown is another obvious target, it doesn't cleanly support all the directives that you get from RST ... and Github won't render README.html). * scripted a development command to generate HTML docs and README.rst In the end, I don't love [how the auto-generated README.rst looks](https://github.com/dssg/ohio/tree/docs), at least not as Github renders it; but, for something like this, I think it's fine, and keeps API documentation in the code. A discussion of [benchmarking the Pandas extension](#11) is included in `doc/index.rst` (and from there included in built documentation). resolves #1

jesteria added 7 commits March 21, 2019 20:41

initial profiling (ipython) scripting & local results

c2a4af5

stand-alone profiling script & results for pg_copy_from

4db5f68

moved profile/ directory to prof/ to avoid conflict with built-in mod…

d70facf

…ule `profile`

refactored profiling prof into executable package

f63c252

added management command for profiling

632f495

updated profiling package requirements

af53d68

jesteria requested review from saleiro and thcrock April 2, 2019 23:41

thcrock reviewed Apr 3, 2019

View reviewed changes

jesteria added 3 commits April 3, 2019 11:12

removed profilers of experimental feature iter_size

d5207dc

functional decorator for Wrapper class

d21bb72

use built-in collections.Counter in place of custom histogram

b5f68c8

jesteria added 2 commits April 3, 2019 13:46

__doc__ for tool & util

dfea259

profiler for StringIO -> COPY via DataFrame accessor

5809f1c

This profiler permits a proper comparision to pg_copy_to, (and exposes potential opportunities for improvement).

thcrock approved these changes Apr 3, 2019

View reviewed changes

jesteria merged commit 1f208d9 into master Apr 3, 2019

jesteria deleted the profiling branch April 3, 2019 20:57

jesteria mentioned this pull request Apr 10, 2019

basic documentation #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiling package and results for pandas extensions #11

profiling package and results for pandas extensions #11

jesteria commented Apr 2, 2019

thcrock Apr 3, 2019 •

edited

Loading

jesteria Apr 3, 2019

thcrock Apr 3, 2019

jesteria Apr 3, 2019

thcrock commented Apr 3, 2019

jesteria commented Apr 3, 2019

jesteria commented Apr 3, 2019

jesteria commented Apr 3, 2019

thcrock left a comment

jesteria commented Apr 3, 2019

		return f"contextmanager({self.func})"


		class contextmanager(_ComposingMixin, Wrapper):

profiling package and results for pandas extensions #11

profiling package and results for pandas extensions #11

Conversation

jesteria commented Apr 2, 2019

Branch summary

Results summary

Profiling

Results

Copy from DataFrame to database table

Copy from database table to DataFrame

thcrock Apr 3, 2019 • edited Loading

Choose a reason for hiding this comment

jesteria Apr 3, 2019

Choose a reason for hiding this comment

thcrock Apr 3, 2019

Choose a reason for hiding this comment

jesteria Apr 3, 2019

Choose a reason for hiding this comment

thcrock commented Apr 3, 2019

jesteria commented Apr 3, 2019

jesteria commented Apr 3, 2019

jesteria commented Apr 3, 2019

thcrock left a comment

Choose a reason for hiding this comment

jesteria commented Apr 3, 2019

thcrock Apr 3, 2019 •

edited

Loading