Documentation #1

TomAugspurger · 2016-11-02T11:44:39Z

This is a sketch for some sections of documentation that should go in the README.

What to test?

Ideally, benchmarks measure how long our project (dask, distributed) spends doing something, not the underlying libraries they're built on. We want to limit the variance across runs to just code we control.

For example, I suspect (self.data.a > 0).compute() is not a great benchmark. My guess (without having profiled) is that the .compute part takes the majority of the time, most of which would be in pandas / NumPy. (I need to profile all these. I'm reading through dask now to find places where dask is doing a lot of work.)

Benchmarking new Code

If you're writing an optimization, say, you can benchmark it by

writing a benchmark that exercises your optimization and placing it in benchmarks/
setting the repo field in asv.conf.json to the path of your dask / distributed repository on your local file system
running asv continuous -f 1.1 upstream/master HEAD (optionally with a regex -b <regex> to filter to just your benchmark.

Naming Conventions

Directory Structure

This repository contains benchmarks for several dask related projects.
Each project needs it's own benchmark directory because asv is built around
one configuration file (asv.conf.json) and benchmark suite per repository.

The text was updated successfully, but these errors were encountered:

pitrou · 2016-11-02T23:28:52Z

When benchmarking local changes, I also find asv dev to be very useful. Not sure it needs to be mentioned in the README, though.

pitrou · 2016-11-03T15:08:54Z

I think we should also have guidelines for benchmarks:

have individual time_xxx methods take on the order of 100-300 ms if possible (obviously some workloads will need more), so that asv can repeat the method several times and output a stable minimum
perhaps choose worker counts so as to minimize variability?

pitrou · 2016-11-03T15:52:26Z

Another issue: which timer function should be used? asv's default timer may not be adequate:
https://asv.readthedocs.io/en/latest/writing_benchmarks.html#timing

Should we measure CPU time or wallclock time? IMHO we should measure wallclock time: if dask or distributed schedules tasks inefficiently and doesn't make full use of the CPU, it's a problem that should appear in the benchmark results.

danielballan · 2016-11-03T20:47:59Z

@TomAugspurger I'm interested in helping with this, partly as a way to become more familiar with the dask API. Is there anything in particular you would prefer me to target, to start?

TomAugspurger · 2016-11-03T21:30:07Z

@danielballan great, thanks! I'm guessing that @mrocklin, @jcrist, and Antoine have the most knowledge on which parts of dask would be best to benchmark.

My current thinking is that we'll have two kinds of benchmarks: The first are higher-level benchmarks that hit things like top-level methods on dask.array, dask.bag, and dask.dataframe. The second kind of benchmarks are for "internal" methods in places like https://github.com/dask/dask/blob/master/dask/optimize.py.

I think the first kind will be easier to write benchmarks for as you learn the library (that's true for me anyway. ATM I have no idea how to write a good benchmark for something in dask.optimize).

mrocklin · 2016-11-03T21:36:01Z

I agree with @TomAugspurger 's classification of high-level external benchmarks and internal ones.

I also agree that high-level external benchmarks are probably both the more useful and the more approachable. Actually, I'm curious if, as with all things, we can steal from Pandas a bit here. Are there benchmarks in Pandas that are appropriate to take?

There are some extreme things we can test as well, such as doing groupby-applies with small dask dataframes with 1000 partitions, or calling

delayed(sum)([delayed(inc)(i) for i in range(1000)].compute(get=...)

These should be good to stress the administrative side.

pitrou · 2016-12-05T19:10:50Z

Other question: I see a couple of existing benchmarks parameterize on the get function (multiprocessing.get, threaded.get, etc.). Is this useful/desired? What are we trying to achieve here?

TomAugspurger · 2016-12-05T22:37:49Z

@pitrou for a bit, I was thinking these benchmarks could be helpful for users to see the overall performance characteristics of the various backends across different workloads. In hindsight it's probably best to keep this strictly for devs.

I'll send along a PR to remove those when I get a chance. Been swamped lately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation #1

Documentation #1

TomAugspurger commented Nov 2, 2016

pitrou commented Nov 2, 2016

pitrou commented Nov 3, 2016

pitrou commented Nov 3, 2016

danielballan commented Nov 3, 2016

TomAugspurger commented Nov 3, 2016

mrocklin commented Nov 3, 2016

pitrou commented Dec 5, 2016

TomAugspurger commented Dec 5, 2016

Documentation #1

Documentation #1

Comments

TomAugspurger commented Nov 2, 2016

What to test?

Benchmarking new Code

Naming Conventions

Directory Structure

pitrou commented Nov 2, 2016

pitrou commented Nov 3, 2016

pitrou commented Nov 3, 2016

danielballan commented Nov 3, 2016

TomAugspurger commented Nov 3, 2016

mrocklin commented Nov 3, 2016

pitrou commented Dec 5, 2016

TomAugspurger commented Dec 5, 2016