Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarks: No way to get reproducible results #203

Open
olajep opened this issue Jul 21, 2015 · 10 comments
Open

benchmarks: No way to get reproducible results #203

olajep opened this issue Jul 21, 2015 · 10 comments

Comments

@olajep
Copy link
Member

olajep commented Jul 21, 2015

Variance between runs is way too high.

I guess we could wrap the function in a for loop and use the lowest measurement, that would certainly improve things.

But I think we better use performance counters instead.
Epiphany isn't affected by this since we already use CTIMERs there.

PAPI seems to have the cross-platform support we need:
http://icl.cs.utk.edu/papi/

@lfochamon
Copy link
Contributor

@olajep What do you think about timing several runs of the functions? Most of them run in the ns, maybe running them 1000 times (for example) would improve the precision of the measurements. Also, running the benchmark a few times is a good idea to get some statistics. Maybe report minimum, median and maximum run times. What I mean is something like:

for (i = 0; i < 100; i++) {
    item_preface(&data[i], ...);
    for (j = 0; j < 1000; j++) {
        fun()
    }
    item_done(&data[i], ...);
}

PAPI looks cool... Maybe we could take only the ARM/x86 parts (I see they don't support Windows anymore, not sure if that would be an issue).

@olajep
Copy link
Member Author

olajep commented Jul 22, 2015

@lchamon

On 2015-07-22 02:51, Chamon wrote:

@olajep https://github.com/olajep What do you think about timing several runs of the functions? Most of them run in the ns, maybe running them 1000 times (for example) would improve the precision of the measurements.

Yes, that will certainly improve things. I did some testing now and even with 32000 iterations there can be a 20 percent diff between two runs (most are pretty close however). Which is a lot better than before (>100% !). And 32000 iterations takes way too long for several functions. Benchmarking one of the image functions takes almost two minutes.

Also, running the benchmark a few times is a good idea to get some statistics. Maybe report minimum, median and maximum run times.

I'm not so sure about that, we should only care about the lowest measurement. Everything else is noise (e.g., context switches, some other process evicting our data from L2 cache ...).

What I mean is something like:

for (i = 0; i < 100; i++) {
item_preface(&data[i], ...);
for (j = 0; j < 1000; j++) {
fun()
}
item_done(&data[i], ...);
}
Yup, we need the loop, no question about that. The question is how many iterations.

PAPI looks cool... Maybe we could take only the ARM/x86 parts (I see they don't support Windows anymore, not sure if that would be an issue).

I believe that benchmarking using clock cycles instead will give more stable results.
We also need to size the benchmarks so they:
i) fit in cache and (we don't want to benchmark the memory subsystem) and
ii) one function call should be short enough to not always be preempted by the kernel.

I think that should do it.

Cheers,
Ola

@lfochamon
Copy link
Contributor

@olajep Hmmm... I thought you could maybe limit the time instead of the number of iterations. But it might complicate things more than it solves, don't know. My idea was (in retrospect, maybe not a really good one):

volatile int i = 0; /* So the compiler doesn't optimize the loop */
volatile int j = 0;

item_preface(&data, item);
while (data->end - data->start < MAX_TIME) {
    item->benchmark(&spec);
    item_done(&data, &spec, item->name);
    i++;
}

loop_time = platform_clock();
while (i - j > 0)
    j++;
loop_time -= platform_clock();

data->end -= loop_time;

Maybe clock count is the way to go.

@mansourmoufid
Copy link
Contributor

The higher the resolution of the timing, the less measurements you need to make. The Parallella has performance counters, right? And PAPI supports Linux on ARM. So it sounds like the right solution. 👍

@mansourmoufid
Copy link
Contributor

On ARM, PAPI uses the perf subsystem of the Linux kernel. If you want, you can use perf directly. The SUPERCOP benchmark software does this (look in the file supercop-20141124/cpucycles/perfevent.c). But the perf API is not as nice as PAPI.

@lfochamon
Copy link
Contributor

The Linux/ARM timers from PAPI appears to use clock_gettime or gettimeofday (depending on what's available). Clock cycles are estimated multiplying the time in usec by the clock frequency (viz. linux_timer.c lines 288 and 260). I could be wrong (someone should check), but at least in this case it doesn't seem to help much. Reading the PMU from ARM can't be done from user space. x86 provides assembly time stamp reading (rdtsc) and that's what PAPI uses.

I guess for ARM there is no easy way around gettimeofday/clock_gettime or perf_event. I guess only testing to see if there's a big difference...

@mansourmoufid
Copy link
Contributor

I haven't used PAPI yet, but it should be possible to check at run-time if it supports hardware counters by checking if PAPI_num_counters returns a negative or zero. But first you need perf support in the kernel, on Debian you get that with the linux-tools-$(uname -r) package. I'll try it tomorrow.

@lfochamon
Copy link
Contributor

@eliteraspberries I haven't used it either, take what I say with a grain of salt. I was checking out the source to see how they would be accessing the ARM performance counter from the user space, but it seems they're not (like you said, only compiling and running num_counters to be sure).

To me, perf appears to be the way to go on ARM/Linux (using perf_event_open as per http://web.eece.maine.edu/~vweaver/projects/perf_events/programming.html) and rdtsc inline assembly for x86. I'll see if I can contribute something more concrete, but I'm completely swamped for the next 2 weeks.

@Adamszk
Copy link

Adamszk commented Jul 30, 2015

I measure my code performance such as speed and the timing is occasionally off. I believe 10 iteration is minimum and 100 is enough, 1000 is a bit too much. An average or mean with plus and minus deviation will suffice to have one column of data for speed. Below is example code in c++ I used to time my code performance:
start = std::clock();
float answer1= erff(w); // standard algorithm here
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;

@lfochamon
Copy link
Contributor

I have tested an inline assembly solution for x86 processors that have rdtsc. To get some statistics, I timed 3 functions (10 ms, 100 ms, and 1 s) using clock(), rdtsc, and QueryPerformanceCounter() from the Windows API (I'm on Windows, so no gettime()). All statistics are calculated over 200 independent measurements, with compiler flag O3, and are summarized in the table and plots below.

All methods hit the same average measurement (some with more precision than others). For fast functions, the variance is considerably better using rdtsc or QueryPerformanceCounter(), which suggest that maybe using the perf_event_open method on Linux we could avoid the inline assembly altogether. I'll try to get back to you guys in a few weeks with tests on that (if no one has done it before). When functions hit the 1 s mark, though, the measurement variances are basically the same. The precision on sub-microsecond measurements is not great, but I haven't tested tens of microseconds yet.

@olajep I pushed the solution using rdtsc (the method used by PAPI) to a branch on my fork (benchmark.c), but I'm still new to Autotools I have no idea how to use autoconf macros to define the __x86_64__ (64 bits processor), __i386__ (32 bits processor), and CPU_FREQ (CPU clock frequency).

Length Method Mean Variance (ratio to rdtsc)
1000ms clock() 1.60234500 1.520198e-05 (1.748116)
1000ms rdtsc 1.60229626 8.696210e-06
1000ms WinQPC 1.60208964 1.735461e-05 (1.9956522)
100ms clock() 0.16021000 3.433065e-06 (6.507617)
100ms rdtsc 0.16018014 5.275457e-07
100ms WinQPC 0.16020998 6.624749e-07 (1.2557677)
10ms clock() 0.01605000 4.168342e-06 (37.581361)
10ms rdtsc 0.01610825 1.109151e-07
10ms WinQPC 0.01608080 6.404925e-08 (0.5774617)

rplot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants