-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathccgrid19.tex
999 lines (929 loc) · 57.1 KB
/
ccgrid19.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
% Created 2019-03-25 Mon 14:11
% Intended LaTeX compiler: pdflatex
\documentclass[conference]{IEEEtran}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{colortbl}
\usepackage{xcolor}
\usepackage{url}
\usepackage{listings}
%\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{multirow}
\usepackage{caption}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{array}
\usepackage{relsize}
\usepackage{bm}
\usepackage{wasysym}
\usepackage{ragged2e}
\usepackage{todonotes}
\usepackage{tabularx}
\usepackage{boxedminipage}
\usepackage[all]{nowidow}
\lstset{ %
backgroundcolor={},
basicstyle=\ttfamily\scriptsize,
breakatwhitespace=true,
breaklines=true,
captionpos=n,
extendedchars=true,
frame=n,
rulecolor=\color{black},
showspaces=false,
showstringspaces=false,
showtabs=false,
stepnumber=2,
stringstyle=\color{gray},
tabsize=2,
}
\renewcommand*{\UrlFont}{\ttfamily\smaller\relax}
\makeatletter
\def\maketag@@@#1{\hbox{\m@th\normalfont\normalsize#1}}
\makeatother
\graphicspath{{./img/}}
\renewcommand*{\UrlFont}{\ttfamily\smaller\relax}
\author{\IEEEauthorblockN{\scalebox{.95}{Pedro Bruel\IEEEauthorrefmark{1}\IEEEauthorrefmark{2},
Steven Quinito Masnada\IEEEauthorrefmark{3},
Brice Videau\IEEEauthorrefmark{1},
Arnaud Legrand\IEEEauthorrefmark{1},
Jean-Marc Vincent\IEEEauthorrefmark{1},
Alfredo Goldman\IEEEauthorrefmark{2}}}
\smallskip
\IEEEauthorblockA{\begin{minipage}[t]{.21\linewidth}\centering\IEEEauthorrefmark{2}University of São Paulo \\ São Paulo, Brazil\\
\small\{phrb, gold\}@ime.usp.br\null\vspace{-15pt}\end{minipage}\hfill
\begin{minipage}[t]{.27\linewidth}\centering \IEEEauthorrefmark{3}University of Grenoble Alpes \\ Inria, CNRS, Grenoble INP, LJK \\ 38000 Grenoble, France\\
\small [email protected]\null\vspace{-15pt}\end{minipage}\hfill
\begin{minipage}[t]{.42\linewidth}\centering\IEEEauthorrefmark{1}University of Grenoble Alpes \\ CNRS, Inria, Grenoble INP, LIG \\ 38000 Grenoble, France\\
\small\{arnaud.legrand, brice.videau, jean-marc.vincent\}@imag.fr\null\vspace{-15pt}\end{minipage}}}
\date{\today}
\title{Autotuning under Tight Budget Constraints: \\ A Transparent Design of Experiments Approach}
\hypersetup{
pdfauthor={},
pdftitle={Autotuning under Tight Budget Constraints: \\ A Transparent Design of Experiments Approach},
pdfkeywords={},
pdfsubject={},
pdfcreator={Emacs 26.1 (Org mode 9.2.2)},
pdflang={English}}
\begin{document}
\maketitle
\begin{abstract}
A large amount of resources is spent writing, porting, and optimizing scientific
and industrial High Performance Computing applications, which makes autotuning
techniques fundamental to lower the cost of leveraging the improvements on
execution time and power consumption provided by the latest software and
hardware platforms. Despite the need for economy, most autotuning techniques
still require large budgets of costly experimental measurements to provide good
results, while rarely providing exploitable knowledge after optimization. The
contribution of this paper is a user-transparent autotuning technique based on
Design of Experiments that operates under tight budget constraints by
significantly reducing the measurements needed to find good optimizations. Our
approach enables users to make informed decisions on which optimizations to
pursue and when to stop. We present an experimental evaluation of our approach
and show it is capable of leveraging user decisions to find the best global
configuration of a GPU Laplacian kernel using half of the measurement budget
used by other common autotuning techniques. We show that our approach is also
capable of finding speedups of up to \boldmath\(50\times\), compared to \texttt{gcc}'s \texttt{-O3}, for
some kernels from the SPAPT benchmark suite, using up to \boldmath\(10\times\) fewer
measurements than random sampling.
\end{abstract}
\section{Introduction}
\label{sec:org3961b98}
Optimizing code for objectives such as performance and power consumption is
fundamental to the success and cost-effectiveness of industrial and scientific
endeavors in High Performance Computing (HPC). A considerable amount of highly
specialized time and effort is spent in porting and optimizing code for GPUs,
FPGAs and other hardware accelerators. Experts are also needed to leverage
bleeding edge software improvements in compilers, languages, libraries and
frameworks. The objective of techniques for the automatic configuration and
optimization of HPC applications, or \emph{autotuning}, is to decrease the cost and
time needed to adopt efficient hardware and software. Typical autotuning
targets include algorithm selection, source-to-source transformations and
compiler configuration.
Autotuning can be studied as a search problem where the objective is to minimize
software or hardware metrics. The exploration of the search spaces defined by
code and compiler configurations and optimizations presents interesting
challenges. Such spaces grow exponentially with the number of parameters, and
are also difficult to explore extensively due to the often prohibitive costs of
hardware utilization, program compilation and execution times. Developing
autotuning strategies capable of producing good optimizations while minimizing
resource utilization is therefore essential. The capability of acquiring
knowledge about an optimization problem is also crucial in an autotuning
strategy, since this knowledge can decrease the cost of subsequent optimizations
of the same application or for the same hardware.
It is common and usually effective to use search meta-heuristics such as genetic
algorithms and simulated annealing in autotuning. These strategies attempt to
exploit local search space properties, but are generally incapable of exploiting
global structures. Seymour \emph{et al.}~\cite{seymour2008comparison},
Knijnenburg \emph{et al.}~\cite{knijnenburg2003combined}, and Balaprakash \emph{et
al.}~\cite{balaprakash2011can,balaprakash2012experimental} report that
these strategies are not more effective than a naive uniform random sample of
the search space, and usually rely on a large number of measurements or restarts
to achieve performance improvements. Search strategies based on gradient descent
are also commonly used in autotuning, and also rely on a large number of
measurements, but their effectiveness diminishes significantly in search spaces
with complex local structures. Automated machine learning autotuning
strategies~\cite{beckingsale2017apollo,falch2017machine,balaprakash2016automomml}
are promising for building models for predicting important parameters, but still
rely on a sizable data set for training.
In summary, search strategies based on meta-heuristics, gradient descent and
machine learning require a large number of measurements to be effective, and are
usually incapable of providing knowledge about search spaces to users. Since
these strategies are not transparent, at the end of each autotuning session it
is difficult to decide if and where further exploration is warranted, and often
impossible to know which parameters are responsible for the observed
improvements. After exploring a search space, deducing any of the space's global
properties confidently is impossible, since the space was automatically explored
with unknown biases.
The contribution of this paper is an autotuning strategy that leverages existing
knowledge about a problem by using an initial performance model that is refined
iteratively using performance measurements, statistical analysis, and user
input. Our strategy places a heavy weight on decreasing autotuning costs by
using a \emph{Design of Experiments} (DoE) methodology to minimize the number of
experiments needed to find optimizations. Each iteration uses \emph{Analysis of
Variance} (ANOVA) tests and \emph{linear model regressions} to identify promising
subspaces and parameter significance. An architecture- and problem-specific
performance model is built iteratively and with user input, which enables making
informed decisions about which regions of the search space are worth exploring.
We evaluate the performance of our approach by optimizing a Laplacian Kernel for
GPUs, where the search space, the global optimum, and a performance model
approximation are known. The budget of measurements was tightly constrained on
this experiment. Speedups and budget utilization reduction achieved by our
approach on this setting motivated a more comprehensive performance evaluation.
We chose the \emph{Search Problems in Automatic Performance Tuning}
(SPAPT)~\cite{balaprakash2012spapt} benchmark suite for this evaluation, where
we obtained diverse results. Out of the 17 SPAPT kernels benchmarked, no speedup
could be found for three kernels, but uniform random sampling performed well on
all others. For eight of the kernels, our approach found speedups of up to
\(50\times\), compared to \texttt{gcc}'s \texttt{-O3} with no code transformations, while using up to
\(10\times\) fewer measurements than random sampling.
The rest of this paper is organized as follows. Section~\ref{sec:org9b6520b} presents
related work on source-to-source transformation, which is the main optimization
target in SPAPT kernels, on autotuning systems and on search space exploration
strategies. Section~\ref{sec:orgb6938f2} discusses the
implementation of our approach in detail. Section~\ref{sec:orgd53cbad}
discusses the DoE, ANOVA, and linear regression methodology we used to develop
our approach. Section~\ref{sec:orgeace0ee} presents the results on the
performance evaluation on the GPU Laplacian Kernel and on the SPAPT benchmark
suite. Section~\ref{sec:orgad88b73} discusses our conclusions and future work.
\section{Background}
\label{sec:org9b6520b}
This section presents the background and related work on source-to-source
transformation, autotuning systems and search space exploration strategies.
\subsection{Source-to-Source Transformation}
\label{sec:orge8340fb}
Our approach can be applied to any autotuning domain that expresses optimization
as a search problem, although the performance evaluations we present in
Section~\ref{sec:orgeace0ee} were obtained in the domain of source-to-source
transformation. Several frameworks, compilers and autotuners provide tools to
generate and optimize architecture-specific
code~\cite{hartono2009annotation,videau2017boast,tiwari2009scalable,yi2007poet,ansel2009petabricks}.
We used BOAST~\cite{videau2017boast} and Orio~\cite{hartono2009annotation} to
perform source-to-source transformations targeting parallelization on CPUs and
GPUs, vectorization, loop transformations such as tiling and unrolling, and data
structure size and copying.
\subsection{Autotuning}
\label{sec:orga21fad2}
John Rice's Algorithm Selection framework~\cite{rice1976algorithm} is the
precursor of autotuners in various problem domains. In 1997, the PHiPAC
system~\cite{bilmes1997optimizing} used code generators and search scripts to
automatically generate high performance code for matrix multiplication. Since
then, systems approached different domains with a variety of strategies.
Dongarra \emph{et al.}~\cite{dongarra1998automatically} introduced the ATLAS
project, that optimizes dense matrix multiplication routines. The
OSKI~\cite{vuduc2005oski} library provides automatically tuned kernels for
sparse matrices. The FFTW~\cite{frigo1998fftw} library provides tuned C
subroutines for computing the Discrete Fourier Transform.
Periscope~\cite{gerndt2010automatic} is a distributed online autotuner for
parallel systems and single-node performance. In an effort to provide a common
representation of multiple parallel programming models, the INSIEME compiler
project~\cite{jordan2012multi} implements abstractions for OpenMP, MPI and
OpenCL, and generates optimized parallel code for heterogeneous multi-core
architectures.
A different approach is to combine generic search algorithms and problem
representation data structures in a single system that enables the
implementation of autotuners for different domains. The
PetaBricks~\cite{ansel2009petabricks} project provides a language, compiler and
autotuner, enabling the definition and selection of multiple algorithms for the
same problem. The ParamILS framework~\cite{hutter2009paramils} applies
stochastic local search algorithms to algorithm configuration and parameter
tuning. The OpenTuner framework~\cite{ansel2014opentuner} provides ensembles of
techniques that search the same space in parallel, while exploration is managed
by a multi-armed bandit strategy.
\subsection{Search Space Exploration Strategies}
\label{sec:org9d0e77e}
\begin{figure}[b]
\centering
\includegraphics[width=.95\columnwidth]{./img/sampling_comparison.pdf}
\caption{\label{fig:orgbae3568}
Exploration of the search space, using a fixed budget of 50 points. The red ``\(+\)'' represents the best point found by each strategy, and ``\(\times\)''s denote neighborhood exploration}
\end{figure}
Figure~\ref{fig:orgbae3568} shows the contour of a search space defined by a
function of the form \(z = x^2 + y^2 + \varepsilon\), where \(\varepsilon\) is a local perturbation, and
the exploration of that search space by six different strategies. In such a
simple search space, even a uniform random sample can find points close to the
optimum, despite not exploiting geometry. A Latin
Hypercube~\cite{carnell2018lhs} sampling strategy covers the search space more
evenly, but still does not exploit the space's geometry. Strategies based on
neighborhood exploration such as simulated annealing and gradient descent can
exploit local structures, but may get trapped in local minima. Their performance
is strongly dependent on search starting point. These strategies do not leverage
global search space structure, or provide exploitable knowledge after
optimization.
Measurement of the kernels optimized on the performance evaluations in
Section~\ref{sec:orgeace0ee} can exceed 20 minutes, including the time of code
transformation, compilation, and execution. Measurements in other problem
domains can take much longer to complete. This strengthens the motivation to
consider search space exploration strategies capable of operating under tight
budget constraints. These strategies have been developed and improved by
statisticians for a long time, and can be grouped under the DoE term.
The D-Optimal sampling strategies shown on the two rightmost bottom panels of
Figure~\ref{fig:orgbae3568} are based on the DoE methodology, and leverage
previous knowledge about search spaces for an efficient exploration. These
strategies provide transparent analyses that enable focusing on interesting
subspaces. In the next section we describe our approach to autotuning based on
the DoE methodology.
\section{Autotuning with Design of Experiments}
\label{sec:orgb6938f2}
An \emph{experimental design} determines a selection of experiments whose objective is
to identify the relationships between \emph{factors} and \emph{responses}. While factors and
responses can refer to different concrete entities in other domains, in computer
experiments factors can be configuration parameters for algorithms and
compilers, and responses can be the execution time or memory consumption of a
program. Each possible value of a factor is called a \emph{level}. The \emph{effect} of a
factor on the measured response, without the factor's \emph{interactions} with other
factors, is the \emph{main effect} of that factor. Experimental designs can be
constructed with different goals, such as identifying the main effects or
building an analytical model for the response.
In this section we discuss in detail our iterative DoE approach to autotuning.
Figure~\ref{fig:org45b5375} presents an overview of our approach, with
numbered steps. In step 1 we define the factors and levels that compose the
search space of the target problem, in step 2 we select an initial performance
model, and in step 3 we generate an experimental design. We run the experiments
in step 4 and then, as we discuss in the next section, we identify significant
factors with an ANOVA test in step 5. This enables selecting and fitting a new
performance model in steps 6 and 7. The new model is used in step 8 for
predicting levels for each significant factor. We then go back to step 3,
generating a new design for the new problem subspace with the remaining factors.
Informed decisions made by the user at each step guide the outcome of each
iteration.
\begin{figure}[b]\vspace{-.5cm}
\centering
\includegraphics[width=.95\columnwidth]{./img/doe_anova_strategy.pdf}
\caption{\label{fig:org45b5375}
Overview of the DoE approach to autotuning proposed in this paper}
\end{figure}
Step 1 of our approach is to define target factors and which of their levels are
worth exploring. Then, the user must select an initial performance model in
step 2. Compilers typically expose many 2-level factors in the form of
configuration flags, and the performance model for a single flag can only be a
linear term, since there are only 2 values to measure. Interactions between
flags and numerical factors such as block sizes in CUDA programs or loop
unrolling amounts are also common. Deciding which levels to include for these
kinds of factors requires more careful analysis. For example, if we suspect the
performance model has a quadratic term for a certain factor, the design should
include at least three factor levels. The ordering between the levels of other
compiler parameters, such as \texttt{-O(0,1,2,3)}, is not obviously translated
to a number. Factors like these are named \emph{categorical}, and must be treated
differently when constructing designs in step 3 and analyzing results in step 5.
We decided to use D-Optimal designs because their construction techniques enable
mixing categorical and numerical factors in the same screening design, while
biasing sampling according to the performance model. This enables the autotuner
to exploit global search space structures if we use the right model. When
constructing a D-Optimal design in step 3 the user can require that specific
points in the search space are included, or that others are not. Algorithms for
constructing D-Optimal designs are capable of adapting to these requirements by
optimizing a starting design. Before settling on D-Optimal designs, we explored
other design construction techniques such as the
Plackett-Burman~\cite{plackett1946design} screening designs shown in the next
section, the \emph{contractive replacement} technique of
Addelman-Kempthorne~\cite{addelman1961some} and the \emph{direct generation} algorithm
by Grömping and Fontana~\cite{ulrike2018algorithm}. These techniques have strong
requirements on design size and level mixing, so we opted for a more flexible
technique that would enable exploring a more comprehensive class of autotuning
problems.
After the design is constructed in step 3, we run each selected experiment in
step 4. This step can run in parallel since experiments are independent. Not all
target programs run successfully in their entire input range, making runtime
failures common in this step. The user can decide whether to construct a new
design using the successfully completed experiments or to continue to the
analysis step if enough experiments succeed.
After running the ANOVA test in step 5, the user should apply domain knowledge
to analyze the ANOVA table and determine which factors are significant. Certain
factors might not appear significant and should not be included in the
regression model. Selecting the model after the ANOVA test in step 6 also
benefits from domain knowledge.
A central assumption of ANOVA is the \emph{homoscedasticity} of the response, which can
be interpreted as requiring the observed error on measurements to be independent
of factor levels and of the number of measurements. Fortunately, there are
statistical tests and corrections for lack of homoscedasticity. Our approach
uses the homoscedasticity check and correction by power transformations from the
\texttt{car} package~\cite{fox2011car} of the \texttt{R} language.
We fit the selected model to our design's data in step 7, and use the fitted
model in step 8 to find levels that minimize the response. The choice of the
method used to find these levels depends on factor types and on the complexity
of the model and search space. If factors have discrete levels, neighborhood
exploration might be needed to find levels that minimize the response around
predicted levels. Constraints might put predicted levels on an undefined or
invalid region on the search space. This presents challenge, because the borders
of valid regions would have to be explored.
In step 8 we also fix factor levels to those predicted to achieve best
performance. The user can also decide the level of trust placed on the
prediction at this step, by keeping other levels available. In step 8 we perform
a reduction of problem dimension by eliminating factors and decreasing the size
of the search space. If we identified significant parameters correctly, we will
have restricted further search to better regions. In the next section we present
a simple fictional application our approach that illustrates the fundamentals of
the DoE methodology, screening designs and D-Optimal designs.
\section{Design of Experiments}
\label{sec:orgd53cbad}
In this section we first present the assumptions of a traditional DoE
methodology using an example of \emph{2-level screening designs}, an efficient way to
identify main effects. We then discuss techniques for the construction of
efficient designs for factors with arbitrary numbers and types of levels, and
present \emph{D-Optimal} designs, the technique used in this paper.
\subsection{Screening \& Plackett-Burman Designs}
\label{sec:org35859df}
Screening designs identify parsimoniously the main effects of 2-level factors in
the initial stages of studying a problem. While interactions are not considered
at this stage, identifying main effects early enables focusing on a smaller set
of factors on subsequent experiments. A specially efficient design construction
technique for screening designs was presented by Plackett and
Burman~\cite{plackett1946design} in 1946, and is available in the \texttt{FrF2}
package~\cite{gromping2014frf2} of the \texttt{R} language~\cite{team2018rlanguage}.
Despite having strong restrictions on the number of factors supported,
Plackett-Burman designs enable the identification of main effects of \(n\) factors
with \(n + 1\) experiments. Factors may have many levels, but Plackett-Burman
designs can only be constructed for 2-level factors. Therefore, before
constructing a Plackett-Burman design we must identify \emph{high} and \emph{low} levels for
each factor.
Assuming a linear relationship between factors and the response is fundamental
for running ANOVA tests using a Plackett-Burman design. Consider the linear
relationship
\vspace{-2pt}
\begin{equation}
\mathbf{Y} = \bm{\beta}\mathbf{X} + \varepsilon\text{,}
\label{eq:linear_assumption}
\vspace{-2pt}
\end{equation}\noindent
where \(\varepsilon\) is the error term, \(\mathbf{Y}\) is the observed response, \(\mathbf{X}
= \left\{1, x_1,\dots,x_n\right\}\) is the set of \(n\) 2-level factors, and
\(\bm{\beta} = \left\{\beta_0,\dots,\beta_n\right\}\) is the set with the \emph{intercept} \(\beta_0\)
and the corresponding \emph{model coefficients}. ANOVA tests can rigorously compute the
significance of each factor, we can think of that intuitively by noting that
less significant factors will have corresponding values in \(\bm{\beta}\) close to
zero.
The next example illustrates the screening methodology. Suppose we wish to
minimize a performance metric \(Y\) of a problem with factors \(x_1,\dots,x_8\)
assuming values in \(\left\{-1, -0.8, -0.6, \dots, 0.6, 0.8, 1\right\}\). Each \(y_i \in
Y\) is defined as
\vspace{-2pt}
\begin{align}
\label{eq:real_model}
y_i = & -1.5x_1 + 1.3x_3 + 3.1x_5 + \\
& -1.4x_7 + 1.35x_8^2 + 1.6x_3x_5 + \varepsilon\text{.} \nonumber
\vspace{-2pt}
\end{align}\noindent
Suppose that, for the purpose of this example, the computation is done by a very
expensive black-box procedure. Note that factors \(\{x_2,x_4,x_6\}\) have no
contribution to the response, and we can think of the error term \(\varepsilon\) as
representing not only noise, but our uncertainty regarding the model. Higher
amplitudes of \(\varepsilon\) might make isolating factors with low significance harder to
justify.
To study this problem efficiently we decided to construct a Plackett-Burman
design, which minimizes the experiments needed to identify significant factors.
The analysis of this design will enable decreasing the dimension of the problem.
Table~\ref{tab:plackett} presents the Plackett-Burman design we generated. It
contains high and low values, chosen to be \(-1\) and \(1\), for the factors
\(x_1,\dots,x_8\), and the observed response \(\mathbf{Y}\). It is a required step to
add the 3 ``dummy'' factors \(d_1,\dots,d_3\) to complete the 12 columns needed to
construct a Plackett-Burman design for 8 factors~\cite{plackett1946design}.
% latex table generated in R 3.5.1 by xtable 1.8-2 package
% Mon Oct 15 18:04:00 2018
\begin{table}[b]
\centering
\caption{Randomized Plackett-Burman design for factors $x_1, \dots, x_8$, using 12 experiments and ``dummy'' factors $d_1, \dots, d_3$, and computed response $\mathbf{Y}$}
\label{tab:plackett}
\begingroup\scriptsize
\begin{tabular}{cccccccccccc}
\toprule
$x_1$ & $x_2$ & $x_3$ & $x_4$ & $x_5$ & $x_6$ & $x_7$ & $x_8$ & $d_1$ & $d_2$ & $d_3$ & $Y$ \\
\midrule
1 & -1 & 1 & 1 & 1 & -1 & -1 & -1 & 1 & -1 & 1 & 13.74 \\
-1 & 1 & -1 & 1 & 1 & -1 & 1 & 1 & 1 & -1 & -1 & 10.19 \\
-1 & 1 & 1 & -1 & 1 & 1 & 1 & -1 & -1 & -1 & 1 & 9.22 \\
1 & 1 & -1 & 1 & 1 & 1 & -1 & -1 & -1 & 1 & -1 & 7.64 \\
1 & 1 & 1 & -1 & -1 & -1 & 1 & -1 & 1 & 1 & -1 & 8.63 \\
-1 & 1 & 1 & 1 & -1 & -1 & -1 & 1 & -1 & 1 & 1 & 11.53 \\
-1 & -1 & -1 & 1 & -1 & 1 & 1 & -1 & 1 & 1 & 1 & 2.09 \\
1 & 1 & -1 & -1 & -1 & 1 & -1 & 1 & 1 & -1 & 1 & 9.02 \\
1 & -1 & -1 & -1 & 1 & -1 & 1 & 1 & -1 & 1 & 1 & 10.68 \\
1 & -1 & 1 & 1 & -1 & 1 & 1 & 1 & -1 & -1 & -1 & 11.23 \\
-1 & -1 & -1 & -1 & -1 & -1 & -1 & -1 & -1 & -1 & -1 & 5.33 \\
-1 & -1 & 1 & -1 & 1 & 1 & -1 & 1 & 1 & 1 & -1 & 14.79 \\
\bottomrule
\end{tabular}
\endgroup
\end{table}
So far, we have performed steps 1, 2, and 3 from
Figure~\ref{fig:org45b5375}. We use our initial assumption in
Equation~\eqref{eq:linear_assumption} to identify the most significant factors by
performing an ANOVA test, which is step 5 from
Figure~\ref{fig:org45b5375}. The results are shown in
Table~\ref{tab:anova_linear}, where the \emph{significance} of each factor is
interpreted from the F-test and \(\text{Pr}(>\text{F})\) values.
Table~\ref{tab:anova_linear} uses ``\(*\)'', as is convention in the \texttt{R}
language, to represent the significance values for each factor.
We see on Table~\ref{tab:anova_linear} that factors \(\left\{x_3,x_5,x_7,x_8\right\}\)
have at least one ``\(*\)'' of significance. For the purpose of this example, this
is sufficient reason to include them in our linear model for the next step. We
decide as well to discard factors \(\left\{x_2,x_4,x_6\right\}\) from our model, due
to their low significance. We see that factor \(x_1\) has a significance mark of
``\(\cdot\)'', but comparing F-test and \(\text{Pr}(>\text{F})\) values we decide that
they are fairly smaller than the values of factors that had no significance, and
we keep this factor.
% latex table generated in R 3.5.1 by xtable 1.8-2 package
% Mon Oct 15 18:04:10 2018
\begin{table}[t]
\centering
\caption{Shortened ANOVA table for the fit of the naive model, with significance intervals from the \texttt{R} language}
\label{tab:anova_linear}
\begingroup\small
\begin{tabular}{lrrl}
\toprule
& F value & $\text{Pr}(>\text{F})$ & Signif. \\
\midrule
$x_1$ & 8.382 & 0.063 & $\cdot$ \\
$x_2$ & 0.370 & 0.586 & \\
$x_3$ & 80.902 & 0.003 & $**$ \\
$x_4$ & 0.215 & 0.675 & \\
$x_5$ & 46.848 & 0.006 & $**$ \\
$x_6$ & 5.154 & 0.108 & \\
$x_7$ & 13.831 & 0.034 & $*$ \\
$x_8$ & 59.768 & 0.004 & $**$ \\
\bottomrule
\end{tabular}
\endgroup
\end{table}
Moving forward to steps 6, 7, and 8 in Figure~\ref{fig:org45b5375}, we will
build a linear model using factors \(\left\{x_1,x_3,x_5,x_7,x_8\right\}\), fit the
model using the values of \(Y\) we obtained when running our design, and use model
coefficients to predict the levels of each factor that minimize the real
response. We can do that because these factors are numerical, even though only
discrete values are allowed.
We now proceed to the prediction step, where we wish to identify the levels of
factors \(\left\{x_1,x_3,x_5,x_7,x_8\right\}\) that minimize our fitted model, without
running any new experiments. This can be done by, for example, using a gradient
descent algorithm or finding the point where the derivative of the function
given by the linear regression equals to zero.
Table~\ref{tab:linear_prediction_comparison} compares the prediction for \(Y\) from
our linear model with the selected factors \(\left\{x_1,x_3,x_5,x_7,x_8\right\}\) with
the actual global minimum \(Y\) for this problem. Note that factors
\(\left\{x_2,x_4,x_6\right\}\) are included for the global minimum. This happens here
because of the error term \(\varepsilon\), but could also be interpreted as due to model
uncertainty.
% latex table generated in R 3.5.1 by xtable 1.8-2 package
% Mon Oct 15 18:04:58 2018
\begin{table}[b]
\centering
\caption{Comparison of the response $Y$ predicted by the linear model and the true global minimum. Factors used in the model are bolded}
\label{tab:linear_prediction_comparison}
\begingroup\footnotesize
\begin{tabularx}{\columnwidth}{lrlrlrlrrr}
\toprule
& $\bm{x_1}$ & $x_2$ & $\bm{x_3}$ & $x_4$ & $\bm{x_5}$ & $x_6$ & $\bm{x_7$} & $\bm{x_8}$ & $Y$ \\
\midrule
Lin. & -1.0 & -- & -1.0 & -- & -1.0 & -- & 1.0 & -1.0 & -1.046 \\
Min. & 1.0 & -0.2 & -1.0 & 0.6 & -1.0 & 0.4 & 0.8 & 0.0 & -9.934 \\
\bottomrule
\end{tabularx}
\endgroup
\end{table}
Using 12 measurements and a simple linear model, the predicted best value of \(Y\)
was around \(10\times\) larger than the global optimum. Note that the model predicted
the correct levels for \(x_3\) and \(x_5\), and almost for \(x_7\). The linear model
predicted wrong levels for \(x_1\), perhaps due to this factor's interaction with
\(x_3\), and for \(x_8\). Arguably, it would be impossible to predict the correct
level for \(x_8\) using this linear model, since a quadratic term composes the true
formula of \(Y\). As we showed in Figure~\ref{fig:orgbae3568}, a D-Optimal
design using a linear model could detect the significance of a quadratic term,
but the resulting regression will often lead to the wrong level.
We can improve upon this result if we introduce some information about the
problem and use a more flexible design construction technique. Next, we will
discuss the construction of efficient designs using problem-specific formulas
and continue the optimization of our example.
\subsection{D-Optimal Designs}
\label{sec:org5f2b90b}
The application of DoE to autotuning problems requires design construction
techniques that support factors of arbitrary types and number of levels.
Autotuning problems typically combine factors such as binary flags, integer and
floating point numerical values, and unordered enumerations of abstract values.
Because Plackett-Burman designs only support 2-level factors, we had to restrict
factor levels to interval extremities in our example. We have seen that this
restriction makes it difficult to measure the significance of quadratic terms.
We will now show how to optimize our example further by using \emph{D-Optimal designs},
which increase the number of levels we can efficiently screen for and enables
detecting the significance of more complex terms.
To construct a D-Optimal design it is necessary to choose an initial model,
which can be done based on previous experiments or on expert knowledge of the
problem. Once a model is selected, algorithmic construction is performed by
searching for the set of experiments that minimizes \emph{D-Optimality}, a measure of
the \emph{variance} of the \emph{estimators} for the \emph{regression coefficients} associated with
the selected model. This search is usually done by swapping experiments from the
current candidate design with experiments from a pool of possible experiments,
according to certain rules, until some stopping criterion is met. In the
example in this section and in our approach we use Fedorov's
algorithm~\cite{fedorov1972theory} for constructing D-Optimal designs,
implemented in \texttt{R} in the \texttt{AlgDesign} package~\cite{wheeler2014algdesign}.
In our example, suppose that in addition to using our previous screening results
we decide to hire an expert in our problem's domain. The expert confirms our
initial assumptions that the factor \(x_1\) should be included in our model since
it is usually significant for this kind of problem and has a strong interaction
with factor \(x_3\). She also mentions we should replace the linear term for \(x_8\)
by a quadratic term for this factor.
Using our previous screening and the domain knowledge provided by our expert, we
choose a new performance model and use it to construct a D-Optimal design using
Fedorov's algorithm. Since we need enough degrees of freedom to fit our model,
we construct the design with 12 experiments shown in Table~\ref{tab:d_optimal}.
Note that the design includes levels \(-1\), \(0\), and \(1\) for factor \(x_8\). The
design will sample from different regions of the search space due to the
quadratic term, as was shown in Figure~\ref{fig:orgbae3568}.
% latex table generated in R 3.5.1 by xtable 1.8-2 package
% Mon Oct 15 18:05:40 2018
\begin{table}[t]
\centering
\caption{D-Optimal design constructed for the factors $\left\{x_1,x_3,x_5,x_7,x_8\right\}$ and computed response $Y$}
\label{tab:d_optimal}
\begingroup\footnotesize
\begin{tabular}{rrrrrr}
\toprule
$x_1$ & $x_3$ & $x_5$ & $x_7$ & $x_8$ & $Y$ \\
\midrule
-1.0 & -1.0 & -1.0 & -1.0 & -1.0 & 2.455 \\
-1.0 & 1.0 & 1.0 & -1.0 & -1.0 & 6.992 \\
1.0 & -1.0 & -1.0 & 1.0 & -1.0 & -7.776 \\
1.0 & 1.0 & 1.0 & 1.0 & -1.0 & 4.163 \\
1.0 & 1.0 & -1.0 & -1.0 & 0.0 & 0.862 \\
-1.0 & 1.0 & 1.0 & -1.0 & 0.0 & 5.703 \\
1.0 & -1.0 & -1.0 & 1.0 & 0.0 & -9.019 \\
-1.0 & -1.0 & 1.0 & 1.0 & 0.0 & 2.653 \\
-1.0 & -1.0 & -1.0 & -1.0 & 1.0 & 1.951 \\
1.0 & -1.0 & 1.0 & -1.0 & 1.0 & 0.446 \\
-1.0 & 1.0 & -1.0 & 1.0 & 1.0 & -2.383 \\
1.0 & 1.0 & 1.0 & 1.0 & 1.0 & 4.423 \\
\bottomrule
\end{tabular}
\endgroup
\end{table}
We now fit this model using the results of the experiments in our design.
Table~\ref{tab:correct_fit} shows the model fit table and compares the estimated
and real model coefficients. This example illustrates that the DoE approach can
achieve close model estimations using few resources, provided the approach is
able to use user input to identify significant factors, and knowledge about the
problem domain to tweak the model.
% latex table generated in R 3.5.1 by xtable 1.8-2 package
% Mon Oct 15 18:06:01 2018
\begin{table}[b]
\centering
\caption{Correct model fit comparing real and estimated coefficients, with significance intervals from the \texttt{R} language}
\label{tab:correct_fit}
\begingroup\small
\begin{tabular}{lrrrrl}
\toprule
& Real & Estimated & t value & Pr$(>|\text{t}|)$ & Signif. \\
\midrule
Intercept & 0.000 & 0.050 & 0.305 & 0.776 & \\
$x_1$ & -1.500 & -1.452 & -14.542 & 0.000 & *** \\
$x_3$ & 1.300 & 1.527 & 15.292 & 0.000 & *** \\
$x_5$ & 3.100 & 2.682 & 26.857 & 0.000 & *** \\
$x_7$ & -1.400 & -1.712 & -17.141 & 0.000 & *** \\
$x_8$ & 0.000 & -0.175 & -1.516 & 0.204 & \\
$x_8^2$ & 1.350 & 1.234 & 6.180 & 0.003 & ** \\
$x_1x_3$ & 1.600 & 1.879 & 19.955 & 0.000 & *** \\
\bottomrule
\end{tabular}
\endgroup
\end{table}
Table~\ref{tab:prediction_comparisons} compares the global minimum of this
example with the predictions made by our initial linear model from the screening
step, and our improved model. Using screening, D-Optimal designs, and domain
knowledge, we found an optimization within \(10\%\) of the global optimum while
computing \(Y\) only 24 times. We were able to do that by first reducing the
problem's dimension when we eliminated insignificant factors in the screening
step. We then constructed a more careful exploration of this new problem
subspace, aided by domain knowledge provided by an expert. Note that we could
have reused some of the 12 experiments from the previous step to reduce the size
of the new design further.
% latex table generated in R 3.5.1 by xtable 1.8-2 package
% Mon Oct 15 18:06:07 2018
\begin{table}[ht]
\centering
\caption{Comparison of the response $Y$ predicted by our models and the true global minimum. Factors used in the models are bolded}
\label{tab:prediction_comparisons}
\begingroup\footnotesize
\begin{tabular}{lrlrlrlrrr}
\toprule
& $\bm{x_1}$ & $x_2$ & $\bm{x_3}$ & $x_4$ & $\bm{x_5}$ & $x_6$ & $\bm{x_7$} & $\bm{x_8}$ & $Y$ \\
\midrule
Quad. & 1.0 & -- & -1.0 & -- & -1.0 & -- & 1.0 & 0.0 & -9.019 \\
Lin. & -1.0 & -- & -1.0 & -- & -1.0 & -- & 1.0 & -1.0 & -1.046 \\
Min. & 1.0 & -0.2 & -1.0 & 0.6 & -1.0 & 0.4 & 0.8 & 0.0 & -9.934 \\
\bottomrule
\end{tabular}
\endgroup
\end{table}
We are able to explain the performance improvements we obtained in each step of
the process, because we finish steps with a performance model and a performance
prediction. Each factor is included or removed using information obtained in
statistical tests, or expert knowledge. If we need to optimize this problem
again, for a different architecture or with larger input, we could start
exploring the search space with a less naive model. We could also continue the
optimization of this problem by exploring levels of factors
\(\left\{x_2,x_4,x_6\right\}\). The significance of these factors could now be
detectable by ANOVA tests since the other factors are now fixed. If we still
cannot identify any significant factor, it might be advisable to spend the
remaining budget using another exploration strategy such as uniform random or
latin hypercube sampling.
The process of screening for factor significance using ANOVA and fitting a new
model using acquired knowledge is equivalent to steps 5, 6, and 7 in
Figure~\ref{fig:org45b5375}. In the next section we evaluate the performance
of our DoE approach in two scenarios.
\section{Performance Evaluation}
\label{sec:orgeace0ee}
In this section we present performance evaluations of our approach in two
scenarios that differ on search space size and complexity.
\vspace{-5pt}
\subsection{GPU Laplacian Kernel}
\label{sec:org40a0a1b}
We first evaluated the performance of our approach in a Laplacian Kernel
implemented using BOAST~\cite{videau2017boast} and targeting the \emph{Nvidia K40c}
GPU. The objective was to minimize the \emph{time to compute each pixel} by finding the
best level combination for the factors listed in Table
\ref{tab:org5c5e9b6}. Considering only factors and levels, the size of the
search space is \(1.9\times10^5\), but removing points that fail at runtime yields a
search space of size \(2.3\times10^4\). The complete search space took 154 hours to be
evaluated on \emph{Debian Jessie}, using an \emph{Intel Xeon E5-2630v2} CPU, \texttt{gcc}
version \texttt{4.8.3} and \emph{Nvidia} driver version \texttt{340.32}.
We applied domain knowledge to construct the following initial performance model:
\vspace{-2pt}
\\\begin{minipage}{\linewidth}\scriptsize
\begin{align}
\label{eq:gpu_laplacian_performance_model}
\texttt{time\_per\_pixel} \sim & \scriptsize\; \texttt{y\_component\_number} + \frac{1}{\texttt{y\_component\_number}} \; + \nonumber \\
& \scriptsize\; \texttt{load\_overlap} + \texttt{temporary\_size} \; + \nonumber \\
& \scriptsize\; \texttt{vector\_length} + \texttt{lws\_y} + \frac{1}{\texttt{lws\_y}} \; + \\
& \scriptsize\; \texttt{elements\_number} + \texttt{threads\_number} \; + \nonumber \\
& \scriptsize\; \frac{1}{\texttt{elements\_number}} + \frac{1}{\texttt{threads\_number}}\text{.} \nonumber
\end{align}
\vspace{2pt}
\end{minipage}
This performance model was used by the Iterative Linear Model (LM) algorithm and
by our D-Optimal Design approach (DLMT). LM is almost identical to our approach,
described Section~\ref{sec:orgb6938f2}, but it uses a
fixed-size random sample of the search space instead of generating D-Optimal
designs. We compared the performance of our approach with the following
algorithms: uniform Random Sampling (RS); Latin Hypercube Sampling (LHS); Greedy
Search (GS); Greedy Search with Restart (GSR); and Genetic Algorithm (GA). Each
algorithm performed \emph{at most} 125 measurements over 1000 repetitions, without user
intervention.
\begin{table}[t]
\caption{\label{tab:org5c5e9b6}
Parameters of the Laplacian Kernel}
\centering
\scriptsize
\begin{tabular}{llp{0.37\columnwidth}}
\toprule
Factor & Levels & Short Description\\
\midrule
\texttt{vector\_length} & \(2^0,\dots,2^4\) & Size of support arrays\\
\texttt{load\_overlap} & \textit{true}, \textit{false} & Load overlaps in vectorization\\
\texttt{temporary\_size} & \(2,4\) & Byte size of temporary data\\
\texttt{elements\_number} & \(1,\dots,24\) & Size of equal data splits\\
\texttt{y\_component\_number} & \(1,\dots,6\) & Loop tile size\\
\texttt{threads\_number} & \(2^5,\dots,2^{10}\) & Size of thread groups\\
\texttt{lws\_y} & \(2^0,\dots,2^{10}\) & Block size in \(y\) dimension\\
\bottomrule
\end{tabular}
\end{table}
Since we measured the entire valid search space, we could use the \emph{slowdown}
relative to the \emph{global minimum} to compare algorithm performance.
Table~\ref{tab:gpu_laplacian_compare_budget} shows the mean, minimum and maximum
slowdowns in comparison to the global minimum, for each algorithm. It also shows
the mean and maximum budget used by each algorithm.
Figure~\ref{fig:org6f6944c} presents histograms with the count
of the slowdowns found by each of the 1000 repetitions. Arrows point the maximum
slowdown found by each algorithm. Note that GS's maximum slowdown was left out
of range to help the comparison between the other algorithms.
% latex table generated in R 3.5.1 by xtable 1.8-2 package
% Thu Oct 18 15:00:23 2018
\begin{table}[b]
\centering
\caption{Slowdown and budget used by 7 optimization methods on the Laplacian Kernel, using a budget of 125 points with 1000 repetitions}
\label{tab:gpu_laplacian_compare_budget}
\begingroup\footnotesize
\begin{tabular}{lrrrrr}
\toprule
& Mean & Min. & Max. & Mean Budget & Max. Budget \\
\midrule
RS & 1.10 & 1.00 & 1.39 & 120.00 & 120.00 \\
LHS & 1.17 & 1.00 & 1.52 & 98.92 & 125.00 \\
GS & 6.46 & 1.00 & 124.76 & 22.17 & 106.00 \\
GSR & 1.23 & 1.00 & 3.16 & 120.00 & 120.00 \\
GA & 1.12 & 1.00 & 1.65 & 120.00 & 120.00 \\
LM & 1.02 & 1.01 & 3.77 & 119.00 & 119.00 \\
\rowcolor{red!25}DLMT & 1.01 & 1.01 & 1.01 & 54.84 & 56.00 \\
\bottomrule
\end{tabular}
\endgroup
\end{table}
All algorithms performed relatively well in this kernel, with only GS not being
able to find slowdowns smaller than 4\(\times\) in some runs. As expected, other search
algorithms had results similar to RS. LM was able to find slowdowns close to the
global minimum on most runs, but some runs could not find slowdowns smaller than
\(4\times\). Our approach reached a slowdown of 1\% from the global minimum \emph{in all} of
the 1000 runs while using \emph{at most} fewer than half of the allotted budget.
We implemented a simple approach for the prediction step in this problem,
choosing the best value of our fitted models on the complete set of valid level
combinations. This was possible for this problem since all valid combinations
were known. For problems were the search space is too large to be generated, we
would have to either adapt this step and run the prediction on a sample or
minimize the model using the differentiation strategies mentioned in
Section~\ref{sec:org35859df}.
\begin{figure}[t]\vspace{-.5cm}
\centering
\includegraphics[width=.9\columnwidth]{./img/comparison_histogram.pdf}
\caption{\label{fig:org6f6944c}
Distribution of slowdowns in relation to the global minimum for 7 optimization methods on the Laplacian Kernel, using a budget of 125 points over 1000 repetitions \vspace{-.5cm}}
\end{figure}
This kernel provided ideal conditions for using our approach, where the
performance model is approximately known and the complete valid search space is
small enough to be used for prediction. The global minimum also appears to not
be isolated in a region of points with bad performance, since our approach was
able to exploit space geometry. We will now present a performance evaluation of
our approach in a larger and more comprehensive benchmark.
\subsection{SPAPT Benchmark Suite}
\label{sec:orge74ca1f}
The SPAPT~\cite{balaprakash2012spapt} benchmark suite provides parametrized CPU
kernels from different HPC domains. The kernels shown in Table~\ref{tab:orgd14f3ce}
are implemented using the code annotation and transformation tools provided by
Orio~\cite{hartono2009annotation}. Search space sizes are larger than in the
Laplacian Kernel example. Kernel factors are either integers in an interval,
such as loop unrolling and register tiling amounts, or binary flags that control
parallelization and vectorization.
\begin{figure*}[p]
\centering
\includegraphics[width=\textwidth]{./img/iteration_best_comparison.pdf}
\caption{\label{fig:org0536b62}
Cost of best points found on each run, and the iteration where they were found. RS and DLMT found no speedups with similar budgets for kernels marked with ``[0]'' and \emph{blue} headers, and similar speedups with similar budgets for kernels marked with ``[=]'' and \emph{orange} headers. DLMT found similar speedups using smaller budgets for kernels marked with ``[+]'' \emph{green} headers. Ellipses delimit an estimate of where 95\% of the underlying distribution lies}
\end{figure*}
\begin{figure*}[p]
\centering
\includegraphics[width=\textwidth]{./img/split_histograms.pdf}
\caption{\label{fig:orgc8b1732}
Histograms of explored search spaces, showing the real count of measured configurations. Kernels are grouped in the same way as in Figure~\ref{fig:org0536b62}. DLMT spent fewer measurements than RS in configurations with smaller speedups or with slowdowns, even for kernels in the orange group. DLMT also spent more time exploring configurations with larger speedups}
\end{figure*}
We used the Random Sampling (RS) implementation available in Orio and integrated
an implementation of our approach (DLMT) to the system. We omitted the other
Orio algorithms because other studies using SPAPT
kernels~\cite{balaprakash2011can,balaprakash2012experimental} showed that their
performance is similar to RS regarding budget usage. The global minima are not
known for any of the problems, and search spaces are too large to allow complete
measurements. Therefore, we used the performance of each application compiled
with \texttt{gcc}'s \texttt{-O3}, with no code transformations, as a \emph{baseline}
for computing the \emph{speedups} achieved by each strategy. We performed 10 autotuning
repetitions for each kernel using RS and DLMT, using a budget of \emph{at most} 400
measurements. DLMT was allowed to perform only 4 of the iterations shown in
Figure~\ref{fig:org45b5375}. Experiments were performed using
Grid5000~\cite{balouek2013adding}, on \emph{Debian Jessie}, using an \emph{Intel Xeon
E5-2630v3} CPU and \texttt{gcc} version \texttt{6.3.0}.
\begin{table}[t]
\caption{\label{tab:orgd14f3ce}
Kernels from the SPAPT benchmark used in this evaluation}
\centering
\scriptsize
\begin{tabular}{llll}
\toprule
Kernel & Operation & Factors & Size\\
\midrule
\texttt{atax} & Matrix transp. \& vector mult. & 18 & \(2.6 \times 10^{16}\)\\
\texttt{dgemv3} & Scalar, vector \& matrix mult. & 49 & \(3.8 \times 10^{36}\)\\
\texttt{gemver} & Vector mult. \& matrix add. & 24 & \(2.6 \times 10^{22}\)\\
\texttt{gesummv} & Scalar, vector, \& matrix mult. & 11 & \(5.3 \times 10^{9}\)\\
\texttt{hessian} & Hessian computation & 9 & \(3.7 \times 10^{7}\)\\
\texttt{mm} & Matrix multiplication & 13 & \(1.2 \times 10^{12}\)\\
\texttt{mvt} & Matrix vector product \& transp. & 12 & \(1.1 \times 10^{9}\)\\
\texttt{tensor} & Tensor matrix mult. & 20 & \(1.2 \times 10^{19}\)\\
\texttt{trmm} & Triangular matrix operations & 25 & \(3.7 \times 10^{23}\)\\
\texttt{bicg} & Subkernel of BiCGStab & 13 & \(3.2 \times 10^{11}\)\\
\texttt{lu} & LU decomposition & 14 & \(9.6 \times 10^{12}\)\\
\texttt{adi} & Matrix sub., mult., \& div. & 20 & \(6.0 \times 10^{15}\)\\
\texttt{jacobi} & 1-D Jacobi computation & 11 & \(5.3 \times 10^{9}\)\\
\texttt{seidel} & Matrix factorization & 15 & \(1.3 \times 10^{14}\)\\
\texttt{stencil3d} & 3-D stencil computation & 29 & \(9.7 \times 10^{27}\)\\
\texttt{correlation} & Correlation computation & 21 & \(4.5 \times 10^{17}\)\\
\bottomrule
\end{tabular}
\end{table}
The time to measure each kernel varied from a few seconds to up to 20 minutes.
In testing, some transformations caused the compiler to enter an internal
optimization process that did not stop for over 12 hours. We did not study why
these cases delayed for so long, and implemented an execution timeout of 20
minutes, considering cases that took longer than that to compile to be runtime
failures.
Similar to the previous example, we automated factor elimination based on ANOVA
tests so that a comprehensive evaluation could be performed. We also did not
tailor initial performance models, which were the same for all kernels. Initial
models had a linear term for each factor with two or more levels, plus quadratic
and cubic terms for factors with sufficient levels. Although automation and
identical initial models might have limited the improvements at each step of our
application, our results show that it still succeeded in decreasing the budget
needed to find significant speedups for some kernels.
Figure~\ref{fig:org0536b62} presents the \emph{speedup} found by each run of
RS and DLMT, plotted against the algorithm \emph{iteration} where that speedup was
found. We divided the kernels into 3 groups according to the results. The group
where no algorithm found any speedups contains 3 kernels and is marked with
``[0]'' and \emph{blue} headers. The group where both algorithms found similar
speedups, in similar iterations, contains 6 kernels and is marked with ``[=]''
and \emph{orange} headers. The group where DLMT found similar speedups using a
significantly smaller budget than RS contains 8 kernels and is marked with
``[+]'' and \emph{green} headers. Ellipses delimit an estimate of where 95\% of the
underlying distribution lies, and a dashed line marks the \texttt{-03} baseline.
In comparison to RS, our approach significantly decreased the average number of
iterations needed to find speedups for the 8 kernels in the green group.
Figure~\ref{fig:orgc8b1732} shows the search space exploration performed by RS
and DLMT. It uses the same color groups as Figure~\ref{fig:org0536b62},
and shows the distribution of the speedups that were found during all
repetitions of the experiments. Histogram areas corresponding to DLMT are
usually smaller because it always stopped at 4 iterations, while RS always
performed 400 measurements. This is particularly visible in \texttt{lu}, \texttt{mvt}, and
\texttt{jacobi}. We also observe that the quantity of configurations with high speedups
found by DLMT is higher, even for kernels on the orange group. This is
noticeable in \texttt{gemver}, \texttt{bicgkernel}, \texttt{mm} and \texttt{tensor}, and means that DLMT spent less
of the budget exploring configurations with small speedups or slowdowns, in
comparison with RS.
Analyzing the significant performance parameters identified by our automated
approach for every kernel, we were able to identify interesting relationships
between parameters and performance. In \texttt{bicgkernel}, for example, DLTM identified
a linear relationship for OpenMP and scalar replacement optimizations, and
quadratic relationships between register and cache tiling, and loop
unrolling. This is an example of the transparency in the optimization process
that can be achieved with a DoE approach.
Our approach used a generic initial performance model for all kernels, but since
it iteratively eliminates factors and model terms based on ANOVA tests, it was
still able to exploit global search space structures for kernels in the orange
and green groups. Even in this automated setting, the results with SPAPT kernels
illustrate the ability our approach has to reduce the budget needed to find good
speedups by efficiently exploring search spaces.
\section{Conclusion}
\label{sec:orgad88b73}
The contribution of this paper is a transparent DoE approach for program
autotuning under tight budget constraints. We discussed the underlying concepts
that enable our approach to reduce significantly the measurement budget needed
to find good optimizations consistently over different kernels exposing
configuration parameters of source-to-source transformations. We have made
efforts to make our results, figures and analyses reproducible by hosting all
our scripts and data publicly~\cite{bruel2018ccgrid19}.
Our approach outperformed six other search heuristics, always finding a slowdown
of 1\% from the global optimum of the search space defined by the optimization of
a Laplacian kernel for GPUs, while using at most half of the allotted budget. In
a more comprehensive evaluation, using kernels from the SPAPT benchmark, our
approach was able to find the same speedups as RS while using up to \(10\times\) fewer
measurements. We showed that our approach explored search spaces more
efficiently, even for kernels where it performed similarly to random sampling.
We presented a completely automated version of our approach in this paper so
that we could perform a thorough performance evaluation on comprehensive
benchmarks. Despite using the same generic performance model for all kernels,
our approach was able to find good speedups by eliminating insignificant model
terms at each iteration. This means that our approach can still improve the
performance of applications using unspecialized models that incorporate only
general knowledge about algorithm performance. We would incur some budget
overhead in this case while insignificant terms are removed.
In future work we will explore the impact of user input and expert knowledge in
the selection of the initial performance model and in the subsequent elimination
of factors using ANOVA tests. We expect that tailored initial performance models
and assisted factor elimination will improve the solutions found by our approach
and decrease the budget needed to find them.
Our current strategy eliminates completely from the model the factors with low
significance detected by ANOVA tests. In future work we will also explore the
effect of adding random experiments with randomized factor levels. We expect
this will decrease the impact of removing factors wrongly detected to have low
significance.
Decreasing the number of experiments needed to find optimizations is a desirable
property for autotuners in problem domains other than source-to-source
transformation. We intend to evaluate the performance of our approach in domains
such as High-Level Synthesis and compiler configuration for FPGAs, where search
spaces can get as large as \(10^{126}\), and where we already have some
experience~\cite{bruel2017autotuninghls}.
\section*{Acknowledgment}
\label{sec:org3a3ce9b}
Experiments presented in this paper were carried out using the Grid'5000
testbed, supported by a scientific interest group hosted by Inria and including
CNRS, RENATER and several Universities as well as other organizations. This
work was partly funded by CAPES, \emph{Coordenação de Aperfeiçoamento de Pessoal de
Nível Superior}, Brazil, funding code 001.
\bibliographystyle{IEEEtran}
\bibliography{references}
\end{document}