-
Notifications
You must be signed in to change notification settings - Fork 4
/
Parallelism and Dependence Theory.tex
421 lines (334 loc) · 10.1 KB
/
Parallelism and Dependence Theory.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
\newpage
\section{Parallelism and Dependence Theory}
\subsection{Exercises}
\begin{problem}
Consider the following loop nest:
\begin{lstlisting}[language=C,frame=single]
for (i = 1; i < 1000; i++)
for (j = 1; j < 1000; j++)
A[i][j] = A[i+1][j-1];
\end{lstlisting}
To parallelize this loop as much as possible, we may pick a processor space and map each iteration (i, j)
to a selected processor in that space, using an affine transformation. Consider the data dependences in the
statement of the loop body. What space-partition constraints do they imply? What is the maximum dimension of
the processor space? What constraints on the affine mapping from iterations to processors are implied? Identify
from the list below, the one mapping that:
\item 1.Satisfies the space-partition constraints, and
\item 2.Provides the maximum possible degree of parallelism.
\item a.
$$
\begin{bmatrix}
1 & 1
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
2
\end{bmatrix}
$$
\item b.
$$
\begin{bmatrix}
3 & 0
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
5
\end{bmatrix}
$$
\item c.
$$
\begin{bmatrix}
1 & 0
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
2
\end{bmatrix}
$$
\item d.
$$
\begin{bmatrix}
4 & -4
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
9
\end{bmatrix}
$$
\item {\color{red} Hint: The data-dependence constraints say that iterations which touch the same array element must map to the same processor. Notice that A[i,j] is written at iteration (i, j) and read at iteration (i', j'), where i = i' + 1 and j = j' - 1. Thus, for example, the iteration with i = 7 and j = 4 must be assigned to the same processor as the iteration with i = 6 and j = 5.}
\end{problem}
\begin{problem}
Consider the following fully permutable loop nest:
\begin{lstlisting}[language=C,frame=single]
for (i = 0; i < 100; i++)
for (j = 0; j < 100; j++)
A[i][j] = A[i+1][j+1];
\end{lstlisting}
A time partition for the statement that forms the body of the loop nest is a matrix and vector of the form
$$
\begin{bmatrix}
x & y
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
z
\end{bmatrix}
$$
that satisfies the time-partition constraints for the loop. Determine what these constraints are, and what they say about the possible values of x, y, and z. Then, identify the time partition in the list below that DOES NOT satisfy the constraints.
\item a.
$$
\begin{bmatrix}
3 & 2
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
200
\end{bmatrix}
$$
\item b.
$$
\begin{bmatrix}
1 & 1
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
0
\end{bmatrix}
$$
\item c.
$$
\begin{bmatrix}
1 & 0
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
100
\end{bmatrix}
$$
\item a.
$$
\begin{bmatrix}
-5 & 3
\end{bmatrix}
\times
\begin{bmatrix}
i \\
j
\end{bmatrix}
+
\begin{bmatrix}
1000
\end{bmatrix}
$$
\item {\color{red} Hint: the data dependences for this problem are based on the fact that after A[i,j] is written during iteration (i, j), the same element is read at iteration (i+1, j+1). Thus, the time-partition constraints require that the time partition for iteration (i, j) is no greater than the time partition for iteration (i+1, j+1). }
\end{problem}
\subsection{DATA DEPENDENCE}
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p197.png}
\caption{An example.}
\label{fig:p197}
\end{figure}
\begin{itemize}
\item Flow (true) dependence: a statement $S_i$ precedes a
statement $S_j$ in execution and $S_i$ computes a data value that
$S_j$ uses. $S_i \delta^{+} S_j$. In Figure \ref{fig:p197}, $S_1 \delta^{+} S_2$ and $S_2 \delta^{+} S_4$
\item Anti dependence: a statement $S_i$ precedes a statement $S_j$ in
execution and $S_i$ uses a data value that $S_j$ computes. $S_i \delta^{a} S_j$
In Figure \ref{fig:p197}, $S_2 \delta^{a} S_3$
\item Output dependence: a statement $S_i$ precedes a statement $S_j$
in execution and $S_i$ computes a data value that $S_j$ also
computes. $S_i \delta^{o} S_j$,
In Figure \ref{fig:p197}, $S_1 \delta^{o} S_3$ and $S_3 \delta^{o} S_4$
\end{itemize}
The dependence is said to flow from $S_i$ to $S_j$ because $S_i$
precedes $S_j$ in execution.
$S_i$ is said to be the source of the dependence. $S_j$ is said to
be the sink of the dependence.
The only “true” dependence is flow dependence; it
represents the flow of data in the program.
The other types of dependence are caused by programming
style; they may be eliminated by re-naming as seen in Figure \ref{fig:p198}.
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p198.png}
\caption{Dependence other than flow dependence can be eliminated by re-naming for Figure \ref{fig:p197}.}
\label{fig:p198}
\end{figure}
\subsection{DATA DEPENDENCE DIRECTIONS}
\subsubsection{ = }
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p199.png}
\caption{}
\label{fig:p199}
\end{figure}
In Figure \ref{fig:p199}, There is an instance of $S_1$ that precedes an instance of $S_2$ in
execution and $S_1$ produces data that $S_2$ consumes.
$S_1$ is the {\color{red}source} of the dependence; $S_2$ is the {\color{red}sink} of the
dependence.
The dependence flows between instances of statements in the
same iteration ({\color{red}loop-independent} dependence).
The number of iterations between source and sink (dependence
distance) is 0. The {\color{red}dependence direction} is {\color{red}=}.
$S_1 \delta^{+}_{=} S_2$ or $S_1 \delta^{+}_{0} S_2$
\subsubsection{$<$}
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p200.png}
\caption{}
\label{fig:p200}
\end{figure}
In Figure \ref{fig:p200},there is an instance of $S_1$ that precedes an instance of $S_2$ in
execution and $S_1$ produces data that $S_2$ consumes.
$S_1$ is the source of the dependence; $S_2$ is the sink of the
dependence.
The dependence flows between instances of statements in
different iterations (loop-carried dependence).
The dependence distance is 1. The direction is positive (<).
$S_1 \delta^{+}_{<} S_2$ or $S_1 \delta^{+}_{1} S_2$
\subsubsection{$>$}
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p201.png}
\caption{}
\label{fig:p201}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p202.png}
\caption{}
\label{fig:p202}
\end{figure}
An instance of S precedes
another instance of S and
S produces data that S
consumes. S is both source and sink. The dependence is loop-
carried.
The dependence distance
is (1,-1). $S_1 \delta^{+}_{(1,-1)} S_2$
\subsection{LOOP TRANSFORMATIONS USING DIRECTION VECTORS}
\subsubsection{Loop Parallelization}
The iterations of a loop may be executed
in parallel with one another if and only if
no dependences are carried by the loop.
When that loop does not have a loop-carried dependence, it
is safe to run a loop in parallel, which means
outermost loop with a direction other than “=“
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p203.png}
\caption{Iterations of loop j must be executed sequentially, but the
iterations of loop i may be executed in parallel. Outer loop parallelism.}
\label{fig:p203}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p204.png}
\caption{Iterations of loop i must be executed sequentially, but the
iterations of loop j may be executed in parallel. Inner loop parallelism.}
\label{fig:p204}
\end{figure}
\subsubsection{Loop Interchange}
When all dependences remain lexicographically positive,
it is legal to interchange or block/tile loops, which means
outermost direction other than “=“ must be “<” (positive)
\subsection{DATA DEPENDENCE DECISION ALGORITHM}
\subsubsection{Lamport's Test}
Lamport’s Test is used when there is a single index variable
in the subscript expressions, and when the coefficients of
the index variable in both expressions are the same.
\[
A( \cdots , b * i + c_1 , \cdots ) = \cdots
\cdots = A( \cdots , b * i + c_2 , \cdots )
\]
The dependence problem: does there exist i 1 and i 2 , such
that \(L_i \leq i_1 \leq i_2 \leq U_i \) and such that
\[
b * i_1 + c_1 = b * i_2 + c_2
\]
or
\[
i_2 - i _1 = \frac{c_1-c_2}{b_2}
\]
There is integer solution if and only if \( \frac{c_1-c_2}{b_2} \)is integer.
The dependence distance is \( d = \frac{c_1-c_2}{b_2} \) if \( L_i \leq |d| \leq U_i \)
\begin{itemize}
\item $d > 0$: true dependence.
\item $d = 0$: loop independent dependence.
\item $d < 0$: anti dependence.
\end{itemize}
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p205.png}
\caption{}
\label{fig:p205}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.3\textwidth]{p206.png}
\caption{}
\label{fig:p206}
\end{figure}
\subsubsection{GCD Test}
A greatest common divisor (GCD) test is
a test used in computer science compiler theory to study of
loop optimization and loop dependence analysis to test the dependency
between loop statements. \cite{GCDtestW13:online}
It is difficult to analyze array references in compile time to
determine data dependency (whether they point to same address or not).
A simple and sufficient test for the absence of a dependence is
the greatest common divisor (GCD) test. It is based on the observation
that if a loop carried dependency exists between $X[a*i + b]$ and $X[c*i + d]$
(where X is the array; a, b, c and d are integers, and i is the loop variable),
then GCD (c, a) must divide (d - b). The assumption is that the loop must
be normalized : written so that the loop index/variable starts at 1 and
gets incremented by 1 in every iteration. For example, in the following
loop, a=2, b=3, c=2, d=0 and GCD(a,c)=2 and (d-b) is -3. Since 2
does not divide -3, no dependence is possible.
\begin{lstlisting}[language=C,frame=single ,label = lst:expression1]
for (i=1; i<=100; i++)
{
X[2*i+3] = X[2*i] + 50;
}
\end{lstlisting}