-
Notifications
You must be signed in to change notification settings - Fork 4
/
Instruction Scheduling.tex
102 lines (84 loc) · 7 KB
/
Instruction Scheduling.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
\section{Instruction Scheduling}
\subsection{Introduction}
We talk about what we called
machine-independent optimizations in previous chapters. So for example, things like
partial redunancy elimation, dead-code elimation, consant propagation and
folding. These are things that are good for improving code, no matter
what machine you're running on. It's always good to eliminate work
from the code. But now we are talking about machine-dependent optimazations.
These are optimizations where you need to know a little bit information
about the machine you're targeting. The goal of the machine-independent
optimizations is simply to eliminate work. For a register allocation, the point
is to make it less expensive to access data. It's much cheaper to use data in
registers than having to go to memory on the stack all the time.
What's the point of instruction scheduling? With instruction scheduling,
we aren't getting rid of work. We assume that the earlier optimazation passes have
already eliminated as much work as possible, but instead our goal is to take that
fixed amount of work that we have and to perform it faster. And how is that possible
the basic answer for how this happens is that we want to execute things in parallel.
If we want to scrunch the same amount of work into less time that means essentially
overlapping things. Let's say there's a sequence of instructions that we're assigning
values to a b and c. With instruction scheduling, we're hoping to we still have to do all
all of this work, but we can do all the work simultaneously, then that would allow
us to get our answer faster.
Parallelism occurs in different forms in modern machines, and I'am going to talk a little
bit about each of these things. The first two items are pipelineing and superscalar
processing. Exploiting those types of Parallelism for historical reasons we call that
instruction scheduling. To take full advantage of parallel processing on multiple cores
with multiple threads we call that automaic Parallelization, which we will talk about
later.
At a high level, if you think about the time that it takes to exceute an instruction,
the idea behind pipelineing is to break up that time into different stages.
Examples of some modern processors talked about breaking up execution into five
stages for example. So you fetch an instruction, you go grab it from memory, you
decode it and figure out what it is and you read grab any registers that it's going to read
from, then you execute it, then if it's a memory access you may need to load or store
from memory and finally you may have a result that you need to put into a register file.
These may be five different steps that instructions tyically go through, and the key thing is
that they use independent pieces of hardware as they flow through these different stages of execution
And since they're using different pieces of hardware, what we do is rather than executing them
one after another, instead we can overlap them in time so that at any moment in time, we have
several different instructions in flight, where we're using different pieces of hardware for each
of the different instructions. So notice at one moment in time, we're using all the hardware
associated with all five different stages, because those are independent pieces od hardware, but
they happen to be running different instructions. So this is one way to create Parallelism.
Should we simply make pipelines deeper and deeper? This becomes less and less attractive,
because these is overhead between pipeline stages. So you have registers that capture the
stage and between pipe stages. So this becomes less attractive. So you don't see hundred stage
pipelines. And in fact if you make the pipeline stage short than the time it takes to do just
a basic arithmetic operation. It's not particularly clear what the point is in trying to pipeline.
at such a tiny granularity. The length of a pipeline stage can affect the clock rate.And there
was a lot of focus on increasing clock rates. Pentium 4 processor had a 20-stage pipeline.
Pipeline is a form Parallelism and another way to make processors run faster is not just to break them up into independent stages whitn the execution of one instruction but to actually allow multiple instructions tp proceed side by side through pipeline stages.
And in order todo this, you need to create more hardware in particular you need more alus.
So if we just focus on the execution stage where we do an arithmetic operation.
In reality, this involves using a piece of hardware called called an alu. The idea is
what if we created multiple alus, so here I'm showing two of them or it clould be more
than that. So the idea is instead of just executing one instruction through the alu,
I can potentially be executing two instructions side by side.
The requirement though is they have to be independent instruction. They can't depend
on each other.
For original pipeline, only one instruction in a given pipe stage at a given time.
But for superscalar pipeline, there can be multiple instructions in the same pipe
stage at the same time. The idea is we create even more Parallelism by allowing more
instructions to proceed simultaneously. Although there has to be even more independence
between these instructions, the ones are going side by side at least.
Beceuse pipelining and superscalar processing, we need to parallel instructions that don't have
tight data dependencies, so we can execute them simultaneously. So if you think about this
being the instruction stream, the dynamic path through the code. The ideal thing for instruction scheduling
is through some magic, we take all these instructions and figure out a way to cram them together
into as little time as possible. In the absoulte extreme instead of doing n instructions with n cycles
. What if we could cram all of them into one cycle. So is this possible?
There are three major things that can strain our ability to achieve that ideal. They are
hardware resources, data dependencies and control dependencies.
First, hardware resources. Modern processors have finite resources, and there are often constraints
on how resources can be used. There are three kinds constraints that affect scheduling. First of all,
a processor can only issue a certain number of instructions every clock cycle and so that might
be fo example four instructions or something like that. The next type of constraints is there's
a certain mixture of functional units, so there might be a certain of alus that can perform. So
among you have four instructions, it's not that you can have an arbitrary mix of instructions.
There can be also constraints based on the type of instruction that you're trying to execute. Finally,
another type of limitation is that there are some functional units where you can't necessarily start a instruction
once you start a new one down the same piece of hardware immediately on the next clock cycle.
Some like integer divide. So there are some operations keep a piece of hardware busy for a while.
There might be a restriction on how soon you can start the next instruction.