VLIW Architecture
VLIW Architecture
VLIW Architecture
PROCESSORS
2. VLIW Processors
4. Loop Unrolling
5. Trace Scheduling
Datorarkitektur Fö 9-10 1 of 53
What is Good with Superscalars?
Binary compatibility
If functional units are added in a new version of the architecture or
some other improvements have been made to the architecture (without
changing the instruction sets), old programs can benefit from the
additional potential of parallelism.
Why?
Because the new hardware will issue the old instruction sequence in a
more efficient way.
Datorarkitektur Fö 9-10 2 of 53
What is Bad with Superscalars?
Very complex
Much hardware is needed for run-time detection. There is a limit in how
far we can go with this technique.
Power consumption can be very large!
Datorarkitektur Fö 9-10 3 of 53
The Alternative: VLIW Processors
At execution, after one instruction has been fetched all the corresponding
operations are issued in parallel.
Datorarkitektur Fö 9-10 4 of 53
VLIW Processors
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
op1 empty
po op3 op4
Datorarkitektur Fö 9-10 instruction - 3 5 of 53
VLIW Processors
FU-1
FU-2
Register Files
Instruction Instruction
fetch decode FU-3
unit unit
FU-4
FU-n
Execution unit
Datorarkitektur Fö 9-10 6 of 53
Advantages with VLIW Processors
Simpler hardware:
Does not need additional sophisticated hardware to detect parallelism,
like in superscalars.
Power consumption is reduced, compared to superscalar.
Good compilers can detect parallelism based on global analysis of the whole
program (no instruction window problem).
Datorarkitektur Fö 9-10 7 of 53
Problems with VLIW Processors
Large number of registers needed in order to keep all FUs active (to store
operands and results).
Large data transport capacity is needed between FUs and the register file and
between register files and memory.
High bandwidth between instruction cache and fetch unit.
Example: one instruction with 7 operations, each 24 bits 168 bits/instruction.
Large code size, partially because unused operations wasted bits in
instruction word.
Incompatibility of binary cod
For example:
If for a new version of the processor additional FUs are introduced
the number of operations possible to execute in parallel is increased
the instruction word changes
old binary code cannot be run on this processor.
Datorarkitektur Fö 9-10 8 of 53
An Example
Consider the following code in C:
for (i=959; i >= 0; i--)
x[i] = x[i] + s;
Datorarkitektur Fö 9-10 9 of 53
An Example
Consider the following code in C:
for (i=959; i >= 0; i--)
x[i] = x[i] + s;
The delay for a double word load is one additional clock cycle.
The delay for a floating point operation is two additional clock cycles.
Datorarkitektur Fö 9-10 11 of 53
An Example
LDD
F0,(R1)
ADF
F4,F0,F2
SBI
R1,R1,#8
STD BGEZ
8(R1),F4 R1,Loop
One iteration takes 6 cycles. The whole loop takes 960*6 = 5760 cycles.
Almost no parallelism there.
Most of the fields in the instructions are empty.
We have two completely empty cycles.
Datorarkitektur Fö 9-10 12 of 53
Loop Unrolling
Let us rewrite the previous example:
Datorarkitektur Fö 9-10 13 of 53
Loop Unrolling
LDD LDD
F0,(R1) F6,-8(R1)
ADF ADF
F4,F0,F2 F8,F6,F2
SBI
R1,R1,#16
STD STD BGEZ
16(R1),F4 8(R1),F8 R1,Loop
Datorarkitektur Fö 9-10 14 of 53
Loop Unrolling
Datorarkitektur Fö 9-10 15 of 53
Loop Unrolling
Let us unroll three iterations in our example:
for (i=959; i >= 0; i-=3){
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
}
Datorarkitektur Fö 9-10 17 of 53
Loop Unrolling
Datorarkitektur Fö 9-10 18 of 53
Loop Unrolling
LDD LDD
F0,(R1) F6,-8(R1)
LDD LDD
F10,-16(R1) F14,-24(R1)
LDD LDD ADF ADF
F18,-32(R1) F22,-40(R1) F4,F0,F2 F8,F6,F2
LDD LDD ADF ADF
F26,-48(R1) F30,-56(R1) F12,F10,F2 F16,F14,F2
ADF ADF
F20,F18,F2 F24,F22,F2
STD STD ADF ADF
(R1),F4 -8(R1),F8 F28,F26,F2 F32,F30,F2
STD STD
-16(R1),F12 -24(R1),F16
STD STD SBI
-32(R1),F20 -40(R1),F24 R1,R1,#64
STD STD BGEZ
16(R1),F28 8(R1),F32 R1,Loop
Datorarkitektur Fö 9-10 19 of 53
Loop Unrolling
A good compiler has to find the optimal level of unrolling for each loop.
Loop unrolling increases the memory space needed to store the program.
Datorarkitektur Fö 9-10 20 of 53
Trace Scheduling
Trace scheduling is another technique used in compilers in order to exploit
parallelism across conditional branches.
Datorarkitektur Fö 9-10 21 of 53
Trace Scheduling
Trace scheduling is another technique used in compilers in order to exploit
parallelism across conditional branches.
1. Trace selection
2. Instruction scheduling
Datorarkitektur Fö 9-10 22 of 53
Trace Scheduling
Example:
if (c != 0)
b = a / c;
else
b = 0; h=0;
f = g + h;
Datorarkitektur Fö 9-10 23 of 53
Trace Scheduling
Example:
This (for an ordinary processor) would be compiled to:
if (c != 0)
b = a / c; LD R0, c R0 ← c ;(load word)
else BZ R0,Else
b = 0; h=0; LD R1, a R1 ← a ;(load integer)
f = g + h; DV R1,R1,R0 R1 ← R1 / R0 ;(divide integer)
ST b,R1 b ← R1 ;(store word)
BR Next
Else: STI b,#0 b←0
STI h,#0 h←0
Next: LD R0, g R0 ← g ;(load word)
LD R1, h R1 ← h ;(load word)
AD R1,R1,R0 R1 ← R1 + R0 ;(add integer)
ST f,R1 f ← R1 ;(store word)
End: ---------------
Datorarkitektur Fö 9-10 24 of 53
Trace Scheduling
LD R0, c
BZ R0,Else
LD R1, a
Else: STI b,#0
DV R1,R1,R0
STI h,#0
ST b,R1
Next: LD R0, g
LD R1, h
AD R1,R1,R0
ST f,R1
ST f,R1
BZ R0,Else
Instruction scheduling:
Next: LD R0, g
LD R0,c LD R1,a
LD R2,g LD R3,h BZ R0,Else
LD R1, h DV R1,R1,R0
Next: ST b,R1 AD R3,R3,R2
AD R1,R1,R0 ST f,R3 BR End
The code for the entire sequence is produced by using the schedule
generated for the selected trace.
Datorarkitektur Fö 9-10 28 of 53
Trace Scheduling
LD R0, c
In the example:
BZ R0,Else
the load of g and h is moved up,
from the next sequence, before
LD R1, a the conditional branch;
Else: STI b,#0
DV R1,R1,R0
STI h,#0
ST b,R1
Next: LD R0, g
LD R0,c LD R1,a
LD R2,g LD R3,h BZ R0,Else
LD R1, h DV R1,R1,R0
Next: ST b,R1 AD R3,R3,R2
AD R1,R1,R0 ST f,R3 BR End
ST f,R1
ST b,R1
Next: LD R0, g
LD R0,c LD R1,a
LD R2,g LD R3,h BZ R0,Else
LD R1, h DV R1,R1,R0
Next: ST b,R1 AD R3,R3,R2
AD R1,R1,R0 ST f,R3 BR End
ST f,R1
Next: LD R0, g
LD R0,c LD R1,a
LD R2,g LD R3,h BZ R0,Else
LD R1, h DV R1,R1,R0
Next: ST b,R1 AD R3,R3,R2
AD R1,R1,R0 ST f,R3 BR End
ST f,R1
LD R1, a
Else: STI b,#0
DV R1,R1,R0
STI h,#0
ST b,R1
BZ R0,Else
LD R1, a
Else: STI b,#0
DV R1,R1,R0
STI h,#0
ST b,R1
That’s the correct code:
Next: LD R0, g LD R0,c LD R1,a
LD R2,g LD R3,h BZ R0,Else
DV R1,R1,R0
LD R1, h
Next: ST b,R1 AD R3,R3,R2
ST f,R3 BR End
AD R1,R1,R0 Else: STI R1,#0 STI h,#0
STI R3,#0 BR Next
ST f,R1 End:
ST b,R1
That’s the correct code:
Next: LD R0, g LD R0,c LD R1,a
LD R2,g LD R3,h BZ R0,Else
DV R1,R1,R0
LD R1, h
Next: ST b,R1 AD R3,R3,R2
ST f,R3 BR End
AD R1,R1,R0 Else: STI R1,#0 STI h,#0
STI R3,#0 BR Next
ST f,R1 End:
At program execution always the correct path will be taken (of course!);
however, if this is not the one predicted by the compiler, execution will be
slower because of the compensation code.
Datorarkitektur Fö 9-10 37 of 53
Some VLIW Processors
Examples of successful VLIW processors:
TriMedia of Philips
TMS320C6x of Texas Instruments
Datorarkitektur Fö 9-10 38 of 53
The Itanium Architecture
The Itanium is not a pure VLIW architecture, but many of its features are typical
for VLIW processors.
Datorarkitektur Fö 9-10 39 of 53
General Organization
Memory
FU
128 Registers
(integers)
FU
Instruction Instruction
fetch decode & 64 predicate
unit control unit registers
FU
128 Registers
(float. pnts.)
FU
Datorarkitektur Fö 9-10 41 of 53
Instruction Format
128-bits
Operation1 Operation2 Operation3 Temp-
late
3 operations/instruction word (40 bits/operation)
This does not mean that max. 3 operations can be executed in parallel!
The three operations in the instruction are not necessarily parallel!
The template (8bits) indicates what can be executed in parallel.
The encoding in the template shows which of the operations in the
instruction can be executed in parallel.
The template connects also to neighbouring instructions operations
from different instructions can be executed in parallel.
Datorarkitektur Fö 9-10 42 of 53
Instruction Format
128-bits
Operation1 Operation2 Operation3 Temp-
late
3 operations/instruction word (40 bits/operation)
This does not mean that max. 3 operations can be executed in parallel!
The three operations in the instruction are not necessarily parallel!
The template (8bits) indicates what can be executed in parallel.
The encoding in the template shows which of the operations in the
instruction can be executed in parallel.
The template connects also to neighbouring instructions operations
from different instructions can be executed in parallel.
The template provides high flexibility and avoids some of the problems with
classical VLIW processors
Operations in one instruction have not necessarily to be parallel no
places have to be left empty when no parallel operation is available.
The number of parallel operations is not restricted by the instruction
size processor generations have different number of functional
units without changing instruction format binary compatibility.
If, according to the template, there are more parallel operations than
functional units available processor takes them sequentially.
Datorarkitektur Fö 9-10 43 of 53
Predicated Execution
128-bits
Operation1 Operation2 Operation3 Temp-
late
40-bits
Datorarkitektur Fö 9-10 44 of 53
Predicated Execution
128-bits
Operation1 Operation2 Operation3 Temp-
late
40-bits
Any operation can refer to a predicate register
<Pi> operation i is number of a predicate register (between 0 and 63)
This means that the respective operation is to be committed (the
results made visible) only when the respective predicate is true (the
predicate register gets value 1).
If the predicate value is known when the operation is issued, the
operation is executed only if this value is true.
If the predicate is not known at that moment, the operation is started; if
the predicate turns out to be false, the operation is discarded.
40-bits
Any operation can refer to a predicate register
<Pi> operation i is number of a predicate register (between 0 and 63)
This means that the respective operation is to be committed (the
results made visible) only when the respective predicate is true (the
predicate register gets value 1).
If the predicate value is known when the operation is issued, the
operation is executed only if this value is true.
If the predicate is not known at that moment, the operation is started; if
the predicate turns out to be false, the operation is discarded.
If no predicate register is mentioned, the operation is executed and
committed unconditionally.
Datorarkitektur Fö 9-10 46 of 53
Predicated Execution
Predicate assignment
Datorarkitektur Fö 9-10 47 of 53
Branch Predication
The idea is: let instructions from both branches go on in parallel, before the
branch condition has been evaluated. The hardware (predicated execution)
takes care that only those instructions are committed which correspond to
the right branch.
Datorarkitektur Fö 9-10 48 of 53
Branch Predication
Branch predication is not branch prediction:
Branch prediction:
Guess which branch is taken and then go along that one; if the guess was
wrong, undo all the work;
Branch predication:
Both branches are started and when the condition is known (the predicate
registers are set) the right instructions are committed, all others are discarded.
Datorarkitektur Fö 9-10 49 of 53
Branch Predication
Example: Assumptions:
The values are stored in registers, as follows:
if (a && b) a: R0; b: R1; j: R2; c: R3; k: R4; m: R5; i: R6.
j = j + 1;
else{ This sequence (for an ordinary processor) would
if (c) be compiled to:
k = k + 1;
else BZ R0, L1 branch if a == 0
k = k - 1; BZ R1, L1 branch if b == 0
m = k * 5} ADI R2, R2,#1 R2 ← R2 + 1;(integer)
i = i + 1; BR L4
L1: BZ R3, L2 branch if c == 0
ADI R4, R4,#1 R4 ← R4 + 1;(integer)
BR L3
L2: SBI R4, R4,#1 R4 ← R4 - 1;(integer)
L3: MPI R5, R4,#5 R5 ← R4 * 5;(integer)
L4: ADI R6, R6,#1 R6 ← R6 + 1;(integer)
Datorarkitektur Fö 9-10 50 of 53
Branch Predication
Example: Assumptions:
The values are stored in registers, as follows:
if (a && b) a: R0; b: R1; j: R2; c: R3; k: R4; m: R5; i: R6.
j = j + 1;
else{
if (c)
k = k + 1;
else
k = k - 1;
m = k * 5}
i = i + 1;
The compiler can plan all these instructions to be issued in parallel, except
(5) with (7) and (6) with (7) which are data-dependent.
Instructions can be started before the particular predicate on which they
depend is known. When the predicate will be known, the particular
instruction will or will not be committed.
Datorarkitektur Fö 9-10 53 of 53