Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Lecture5 Lecture5

Pipelining+Hazards
Pipelining
Basics
EE/CS520 Comp.Archi.
9/12/201
2
2
FundamentalExecutionCycle
Instruction
Fetch
Obtaininstructionfrom
InstMemory
Fetch
Instruction
Decode
Determinerequired
actionsand
instruction size
&Operand
Fetch
instructionsize
Locateandobtain
operanddata
Execute
Result
Computeresultvalueor
statusorcondition
Deposit results in
Result
Store
Next
Depositresultsin
storageforlateruse
Determine successor
Instruction
Determinesuccessor
instruction
EE/CS520 Comp.Archi.
9/12/2012
3
HowtoImprovePerformance
Basicideaistoreducetheexecutiontime
Increasetheclockfrequency
Workinparallelonmultipledata
Parallelism
Serialize the operations like an assembly line Serializetheoperationslikeanassemblyline
Pi li i Pipelining
EE/CS520 Comp.Archi.
9/12/2012
4
Example:CarAssemblyLine
T1 T2 T3 T4 T5 T6
Oneworkerdoingallthework
Latency:6timeunit
Thruput:1carevery6timeunits
Oneworkerdoingallthework
Latency:6timeunit
Thruput:1carevery6timeunits
T
i
m
ee
EE/CS520 Comp.Archi.
9/12/2012
5
Howto increase the production? Howto increase the production? Howtoincreasetheproduction? Howtoincreasetheproduction?
Dedicate one worker for each elementary task Dedicateoneworkerforeachelementarytask
EE/CS520 Comp.Archi.
9/12/2012
6
T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6
workdividedinto6worker
Latency:6timeunit
Thruput:1carevery1timeunit
workdividedinto6worker
Latency:6timeunit
Thruput:1carevery1timeunit
EE/CS520 Comp.Archi.
9/12/2012
7
ProcessorPipelining
Atechniquewherebymultipleinstructionsare
overlappedinexecution
Takesadvantageoftheparallelismexistedamong
different actions needed to execute an instruction differentactionsneededtoexecuteaninstruction
Processorcycle:Timerequiredbyoneinstructionto
moveaheadapipelinestage p p g
Sinceallstagesaresynchronized,thesloweststage
determinestheprocessorcycle
N ll it i 1 l k l ( ti 2) Normallyitis1clockcycle(sometimes2)
EE/CS520 Comp.Archi.
9/12/2012
8
ProcessorPipelining
VonNeumannexecutioncycle:
IF:InstructionFetch
ID:InstructionDecode
E I t ti E t Ex:InstructionExecute
WB:WriteBacktheResult
4 cc
IF IF ID ID Ex Ex WB WB
NonpipelinedController:
IF IF ID ID Ex Ex WB WB
4cc
1cc 1cc 1cc 1cc
PipelinedController: i
i+1
IF IF ID ID Ex Ex WB WB
IF IF ID ID Ex Ex WB WB
1cc
1cc
9/12/2012
i+2
i+3
IF IF ID ID Ex Ex WB WB
IF IF ID ID Ex Ex WB WB
EE/CS520 Comp.Archi.
9
ProcessorPipelining
Furtherdividingtasksintosubtasks:
1)Wedecreasethecycletime,
2) We increase the no of stages in the pipeline
i
i+1
2)Weincreasetheno.ofstagesinthepipeline
IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
IF IF ID ID EE WB WB IF IF WB WB EE ID ID
0.5cc 0.5cc
0.5cc
i+1
i+2
i+3
i+4
i 5

p
i
p
e
l
i
n
e
IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
IF IF ID ID EE WB WB IF IF WB WB EE ID ID
i+5
8

s
t
a
g
e

IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
Timeperinstruction
Timeperinstruction
pipelinedmachine
=
Timeperinstruction
unpipelinedmachine
Numberofpipelinestages
EE/CS520 Comp.Archi.
9/12/2012
10
MIPS5StagePipeline
9/12/201
2
11
EE/CS520 Comp.Archi.
MIPS5StagePipeline
InstFetch(IF):
SendthePCtoinst.memoryandfetchthecurrentinst.
I t D d /R F t h (ID) InstDecode/Reg.Fetch(ID):
Decodetheinst.
Read the required src registers Readtherequiredsrc.registers.
Doequalitytestonregs(forpossiblebranch)
Computepossiblebranchaddressbyaddingsign
extendedoffsettoPC
Signextendedimmediateisalsocalculated
EE/CS520 Comp.Archi.
9/12/2012
12
MIPS5StagePipeline
Execute(EX):3differenttypes
memoryreference:ALUaddsbaseregisterandoffsetto
formtheeffectiveaddress
regreginst:ALUperformstherequiredarith/logic
operation on src regs operationonsrcregs.
regimminst:ALUperformsthesameonthesrcregand
signextendedimmvalue
9/12/2012
13
EE/CS520 Comp.Archi.
MIPS5StagePipeline
Mem.Access(MEM):
Ifload,readdatamemoryfromtheaddresscalculatedinEX
Ifstore,writesrcregtomemory
WriteBack(WB):
Reg Reg or Load type RegRegorLoadtype
Writetheresulttoregfile
9/12/2012
14
EE/CS520 Comp.Archi.
MIPS5StagePipeline
Time(clockcycles)
C l 1 C l 2 C l 3 C l 4 C l 6 C l 7 C l 5
I
n
s
Cycle1 Cycle2 Cycle3 Cycle4 Cycle6 Cycle7 Cycle5
Reg
A
L
U
DMem Ifetch
Reg
s
t
r.
Reg
A
L
U
DMem Ifetch
Reg
O
r
d
e
Reg
A
L
U
DMem Ifetch
Reg
R
L
U
DM If t h
Reg e
r
Reg
A
L
DMem Ifetch
Reg
9/12/2012
15
EE/CS520 Comp.Archi.
MIPS5StagePipeline
Time(clockcycles)
C l 1 C l 2 C l 3 C l 4 C l 6 C l 7 C l 5
I
n
s
Cycle1 Cycle2 Cycle3 Cycle4 Cycle6 Cycle7 Cycle5
Reg
A
L
U
DMem Ifetch
Reg
s
t
r.
Reg
A
L
U
DMem Ifetch
Reg
O
r
d
e
Reg
A
L
U
DMem Ifetch
Reg
e
r Reg
A
L
U
DMem Ifetch
Reg
Insertionofpipeliningregisterstoavoidinterstageinterference Insertionofpipeliningregisterstoavoidinterstageinterference
9/12/2012
16
EE/CS520 Comp.Archi.
BasicPerformanceIssues
Pipeliningincreasesthethroughputbutthelatency
remainsunchanged
Infact,slightlyincreasedduetocontroloverhead
Puts the limit on practical depth of a pipeline Putsthelimitonpracticaldepthofapipeline
Cantaffordhugelatencyonasingleinstruction,ifitpasses
through(say)100stagesofapipeline
Subdividingpipelinestagesdecreasestheperstage
executiontime
IF IF ID ID Ex Ex WB WB IF IF WB WB Ex Ex ID ID
Buttowhichextent??
Dictatedbythepipelineregisterdelay+clockskew y p p g y
9/12/2012
17
EE/CS520 Comp.Archi.
BasicPerformanceIssues
Pipelineregisterdelay:
Setuptimeneededbypipelineregisterforitsinputto
becomestablebeforeitcouldbewritten
Clock skew: Clockskew:
Sameclocksignalarrivingatdifferentpartsofthe
designwithdifferentphasesisknownasskew. g p
9/12/2012
18
EE/CS520 Comp.Archi.
Pipelining:Example
Unpipelinedprocessor
OpType Freq. Exe. Time
ALU ops 40% 4 CC
ClockCycle=1ns
Pipelineoverhead=0.2ns
ALUops 40% 4 CC
Branches 20% 4CC
Memops 40% 5CC
Speedupwhenpipelined?
Avg.instexecutiontime=clockcyclexavg.CPI
=1nsx((0.4+0.2)x4+0.4x5)
= 4.4 ns
Avg.instexecutiontime=clockcyclexavg.CPI
=1nsx((0.4+0.2)x4+0.4x5)
= 4.4 ns 4.4ns
Avg.executiontimewhenpipelined=1ns+0.2ns=1.2ns
Speedup = 4 4/1 2 = 3 7 times
4.4ns
Avg.executiontimewhenpipelined=1ns+0.2ns=1.2ns
Speedup = 4 4/1 2 = 3 7 times Speedup=4.4/1.2=3.7times Speedup=4.4/1.2=3.7times
9/12/2012
19
EE/CS520 Comp.Archi.
PipelineHazards
9/12/201
2
20
EE/CS520 Comp.Archi.
MajorHurdlestoPipelining
Asimplepipelinewouldworkjustfineif
Alltheinstructionswereindependentofeachother p
Doesnothappeninreallife!!!
Hazards prevent next insts execution during its Hazards preventnextinst sexecutionduringits
designatedclockcycle
Structuralhazards:attempttousethesamehardwaretodotwo p
differentthingsatonce
Datahazards:Instructiondependsonresultofpriorinstruction
still in the pipeline stillinthepipeline
Controlhazards:Arisefromthepipeliningofbranchesthat
changethePC g
9/12/2012
21
EE/CS520 Comp.Archi.
StructuralHazards
Somecombinationofinstructionscantbeexecuteddue
toresourceconflicts
Examples:
If a funct unit is not fully pipelined such as a multiplier or divider Ifafunct.unitisnotfullypipelined,suchasamultiplierordivider
Ifaresourcehasnotbeenduplicatedenough,suchasaregfilehas
onlyonereadportbutpipelineneedstworeadsinonecycle.
A single shared memory for insts and data Asinglesharedmemoryforinstsanddata
9/12/2012
22
EE/CS520 Comp.Archi.
StructuralHazards
Cycle1 Cycle2 Cycle3 Cycle4 Cycle6 Cycle7 Cycle5
I
n
Load
Reg
A
L
U
DMem Ifetch
Reg
s
t
O
Inst1
I 2
Reg
A
L
U
DMem Ifetch
Reg
U
O
r
d
e
Inst2
Inst3
Reg
A
L
U
DMem Ifetch
Reg
Reg
A
L
U
DMem Ifetch
Reg
e
r
Inst4
9/12/2012
23
EE/CS520 Comp.Archi.
SolutiontoStructuralHazards
Time(clockcycles)
Cycle1 Cycle2 Cycle3 Cycle4 Cycle6 Cycle7 Cycle5
I
n
Load Reg
A
L
U
DMem Ifetch
Reg
s
t
O
Inst1
I t 2
Reg
A
L
U
DMem Ifetch
Reg
Bubble Bubble Bubble Bubble Bubble
O
r
d
e
Inst2
Stall
Reg
A
L
U
DMem Ifetch
Reg
Bubble Bubble Bubble Bubble Bubble e
r
Inst3 Reg
A
L
U
DMem Ifetch
Reg
9/12/2012
24
EE/CS520 Comp.Archi.
WhyAllowStructuralHazards?
Reductionofoverallcost
A1portmemoryismuchcheaperthana2portmemory
B th i Sili A d P C ti BothinSiliconAreaandPowerConsumption
Key point: If structural hazard is rare it may not be Keypoint:Ifstructuralhazardisrare,itmaynotbe
worththecosttoavoidit
9/12/2012
25
EE/CS520 Comp.Archi.
DataHazards
Occurwhenpipelinechangestheorderof
read/writeaccessestooperandsascomparedto
unpipelinedexecution
Example:
DADD R1,R2,R3
DSUB R4 R1 R5
DADD R1,R2,R3
DSUB R4 R1 R5 DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11 , , , ,
9/12/2012
26
EE/CS520 Comp.Archi.
DataHazards
Cycle1 Cycle2 Cycle3 Cycle4 Cycle6 Cycle7 Cycle5
I
n
DADDR1,R2,R3
Reg
A
L
U
DMem Ifetch
Reg
s
t
O
Reg
A
L
U
DMem Ifetch
Reg
U
DSUBR4,R1,R5
O
r
d
e
ANDR6,R1,R7 Reg
A
L
U
DMem Ifetch
Reg
Reg
L
U
DMem Ifetch
Reg
e
r
ORR8,R1,R9
XORR10,R1,R11
Reg
A
L
DMem Ifetch
Reg
Reg
A
L
U
DMem Ifetch
Reg
9/12/2012
27
EE/CS520 Comp.Archi.
ThreeGenericDataHazards
ReadAfterWrite(RAW)
Inst
J
triestoreadoperandbeforeInst
I
writesit
J
p
I
I: add r1,r2,r3
J : sub r4 r1 r3 J : sub r4,r1,r3
CausedbyaDependence(incompilernomenclature).
Thishazardresultsfromanactualneedfor
communication.
9/12/2012
28
EE/CS520 Comp.Archi.
ThreeGenericDataHazards
WriteAfterRead(WAR)
Inst
J
writesoperandbefore Inst
I
readsit
J
p f
I
I: sub r4,r1,r3
J : add r1,r2,r3
Causedbyanantidependence
This results from reuse of the name r1
K: mul r6,r1,r7
Thisresultsfromreuseofthename r1 .
CanthappeninMIPS5stagepipeline:
Allinstructionstake5stages
Readsarealwaysinstage2
Writesarealwaysinstage5 y g
9/12/2012
29
EE/CS520 Comp.Archi.
ThreeGenericDataHazards
WriteAfterWrite(WAW)
Inst writes operand before Inst writes it Inst
J
writesoperandbefore Inst
I
writesit.
I: mul r1,r4,r3
J : add r1,r2,r3
Causedbyanoutputdependence
K: sub r6,r1,r7
Duetoreuseofthenamer1
CanthappeninMIPS5stagepipeline:
Allinststake5stages,and Writesarealwaysinstage5
9/12/2012
30
EE/CS520 Comp.Archi.
Solution:RAW DataHazards
Onesolution:Compilermustcheckthedependences:
Ifneeded,addsNOPinstructions(stalls):
i1:add R2,R1,R3 #R2:=R1+R3
i2:NOP
i3:NOP
i4 : sub R7 R2 3 #R7 := R2 3 i4:sub R7,R2,3 #R7:=R2 3
9/12/2012
31
EE/CS520 Comp.Archi.
PipelinePerformancewithStalls
i li d i i A
d unpipeline inst time Average
Speedup
pipelined inst time Average
p p
i li d ti l Cl k i li d CPI
d unpipeline time cyle Clock x d unpipeline CPI

pipelined time cyle Clock x pipelined CPI
AssumeClockcycletimeremainsunchanged
pipelined CPI
d unpipeline CPI
Speedup
CPIpipelined=IdealCPI+StallCPI
1 S ll CPI
CPI St ll 1
d unpipeline CPI
Speedup
=1+StallCPI
CPI Stall 1
9/12/2012
32
EE/CS520 Comp.Archi.
PipelinePerformancewithStalls
Assumeeveryinsttakessameno.ofcycle=theno.ofpipelinestages(depthofpipeline)
CPI unpipelined = Pipeline depth
IF IF ID ID Ex Ex WB WB
4cc
CPIunpipelined Pipelinedepth
CPI Stall 1
depth Pipeline
Speedup

Hence,
CPI Stall 1
Ideallyiftherearenostallsthen,
Speedup=Pipelinedepth
9/12/2012
33
EE/CS520 Comp.Archi.
Solution:DataForwarding
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6 R1 R7
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6 R1 R7
Key insight: Result is not really needed by DSUB
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
Keyinsight:ResultisnotreallyneededbyDSUB
untilafteritisproducedbyDADD
Basic idea: Basicidea:
Cantheresultbemovedfromthepipelineregister
whereDADDstoresittowhereDSUBneedsit?Ifyes,
f d h l forwardtheresult
TheresultfromEX/MEMandMEM/WBpipelineregisterisfed
backtoALUinputs
IfwedetectthatpreviousALUoperationsdstregisterissame
asthecurrentoperationssrcregister,theforwardingcontrol
logicselectstheforwardedvalue
9/12/2012
34
EE/CS520 Comp.Archi.
DataForwarding
Cycle1 Cycle2 Cycle3 Cycle4 Cycle6 Cycle7 Cycle5
I
n
Reg
A
L
U
DMem Ifetch
Reg
DADDR1,R2,R3
s
t
O
Reg
A
L
U
DMem Ifetch
Reg
U
DSUBR4,R1,R5
O
r
d
e
Reg
A
L
U
DMem Ifetch
Reg
Reg
L
U
DMem Ifetch
Reg
ANDR6,R1,R7
e
r
Reg
A
L
DMem Ifetch
Reg
Reg
A
L
U
DMem Ifetch
Reg
ORR8,R1,R9
XORR10,R1,R11
9/12/2012
35
EE/CS520 Comp.Archi.
DataForwarding
NextPC
M
E
A
m
u
x
R
e
g
i
s
M
E
M
/
W
R
I
D
/
E
X
X
/
M
E
M

Data
Memory
A
L
U
m
u
x
s
t
e
r

F
i
l
e
Immediate
m
u
x
9/12/2012
36
EE/CS520 Comp.Archi.
DataForwarding
DADD R1,R2,R3
LD R4,0(R1)
DADD R1,R2,R3
LD R4,0(R1)
Cycle1 Cycle2 Cycle3 Cycle4 Cycle6 Cycle7 Cycle5
SD R4,12(R1) SD R4,12(R1)
I
n
DADDR1,R2,R3
Reg
A
L
U
DMem Ifetch
Reg
s
t
O
LDR4,0(R1)
Reg
A
L
U
DMem Ifetch
Reg
U
O
r
d
e
SDR4,12(R1)
Reg
A
L
U
DMem Ifetch
Reg
e
r
SDimmediatelyfollowingLDneeds:
data forwarding from MEM/WB pipeline register to DMEMinput as well
SDimmediatelyfollowingLDneeds:
data forwarding from MEM/WB pipeline register to DMEMinput as well dataforwardingfromMEM/WBpipelineregistertoDMEMinputaswell dataforwardingfromMEM/WBpipelineregistertoDMEMinputaswell
9/12/2012
37
EE/CS520 Comp.Archi.
DataForwarding
NextPC
M
E
A
m
u
x
R
e
g
i
s
m
u
x
M
E
M
/
W
R
I
D
/
E
X
X
/
M
E
M

Data
Memory
A
L
U
m
u
x
s
t
e
r

F
i
l
e
y
Immediate
m
u
x
9/12/2012
38
EE/CS520 Comp.Archi.
DataForwarding
Canwecompletelyavoiddatahazardsusingforwarding?
NO:TherewillbeLoadUseDelays(LUD)inacode NO:TherewillbeLoadUseDelays(LUD)inacode
9/12/2012
39
EE/CS520 Comp.Archi.
DataHazardsrequiringStalls
Cycle1 Cycle2 Cycle3 Cycle4 Cycle6 Cycle7 Cycle5
LD R1,0(R2)
DSUB R4,R1,R5
AND R6 R1 R7
LD R1,0(R2)
DSUB R4,R1,R5
AND R6 R1 R7
I
n
LD
Reg
A
L
U
DMem Ifetch
Reg
AND R6,R1,R7
OR R8,R1,R9
AND R6,R1,R7
OR R8,R1,R9
s
t
O
DSUB
AND
Reg
A
L
U
DMem Ifetch
Reg
U
O
r
d
e
AND
OR
Reg
A
L
U
DMem Ifetch
Reg
Reg
L
U
DMem Ifetch
Reg
e
r
Reg
A
L
DMem Ifetch
Reg
9/12/2012
40
EE/CS520 Comp.Archi.
Control(Branch)Hazards
Cancausegreaterperformancedegradationthan
thedatahazards
O ll l f b h 10% 30% Onestallcycleforeverybranchcauses10%to30%
performanceloss
Depends of branch inst frequency Dependsofbranchinstfrequency
9/12/2012
41
EE/CS520 Comp.Archi.
RecallBranches
Taken Branch
IfthebranchchangesthePCtoitstargetaddress
Untaken Branch
If h b h d h h PC l PC 4 IfthebranchdoesnotchangethePC,usualPC+4
Not sure if a branch is taken or not until the end of ID NotsureifabranchistakenornotuntiltheendofID
Resultsinabranchstall
9/12/2012
42
EE/CS520 Comp.Archi.

You might also like