Lecture 2: Review of Instruction Set Design and Pipelining - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Lecture 2: Review of Instruction Set Design and Pipelining

Description:

Lasts through many implementations (portability, compatibility) ... Single Accumulator (EDSAC 1950) Accumulator Index Registers ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 41

Provided by: Rand249

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 2: Review of Instruction Set Design and Pipelining

1
Lecture 2 Review of Instruction Set Design and
Pipelining

Dr. Ben Juurlink
Modern Computer Architectures
Fall 2001

2
Computer Architecture Is

the attributes of a computing system as seen
by the programmer, i.e., the conceptual structure
and functional behavior, as distinct from the
organization of the data flows and controls the
logic design, and the physical implementation.
Amdahl, Blauw, and Brooks, 1964

SOFTWARE
3
Instruction Set Architecture (ISA)
software
instruction set
hardware
4
Interface Design

A good interface
Lasts through many implementations (portability,
compatibility)
Is used in many different ways (generality)
Provides convenient functionality to higher
levels
Permits an efficient implementation at lower
levels

use
time
imp 1
Interface
use
imp 2
use
imp 3
5
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from
Implementation
High-level Language Based
Concept of a Family
(B5000 1963)
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
(Vax, Intel 432 1977-80)
RISC
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
VLIW/EPIC?
(IA-64. . .1999)
6
Evolution of Instruction Sets

Major advances in computer architecture are
typically associated with landmark instruction
set designs
Ex Stack vs GPR (System 360)
Design decisions must take into account
technology
machine organization
programming languages
compiler technology
operating systems
And they in turn influence these

7
A "Typical" RISC

32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP FP take pair)
3-address, reg-reg arithmetic instruction
Single address mode for load/store base
displacement
no indirection
Simple branch conditions
Delayed branch
Note x86/Pentium can be classified as CISC but
complex instructions are translated to several
micro-operations which are executed by hardware
instruction set is CISC
implementation is RISC

see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
8
Example MIPS (? DLX)
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
9
MIPS Characteristics

All instructions have the same length (can be
wasteful of memory)
Only load/store instructions access memory,
arithmetic instructions operate on registers
Only one address mode for load/store
base(register)displacement(immediate)
Only branch-if-equal (beq) and branch-if-not-equal
(bne) instructions, no branch-if-less-than (blt)
or branch-if-greater-than-or-equal (bge) etc.
Next instruction after branch is always executed
(delayed branch)
Why? Pipelining! (next)

10
Pipelining Its Natural!

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

11
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

12
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads

13
Pipelining Lessons

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup Number pipeline stages
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
14
Computer Pipelines

Execute billions of instructions, so throughout
is what matters
MIPS/DLX desirable features
all instructions same length -gt can start
fetching next instruction before instruction is
decoded
registers located in same place in instruction
format -gt can read registers before instruction
is decoded
memory operands only in loads or stores -gt we
dont have to fetch data from memory before
instruction is executed

15
5 Steps of MIPS/DLX Datapath
e
g
i
s
t
e
r

1
R
e
a
d
Z
e
r
o
r
e
g
i
s
t
e
r

2
A
L
U
A
L
U
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
D
a
t
a
u
W
r
i
t
e
x
m
e
m
o
r
y
d
a
t
a
1
1
6
S
i
g
n
e
x
t
e
n
d
16
Pipelined MIPS/DLX Datapath
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
17
Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
18
Summary

Five pipeline stages
Instruction fetch
Instruction decode and register fetch
(speculative!)
Execute (arithmetic instr or calc. branch
condition) / address calculation (load/store
instr.)
Memory access (only for load/store) / calculate
branch target
Write-back result to register file
Problem suppose that memory operands can also
appear in arithmetic instructions, as in
addm R1,R2,4(R3) R1 R2 MemR34
How would this affect the pipeline?

Solution Stages 3 and 4 would expand to an
address stage, memory stage, and then execute
stage.
19
Its Not That Easy...

Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle
Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away)
Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock)
Control hazards Branch instructions change the
sequential order of instruction execution (for
very dirty laundry we have to check the water
temperature setting)

20
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
21
One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
22
Data HazardsFigure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
23
Three Generic Data Hazards

InstrI followed by InstrJ
Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it

i add r1,r2,r3 j add r4,r1,r5
24
Three Generic Data Hazards

InstrI followed by InstrJ
i add r1,r2,r3
j add r2,r4,r5
Write After Read (WAR) InstrJ writes operand
before InstrI reads it
Gets wrong operand
Cant happen in DLX 5-stage pipeline because
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5

25
Three Generic Data Hazards

InstrI followed by InstrJ
Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it
Leaves wrong result ( InstrI not InstrJ )
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Writes are always in stage 5
Will see WAR and WAW in superscalar designs

i add r1,r2,r3 j add r1,r4,r5
26
Forwarding to Avoid Data HazardsRegister file
forwarding
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
27
Forwarding to Avoid Data Hazards
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
28
Hardware for ForwardingFigure 3.20, Page 161
29
Data Hazard Even with ForwardingUse after load
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
30
Hazard Detection UnitHardware must detect hazard
and stall the pipeline(insert a pipeline bubble)
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
31
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd

Fast code
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

32
Control Hazard on BranchesThree Stage Stall
33
Branch Stall Impact

If CPI 1, 30 branch, Stall 3 cycles gt new CPI
1.9!
Two part solution
Determine branch taken or not sooner, AND
Compute taken branch address earlier
MIPS/DLX branches test if register 0 or ? 0
MIPS/DLX Solution reduce branch penalty by
Moving zero test to ID/RF stage
Adding adder to calculate new PC in ID/RF stage
Now 1 clock cycle penalty for branch versus 3

34
Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
35
Pipelined DLX/MIPS Datapath
This is the correct 1 cycle branch
delay implementation

36
Five Branch Hazard Alternatives

1 Stall until branch direction is clear
2 Predict Branch Not Taken
Execute successor instructions in sequence
Squash instructions in pipeline if branch
actually taken
Advantage of late pipeline state update
47 DLX branches not taken on average
PC4 already calculated, so use it to get next
instruction
3 Predict Branch Taken
53 DLX branches taken on average
But havent calculated branch target address in
DLX
DLX still incurs 1 cycle branch penalty
Other machines branch target known before outcome

37
Five Branch Hazard Alternatives

4 Delayed Branch
Define branch to take place AFTER a following
instruction
branch instruction sequential
successor1 sequential successor2
Branch delay of n cycles sequential successorn
branch target if taken
1 slot delay allows proper decision and branch
target address calculation in 5 stage pipeline
MIPS/DLX uses this
5 Branch Prediction predict the outcome of a
branch based on its history (next lecture)

38
Delayed Branch

Where to get instructions to fill branch delay
slot?
Before branch instruction
From the target address only valuable when
branch taken. Furthermore, it must be legal
instruction at target address writes to a
register that is not used by other branch
direction.
From fall through only valuable when branch not
taken
Cancelling or nullifying branches allow more
slots to be filled.
Instr. in delay slot includes direction branch
was predicted. If correct, instr. executes
normally. If not, it is turned into a nop.
Compiler effectiveness for single branch delay
slot
Fills about 60 of branch delay slots
About 80 of instr. executed in branch delay
slots useful in computation
About 50 (60 x 80) of slots usefully filled
Delayed Branch downsides
Deeper pipelines (P4 has 20!) means longer branch
delays means more difficult job to fill delay
slots
Architecturally visible. From a MIPS R10000
(superscalar) paper Delayed branches are no
longer needed, but are maintained
for compatibility.