Lecture 2: Review of Instruction Set Design and Pipelining - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Lecture 2: Review of Instruction Set Design and Pipelining

Description:

Lasts through many implementations (portability, compatibility) ... Single Accumulator (EDSAC 1950) Accumulator Index Registers ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 41
Provided by: Rand249
Category:

less

Transcript and Presenter's Notes

Title: Lecture 2: Review of Instruction Set Design and Pipelining


1
Lecture 2 Review of Instruction Set Design and
Pipelining
  • Dr. Ben Juurlink
  • Modern Computer Architectures
  • Fall 2001

2
Computer Architecture Is
  • the attributes of a computing system as seen
    by the programmer, i.e., the conceptual structure
    and functional behavior, as distinct from the
    organization of the data flows and controls the
    logic design, and the physical implementation.
  • Amdahl, Blauw, and Brooks, 1964

SOFTWARE
3
Instruction Set Architecture (ISA)
software
instruction set
hardware
4
Interface Design
  • A good interface
  • Lasts through many implementations (portability,
    compatibility)
  • Is used in many different ways (generality)
  • Provides convenient functionality to higher
    levels
  • Permits an efficient implementation at lower
    levels

use
time
imp 1
Interface
use
imp 2
use
imp 3
5
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from
Implementation
High-level Language Based
Concept of a Family
(B5000 1963)
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
(Vax, Intel 432 1977-80)
RISC
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
VLIW/EPIC?
(IA-64. . .1999)
6
Evolution of Instruction Sets
  • Major advances in computer architecture are
    typically associated with landmark instruction
    set designs
  • Ex Stack vs GPR (System 360)
  • Design decisions must take into account
  • technology
  • machine organization
  • programming languages
  • compiler technology
  • operating systems
  • And they in turn influence these

7
A "Typical" RISC
  • 32-bit fixed format instruction (3 formats)
  • 32 32-bit GPR (R0 contains zero, DP FP take pair)
  • 3-address, reg-reg arithmetic instruction
  • Single address mode for load/store base
    displacement
  • no indirection
  • Simple branch conditions
  • Delayed branch
  • Note x86/Pentium can be classified as CISC but
    complex instructions are translated to several
    micro-operations which are executed by hardware
  • instruction set is CISC
  • implementation is RISC

see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
8
Example MIPS (? DLX)
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
9
MIPS Characteristics
  • All instructions have the same length (can be
    wasteful of memory)
  • Only load/store instructions access memory,
    arithmetic instructions operate on registers
  • Only one address mode for load/store
    base(register)displacement(immediate)
  • Only branch-if-equal (beq) and branch-if-not-equal
    (bne) instructions, no branch-if-less-than (blt)
    or branch-if-greater-than-or-equal (bge) etc.
  • Next instruction after branch is always executed
    (delayed branch)
  • Why? Pipelining! (next)

10
Pipelining Its Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

11
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

12
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
  • Pipelined laundry takes 3.5 hours for 4 loads

13
Pipelining Lessons
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup Number pipeline stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
14
Computer Pipelines
  • Execute billions of instructions, so throughout
    is what matters
  • MIPS/DLX desirable features
  • all instructions same length -gt can start
    fetching next instruction before instruction is
    decoded
  • registers located in same place in instruction
    format -gt can read registers before instruction
    is decoded
  • memory operands only in loads or stores -gt we
    dont have to fetch data from memory before
    instruction is executed

15
5 Steps of MIPS/DLX Datapath
e
g
i
s
t
e
r

1
R
e
a
d
Z
e
r
o
r
e
g
i
s
t
e
r

2
A
L
U
A
L
U
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
D
a
t
a
u
W
r
i
t
e
x
m
e
m
o
r
y
d
a
t
a
1
1
6
S
i
g
n
e
x
t
e
n
d
16
Pipelined MIPS/DLX Datapath
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
17
Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
18
Summary
  • Five pipeline stages
  • Instruction fetch
  • Instruction decode and register fetch
    (speculative!)
  • Execute (arithmetic instr or calc. branch
    condition) / address calculation (load/store
    instr.)
  • Memory access (only for load/store) / calculate
    branch target
  • Write-back result to register file
  • Problem suppose that memory operands can also
    appear in arithmetic instructions, as in
  • addm R1,R2,4(R3) R1 R2 MemR34
  • How would this affect the pipeline?

Solution Stages 3 and 4 would expand to an
address stage, memory stage, and then execute
stage.
19
Its Not That Easy...
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (single person to
    fold and put clothes away)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (missing
    sock)
  • Control hazards Branch instructions change the
    sequential order of instruction execution (for
    very dirty laundry we have to check the water
    temperature setting)

20
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
21
One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
22
Data HazardsFigure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
23
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it

i add r1,r2,r3 j add r4,r1,r5
24
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • i add r1,r2,r3
  • j add r2,r4,r5
  • Write After Read (WAR) InstrJ writes operand
    before InstrI reads it
  • Gets wrong operand
  • Cant happen in DLX 5-stage pipeline because
  • All instructions take 5 stages, and
  • Reads are always in stage 2, and
  • Writes are always in stage 5

25
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Write After Write (WAW) InstrJ tries to write
    operand before InstrI writes it
  • Leaves wrong result ( InstrI not InstrJ )
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Writes are always in stage 5
  • Will see WAR and WAW in superscalar designs

i add r1,r2,r3 j add r1,r4,r5
26
Forwarding to Avoid Data HazardsRegister file
forwarding
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
27
Forwarding to Avoid Data Hazards
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
28
Hardware for ForwardingFigure 3.20, Page 161
29
Data Hazard Even with ForwardingUse after load
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
30
Hazard Detection UnitHardware must detect hazard
and stall the pipeline(insert a pipeline bubble)
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
31
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

32
Control Hazard on BranchesThree Stage Stall
33
Branch Stall Impact
  • If CPI 1, 30 branch, Stall 3 cycles gt new CPI
    1.9!
  • Two part solution
  • Determine branch taken or not sooner, AND
  • Compute taken branch address earlier
  • MIPS/DLX branches test if register 0 or ? 0
  • MIPS/DLX Solution reduce branch penalty by
  • Moving zero test to ID/RF stage
  • Adding adder to calculate new PC in ID/RF stage
  • Now 1 clock cycle penalty for branch versus 3

34
Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
35
Pipelined DLX/MIPS Datapath
This is the correct 1 cycle branch
delay implementation

36
Five Branch Hazard Alternatives
  • 1 Stall until branch direction is clear
  • 2 Predict Branch Not Taken
  • Execute successor instructions in sequence
  • Squash instructions in pipeline if branch
    actually taken
  • Advantage of late pipeline state update
  • 47 DLX branches not taken on average
  • PC4 already calculated, so use it to get next
    instruction
  • 3 Predict Branch Taken
  • 53 DLX branches taken on average
  • But havent calculated branch target address in
    DLX
  • DLX still incurs 1 cycle branch penalty
  • Other machines branch target known before outcome

37
Five Branch Hazard Alternatives
  • 4 Delayed Branch
  • Define branch to take place AFTER a following
    instruction
  • branch instruction sequential
    successor1 sequential successor2
    Branch delay of n cycles sequential successorn
  • branch target if taken
  • 1 slot delay allows proper decision and branch
    target address calculation in 5 stage pipeline
  • MIPS/DLX uses this
  • 5 Branch Prediction predict the outcome of a
    branch based on its history (next lecture)

38
Delayed Branch
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction
  • From the target address only valuable when
    branch taken. Furthermore, it must be legal
    instruction at target address writes to a
    register that is not used by other branch
    direction.
  • From fall through only valuable when branch not
    taken
  • Cancelling or nullifying branches allow more
    slots to be filled.
  • Instr. in delay slot includes direction branch
    was predicted. If correct, instr. executes
    normally. If not, it is turned into a nop.
  • Compiler effectiveness for single branch delay
    slot
  • Fills about 60 of branch delay slots
  • About 80 of instr. executed in branch delay
    slots useful in computation
  • About 50 (60 x 80) of slots usefully filled
  • Delayed Branch downsides
  • Deeper pipelines (P4 has 20!) means longer branch
    delays means more difficult job to fill delay
    slots
  • Architecturally visible. From a MIPS R10000
    (superscalar) paper Delayed branches are no
    longer needed, but are maintained
  • for compatibility.

39
Pipelining Summary
  • Just overlap tasks, easy if tasks are
    independent
  • Ideal speedup number of pipeline stages
  • Hazards limit performance
  • Structural need more HW resources
  • Data (RAW,WAR,WAW) need forwarding, compiler
    scheduling
  • Control delayed branch, prediction
  • MIPS/DLX instruction set designed with pipelining
    in mind
  • All instructions same length
  • Load/store architecture
  • Registers located in same place in instruction
    format

40
Outlook
  • Next lecture
  • Advanced Pipelining
  • Dynamic Branch Prediction
  • Instruction-Level Parallelism
  • Multiple Issue (superscalar and VLIW) processors
  • Textbook chapter 4.
Write a Comment
User Comments (0)
About PowerShow.com