Cpsc 318 Computer Structures Lecture 14 Pipelined Execution Part II - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Cpsc 318 Computer Structures Lecture 14 Pipelined Execution Part II

Description:

(light clothing) (dark clothing) (very dirty clothing) 30. 30. 30. 30. 30. CPSC318 Lecture 14 ... clothing) (light clothing) (dark clothing) (light clothing) A ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 29

Provided by: davepat4

Category:

more less

Transcript and Presenter's Notes

Title: Cpsc 318 Computer Structures Lecture 14 Pipelined Execution Part II

1
Cpsc 318Computer Structures Lecture 14
Pipelined Execution - Part II

Dr. Son Vuong
(vuong_at_cs.ubc.ca)
March 11, 2004

2
Review (1/3)

Optimal Pipeline
Each stage is executing part of an instruction
each clock cycle.
One instruction finishes during each clock cycle.
On average, execute far more quickly.
What makes this work?
Similarities between instructions allow us to use
same stages for all instructions (generally).
Each stage takes about the same amount of time as
all others little wasted time.

3
Review (2/3)

Pipelining a Big Idea widely used concept
What makes it less than perfect?
Structural hazards suppose we had only one
cache? ? Need more HW resources
Control hazards need to worry about branch
instructions? ? Delayed branch
Data hazards an instruction depends on a
previous instruction?

4
Review (3/3) 5 Steps in 5 stage pipeline

1) IFetch Fetch Instruction, Increment PC
2) Decode Instruction, Read Registers
3) Execute Mem-ref Calculate Address
Arith-log Perform Operation
4) Memory Load Read Data from Memory
Store Write Data to Memory
5) Write Back Write Data to Register

5
Pipelined Execution Representation

Every instruction must take same number of steps,
also called pipeline stages.
One clock cycle per pipeline stage
500 MHz gt 500 million clock cycles / second

6
Structural Hazard Registers?
Can Read and write registers simultaneously
7
Data Hazards (1/2)

Consider the following sequence of instructions

8
Data Hazards (2/2)
Dependencies backwards in time are hazards
9
Data Hazard Solution Forwarding

Forward result from one stage to another

or hazard solved by register hardware
10
Data Hazard Loads (1/4)

Dependencies backwards in time are hazards

Cant solve with forwarding
Must stall instruction dependent on load, then
forward (more hardware)

11
Data Hazard Loads (2/4)

Hardware must stall pipeline
Called interlock

12
Data Hazard Loads (3/4)

Instruction slot after a load is called load
delay slot
If that instruction uses the result of the load,
then the hardware interlock will stall it for one
cycle.
If the compiler puts an unrelated instruction in
that slot, then no stall
Letting the hardware stall the instruction in the
delay slot is equivalent to putting a nop in the
slot (except for the later uses more code space)

13
Data Hazard Loads (4/4)

Stall is equivalent to nop

lw t0, 0(t1)
nop
sub t3,t0,t2
and t5,t0,t4
14
Historical Trivia

First MIPS design did not interlock and stall on
load-use data hazard
Real reason for name behind MIPS Microprocessor
without Interlocked Pipeline Stages
Word Play on acronym for Millions of
Instructions Per Second, also called MIPS

15
Example Nondelayed vs. Delayed Branch
Nondelayed Branch
Delayed Branch
16
Control Hazard Branching

Notes on Branch-Delay Slot
Worst-Case Scenario can always put a no-op in
the branch-delay slot
Better Case can find an instruction preceding
the branch which can be placed in the
branch-delay slot without affecting flow of the
program
Compiler can usually can find such an instruction
at least 50 of the time
Jumps also have a delay slot

17
Advanced Pipelining Concepts

Out-of-order Execution
Superscalar execution
State-of-the-Art Microprocessor, Pentium III and
Pentium 4

18
Review Pipeline Hazard Stall is dependency
2 AM
12
6 PM
8
1
7
10
11
9
Time
T a s k O r d e r
A
B
C
E
F

A depends on D stall since folder tied up

19
Out-of-Order Laundry Dont Wait
2 AM
12
6 PM
8
1
7
10
11
9
Time
30
30
30
30
30
30
30
T a s k O r d e r
A
B
C
D
E
F

A depends on D rest continue need more
resources to allow out-of-order

20
Superscalar Laundry Parallel per stage
2 AM
12
6 PM
8
1
7
10
11
9
Time
T a s k O r d e r
D
E
F

More resources, HW to match mix of parallel
tasks?

21
Superscalar Laundry Mismatch Mix
2 AM
12
6 PM
8
1
7
10
11
9
Time
30
30
30
30
30
30
30
T a s k O r d e r
(light clothing)
(dark clothing)
(light clothing)

Task mix underutilizes extra resources

22
Intel Internals

Hardware below instruction set called
"microarchitecture"
Pentium Pro, Pentium II, Pentium III all based on
same microarchitecture (1994)
Improved clock rate, increased cache size
Pentium 4 has new microarchitecture (2000)

23
Dynamic Scheduling in Pentium III

Q How to pipeline 1 to 17 byte 80x86
instructions?
It doesnt pipeline 80x86 instructions
Decode unit translates the Intel instructions
into 72-bit micro-operations ( MIPS)
Many instructions translate to 1 to 4
micro-operations
14 clocks in total pipeline

24
Dynamic Scheduling in Pentium III

Parameter 80x86 microops
Max. instructions issued/clock 3 6
Max. instr. complete exec./clock 5
Max. instr. commited/clock 3

25
Pentium III Pipeline 14 stages total

8 stages are used for in-order instruction fetch,
decode, and issue
Takes 1 clock cycle to determine length of 80x86
instructions 2 more to create the
micro-operations (mops)
3 stages are used for out-of-order execution in
one of 5 separate functional units
3 stages are used for instruction commit (a.ka.
graduation or complete)

Execu-tionunits(5)
Gradu-ation 3 mops/clk
InstrDecode3 Instr/clk
InstrFetch16B/clk
Renaming3 mops/clk
26
Pentium 4 Still translate in to micro-ops

Instruction Cache holds micro-operations instead
of 80x86 instructions!
no decode stages of 80x86 on cache hit
called trace cache (TC)
Clock rates
Pentium III 1.2 GHz v. Pentium IV 2.0 GHz
14 stage pipeline vs. 24 stage pipeline
Caches
Pentium III L1I 16KB, L1D 16KB, L2 256 KB
Pentium 4 L1I 12K mops, L1D 8 KB, L2 256 KB
Block size PIII 32B v. P4 128B
Faster memory bus 400 MHz v. 133 MHz

27
Reading Quiz

1. Pentium III executes instructions similar to
MIPS?
2. What is danger of out-of-order execution
(executing after data hazard while waiting for
resolution)?

28
And in Conclusion.. 1/1

Pipeline challenge is hazards
Forwarding helps with many data hazards
Delayed branch helps with control hazard in 5
stage pipeline
More aggressive performance superscalar,
out-of-order execution
Pentium 4 translates into MIPS instrs.
Pentium 4 long pipeline to increase clock
frequency does it really help performance?
Macintosh PowerPC _at_ 500 MHz vs.
Intel Pentium 4 _at_ 2 GHz
vs. AMD Athlon _at_ 1.6 GHz?