Title: Cpsc 318 Computer Structures Lecture 14 Pipelined Execution Part II
1Cpsc 318Computer Structures Lecture 14
Pipelined Execution - Part II
- Dr. Son Vuong
- (vuong_at_cs.ubc.ca)
- March 11, 2004
2Review (1/3)
- Optimal Pipeline
- Each stage is executing part of an instruction
each clock cycle. - One instruction finishes during each clock cycle.
- On average, execute far more quickly.
- What makes this work?
- Similarities between instructions allow us to use
same stages for all instructions (generally). - Each stage takes about the same amount of time as
all others little wasted time.
3Review (2/3)
- Pipelining a Big Idea widely used concept
- What makes it less than perfect?
- Structural hazards suppose we had only one
cache? ? Need more HW resources - Control hazards need to worry about branch
instructions? ? Delayed branch - Data hazards an instruction depends on a
previous instruction?
4Review (3/3) 5 Steps in 5 stage pipeline
- 1) IFetch Fetch Instruction, Increment PC
- 2) Decode Instruction, Read Registers
- 3) Execute Mem-ref Calculate Address
Arith-log Perform Operation - 4) Memory Load Read Data from Memory
Store Write Data to Memory - 5) Write Back Write Data to Register
5Pipelined Execution Representation
- Every instruction must take same number of steps,
also called pipeline stages. - One clock cycle per pipeline stage
- 500 MHz gt 500 million clock cycles / second
6Structural Hazard Registers?
Can Read and write registers simultaneously
7Data Hazards (1/2)
- Consider the following sequence of instructions
8Data Hazards (2/2)
Dependencies backwards in time are hazards
9Data Hazard Solution Forwarding
- Forward result from one stage to another
or hazard solved by register hardware
10Data Hazard Loads (1/4)
- Dependencies backwards in time are hazards
- Cant solve with forwarding
- Must stall instruction dependent on load, then
forward (more hardware)
11Data Hazard Loads (2/4)
- Hardware must stall pipeline
- Called interlock
12Data Hazard Loads (3/4)
- Instruction slot after a load is called load
delay slot - If that instruction uses the result of the load,
then the hardware interlock will stall it for one
cycle. - If the compiler puts an unrelated instruction in
that slot, then no stall - Letting the hardware stall the instruction in the
delay slot is equivalent to putting a nop in the
slot (except for the later uses more code space)
13Data Hazard Loads (4/4)
- Stall is equivalent to nop
lw t0, 0(t1)
nop
sub t3,t0,t2
and t5,t0,t4
14 Historical Trivia
- First MIPS design did not interlock and stall on
load-use data hazard - Real reason for name behind MIPS Microprocessor
without Interlocked Pipeline Stages - Word Play on acronym for Millions of
Instructions Per Second, also called MIPS
15Example Nondelayed vs. Delayed Branch
Nondelayed Branch
Delayed Branch
16Control Hazard Branching
- Notes on Branch-Delay Slot
- Worst-Case Scenario can always put a no-op in
the branch-delay slot - Better Case can find an instruction preceding
the branch which can be placed in the
branch-delay slot without affecting flow of the
program - Compiler can usually can find such an instruction
at least 50 of the time - Jumps also have a delay slot
17Advanced Pipelining Concepts
- Out-of-order Execution
- Superscalar execution
- State-of-the-Art Microprocessor, Pentium III and
Pentium 4
18Review Pipeline Hazard Stall is dependency
2 AM
12
6 PM
8
1
7
10
11
9
Time
T a s k O r d e r
A
B
C
E
F
- A depends on D stall since folder tied up
19Out-of-Order Laundry Dont Wait
2 AM
12
6 PM
8
1
7
10
11
9
Time
30
30
30
30
30
30
30
T a s k O r d e r
A
B
C
D
E
F
- A depends on D rest continue need more
resources to allow out-of-order
20Superscalar Laundry Parallel per stage
2 AM
12
6 PM
8
1
7
10
11
9
Time
T a s k O r d e r
D
E
F
- More resources, HW to match mix of parallel
tasks?
21Superscalar Laundry Mismatch Mix
2 AM
12
6 PM
8
1
7
10
11
9
Time
30
30
30
30
30
30
30
T a s k O r d e r
(light clothing)
(dark clothing)
(light clothing)
- Task mix underutilizes extra resources
22Intel Internals
- Hardware below instruction set called
"microarchitecture" - Pentium Pro, Pentium II, Pentium III all based on
same microarchitecture (1994) - Improved clock rate, increased cache size
- Pentium 4 has new microarchitecture (2000)
23Dynamic Scheduling in Pentium III
- Q How to pipeline 1 to 17 byte 80x86
instructions? - It doesnt pipeline 80x86 instructions
- Decode unit translates the Intel instructions
into 72-bit micro-operations ( MIPS) - Many instructions translate to 1 to 4
micro-operations - 14 clocks in total pipeline
24Dynamic Scheduling in Pentium III
- Parameter 80x86 microops
- Max. instructions issued/clock 3 6
- Max. instr. complete exec./clock 5
- Max. instr. commited/clock 3
25Pentium III Pipeline 14 stages total
- 8 stages are used for in-order instruction fetch,
decode, and issue - Takes 1 clock cycle to determine length of 80x86
instructions 2 more to create the
micro-operations (mops) - 3 stages are used for out-of-order execution in
one of 5 separate functional units - 3 stages are used for instruction commit (a.ka.
graduation or complete)
Execu-tionunits(5)
Gradu-ation 3 mops/clk
InstrDecode3 Instr/clk
InstrFetch16B/clk
Renaming3 mops/clk
26Pentium 4 Still translate in to micro-ops
- Instruction Cache holds micro-operations instead
of 80x86 instructions! - no decode stages of 80x86 on cache hit
- called trace cache (TC)
- Clock rates
- Pentium III 1.2 GHz v. Pentium IV 2.0 GHz
- 14 stage pipeline vs. 24 stage pipeline
- Caches
- Pentium III L1I 16KB, L1D 16KB, L2 256 KB
- Pentium 4 L1I 12K mops, L1D 8 KB, L2 256 KB
- Block size PIII 32B v. P4 128B
- Faster memory bus 400 MHz v. 133 MHz
27Reading Quiz
- 1. Pentium III executes instructions similar to
MIPS? - 2. What is danger of out-of-order execution
(executing after data hazard while waiting for
resolution)?
28And in Conclusion.. 1/1
- Pipeline challenge is hazards
- Forwarding helps with many data hazards
- Delayed branch helps with control hazard in 5
stage pipeline - More aggressive performance superscalar,
out-of-order execution - Pentium 4 translates into MIPS instrs.
- Pentium 4 long pipeline to increase clock
frequency does it really help performance? - Macintosh PowerPC _at_ 500 MHz vs.
Intel Pentium 4 _at_ 2 GHz
vs. AMD Athlon _at_ 1.6 GHz?