Cpsc 318 Computer Structures Lecture 14 Pipelined Execution Part II - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Cpsc 318 Computer Structures Lecture 14 Pipelined Execution Part II

Description:

(light clothing) (dark clothing) (very dirty clothing) 30. 30. 30. 30. 30. CPSC318 Lecture 14 ... clothing) (light clothing) (dark clothing) (light clothing) A ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 29
Provided by: davepat4
Category:

less

Transcript and Presenter's Notes

Title: Cpsc 318 Computer Structures Lecture 14 Pipelined Execution Part II


1
Cpsc 318Computer Structures Lecture 14
Pipelined Execution - Part II
  • Dr. Son Vuong
  • (vuong_at_cs.ubc.ca)
  • March 11, 2004

2
Review (1/3)
  • Optimal Pipeline
  • Each stage is executing part of an instruction
    each clock cycle.
  • One instruction finishes during each clock cycle.
  • On average, execute far more quickly.
  • What makes this work?
  • Similarities between instructions allow us to use
    same stages for all instructions (generally).
  • Each stage takes about the same amount of time as
    all others little wasted time.

3
Review (2/3)
  • Pipelining a Big Idea widely used concept
  • What makes it less than perfect?
  • Structural hazards suppose we had only one
    cache? ? Need more HW resources
  • Control hazards need to worry about branch
    instructions? ? Delayed branch
  • Data hazards an instruction depends on a
    previous instruction?

4
Review (3/3) 5 Steps in 5 stage pipeline
  • 1) IFetch Fetch Instruction, Increment PC
  • 2) Decode Instruction, Read Registers
  • 3) Execute Mem-ref Calculate Address
    Arith-log Perform Operation
  • 4) Memory Load Read Data from Memory
    Store Write Data to Memory
  • 5) Write Back Write Data to Register

5
Pipelined Execution Representation
  • Every instruction must take same number of steps,
    also called pipeline stages.
  • One clock cycle per pipeline stage
  • 500 MHz gt 500 million clock cycles / second

6
Structural Hazard Registers?
Can Read and write registers simultaneously
7
Data Hazards (1/2)
  • Consider the following sequence of instructions

8
Data Hazards (2/2)
Dependencies backwards in time are hazards
9
Data Hazard Solution Forwarding
  • Forward result from one stage to another

or hazard solved by register hardware
10
Data Hazard Loads (1/4)
  • Dependencies backwards in time are hazards
  • Cant solve with forwarding
  • Must stall instruction dependent on load, then
    forward (more hardware)

11
Data Hazard Loads (2/4)
  • Hardware must stall pipeline
  • Called interlock

12
Data Hazard Loads (3/4)
  • Instruction slot after a load is called load
    delay slot
  • If that instruction uses the result of the load,
    then the hardware interlock will stall it for one
    cycle.
  • If the compiler puts an unrelated instruction in
    that slot, then no stall
  • Letting the hardware stall the instruction in the
    delay slot is equivalent to putting a nop in the
    slot (except for the later uses more code space)

13
Data Hazard Loads (4/4)
  • Stall is equivalent to nop

lw t0, 0(t1)
nop
sub t3,t0,t2
and t5,t0,t4
14
Historical Trivia
  • First MIPS design did not interlock and stall on
    load-use data hazard
  • Real reason for name behind MIPS Microprocessor
    without Interlocked Pipeline Stages
  • Word Play on acronym for Millions of
    Instructions Per Second, also called MIPS

15
Example Nondelayed vs. Delayed Branch
Nondelayed Branch
Delayed Branch
16
Control Hazard Branching
  • Notes on Branch-Delay Slot
  • Worst-Case Scenario can always put a no-op in
    the branch-delay slot
  • Better Case can find an instruction preceding
    the branch which can be placed in the
    branch-delay slot without affecting flow of the
    program
  • Compiler can usually can find such an instruction
    at least 50 of the time
  • Jumps also have a delay slot

17
Advanced Pipelining Concepts
  • Out-of-order Execution
  • Superscalar execution
  • State-of-the-Art Microprocessor, Pentium III and
    Pentium 4

18
Review Pipeline Hazard Stall is dependency
2 AM
12
6 PM
8
1
7
10
11
9
Time
T a s k O r d e r
A
B
C
E
F
  • A depends on D stall since folder tied up

19
Out-of-Order Laundry Dont Wait
2 AM
12
6 PM
8
1
7
10
11
9
Time
30
30
30
30
30
30
30
T a s k O r d e r
A
B
C
D
E
F
  • A depends on D rest continue need more
    resources to allow out-of-order

20
Superscalar Laundry Parallel per stage
2 AM
12
6 PM
8
1
7
10
11
9
Time
T a s k O r d e r
D
E
F
  • More resources, HW to match mix of parallel
    tasks?

21
Superscalar Laundry Mismatch Mix
2 AM
12
6 PM
8
1
7
10
11
9
Time
30
30
30
30
30
30
30
T a s k O r d e r
(light clothing)
(dark clothing)
(light clothing)
  • Task mix underutilizes extra resources

22
Intel Internals
  • Hardware below instruction set called
    "microarchitecture"
  • Pentium Pro, Pentium II, Pentium III all based on
    same microarchitecture (1994)
  • Improved clock rate, increased cache size
  • Pentium 4 has new microarchitecture (2000)

23
Dynamic Scheduling in Pentium III
  • Q How to pipeline 1 to 17 byte 80x86
    instructions?
  • It doesnt pipeline 80x86 instructions
  • Decode unit translates the Intel instructions
    into 72-bit micro-operations ( MIPS)
  • Many instructions translate to 1 to 4
    micro-operations
  • 14 clocks in total pipeline

24
Dynamic Scheduling in Pentium III
  • Parameter 80x86 microops
  • Max. instructions issued/clock 3 6
  • Max. instr. complete exec./clock 5
  • Max. instr. commited/clock 3

25
Pentium III Pipeline 14 stages total
  • 8 stages are used for in-order instruction fetch,
    decode, and issue
  • Takes 1 clock cycle to determine length of 80x86
    instructions 2 more to create the
    micro-operations (mops)
  • 3 stages are used for out-of-order execution in
    one of 5 separate functional units
  • 3 stages are used for instruction commit (a.ka.
    graduation or complete)

Execu-tionunits(5)
Gradu-ation 3 mops/clk
InstrDecode3 Instr/clk
InstrFetch16B/clk
Renaming3 mops/clk
26
Pentium 4 Still translate in to micro-ops
  • Instruction Cache holds micro-operations instead
    of 80x86 instructions!
  • no decode stages of 80x86 on cache hit
  • called trace cache (TC)
  • Clock rates
  • Pentium III 1.2 GHz v. Pentium IV 2.0 GHz
  • 14 stage pipeline vs. 24 stage pipeline
  • Caches
  • Pentium III L1I 16KB, L1D 16KB, L2 256 KB
  • Pentium 4 L1I 12K mops, L1D 8 KB, L2 256 KB
  • Block size PIII 32B v. P4 128B
  • Faster memory bus 400 MHz v. 133 MHz

27
Reading Quiz
  • 1. Pentium III executes instructions similar to
    MIPS?
  • 2. What is danger of out-of-order execution
    (executing after data hazard while waiting for
    resolution)?

28
And in Conclusion.. 1/1
  • Pipeline challenge is hazards
  • Forwarding helps with many data hazards
  • Delayed branch helps with control hazard in 5
    stage pipeline
  • More aggressive performance superscalar,
    out-of-order execution
  • Pentium 4 translates into MIPS instrs.
  • Pentium 4 long pipeline to increase clock
    frequency does it really help performance?
  • Macintosh PowerPC _at_ 500 MHz vs.
    Intel Pentium 4 _at_ 2 GHz
    vs. AMD Athlon _at_ 1.6 GHz?
Write a Comment
User Comments (0)
About PowerShow.com