OMSE 510: Computing Foundations 4: The CPU - PowerPoint PPT Presentation

About This Presentation
Title:

OMSE 510: Computing Foundations 4: The CPU

Description:

RISC was first introduction by Patterson and Ditzel in1980 ... ( microprogramming is overkill when ISA matches datapath 1-1) Pipelining is Natural! ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 99
Provided by: franci52
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: OMSE 510: Computing Foundations 4: The CPU


1
OMSE 510 Computing Foundations4 The CPU!
  • Chris Gilmore ltgrimjack_at_cs.pdx.edugt
  • Systems Software Lab
  • Portland State University/OMSE

2
Today
  • Caches
  • DLX Assembly
  • CPU Overview


3
Introduction to RISC
  • Reduced Instruction Set Computer
  • 1975 John Cocke IBM 801
  • IBM started working on a RISC-type computer on
    1975 without calling it by this name
  • used as an I/O processor for IBM Mainframe
  • Patterson and Hennessey
  • RISC was first introduction by Patterson and
    Ditzel in1980
  • Produced first RISC chip in early 1980s
  • RISC I and RISC II from Berkeley and MIPS from
    Stanford

4
RISC Chips
  • RISC II
  • Had 39 instructions and 2 addressing modes, 3
    data types
  • 234 combinations
  • Compared to VAX 304 inst, 16 address mode, 14
    data type
  • 68,096
  • Found that
  • Compiled programs were 30 larger than CISC (Vax
    11/780)
  • Ran upto 5 times faster than 68000
  • Assembler-Compiler ratio (Execution time of
    assembler program divided by the exec time of
    compiled version)
  • Ratio lt 50 for CISC
  • 90 for RISC

5
RISC Definition
  • 1. Single cycle operation
  • 2. Load / store design
  • 3. Hardwired control unit
  • 4. Few instructions and addressing modes
  • 5. Fixed instruction format
  • 6. More compile time effort to avoid pipeline
    penalties

6
Disadvantages of CISC
  • Large, complicated, and time-consuming
    instruction set
  • Complex CU to decode and execute
  • Not necessarily faster than a sequence of several
    RISC instructions
  • Complexity of the CISC CU
  • A large number of design errors
  • Longer design time
  • Too large a choice for the compiler
  • Very difficult to design the optimal compiler
  • Not always yield the most efficient code
  • Specialized to fit certain HLL instruction
  • May be redundant for another HLL
  • Relatively low cost/benefit factor

7
The Advantage of RISC
  • RISC and VLSI realization
  • Relatively small and simple C.U. hardware
  • RISC I 6 RISC II 10
    MC68020 68
  • Higher chance of fitting other features on a chip
  • Can fit a large number of CPU registers
  • Enhances the throughput and HLL support
  • Increase the regularization factor

8
The Advantage of RISC
  • RISC and Computing Speed
  • Faster decoding process
  • Small instruction set, addressing mode, fixed
    instruction format
  • Reduce Memory access.
  • A large number of CPU registers permits R-R
    operations
  • Faster Parameter passing
  • Register windows in RISC I and RISC II
  • streamlined instruction handing
  • All instruction have the same length
  • All execute in one cycle
  • Suitable for the pipelined implementation

9
The Advantage of RISC
  • RISC and design costs and reliability
  • Shorter time to design and reduction of overall
    design costs
  • Reduce the probability that the end product will
    be obsolete
  • Reduced number of design errors
  • Virtual Memory Management System enhancement
  • inst will not cross word boundaries and cant
    wind up on two separate pages

10
The Advantage of RISC
  • RISC and HLL Support
  • Shorter and simpler compiler
  • Usually only a single choice rather than several
    choice in CISC
  • Large Number of CPU registers
  • More efficient code optimization
  • Fast Parameter Passing between procedures
  • register windows
  • Reduced burden on compiler writer

11
The Disadvantage and Criticism of RISC(80s)
  • RISC code to be longer
  • Extra burden on the machine and assembly language
    programmer
  • Several instructions required per a single CISC
    instruction
  • More Memory Locations for their storage
  • Floating Point Support and VMM support

12
RISC Characteristics
  • Pipelined operation
  • Compiler responsible for pipeline conflict
    resolution
  • Delayed branch
  • Delayed load

13
Question 1 Why do microcoding?
  • If simple instruction could execute at very high
    clock rate
  • If you could even write compilers to produce
    microinstructions
  • If most programs use simple instructions and
    addressing modes
  • If microcode is kept in RAM instead of ROM so as
    to fix bugs
  • If same memory used for control memory could be
    used instead as cache for macroinstructions
  • Then why not skip instruction interpretation by a
    microprogram and simply compile directly into
    lowest language of machine? (microprogramming is
    overkill when ISA matches datapath 1-1)

14
Pipelining is Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

15
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

16
Pipelined Laundry Start work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
  • Pipelined laundry takes 3.5 hours for 4 loads

17
Pipelining Lessons
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously using
    different resources
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup
  • Stall for Dependences

6 PM
7
8
9
Time
T a s k O r d e r
18
Execution Cycle
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
19
The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec Calculate the memory address
  • Mem Read the data from the Data Memory
  • Wr Write the data back to the register file

20
Note These 5 stages were there all along!
Fetch
Decode
Execute
Memory
Write-back
21
Pipelining
  • Improve performance by increasing throughput
  • Ideal speedup is number of stages in the
    pipeline. Do we achieve this?

22
Basic Idea
  • What do we need to add to split the datapath into
    stages?

23
Graphically Representing Pipelines
  • Can help with answering questions like
  • how many cycles does it take to execute this
    code?
  • what is the ALU doing during cycle 4?
  • use this representation to help understand
    datapaths

24
Conventional Pipelined Execution Representation
Time
Program Flow
25
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
26
Why Pipeline?
  • Suppose we execute 100 instructions
  • Single Cycle Machine
  • 45 ns/cycle x 1 CPI x 100 inst 4500 ns
  • Multicycle Machine
  • 10 ns/cycle x 4.6 CPI (due to inst mix) x 100
    inst 4600 ns
  • Ideal pipelined machine
  • 10 ns/cycle x (1 CPI x 100 inst 4 cycle drain)
    1040 ns

27
Why Pipeline? Because we can!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
28
Can pipelining get us into trouble?
  • Yes Pipeline Hazards
  • structural hazards attempt to use the same
    resource two different ways at the same time
  • E.g., combined washer/dryer would be a structural
    hazard or folder busy doing something else
    (watching TV)
  • control hazards attempt to make a decision
    before condition is evaluated
  • E.g., washing football uniforms and need to get
    proper detergent level need to see after dryer
    before next load in
  • branch instructions
  • data hazards attempt to use item before it is
    ready
  • E.g., one sock of pair in dryer and one in
    washer cant fold until get sock from washer
    through dryer
  • instruction depends on result of prior
    instruction still in the pipeline
  • Can always resolve hazards by waiting
  • pipeline control must detect the hazard
  • take action (or delay action) to resolve hazards

29
Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Load
Mem
Reg
Reg
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
Detection is easy in this case! (right half
highlight means read, left half write)
30
Structural Hazards limit performance
  • Example if 1.3 memory accesses per instruction
    and only one memory access per cycle then
  • average CPI ? 1.3
  • otherwise resource is more than 100 utilized

31
Control Hazard Solution 1 Stall
  • Stall wait until decision is clear
  • Impact 2 lost cycles (i.e. 3 clock cycles per
    branch instruction) gt slow
  • Move decision to end of decode
  • save 1 cycle per branch

32
Control Hazard Solution 2 Predict
  • Predict guess one direction then back up if
    wrong
  • Impact 0 lost cycles per branch instruction if
    right, 1 if wrong (right 50 of time)
  • Need to Squash and restart following
    instruction if wrong
  • Produce CPI on branch of (1 .5 2 .5) 1.5
  • Total CPI might then be 1.5 .2 1 .8 1.1
    (20 branch)
  • More dynamic scheme history of 1 branch ( 90)

33
Control Hazard Solution 3 Delayed Branch
  • Delayed Branch Redefine branch behavior (takes
    place after next instruction)
  • Impact 0 clock cycles per branch instruction if
    can find instruction to put in slot ( 50 of
    time)
  • As we launch more instruction per clock cycle,
    less useful

34
Delayed/Predicted Branch
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction
  • From the target address only valuable when
    branch taken
  • From fall through only valuable when branch not
    taken
  • Cancelling branches allow more slots to be
    filled
  • Compiler effectiveness for single branch delay
    slot
  • Fills about 60 of branch delay slots
  • About 80 of instructions executed in branch
    delay slots useful in computation
  • About 50 (60 x 80) of slots usefully filled
  • Delayed Branch downside 7-8 stage pipelines,
    multiple instructions issued per clock
    (superscalar)

35
Data Hazard on r1
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
36
Data Hazard on r1
  • Dependencies backwards in time are hazards

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
37
Data Hazard Solution
  • Forward result from one stage to another
  • or OK if define read/write properly

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
38
Forwarding (or Bypassing) What about Loads?
  • Dependencies backwards in time are
    hazards
  • Cant solve with forwarding
  • Must delay/stall instruction dependent on loads

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
sub r4,r1,r3
Dm
Reg
Reg
39
Forwarding (or Bypassing) What about Loads
  • Dependencies backwards in time are
    hazards
  • Cant solve with forwarding
  • Must delay/stall instruction dependent on loads

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
Stall
sub r4,r1,r3
40
Detecting Control Signals
41
Conflicts/Problems
  • I-cache and D-cache are accessed in the same
    cycle it
  • helps to implement them separately
  • Registers are read and written in the same cycle
    easy to
  • deal with if register read/write time equals
    cycle time/2
  • (else, use bypassing)
  • Branch target changes only at the end of the
    second stage
  • -- what do you do in the meantime?
  • Data between stages get latched into registers
    (overhead
  • that increases latency per instruction)

42
Control Hazards
  • Simple techniques to handle control hazard
    stalls
  • for every branch, introduce a stall cycle (note
    every
  • 6th instruction is a branch!)
  • assume the branch is not taken and start
    fetching the
  • next instruction if the branch is taken,
    need hardware
  • to cancel the effect of the wrong-path
    instruction
  • fetch the next instruction (branch delay slot)
    and
  • execute it anyway if the instruction turns
    out to be
  • on the correct path, useful work was done
    if the
  • instruction turns out to be on the wrong
    path,
  • hopefully program state is not lost

43
Slowdowns from Stalls
  • Perfect pipelining with no hazards ? an
    instruction
  • completes every cycle (total cycles num
    instructions)
  • ? speedup increase in clock speed num
    pipeline stages
  • With hazards and stalls, some cycles ( stall
    time) go by
  • during which no instruction completes, and then
    the stalled
  • instruction completes
  • Total cycles number of instructions stall
    cycles
  • Slowdown because of stalls 1/ (1 stall
    cycles per instr)

44
Control and Datapath Split state diag into 5
pieces
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
45
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it

46
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Write After Read (WAR) InstrJ tries to write
    operand before InstrI reads i
  • Gets wrong operand
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Reads are always in stage 2, and
  • Writes are always in stage 5

47
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Write After Write (WAW) InstrJ tries to write
    operand before InstrI writes it
  • Leaves wrong result ( InstrI not InstrJ )
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Writes are always in stage 5
  • Can have WAR and WAW in more complicated pipes

48
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
Fast code LW Rb,b LW Rc,c LW Re,e
ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB
Rd,Re,Rf SW d,Rd
49
Summary Pipelining
  • Reduce CPI by overlapping many instructions
  • Average throughput of approximately 1 CPI with
    fast clock
  • Utilize capabilities of the Datapath
  • start next instruction while working on the
    current one
  • limited by length of longest stage (plus
    fill/flush)
  • detect and resolve hazards
  • What makes it easy
  • all instructions are the same length
  • just a few instruction formats
  • memory operands appear only in loads and stores
  • What makes it hard?
  • structural hazards suppose we had only one
    memory
  • control hazards need to worry about branch
    instructions
  • data hazards an instruction depends on a
    previous instruction

50
Some Issues for your consideration
  • Wont be tested
  • Well talk about modern processors and whats
    really hard
  • exception handling
  • trying to improve performance with out-of-order
    execution, etc.
  • Trying to get CPI lt 1 (Superscalar execution)

51
Superscalar Execution
  • Throwing more hardware at the problem
  • Instruction level parallelism (ILP)
  • Multiple functional units
  • Eg. Multiple ALUs
  • Add a, b, c
  • Add d, e, f
  • Can get CPI lt1!

52
Out-of-order execution
  • Idea Its best if we keep all functional units
    busy
  • Can sometimes reorder computations to take
    advantage of functional units that are otherwise
    idle
  • Automatically do reordering like we did 4 slides
    ago!

53
Register Renaming
  • Internally rename registers, allow for better ILP
  • Add a, b, c
  • Add b, c, d

54
Hyperthreading/Multicore
  • Hyperthreading
  • gt1 virtual CPUs
  • Multi-core
  • gt1 actual CPUs per die

55
Integrated Circuits Costs
IC cost Die cost Testing cost
Packaging cost
Final test yield Die cost
Wafer cost Dies per
Wafer Die yield Dies per wafer (
Wafer_diam / 2)2 Wafer_diam Test
dies
Die Area 2 Die Area
Die Yield Wafer yield 1

Defects_per_unit_area Die_Area
a


- a
Die Cost goes roughly with die area4
56
Real World Examples
  • Chip Metal Line Wafer Defect Area Dies/ Yield Die
    Cost layers width cost
    /cm2 mm2 wafer
  • 386DX 2 0.90 900 1.0 43 360 71 4
  • 486DX2 3 0.80 1200 1.0 81 181 54 12
  • PowerPC 601 4 0.80 1700 1.3 121 115 28 53
  • HP PA 7100 3 0.80 1300 1.0 196 66 27 73
  • DEC Alpha 3 0.70 1500 1.2 234 53 19 149
  • SuperSPARC 3 0.70 1700 1.6 256 48 13 272
  • Pentium 3 0.80 1500 1.5 296 40 9 417
  • From "Estimating IC Manufacturing Costs, by
    Linley Gwennap, Microprocessor Report, August 2,
    1993, p. 15

57
Midterm Questions
  • Examples
  • List and describe 3 types of DRAM
  • What are the relative advantages/disadvantages of
    RISC/CISC
  • What do we have a memory heirarchy?
  • Using your choice of assembly write a (commented)
    routine that computes the nth fibonnaci number.
  • Why do CPUs have registers?
  • Describe how a 3-disk RAID-5 system works


58
Midterm Questions
  • More Examples
  • What are the differences between an synchronous
    and asynchronous bus? What are the relative
    advantages/disadvantages?
  • List and describe techniques to improve cache
    miss rate, reduce cache miss penalty and reduce
    cache hit times


59
Topics for further study
  • The following slides will not be covered in class
    or on tests.

60
Multicycle Instructions
61
Effects of Multicycle Instructions
  • Structural hazards if the unit is not fully
    pipelined (divider)
  • Frequent RAW hazard stalls
  • Potentially multiple writes to the register file
    in a cycle
  • WAW hazards because of out-of-order instr
    completion
  • Imprecise exceptions because of o-o-o instr
    completion

62
Precise Exceptions
  • On an exception
  • must save PC of instruction where program must
    resume
  • all instructions after that PC that might be in
    the pipeline
  • must be converted to NOPs (other instructions
    continue
  • to execute and may raise exceptions of their
    own)
  • temporary program state not in memory (in other
    words,
  • registers) has to be stored in memory
  • potential problems if a later instruction has
    already
  • modified memory or registers
  • A processor that fulfils all the above
    conditions is said to
  • provide precise exceptions (useful for
    debugging and of
  • course, correctness)

63
Dealing with these Effects
  • Multiple writes to the register file increase
    the number of
  • ports, stall one of the writers during ID,
    stall one of the
  • writers during WB (the stall will propagate)
  • WAW hazards detect the hazard during ID and
    stall the
  • later instruction
  • Imprecise exceptions buffer the results if they
    complete
  • early or save more pipeline state so that you
    can return to
  • exactly the same state that you left at

64
ILP
  • Instruction-level parallelism overlap among
    instructions
  • pipelining or multiple instruction execution
  • What determines the degree of ILP?
  • dependences property of the program
  • hazards property of the pipeline

65
Types of Dependences
  • Data dependences an instr produces a result for
    another
  • (true dependence, results in RAW hazards in a
    pipeline)
  • Name dependences two instrs that use the same
    names
  • (anti and output dependences, result in WAR and
    WAW
  • hazards in a pipeline)
  • Control dependences an instructions execution
    depends
  • on the result of a branch re-ordering should
    preserve
  • exception behavior and dataflow

66
An Out-of-Order Processor Implementation
Reorder Buffer (ROB)
Branch prediction and instr fetch
Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6
T1 T2 T3 T4 T5 T6
Register File R1-R32
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2 R1 ?
R3R2
Decode Rename
T1 ? R1R2 T2 ? T1R3 BEQZ T2 T4 ? T1T2 T5 ?
T4T2
ALU
ALU
ALU
Instr Fetch Queue
Results written to ROB and tags broadcast to IQ
Issue Queue (IQ)
67
Design Details - I
  • Instructions enter the pipeline in order
  • No need for branch delay slots if prediction
    happens in time
  • Instructions leave the pipeline in order all
    instructions
  • that enter also get placed in the ROB the
    process of an
  • instruction leaving the ROB (in order) is
    called commit
  • an instruction commits only if it and all
    instructions before
  • it have completed successfully (without an
    exception)
  • To preserve precise exceptions, a result is
    written into the
  • register file only when the instruction commits
    until then,
  • the result is saved in a temporary register in
    the ROB

68
Design Details - II
  • Instructions get renamed and placed in the issue
    queue
  • some operands are available (T1-T6 R1-R32),
    while
  • others are being produced by instructions in
    flight (T1-T6)
  • As instructions finish, they write results into
    the ROB (T1-T6)
  • and broadcast the operand tag (T1-T6) to the
    issue queue
  • instructions now know if their operands are
    ready
  • When a ready instruction issues, it reads its
    operands from
  • T1-T6 and R1-R32 and executes (out-of-order
    execution)
  • Can you have WAW or WAR hazards? By using more
  • names (T1-T6), name dependences can be avoided

69
Design Details - III
  • If instr-3 raises an exception, wait until it
    reaches the top
  • of the ROB at this point, R1-R32 contain
    results for all
  • instructions up to instr-3 save registers,
    save PC of instr-3,
  • and service the exception
  • If branch is a mispredict, flush all
    instructions after the
  • branch and start on the correct path
    mispredicted instrs
  • will not have updated registers (the branch
    cannot commit
  • until it has completed and the flush happens as
    soon as the
  • branch completes)
  • Potential problems ?

70
Managing Register Names
Temporary values are stored in the register file
and not the ROB
Logical Registers R1-R32
Physical Registers P1-P64
At the start, R1-R32 can be found in
P1-P32 Instructions stop entering the pipeline
when P64 is assigned
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2
P33 ? P1P2 P34 ? P33P3 BEQZ P34 P35 ? P33P34
What happens on commit?
71
The Commit Process
  • On commit, no copy is required
  • The register map table is updated the
    committed value
  • of R1 is now in P33 and not P1 on an
    exception, P33 is
  • copied to memory and not P1
  • An instruction in the issue queue need not
    modify its
  • input operand when the producer commits
  • When instruction-1 commits, we no longer have
    any use
  • for P1 it is put in a free pool and a new
    instruction can
  • now enter the pipeline ? for every instr that
    commits, a
  • new instr can enter the pipeline ? number of
    in-flight
  • instrs is a constant number of extra (rename)
    registers

72
The Alpha 21264 Out-of-Order Implementation
Reorder Buffer (ROB)
Branch prediction and instr fetch
Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6
Register File P1-P64
Register Map Table R1?P1 R2?P2
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2 R1 ?
R3R2
Decode Rename
P33 ? P1P2 P34 ? P33P3 BEQZ P34 P35 ?
P33P34 P36 ? P35P34
ALU
ALU
ALU
Instr Fetch Queue
Results written to regfile and tags broadcast to
IQ
Issue Queue (IQ)
73
Lecture 11 Advanced Static ILP
  • Topics loop unrolling, software pipelining
    (Section 4.4)

74
Loop Dependences
  • If a loop only has dependences within an
    iteration, the loop
  • is considered parallel ? multiple iterations
    can be executed
  • together so long as order within an iteration
    is preserved
  • If a loop has dependeces across iterations, it
    is not parallel
  • and these dependeces are referred to as
    loop-carried
  • Not all loop-carried dependences imply lack of
    parallelism
  • Parallel loops are especially desireable in a
    multiprocessor
  • system

75
Examples
For (i1000 igt0 ii-1) xi xi s
No dependences
For (i1 ilt100 ii1) Ai1 Ai
Ci S1 Bi1 Bi Ai1
S2
S2 depends on S1 in the same iteration S1 depends
on S1 from prev iteration S2 depends on S2 from
prev iteration
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
S1 depends on S2 from prev iteration
S1 depends on S1 from 3 prev iterations Referred
to as a recursion Dependence distance 3 limited
parallelism
For (i1000 igt0 ii-1) xi xi-3 s
S1
76
Finding Dependences the GCD Test
  • Do Aai b and Aci d refer to the same
    element?
  • Restrict ourselves to affine array indices
    (expressible as
  • ai b, where i is the loop index, a and b are
    constants)
  • example of non-affine index xyi
  • For a dependence to exist, must have two indices
    j and k
  • that are within the loop bounds, such that
  • aj b ck d
  • aj ck d b
  • G GCD(a,c)
  • (aj/G - ck/G) (d-b)/G
  • If (d-b)/G is not an integer, the initial
    equality can not be true

77
Static vs. Dynamic ILP
Loop L.D F0, 0(R1) F0
array element ADD.D F4, F0, F2
add scalar S.D F4,
0(R1) store result
DADDUI R1, R1, -8 decrement address
pointer BNE R1, R2, Loop
branch if R1 ! R2
Loop L.D F0, 0(R1)
L.D F6, -8(R1) L.D
F10,-16(R1) L.D F14,
-24(R1) ADD.D F4, F0, F2
ADD.D F8, F6, F2 ADD.D
F12, F10, F2 ADD.D F16, F14,
F2 S.D F4, 0(R1)
S.D F8, -8(R1) DADDUI
R1, R1, -32 S.D F12,
16(R1) BNE R1,R2, Loop
S.D F16, 8(R1)
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) ..
Statically unrolled loop
Large window dynamic ooo proc
78
Dynamic ILP
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop L.D F0,
0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R3, R1, -8 BNE R3,
R2, Loop L.D F6, 0(R3) ADD.D F8, F6,
F2 S.D F8, 0(R3) DADDUI R4, R3, -8 BNE
R4, R2, Loop L.D F10, 0(R4) ADD.D
F12, F10, F2 S.D F12, 0(R4) DADDUI R5,
R4, -8 BNE R5, R2, Loop L.D F14,
0(R5) ADD.D F16, F14, F2 S.D F16,
0(R5) DADDUI R6, R5, -8 BNE R6, R2, Loop
Renamed
79
Dynamic ILP
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop L.D F0,
0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R3, R1, -8 BNE R3,
R2, Loop L.D F6, 0(R3) ADD.D F8, F6,
F2 S.D F8, 0(R3) DADDUI R4, R3, -8 BNE
R4, R2, Loop L.D F10, 0(R4) ADD.D
F12, F10, F2 S.D F12, 0(R4) DADDUI R5,
R4, -8 BNE R5, R2, Loop L.D F14,
0(R5) ADD.D F16, F14, F2 S.D F16,
0(R5) DADDUI R6, R5, -8 BNE R6, R2, Loop
1 3 6 1 3 2 4 7 2 4 3 5 8 3 5 4 6 9 4 6
Cycle of Issue
Renamed
80
Loop Pipeline
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE
81
Statically Unrolled Loop
Loop L.D F0, 0(R1)
L.D F6, -8(R1) L.D
F10,-16(R1) L.D F14,
-24(R1) L.D F18, -32(R1)
ADD.D F4, F0, F2 L.D
F22, -40(R1) ADD.D F8, F6, F2
L.D F26, -48(R1) ADD.D F12, F10, F2
L.D F30, -56(R1) ADD.D
F16, F14, F2 L.D F34,
-64(R1) ADD.D F20, F18, F2 S.D
F4, 0(R1) L.D F38, -72(R1)
ADD.D F24, F22, F2 S.D F8, -8(R1)

S.D
F12, 16(R1)

S.D F16, 8(R1) DADDUI
R1, R1, -32 S.D
BNE R1,R2, Loop S.D
82
Static Vs. Dynamic
New iterations completed
1
Dynamic ILP
Cycles
New iterations completed
1
Static ILP
Cycles
  • What if I doubled the number of resources in
    each processor?
  • What if I unrolled the loop and executed it on a
    dynamic ILP processor?

83
Static vs. Dynamic
  • Dynamic because of the loop index, at most one
    iteration
  • can start every cycle even fewer if there are
    resource
  • constraints in other words, we have a
    pipeline that has
  • a throughput of one iteration per cycle!
  • Static by eliminating loop index, each
    iteration is
  • independent ? as many loops can start in a
    cycle as there
  • are resources however, after a while, we
    dont start any
  • more iterations thus, loop unrolling provides
    a brief steady
  • state, where an iteration starts/finishes every
    cycle and the
  • rest is start-up/wind-down for each unrolled
    loop

84
Software Pipeline?!
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE
85
Software Pipelining
Loop L.D F0, 0(R1)
ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop
Loop S.D F4, 16(R1)
ADD.D F4, F0, F2 L.D
F0, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop
  • Advantages achieves nearly the same effect as
    loop unrolling, but
  • without the code expansion an unrolled loop
    may have inefficiencies
  • at the start and end of each iteration, while a
    sw-pipelined loop is
  • almost always in steady state a sw-pipelined
    loop can also be unrolled
  • to reduce loop overhead
  • Disadvantages does not reduce loop overhead,
    may require more
  • registers

86
Loop Dependences
  • If a loop only has dependences within an
    iteration, the loop
  • is considered parallel ? multiple iterations
    can be executed
  • together so long as order within an iteration
    is preserved
  • If a loop has dependeces across iterations, it
    is not parallel
  • and these dependeces are referred to as
    loop-carried
  • Not all loop-carried dependences imply lack of
    parallelism
  • Parallel loops are especially desireable in a
    multiprocessor
  • system

87
Examples
For (i1000 igt0 ii-1) xi xi s
No dependences
For (i1 ilt100 ii1) Ai1 Ai
Ci S1 Bi1 Bi Ai1
S2
S2 depends on S1 in the same iteration S1 depends
on S1 from prev iteration S2 depends on S2 from
prev iteration
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
S1 depends on S2 from prev iteration
S1 depends on S1 from 3 prev iterations Referred
to as a recursion Dependence distance 3 limited
parallelism
For (i1000 igt0 ii-1) xi xi-3 s
S1
88
Constructing Parallel Loops
If loop-carried dependences are not cyclic (S1
depending on S1 is cyclic), loops can be
restructured to be parallel
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
A1 A1 B1 For (i1 ilt99 ii1)
Bi1 Ci Di S3 Ai1
Ai1 Bi1 S4 B101 C100
D100
S1 depends on S2 from prev iteration
S4 depends on S3 of same iteration
89
Finding Dependences the GCD Test
  • Do Aai b and Aci d refer to the same
    element?
  • Restrict ourselves to affine array indices
    (expressible as
  • ai b, where i is the loop index, a and b are
    constants)
  • example of non-affine index xyi
  • For a dependence to exist, must have two indices
    j and k
  • that are within the loop bounds, such that
  • aj b ck d
  • aj ck d b
  • G GCD(a,c)
  • (aj/G - ck/G) (d-b)/G
  • If (d-b)/G is not an integer, the initial
    equality can not be true

90
Predication
  • A branch within a loop can be problematic to
    schedule
  • Control dependences are a problem because of the
    need
  • to re-fetch on a mispredict
  • For short loop bodies, control dependences can
    be
  • converted to data dependences by using
  • predicated/conditional instructions

91
Predicated or Conditional Instructions
  • The instruction has an additional operand that
    determines
  • whether the instr completes or gets converted
    into a no-op
  • Example lwc R1, 0(R2), R3
    (load-word-conditional)
  • will load the word at address (R2) into R1 if
    R3 is non-zero
  • if R3 is zero, the instruction becomes a no-op
  • Replaces a control dependence with a data
    dependence
  • (branches disappear) may need register copies
    for the
  • condition or for values used by both directions

if (R1 0) R2 R2 R4 else R6 R3
R5 R4 R2 R3
R7 !R1 R8 R2 R2 R2 R4 (predicated
on R7) R6 R3 R5 (predicated on R1) R4 R8
R3 (predicated on R1)
92
Complications
  • Each instruction has one more input operand
    more
  • register ports/bypassing
  • If the branch condition is not known, the
    instruction stalls
  • (remember, these are in-order processors)
  • Some implementations allow the instruction to
    continue
  • without the branch condition and
    squash/complete later in
  • the pipeline wasted work
  • Increases register pressure, activity on
    functional units
  • Does not help if the br-condition takes a while
    to evaluate

93
Support for Speculation
  • In general, when we re-order instructions,
    register renaming
  • can ensure we do not violate register data
    dependences
  • However, we need hardware support
  • to ensure that an exception is raised at the
    correct point
  • to ensure that we do not violate memory
    dependences

st br ld
94
Detecting Exceptions
  • Some exceptions require that the program be
    terminated
  • (memory protection violation), while other
    exceptions
  • require execution to resume (page faults)
  • For a speculative instruction, in the latter
    case, servicing
  • the exception only implies potential
    performance loss
  • In the former case, you want to defer servicing
    the
  • exception until you are sure the instruction is
    not speculative
  • Note that a speculative instruction needs a
    special opcode
  • to indicate that it is speculative

95
Program-Terminate Exceptions
  • When a speculative instruction experiences an
    exception,
  • instead of servicing it, it writes a special
    NotAThing value
  • (NAT) in the destination register
  • If a non-speculative instruction reads a NAT, it
    flags the
  • exception and the program terminates (it may
    not be
  • desireable that the error is caused by an array
    access, but
  • the core-dump happens two procedures later)
  • Alternatively, an instruction (the sentinel) in
    the speculative
  • instructions original location checks the
    register value and
  • initiates recovery

96
Memory Dependence Detection
  • If a load is moved before a preceding store, we
    must
  • ensure that the store writes to a
    non-conflicting address,
  • else, the load has to re-execute
  • When the speculative load issues, it stores its
    address in
  • a table (Advanced Load Address Table in the
    IA-64)
  • If a store finds its address in the ALAT, it
    indicates that a
  • violation occurred for that address
  • A special instruction (the sentinel) in the
    loads original
  • location checks to see if the address had a
    violation and
  • re-executes the load if necessary

97
Dynamic Vs. Static ILP
  • Static ILP
  • The compiler finds parallelism ? no
    scoreboarding ?
  • higher clock speeds and lower power
  • Compiler knows what is next ? better global
    schedule
  • - Compiler can not react to dynamic events
    (cache misses)
  • - Can not re-order instructions unless you
    provide
  • hardware and extra instructions to detect
    violations
  • (eats into the low complexity/power argument)
  • - Static branch prediction is poor ? even
    statically
  • scheduled processors use hardware branch
    predictors
  • - Building an optimizing compiler is easier said
    than done
  • A comparison of the Alpha, Pentium 4, and
    Itanium (statically
  • scheduled IA-64 architecture) shows that the
    Itanium is not
  • much better in terms of performance, clock
    speed or power

98
Summary
  • Topics scheduling, loop unrolling, software
    pipelining,
  • predication, violations while re-ordering
    instructions
  • Static ILP is a great approach for handling
    embedded
  • domains
  • For the high performance domain, designers have
    added
  • many frills, bells, and whistles to eke out
    additional
  • performance, while compromising
    power/complexity
Write a Comment
User Comments (0)
About PowerShow.com