CSE 502 Graduate Computer Architecture Lec 15 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE 502 Graduate Computer Architecture Lec 15

Description:

Lec 15 MidTerm Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 160
Provided by: csSunysb9
Category:

less

Transcript and Presenter's Notes

Title: CSE 502 Graduate Computer Architecture Lec 15


1
CSE 502 Graduate Computer Architecture Lec 15
MidTerm Review
  • Larry Wittie
  • Computer Science, StonyBrook University
  • http//www.cs.sunysb.edu/cse502 and lw
  • Slides adapted from David Patterson, UC-Berkeley
    cs252-s06

2
Review Some Basic Unit Definitions
  • Kilobyte (KB) 210 (1,024) or 103 (1,000 or
    thousand) Bytes (a 500-page book)
  • Megabyte (MB) 220 (1,048,576) or 106
    (million) Bytes (1 wall of 1000 books)
  • Gigabyte (GB) 230 (1,073,741,824) or 109
    (billion) Bytes (a 1000-wall library)
  • Terabyte (TB) 240 (1.100 x 1012) or 1012
    (trillion) Bytes (1000 big libraries)
  • Petabyte (PB) 250 (1.126 x 1015) or 1015
    (quadrillion) Bytes (½ hr satellite data)
  • Exabyte 260 (1.153 x 1018) or 1018
    (quintillion) Bytes (40 days 1 satellites
    data)
  • Remember that 8 bits 1 Byte
  • millisec (ms) 10-3 (a thousandth of a)
    second light goes 300 kilometers
  • ?icrosec (?s) 10-6 (a millionth of a)
    second light goes 300 meters
  • nanosec (ns) 10-9 (a billionth of a)
    second light goes 30 cm, 1 foot
  • picosec (ps) 10-12 (a trillionth of a)
    second light goes 300 ?m, 6 hairs
  • femtosec (fs) 10-15 (one quadrillionth)
    second light goes 300 nm, 1 cell
  • attosec 10-18 (one quintillionth of a)
    second light goes 0.3 nm, 1 atom

3
CSE 502 Graduate Computer Architecture Lec 1-2
- Introduction
  • Larry Wittie
  • Computer Science, StonyBrook University
  • http//www.cs.sunysb.edu/cse502 and lw
  • Slides adapted from David Patterson, UC-Berkeley
    cs252-s06

4
Crossroads Uniprocessor Performance
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, October, 2006
  • VAX 25/year 1978 to 1986
  • RISC x86 52/year 1986 to 2002
  • RISC x86 ??/year 2002 to 2006

5
1) Taking Advantage of Parallelism
  • Increasing throughput of server computer via
    multiple processors or multiple disks
  • Detailed HW design
  • Carry lookahead adders uses parallelism to speed
    up computing sums from linear to logarithmic in
    number of bits per operand
  • Multiple memory banks searched in parallel in
    set-associative caches
  • Pipelining overlap instruction execution to
    reduce the total time to complete an instruction
    sequence.
  • Not every instruction depends on immediate
    predecessor ? executing instructions
    completely/partially in parallel possible
  • Classic 5-stage pipeline 1) Instruction Fetch
    (Ifetch), 2) Register Read (Reg), 3) Execute
    (ALU), 4) Data Memory Access (Dmem), 5)
    Register Write (Reg)

6
Pipelined Instruction Execution Is Faster
7
Limits to Pipelining
  • Hazards prevent next instruction from executing
    during its designated clock cycle
  • Structural hazards attempt to use the same
    hardware to do two different things at once
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline
  • Control hazards Caused by delay between the
    fetching of instructions and decisions about
    changes in control flow (branches and jumps).

Time (clock cycles)
I n s t r. O r d e r
8
2) The Principle of Locality gt Caches ()
  • The Principle of Locality
  • Programs access a relatively small portion of the
    address space at any instant of time.
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon (e.g., loops, reuse)
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon (e.g., straight-line
    code, array access)
  • For 30 years, HW has relied on locality for
    memory perf.

MEM
P

9
Levels of the Memory Hierarchy
Capacity Access Time Cost
Staging Xfer Unit
CPU Registers 100s Bytes 300 500 ps (0.3-0.5 ns)
Upper Level
Registers
prog./compiler 1-8 bytes
Instr. Operands
Faster
L1 Cache
L1 and L2 Cache 10s-100s K Bytes 1 ns - 10
ns 1000s/ GByte
cache cntlr 32-64 bytes
Blocks
L2 Cache
cache cntlr 64-128 bytes
Blocks
Main Memory G Bytes 80ns- 200ns 100/ GByte
Memory
OS 4K-8K bytes
Pages
Disk 10s T Bytes, 10 ms (10,000,000 ns) 0.25
/ GByte
Disk
user/operator Mbytes
Files
Larger
Tape Vault Semi-infinite sec-min 1 / GByte
Tape
Lower Level
10
3) Focus on the Common CaseMake Frequent Case
Fast and Rest Right
  • Common sense guides computer design
  • Since its engineering, common sense is valuable
  • In making a design trade-off, favor the frequent
    case over the infrequent case
  • E.g., Instruction fetch and decode unit used more
    frequently than multiplier, so optimize it first
  • E.g., If database server has 50 disks /
    processor, storage dependability dominates system
    dependability, so optimize it 1st
  • Frequent case is often simpler and can be done
    faster than the infrequent case
  • E.g., overflow is rare when adding 2 numbers, so
    improve performance by optimizing more common
    case of no overflow
  • May slow down overflow, but overall performance
    improved by optimizing for the normal case
  • What is frequent case and how much performance
    improved by making case faster gt Amdahls Law

11
4) Amdahls Law - Partial Enhancement Limits
Best to ever achieve
  • Example An I/O bound server gets a new CPU that
    is 10X faster, but 60 of server time is spent
    waiting for I/O.

A 10X faster CPU allures, but the server is only
1.6X faster.
12
5) Processor performance equation
CPI
Inst count
Cycle time
  • CPU time Inst Count x CPI x Clock Cycle
  • Program X
  • Compiler X (X)
  • Inst. Set. X X
  • Organization X X
  • Technology X

13
What Determines a Clock Cycle?
Latch or register
combinational logic
  • At transition edge(s) of each clock pulse, state
    devices sample and save their present input
    signals
  • Past 1 cycle time for signals to pass 10
    levels of gates
  • Today determined by numerous time-of-flight
    issues gate delays
  • clock propagation, wire lengths, drivers

14
?.Latency Lags ?.Bandwidth (for last 20 yrs)
  • Performance Milestones
  • Processor 286, 386, 486, Pentium, Pentium
    Pro, Pentium 4 (21x,2250x)
  • Ethernet 10Mb, 100Mb, 1000Mb, 10000 Mb/s
    (16x,1000x)
  • Memory Module 16bit plain DRAM, Page Mode DRAM,
    32b, 64b, SDRAM, DDR SDRAM (4x,120x)
  • Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
    143x)

(Latency simple operation w/o contention,
BW
best-case)
15
Summary of Technology Trends
  • For disk, LAN, memory, and microprocessor,
    bandwidth improves by more than the square of
    latency improvement
  • In the time that bandwidth doubles, latency
    improves by no more than 1.2X to 1.4X
  • Lag of gains for latency vs bandwidth probably
    even larger in real systems, as bandwidth gains
    multiplied by replicated components
  • Multiple processors in a cluster or even on a
    chip
  • Multiple disks in a disk array
  • Multiple memory modules in a large memory
  • Simultaneous communication in switched local area
    networks (LANs)
  • HW and SW developers should innovate assuming
    Latency Lags Bandwidth
  • If everything improves at the same rate, then
    nothing really changes
  • When rates vary, good designs require real
    innovation

16
Define and quantify power ( 1 / 2)
  • For CMOS chips, traditional dominant energy use
    has been in switching transistors, called dynamic
    power
  • For mobile devices, energy is a better metric
  • For a fixed task, slowing clock rate (the
    switching frequency) reduces power, but not
    energy
  • Capacitive load is function of number of
    transistors connected to output and the
    technology, which determines the capacitance of
    wires and transistors
  • Dropping voltage helps both, so ICs went from 5V
    to 1V
  • To save energy dynamic power, most CPUs now
    turn off clock of inactive modules (e.g. Fltg.
    Pt. Arith. Unit)
  • If a 15 voltage reduction causes a 15 reduction
    in frequency, what is the impact on dynamic
    power?
  • New power/old 0.852 x 0.85 0.853 0.614 39
    reduction
  • Because leakage current flows even when a
    transistor is off, now static power important too

17
Define and quantity dependability (2/3)
  • Module reliability measure of continuous
    service accomplishment (or time to failure).
  • Mean Time To Failure (MTTF) measures Reliability
  • Failures In Time (FIT) 1/MTTF, the failure rate
  • Usually reported as failures per billion hours of
    operation
  • Definition Performance
  • Performance is in units of things-done per second
  • bigger is better
  • If we are primarily concerned with response time
  • " X is N times faster than Y" means

The Speedup N mushroom The BIG Time
the little time
18
And in conclusion
  • Computer Science at the crossroads from
    sequential to parallel computing
  • Salvation requires innovation in many fields,
    including computer architecture
  • An architect must track extrapolate technology
  • Bandwidth in disks, DRAM, networks, and
    processors improves by at least as much as the
    square of the improvement in Latency
  • Quantify dynamic and static power
  • Capacitance x Voltage2 x frequency, Energy vs.
    power
  • Quantify dependability
  • Reliability (MTTF, FIT), Availability (99.9)
  • Quantify and summarize performance
  • Ratios, Geometric Mean, Multiplicative Standard
    Deviation
  • Read Chapter 1, then Appendix A

19
CSE 502 Graduate Computer Architecture Lec 3-5
Performance Instruction Pipelining Review
  • Larry Wittie
  • Computer Science, StonyBrook University
  • http//www.cs.sunysb.edu/cse502 and lw
  • Slides adapted from David Patterson, UC-Berkeley
    cs252-s06

20
A "Typical" RISC ISA
  • 32-bit fixed format instruction (3 formats)
  • 32 32-bit GPR (R0 contains zero, DP take pair)
  • 3-address, reg-reg arithmetic instruction
  • Single address mode for load/store base
    displacement
  • no indirection (since it needs another memory
    access)
  • Simple branch conditions (e.g., single-bit 0 or
    not?)
  • (Delayed branch - ineffective in deep pipelines)

see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
21
Example MIPS
Register-Register R Format Arithmetic
operations
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate I Format All immediate
arithmetic ops
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch I Format Moderate relative distance
conditional branches
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call J Format Long distance jumps
31
26
0
25
target
Op
22
5-Stage MIPS Datapath(has pipeline latches)
Figure A.3, Page A-9
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next PC
MUX
Next SEQ PC
Next SEQ PC
Zero?
RS1
Reg File
MUX
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
WB Data
Imm
RD
RD
RD
  • Data stationary control
  • local decode for each instruction phase /
    pipeline stage

23
Code SpeedUp Equation for Pipelining
For simple RISC pipeline, Ideal CPI 1
24
Data Hazard on Register R1 (If No
Forwarding)Figure A.6, Page A-17
Time (clock cycles)
No forwarding needed since write reg in 1st half
cycle, read reg in 2nd half cycle.
25
Three Generic Data Hazards
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it
  • Caused by a Dependence (in compiler
    nomenclature). This hazard results from an
    actual need for communicating a new data value.

I add r1,r2,r3 J sub r4,r1,r3
26
Three Generic Data Hazards
  • Write After Read (WAR) InstrJ writes operand
    before InstrI reads it
  • Called an anti-dependence by compiler
    writers.This results from reuse of the name
    r1.
  • Cannot happen in MIPS 5 stage pipeline because
  • All instructions take 5 stages, and
  • Register reads are always in stage 2, and
  • Register writes are always in stage 5

27
Three Generic Data Hazards
  • Write After Write (WAW) InstrJ writes operand
    before InstrI writes it.
  • Called an output dependence by compiler
    writersThis also results from the reuse of name
    r1.
  • Cannot happen in MIPS 5 stage pipeline because
  • All instructions take 5 stages, and
  • Register writes are always in stage 5
  • Will see WAR and WAW in more complicated pipes

28
Forwarding to Avoid Data HazardFigure A.7, Page
A-19
Forwarding of ALU outputs needed as ALU inputs 1
2 cycles later.
Forwarding of LW MEM outputs to SW MEM or ALU
inputs 1 or 2 cycles later.
Time (clock cycles)
Need no forwarding since write reg is in 1st half
cycle, read reg in 2nd half cycle.
29
HW Datapath Changes (in red) for
ForwardingFigure A.23, Page A-37
To forward ALU, MEM 2 cycles to ALU
To forward ALU output 1 cycle to ALU inputs
MEM/WR
ID/EX
EX/MEM
NextPC
mux
Registers
(From LW Data Memory)
Data Memory
mux
mux
mux
Immediate
(From ALU)
To forward MEM 1 cycle to SW MEM input
What circuit detects and resolves this hazard?
30
Forwarding Avoids ALU-ALU LW-SW Data
HazardsFigure A.8, Page A-20
Time (clock cycles)
31
LW-ALU Data Hazard Even with Forwarding Figure
A.9, Page A-21
Time (clock cycles)
No forwarding needed since write reg in 1st half
cycle, read reg in 2nd half cycle.
32
Data Hazard Even with Forwarding(Similar to
Figure A.10, Page A-21)
Time (clock cycles)
No forwarding needed since write reg in 1st half
cycle, read reg in 2nd half cycle.
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Bubble
ALU
DMem
or r8,r1,r9
How is this hazard detected?
33
Software Scheduling to Avoid Load Hazards
Try producing fast code with no stalls for a
b c d e f assuming a, b, c, d ,e, and f
are in memory. Slow code LW Rb,b LW
Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e
LW Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code (no stalls)
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

Stall gt
Stall gt
Compiler optimizes for performance. Hardware
checks for safety.
34
5-Stage MIPS Datapath(has pipeline latches)
Figure A.3, Page A-9
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next PC
MUX
Next SEQ PC
Next SEQ PC
Zero?
RS1
Reg File
MUX
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
WB Data
Imm
RD
RD
RD
  • Simple design put branch completion in stage 4
    (Mem)

35
Control Hazard on Branch - Three Cycle Stall
MEM
ID/RF
What do you do with the 3 instructions in
between? How do you do it? Where is the commit?
36
Branch Stall Impact if Commit in Stage 4
  • If CPI 1 and 15 of instructions are branches,
    Stall 3 cycles gt new CPI 1.45!
  • Two-part solution
  • Determine sooner whether branch taken or not, AND
  • Compute taken branch address earlier
  • MIPS branch tests if register 0 or ? 0
  • MIPS Solution
  • Move zero_test to ID/RF (Instr Decode Register
    Fetch) stage (2, 4MEM)
  • Add extra adder to calculate new PC (Program
    Counter) in ID/RF stage
  • Result is 1 clock cycle penalty for branch versus
    3 when decided in MEM

37
Pipelined MIPS DatapathFigure A.24, page A-38
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next SEQ PC
Next PC
MUX
Adder
Zero?
RS1
Reg File
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
The fast_branch design needs a longer stage 2
cycle time, so the clock is slower for all stages.
WB Data
Imm
RD
RD
RD
  • Interplay of instruction set design and cycle
    time.

38
Four Branch Hazard Alternatives
  • 1 Stall until branch direction is clear
  • 2 Predict Branch Not Taken
  • Execute the next instructions in sequence
  • PC4 already calculated, so use it to get next
    instruction
  • Nullify bad instructions in pipeline if branch is
    actually taken
  • Nullify easier since pipeline state updates are
    late (MEM, WB)
  • 47 MIPS branches not taken on average
  • 3 Predict Branch Taken
  • 53 MIPS branches taken on average
  • But have not calculated branch target address in
    MIPS
  • MIPS still incurs 1 cycle branch penalty
  • Other machines branch target known before outcome

39
Four Branch Hazard Alternatives
  • 4 Delayed Branch
  • Define branch to take place AFTER a following
    instruction
  • branch instruction sequential
    successor1 sequential successor2 ........ seque
    ntial successorn
  • branch target if taken
  • 1 slot delay allows proper decision and branch
    target address in 5 stage pipeline
  • MIPS 1st used this (Later versions of MIPS did
    not pipeline deeper)

Branch delay of length n
40
And In Conclusion Control and Pipelining
  • Quantify and summarize performance
  • Ratios, Geometric Mean, Multiplicative Standard
    Deviation
  • FP Benchmarks age, disks fail, single-point
    failure
  • Control via State Machines and Microprogramming
  • Just overlap tasks easy if tasks are independent
  • Speed Up ? Pipeline Depth if ideal CPI is 1,
    then
  • Hazards limit performance on computers
  • Structural need more HW resources
  • Data (RAW,WAR,WAW) need forwarding, compiler
    scheduling
  • Control delayed branch or branch
    (taken/not-taken) prediction
  • Exceptions and interrupts add complexity
  • Next time Read Appendix C
  • No class Tuesday 9/29/09, when Monday classes
    will run.

41
CSE 502 Graduate Computer Architecture Lec 6-7
Memory Hierarchy Review
  • Larry Wittie
  • Computer Science, StonyBrook University
  • http//www.cs.sunysb.edu/cse502 and lw
  • Slides adapted from David Patterson, UC-Berkeley
    cs252-s06

42
Since 1980, CPU has outpaced DRAM ...
Q. How do architects address this gap?
A. Put smaller, faster cache memories between
CPU and DRAM. Create a memory hierarchy.
Performance (1/latency)
CPU 60 per yr 2X in 1.5 yrs
1000
CPU
100
DRAM 9 per yr 2X in 10 yrs
10
DRAM
1980
2000
1990
Year
43
1977 DRAM faster than microprocessors
44
Memory Hierarchy Apple iMac G5
07 Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
Latency Cycles, Time 1, 0.6 ns 3, 1.9 ns 3, 1.9 ns 11, 6.9 ns 88, 55 ns 107, 12 ms
Goal Illusion of large, fast, cheap memory
Let programs address a memory space that scales
to the disk size, at a speed that is usually
nearly as fast as register access
45
iMacs PowerPC 970 (G5) All caches on-chip
46
The Principle of Locality
  • The Principle of Locality
  • Program access a relatively small portion of the
    address space at any instant of time.
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon (e.g., loops, reuse)
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon (e.g., straightline
    code, array access)
  • For last 15 years, HW has relied on locality for
    speed

Locality is a property of programs which is
exploited in machine design.
47
Programs with locality cache well ...
Memory Address (one dot per access)
Donald J. Hatfield, Jeanette Gerald Program
Restructuring for Virtual Memory. IBM Systems
Journal 10(3) 168-192 (1971)
Timegt
48
Memory Hierarchy Terminology
  • Hit data appears in some block in the upper
    level (example Block X)
  • Hit Rate the fraction of memory accesses found
    in the upper level
  • Hit Time Time to access the upper level which
    consists of
  • RAM access time Time to determine hit/miss
  • Miss data needs to be retrieved from a block in
    the lower level (Block Y)
  • Miss Rate 1 - (Hit Rate)
  • Miss Penalty Time to replace a block in the
    upper level
  • Time to deliver the block to the upper level
  • Hit Time ltlt Miss Penalty(500 instructions on
    21264!)

49
Cache Measures
  • Hit rate fraction found in that level
  • So high that usually talk about Miss rate
  • Miss rate fallacy as MIPS to CPU performance,
    miss rate to average memory access time in
    memory
  • Average memory-access time Hit time Miss
    rate x Miss penalty (ns or clocks)
  • Miss penalty time to replace a block from lower
    level, including time to replace in CPU
  • replacement time time to make upper-level room
    for block
  • access time time to lower level
  • f(latency to lower level)
  • transfer time time to transfer block
  • f(BW between upper lower levels)

50
4 Questions for Memory Hierarchy
  • Q1 Where can a block be placed in the upper
    level? (Block placement)
  • Q2 How is a block found if it is in the upper
    level? (Block identification)
  • Q3 Which block should be replaced on a miss?
    (Block replacement)
  • Q4 What happens on a write? (Write strategy)

51
Q1 Where can a block be placed in the upper
level?
  • Block 12 placed in 8 block cache
  • Fully associative, direct mapped, 2-way set
    associative
  • S.A. Mapping Block Number Modulo (Number of
    Sets)

Direct Mapped (12 mod 8) 4
2-Way Assoc (12 mod 4) 0
Full Mapped
Cache
Memory
52
Q2 How find block if in upper level cache?
Bits 18b tag 8b index 256 entries/cache
4b 16 wds/block 2b 4 Byte/wd
  • Bits (One-way) Direct Mapped Cache
    Data Capacity 16KB
  • 256 x 512 / 8
  • Index gt cache entry
  • Location of all
  • possible blocks
  • Tag for each block
  • No need to check
  • index, block-offset
  • Increasing
  • associativity
  • Shrinks index
  • expands tag size
  • Bit Fields in Memory Address Used to Access
    Cache Word
  • Virtual Memory
  • Cache Block

18
53
Q3 Which block to replace after a miss? (After
start up, cache is nearly always full)
  • Easy if Direct Mapped (only 1 block 1 way per
    index)
  • If Set Associative or Fully Associative, must
    choose
  • Random (Ran) Easy to
    implement, but not best if only 2-way
  • LRU (Least Recently Used) LRU is best, but hard
    to implement if gt8-way
  • Also other LRU approximations better than Random
  • Miss Rates for 3 Cache Sizes Associativities
  • Associativity 2-way 4-way
    8-way
  • DataSize LRU Ran LRU Ran
    LRU Ran
  • 16 KB 5.2 5.7 4.7 5.3
    4.4 5.0
  • 64 KB 1.9 2.0 1.5 1.7
    1.4 1.5
  • 256 KB 1.15 1.17 1.13 1.13
    1.12 1.12
  • Random picks gt same low miss rate as LRU for
    large caches

54
Q4 Write policy What happens on a write?
Write-Through Write-Back
Policy Data written to cache block is also written to next lower-level memory Write new data only to the cache Update lower level just before a written block leaves cache, i.e., is removed
Debugging Easier Harder
Can read misses force writes? No Yes (may slow read)
Do repeated writes touch lower level? Yes, memory busier No
Additional option -- let writes to an un-cached
address allocate a new cache line
(write-allocate), else just Write-Through.
55
Write Buffers for Write-Through Caches
Q. Why a write buffer ?
A. So CPU not stall for writes
Q. Why a buffer, why not just one register ?
A. Bursts of writes are common.
Q. Are Read After Write (RAW) hazards an issue
for write buffer?
A. Yes! Drain buffer before next read or check
buffer addresses before read-miss.
56
5 Basic Cache Optimizations
  • Reducing Miss Rate
  • Larger Block size (reduce Compulsory, cold,
    misses)
  • Larger Cache size (reduce Capacity misses)
  • Higher Associativity (reduce Conflict misses) (
    and multiprocessors have cache Coherence misses)
    (4 Cs)
  • Reducing Miss Penalty
  • Multilevel Caches total miss rate p(local miss
    rate)
  • Reducing Hit Time (minimal cache latency)
  • Giving Reads Priority over Writes, since CPU
    waiting
  • Read completes before earlier writes in
    write buffer

57
The Limits of Physical Addressing
A0-A31
A0-A31
Simple addressing method of archaic pre-1978
computers
CPU
Memory
D0-D31
D0-D31
Machine language programs had to be aware of the
machine organization
No way to prevent a program from accessing any
machine resource
58
Solution Add a Layer of Indirection
Virtual Addresses
Physical Addresses
A0-A31
Virtual
Physical
A0-A31
CPU
Main Memory
Address Translation
D0-D31
D0-D31
Data
All user programs run in an standardized virtual
address space starting at zero
Needs fast(!) Address Translation hardware,
managed by the operating system (OS), maps
virtual address to physical memory
Hardware supports modern OS features Memory
protection, Address translation, Sharing
59
Three Advantages of Virtual Memory
  • Translation
  • Program can be given consistent view of memory,
    even though physical memory is scrambled (pages
    of programs in any order in physical RAM)
  • Makes multithreading reasonable (now used a lot!)
  • Only the most important part of each program
    (the Working Set) must be in physical memory at
    any one time.
  • Contiguous structures (like stacks) use only as
    much physical memory as necessary, yet still can
    grow later as needed without recopying.
  • Protection (most important now)
  • Different threads (or processes) protected from
    each other.
  • Different pages can be given special behavior
  • (Read Only, Invisible to user programs, Not
    cached).
  • Kernel and OS data are protected from access by
    User programs
  • Very important for protection from malicious
    programs
  • Sharing
  • Can map same physical page to multiple
    users(Shared memory)

60
Details of Page Table
Page Table
frame
frame
(Byte offset same in VA PA)
frame
page
frame
page
0



virtual address
page
page
  • Page table maps virtual page numbers to physical
    frames (PTE Page Table Entry)
  • Virtual memory gt treats memory ? cache for disk

61
All page tables may not fit in memory!
A table for 4KB pages for a 32-bit physical
address space (max 4GB) has 1M entries
Each process needs its own address space tables!
Top-level table wired (stays) in main memory
Only a subset of the 1024 second-level tables are
in main memory rest are on disk or unallocated
62
MIPS Address Translation How does it work?
Physical Addresses
Virtual Addresses
Virtual
Physical
A0-A31
A0-A31
CPU
Memory
D0-D31
D0-D31
Data
TLB also contains protection bits for virtual
address
Fast common case If virtual address is in TLB,
process has permission to read/write it.
63
Can TLB translation overlap cache indexing?
Virtual Page Number Page Offset
Tag Part of Physical Addr Physical Page Number Index Byte Select
Cache Block
Cache Block










A. Inflexibility. Size of cache limited by page
size.
64
Problems With Overlapped TLB Access
Overlapped access only works so long as the
address bits used to index into the cache
do not change as the result of VA
translation This usually limits overlapping to
small caches, large page sizes, or high
n-way set associative caches if you want a large
capacity cache Example suppose everything the
same except that the cache is increased to
8 KBytes instead of 4 KB
11
2
cache index
00
This bit is changed by VA translation, but it is
needed for cache lookup.
12
20
virt page
disp
Solutions go to 8KByte page sizes
go to 2-way set associative cache or SW
guarantee VA13PA13
2-way set assoc cache
1K
10
4
4
65
Can CPU use virtual addresses for cache?
Virtual Addresses
Physical Addresses
A0-A31
Physical
Virtual
A0-A31
Translation Look-Aside Buffer (TLB)
Virtual
Cache
CPU
Main Memory
D0-D31
D0-D31
D0-D31
Only use TLB on a cache miss !
Downside a subtle, fatal problem. What is it?
(Aliasing)
A. Synonym problem. If two address spaces share a
physical frame, data may be in cache twice.
Maintaining consistency is a nightmare.
66
Summary 1/3 The Cache Design Space
  • Several interacting dimensions
  • cache size
  • block size
  • associativity
  • replacement policy
  • write-through vs write-back
  • write allocation
  • The optimal choice is a compromise
  • depends on access characteristics
  • workload
  • use (I-cache, D-cache, TLB)
  • depends on technology / cost
  • Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
67
Summary 2/3 Caches
  • The Principle of Locality
  • Program access a relatively small portion of the
    address space at any instant of time.
  • Temporal Locality Locality in Time
  • Spatial Locality Locality in Space
  • Three Major Uniprocessor Categories of Cache
    Misses
  • Compulsory Misses sad facts of life. Example
    cold start misses.
  • Capacity Misses increase cache size
  • Conflict Misses increase cache size and/or
    associativity. Nightmare Scenario ping pong
    effect!
  • Write Policy Write Through vs. Write Back
  • Today CPU time is a function of (ops, cache
    misses) vs. just f(ops) Increasing performance
    affects Compilers, Data structures, and
    Algorithms

68
Summary 3/3 TLB, Virtual Memory
  • Page tables map virtual address to physical
    address
  • TLBs are important for fast translation
  • TLB misses are significant in processor
    performance
  • funny times, as most systems cannot access all of
    2nd level cache without TLB misses!
  • Caches, TLBs, Virtual Memory all understood by
    examining how they deal with 4 questions 1)
    Where can block be placed?2) How is block found?
    3) What block is replaced on miss? 4) How are
    writes handled?
  • Today VM allows many processes to share single
    memory without having to swap all processes to
    disk today VM protection is more important than
    memory hierarchy benefits, but computers are
    still insecure
  • Short in-class openbook quiz on appendices A-C
    Chapter 1 near start of next (9/24) class. Bring
    a calculator.
  • (Please put your best email address on your
    exam.)

69
CSE 502 Graduate Computer Architecture Lec 8-10
Instruction Level Parallelism
  • Larry Wittie
  • Computer Science, StonyBrook University
  • http//www.cs.sunysb.edu/cse502 and lw
  • Slides adapted from David Patterson, UC-Berkeley
    cs252-s06

70
Recall from Pipelining Review
  • Pipeline CPI Ideal pipeline CPI Structural
    Stalls Data Hazard Stalls Control Stalls
  • Ideal pipeline CPI measure of the maximum
    performance attainable by the implementation
  • Structural hazards HW cannot support this
    combination of instructions
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline
  • Control hazards Caused by delay between the
    fetching of instructions and decisions about
    changes in control flow (branches and jumps)

71
Instruction Level Parallelism
  • Instruction-Level Parallelism (ILP) overlap the
    execution of instructions to improve performance
  • 2 approaches to exploit ILP
  • 1) Rely on hardware to help discover and exploit
    the parallelism dynamically (e.g., Pentium 4, AMD
    Opteron, IBM Power) , and
  • 2) Rely on software technology to find
    parallelism, statically at compile-time (e.g.,
    Itanium 2)
  • Next 3 lectures on this topic

72
Instruction-Level Parallelism (ILP)
  • Basic Block (BB) ILP is quite small
  • BB a straight-line code sequence with no
    branches in except to the entry and no branches
    out except at the exit
  • average dynamic branch frequency 15 to 25 gt 4
    to 7 instructions execute between a pair of
    branches
  • other problem instructions in a BB are likely to
    depend on each other
  • To obtain substantial performance enhancements,
    we must exploit ILP across multiple basic blocks
  • Simplest loop-level parallelism to exploit
    parallelism among iterations of a loop. E.g.,
  • for (j0 jlt1000 jj1)        xj1
    xj1 yj1
  • for (i0 ilt1000 ii4)        xI1
    xI1 yI1 xI2 xI2 yI2
  • xI3 xI3 yI3 xI4 xI4
    yI4
  • //Vector HW can make this run much faster.

73
Loop-Level Parallelism
  • Exploit loop-level parallelism to find run-time
    parallelism by unrolling loops either via
  • dynamic branch prediction by CPU hardware or
  • static loop unrolling by a compiler
  • (Other ways vectors parallelism - covered
    later)
  • Determining instruction dependence is critical to
    Loop Level Parallelism
  • If two instructions are
  • parallel, they can execute simultaneously in a
    pipeline of arbitrary depth without causing any
    stalls (assuming no structural hazards)
  • dependent, they are not parallel and must be
    executed in order, although they may often be
    partially overlapped

74
ILP and Data Dependencies, Hazards
  • HW/SW must preserve program order give the same
    results as if instructions were executed
    sequentially in the original order of the source
    program
  • Dependences are a property of programs
  • The presence of a dependence indicates the
    potential for a hazard, but the existence of an
    actual hazard and the length of any stall are
    properties of the pipeline
  • Importance of the data dependencies
  • 1) Indicate the possibility of a hazard
  • 2) Determine the order in which results must be
    calculated
  • 3) Set upper bounds on how much parallelism can
  • possibly be exploited
  • HW/SW goal exploit parallelism by preserving
    program order only where it affects the outcome
    of the program

75
Name Dependence 1 Anti-dependence
  • Name dependence when two instructions use the
    same register or memory location, called a name,
    but no data flow between the instructions using
    that name there are 2 versions of name
    dependence, which may cause WAR and WAW hazards,
    if a name such as r1 is reused
  • 1. InstrJ may wrongly write operand r1 before
    InstrI reads it
  • This anti-dependence of compiler writers may
    cause a Write After Read (WAR) hazard in a
    pipeline.
  • 2. InstrJ may wrongly write operand r1 before
    InstrI writes it
  • This output dependence of compiler writers may
    cause a Write After Write (WAW) hazard in a
    pipeline.
  • Instructions with a name dependence can execute
    simultaneously if one name is changed by a
    compiler or by register-renaming in HW.

76
Carefully Violate Control Dependencies
  • Every instruction is control dependent on some
    set of branches, and, in general, these control
    dependencies must be preserved to preserve
    program order
  • if p1
  • S1
  • if p2
  • S2
  • S1 is control dependent on proposition p1, and S2
    is control dependent on p2 but not on p1.
  • Control dependence need not always be preserved
  • Control dependences can be violated by executing
    instructions that should not have been, if doing
    so does not affect program results
  • Instead, two properties critical to program
    correctness are
  • exception behavior and
  • data flow

77
Exception Behavior Is Important
  • Preserving exception behavior ? any changes in
    instruction execution order must not change how
    exceptions are raised in program (? no new
    exceptions)
  • Example DADDU R2,R3,R4 BEQZ R2,L1 LW R1,-1(R
    2)L1
  • (Assume branches are not delayed)
  • What is the problem with moving LW before BEQZ?
  • Array overflow what if R20, so -1R2 is out
    of program memory bounds?

78
Data Flow Of Values Must Be Preserved
  • Data flow actual flow of data values from
    instructions that produce results to those that
    consume them
  • branches make flow dynamic (since we know
    details only at runtime) must determine which
    instruction is supplier of data
  • Example
  • DADDU R1,R2,R3BEQZ R4,LDSUBU R1,R5,R6L OR
    R7,R1,R8
  • OR input R1 depends on which of DADDU or DSUBU?
    Must preserve data flow on execution

79
FP Loop Where are the Hazards?
  • for (i1000 igt0 ii1)
  • xi xi s
  • First translate into MIPS code
  • -To simplify loop end, assume 8 is lowest
    address, F2s, and R1 starts with address for
    x1000
  • Loop L.D F0,0(R1) F0vector element xi
  • ADD.D F4,F0,F2 add scalar from F2 s
  • S.D 0(R1),F4 store result back into xi
  • DADDUI R1,R1,-8 decrement pointer 8B
    (DblWd)
  • BNEZ R1,Loop branch R1!zero

80
FP Loop Showing Stalls
1 Loop L.D F0,0(R1) F0vector element
2 stall 3 ADD.D F4,F0,F2 add scalar in F2
4 stall 5 stall 6 S.D 0(R1),F4 store
result 7 DADDUI R1,R1,-8 decrement pointer 8B
(DW) 8 stall assume cannot forward to branch
9 BNEZ R1,Loop branch R1!zero
produce result use result stalls
between FP ALU op Other FP ALU op 3FP ALU
op Store double 2 Load double FP ALU
op 1Load double Store double
0Integer op Integer op 0
  • Loop every 9 clock cycles. How reorder code to
    minimize stalls?

81
Revised FP Loop Minimizing Stalls
Original 9 cycle per loop code 1 Loop L.D
F0,0(R1) F0vector element 2 stall 3 ADD.D
F4,F0,F2 add scalar in F2 4 stall 5 stall
6 S.D 0(R1),F4 store result 7 DADDUI
R1,R1,-8 decrement pointer 8B 8 stall
assume cannot forward to branch 9
BNEZ R1,Loop branch R1!zero
1 Loop L.D F0,0(R1) 2 DADDUI R1,R1,-8
3 ADD.D F4,F0,F2 4 stall 5 stall
6 S.D 8(R1),F4 altered offset 0gt8 when moved
DADDUI 7 BNEZ R1,Loop
Swap DADDUI and S.D change address offset of S.D
produce result use result stalls
between FP ALU op Other FP ALU op 3FP ALU
op Store double 2 Load double FP ALU
op 1Load double Store double
0Integer op Integer op 0
  • Loop takes 7 clock cycles, but just 3 for
    execution (L.D, ADD.D,S.D), 4 for loop overhead
    How make faster?

82
Unroll Loop Four Times (straightforward way
gives 7gt6.75 cycles)
1 cycle stall
1 Loop L.D F0,0(R1) 3 ADD.D F4,F0,F2 6 S.D 0(R1),
F4 drop DADDUI BNEZ 7 L.D F6,-8(R1) 9 ADD.D F8
,F6,F2 12 S.D -8(R1),F8 drop DADDUI
BNEZ 13 L.D F10,-16(R1) 15 ADD.D F12,F10,F2 18 S.D
-16(R1),F12 drop DADDUI BNEZ 19 L.D F14,-24(R
1) 21 ADD.D F16,F14,F2 24 S.D -24(R1),F16 25 DADDU
I R1,R1,-32 alter to 48 27 BNEZ R1,LOOP Four
loops take 27 clock cycles, or 6.75 per
iteration (Assumes R1 is a multiple of 4)
  • How rewrite loop to minimize stalls?

2 cycles stall
1 cycle stall
83
Unrolled Loop That Minimizes (0) Stalls
1 Loop L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16
(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8
,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D
0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 D
ADDUI R1,R1,-32 13 S.D 8(R1),F16 8-32
-24 14 BNEZ R1,LOOP Four loops take 14 clock
cycles, or 3.5 per loop.
  • 1 Loop L.D F0,0(R1)
  • 3 ADD.D F4,F0,F2
  • S.D 0(R1),F4
  • 7 L.D F6,-8(R1)
  • 9 ADD.D F8,F6,F2
  • 12 S.D -8(R1),F8
  • 13 L.D F10,-16(R1)
  • 15 ADD.D F12,F10,F2
  • 18 S.D -16(R1),F12
  • 19 L.D F14,-24(R1)
  • 21 ADD.D F16,F14,F2
  • 24 S.D -24(R1),F16
  • 25 DADDUI R1,R1,-32 48
  • 27 BNEZ R1,LOOP
  • 27 cycles
  • m means cycle m 1 stall
  • n means cycle n 2 stalls

84
Loop Unrolling Detail - Strip Mining
  • Do not usually know upper bound of loop
  • Suppose it is n, and we would like to unroll the
    loop to make k copies of the body
  • Instead of a single unrolled loop, we generate a
    pair of consecutive loops
  • 1st executes (n mod k) times and has a body that
    is the original loop called strip mining
    of a loop
  • 2nd is the unrolled body surrounded by an outer
    loop that iterates ( n/k ) times
  • For large values of n, most of the execution time
    will be spent in the n/k unrolled loops

85
Five Loop Unrolling Decisions
  • Requires understanding how one instruction
    depends on another and how the instructions can
    be changed or reordered given the dependences
  • Determine if loop unrolling can be useful by
    finding that loop iterations are independent
    (except for loop maintenance code)
  • Use different registers to avoid unnecessary
    constraints forced by using the same registers
    for different computations
  • Eliminate the extra test and branch instructions
    and adjust the loop termination and iteration
    increment/decrement code
  • Determine that loads and stores in unrolled loop
    can be interchanged by observing that loads and
    stores from different iterations are independent
  • Transformation requires analyzing memory
    addresses and finding that no pairs refer to the
    same address
  • Schedule (reorder) the code, preserving any
    dependences needed to yield the same result as
    the original code

86
Three Limits to Loop Unrolling
  • Decrease in amount of overhead amortized with
    each extra unrolling
  • Amdahls Law
  • Growth in code size
  • For larger loops, size is a concern if it
    increases the instruction cache miss rate
  • Register pressure potential shortfall in
    registers created by aggressive unrolling and
    scheduling
  • If not possible to allocate all live values to
    registers, code may lose some or all of the
    advantages of loop unrolling
  • Software pipelining is an older compiler
    technique to unroll loops systematically.
  • Loop unrolling reduces the impact of branches on
    pipelines another way is branch prediction.

87
_
_ Compiler Software-Pipelining of VSV Loop
Software pipelining structure tolerates the long
latencies of FltgPt operations l.s, mul.s,
s.s are single precision (SP) floating-pt. Load,
Multiply, Store. At start r1addr V(0),
r2addrV(last)4, f0 scalar SP fltg
multiplier. Instructions in iteration box are in
reverse order, from different iterations. If have
separate FltgPt function boxes for L, M, S, can
overlap S M L triples. Bg marks prologue
starting iterated code En marks epilogue to
finish code.
l.s f2,0(r1) mul.s f4,f0,f2 s.s f4,0(r1) addi
r1,r1,4 bne r1,r2,Lp l.s f2,0(r1) mul.s
f4,f0,f2 s.s f4,0(r1) addi r1,r1,4 bne
r1,r2,Lp l.s f2,0(r1) mul.s f4,f0,f2 s.s
f4,0(r1) addi r1,r1,4 bne r1,r2,Lp
Bg addi r1,r1,8 l.s f2,-8(r1) mul.s
f4,f0,f2 l.s f2,-4(r1) Lp s.s
f4,-8(r1) mul.s f4,f0,f2 l.s f2,0(r1)
addi r1,r1,4 bne r1,r2,Lp En s.s
f4,-4(r1) mul.s f4,f0,f2 s.s f4,0(r1)
I TIME ? T 1 2 3 4 5 6 7 8 E 1 L M S R 2
L M S A 3 L M S T 4 L M S I 5
L M S O 6 L M S N
?
88
Dynamic (Run-Time) Branch Prediction
  • Why does prediction work?
  • Underlying algorithm has regularities
  • Data that is being operated on has regularities
  • Instruction sequences have redundancies that are
    artifacts of way that humans/compilers solve
    problems
  • Is dynamic branch prediction better than static
    prediction?
  • Seems to be
  • There are a small number of important branches in
    programs which have dynamic behavior
  • Performance ƒ(accuracy, cost of misprediction)
  • Branch History Table Lower bits of PC address
    index table of 1-bit values
  • Says whether or not branch taken last time
  • No address check
  • Problem 1-bit BHT will cause two mispredictions
    per loop, (Average for loops is 9 iterations
    before exit)
  • End of loop case, when it exits instead of
    looping as before
  • First time through loop on next time through
    code, when it predicts exit instead of looping

89
Dynamic Branch Prediction With 2 Bits
  • Solution 2-bit scheme where change prediction
    only if get misprediction twice
  • Red stop, not taken
  • Green go, taken
  • Adds hysteresis to decision making process

90
Branch History Table (BHT) Accuracy
  • Mispredict because either
  • Make wrong guess for that branch
  • Got branch history of wrong branch when index the
    table (same low address bits used for index).
  • 4096 entry
  • BH table

Integer
Floating Point
91
Correlated Branch Prediction
  • Idea record m most recently executed branches
    as taken or not taken, and use that pattern to
    select the proper n-bit branch history table
  • Global Branch History m-bit shift register
    keeping Taken/Not_Taken status of last m branches
    anywhere.
  • In general, (m,n) predictor means use record of
    last m global branches to select between 2m local
    branch history tables, each with n-bit counters
  • Thus, the old 2-bit BHT is a (0,2) predictor
  • Each entry in table has m n-bit predictors.

92
Correlating Branch Predictors
(2,2) predictor with 16 sets of four 2-bit
predictions Behavior of most recent 2 branches
selects between four predictions for next branch,
updating just that prediction
Branch address
4
2-bits per branch predictor
Prediction
2-bit global branch history
93
Accuracy of Different Schemes
4096 Entries 2-bit BHT (4096) Unlimited Entries
2-bit BHT 1024 Entries (2,2) BHT (4096)
20
18
16
14
12
11
Frequency of Mispredictions
10
8
6
6
6
6
5
5
4
4
2
1
1
0
0
nasa7
li
matrix300
doducd
spice
fpppp
gcc
expresso
eqntott
tomcatv
4,096 entries 2-bits per entry
Unlimited entries 2-bits/entry
1,024 entries (2,2)
94
Tournament Predictors
  • Multilevel branch predictor
  • Use n-bit saturating counter to choose between
    predictors
  • Usual choice is between global and local
    predictors

95
Comparing Predictors (Fig. 2.8)
  • Advantage tournament predictor can select the
    right predictor for a particular branch
  • Particularly crucial for integer benchmarks.
  • A typical tournament predictor will select the
    global predictor almost 40 of the time for the
    SPEC Integer benchmarks and less than 15 of the
    time for the SPEC FP benchmarks

6.8 2-bit 3.7 Corr, 2.6 Tourn.
96
Branch Target Buffers (BTB)
  • Branch target calculation is costly and stalls
    the instruction fetch one or more cycles.
  • BTB stores branch PCs and target PCs the same way
    as caches store addresses and data blocks.
  • The PC of a branch is sent to the BTB
  • When a match is found the corresponding Predicted
    target PC is returned
  • If the branch is predicted to be Taken,
    instruction fetch continues at the returned
    predicted PC

97
Branch Target Buffers
98
Dynamic Branch Prediction Summary
  • Prediction becoming important part of execution
  • Branch History Table 2 bits for loop accuracy
  • Correlation Recently executed branches
    correlated with next branch
  • Either different branches (GA)
  • Or different executions of same branches (PA)
  • Tournament predictors take insight to next level,
    by using multiple predictors
  • usually one based on global information and one
    based on local information, and combining them
    with a selector
  • In 2006, tournament predictors using ? 30K bits
    are in processors like the Power5 and Pentium 4
  • Branch Target Buffer include branch address
    prediction

99
Advantages of Dynamic Scheduling
  • Dynamic scheduling - hardware rearranges the
    instruction execution to reduce stalls while
    maintaining data flow and exception behavior
  • It handles cases in which dependences were
    unknown at compile time
  • it allows the processor to tolerate unpredictable
    delays such as cache misses, by executing other
    code while waiting for the miss to resolve
  • It allows code compiled for one pipeline to run
    efficiently on a different pipeline
  • It simplifies the compiler
  • Hardware speculation, a technique with
    significant performance advantages, builds on
    dynamic scheduling (next lecture)

100
HW Schemes Instruction Parallelism
  • Key idea Allow instructions behind stall to
    proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F
    8,F14
  • Enables out-of-order execution and allows
    out-of-order completion (e.g., SUBD before slow
    DIVD)
  • In a dynamically scheduled pipeline, all
    instructions still pass through issue stage in
    order (in-order issue)
  • Will distinguish when an instruction begins
    execution from when it completes execution
    between the two times, the instruction is in
    execution
  • Note Dynamic scheduling creates WAR and WAW
    hazards and makes exception handling harder

101
Dynamic Scheduling Step 1
  • Simple pipeline had only one stage to check both
    structural and data hazards Instruction Decode
    (ID), also called Instruction Issue
  • Split the ID pipe stage of simple 5-stage
    pipeline into 2 stages to make a 6-stage
    pipeline
  • IssueDecode instructions, check for structural
    hazards
  • Read operandsWait until no data hazards, then
    read operands

102
A Dynamic Algorithm Tomasulos
  • For IBM 360/91 (before caches!)
  • ? Long memory latency
  • Goal High Performance without special compilers
  • Small number of floating point registers (4 in
    360) prevented interesting compiler scheduling of
    operations
  • This led Tomasulo to try to figure out how
    effectively to get more registers renaming in
    hardware!
  • Why Study a 1966 Computer?
  • The descendants of this have flourished!
  • Alpha 21264, Pentium 4, AMD Opteron, Power 5,

103
Tomasulo Algorithm
  • Control buffers distributed with Function Units
    (FU)
  • FU buffers called reservation stations have
    pending operands
  • Registers in instructions replaced by values or
    pointers to reservation stations(RSs) called
    register renaming
  • Renaming avoids WAR, WAW hazards
  • More reservation stations than registers, so can
    do optimizations compilers cannot do without
    access to the additional internal registers, the
    reservation stations.
  • Results from RSs as leave each FU sent to waiting
    RSs, not through registers, but over a Common
    Data Bus that broadcasts results to all FUs and
    their waiting RSs
  • Avoids RAW hazards by executing an instruction
    only when its operands are available
  • Load and Stores treated as FUs with RSs as well
  • Integer instructions can go past branches
    (predict taken), allowing FP ops beyond basic
    block in FP queue

104
Tomasulo Organization
FP Registers
From Mem
FP Ops Queue
Load Buffers
Load1 Load2 Load3 Load4 Load5 Load6
Store Buffers
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
Common Data Bus (CDB)
105
Three Stages of Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station free (no structural
    hazard), control issues instr sends operands
    (renames registers).
  • 2. Executeoperate on operands (EX)
  • When both operands ready, start to execute if
    not ready, watch Common Data Bus for result
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting units
    mark reservation station available
  • Normal data bus data destination (go to bus)
  • Common data bus data source (come from bus)
  • 64 bits of data 4 bits of Functional Unit
    source address
  • Write if matches expected Functional Unit
    (produces result)
  • Does the broadcast
  • Example speed After start EX 2 clocks for LD
    3 for Fl .pt. ,- 10 for 40 for /.

106
Reservation Station Components
  • Op Operation to perform in the unit (e.g., or
    )
  • Vj, Vk Value of source operands for Op
  • Each store buffer has a V field, for the result
    to be stored
  • Qj, Qk Reservation stations producing source
    registers (value to be written)
  • Note Qj,Qk0 gt ready
  • Store buffers only have Qi for RS producing
    result
  • Busy Indicates reservation station or FU is
    busy
  • Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions that
    will write that register.

107
Why Can Tomasulo Overlap Iterations Of Loops?
  • Register renaming
  • Multiple iterations use different physical
    destinations for registers (dynamic loop
    unrolling).
  • Reservation stations
  • Permit instruction issue to advance past integer
    control flow operations
  • Also buffer old values of registers - totally
    avoiding the WAR stall
  • Other perspective Tomasulo building data flow
    dependency graph on the fly

108
Tomasulos Scheme Two Major Advantages
  • Distribution of the hazard detection logic
  • distributed reservation stations and the CDBus
  • If multiple instructions waiting on single result
    and each instruction has other operand, then
    instructions can be released simultaneously by
    broadcast on CDB
  • If a centralized register file were used, the
    units would have to read their results from the
    registers when register buses are available
  • Elimination of stalls for WAW and WAR hazards

109
Tomasulo Drawbacks
  • Complexity
  • delays of 360/91, MIPS 10000, Alpha 21264, IBM
    PPC 620 (in CAAQA 2/e, before it was in
    silicon!)
  • Many associative stores (CDB) at high speed
  • Performance limited by Common Data Bus
  • Each CDB must go to multip
Write a Comment
User Comments (0)
About PowerShow.com