CSE 502 Graduate Computer Architecture Lec 15

About This Presentation

Title:

CSE 502 Graduate Computer Architecture Lec 15

Description:

Lec 15 MidTerm Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 160

Provided by: csSunysb9

Learn more at: https://www3.cs.stonybrook.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE 502 Graduate Computer Architecture Lec 15

1
CSE 502 Graduate Computer Architecture Lec 15
MidTerm Review

Larry Wittie
Computer Science, StonyBrook University
http//www.cs.sunysb.edu/cse502 and lw
Slides adapted from David Patterson, UC-Berkeley
cs252-s06

2
Review Some Basic Unit Definitions

Kilobyte (KB) 210 (1,024) or 103 (1,000 or
thousand) Bytes (a 500-page book)
Megabyte (MB) 220 (1,048,576) or 106
(million) Bytes (1 wall of 1000 books)
Gigabyte (GB) 230 (1,073,741,824) or 109
(billion) Bytes (a 1000-wall library)
Terabyte (TB) 240 (1.100 x 1012) or 1012
(trillion) Bytes (1000 big libraries)
Petabyte (PB) 250 (1.126 x 1015) or 1015
(quadrillion) Bytes (½ hr satellite data)
Exabyte 260 (1.153 x 1018) or 1018
(quintillion) Bytes (40 days 1 satellites
data)
Remember that 8 bits 1 Byte
millisec (ms) 10-3 (a thousandth of a)
second light goes 300 kilometers
?icrosec (?s) 10-6 (a millionth of a)
second light goes 300 meters
nanosec (ns) 10-9 (a billionth of a)
second light goes 30 cm, 1 foot
picosec (ps) 10-12 (a trillionth of a)
second light goes 300 ?m, 6 hairs
femtosec (fs) 10-15 (one quadrillionth)
second light goes 300 nm, 1 cell
attosec 10-18 (one quintillionth of a)
second light goes 0.3 nm, 1 atom

3
CSE 502 Graduate Computer Architecture Lec 1-2
- Introduction

Larry Wittie
Computer Science, StonyBrook University
http//www.cs.sunysb.edu/cse502 and lw
Slides adapted from David Patterson, UC-Berkeley
cs252-s06

4
Crossroads Uniprocessor Performance
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, October, 2006

VAX 25/year 1978 to 1986
RISC x86 52/year 1986 to 2002
RISC x86 ??/year 2002 to 2006

5
1) Taking Advantage of Parallelism

Increasing throughput of server computer via
multiple processors or multiple disks
Detailed HW design
Carry lookahead adders uses parallelism to speed
up computing sums from linear to logarithmic in
number of bits per operand
Multiple memory banks searched in parallel in
set-associative caches
Pipelining overlap instruction execution to
reduce the total time to complete an instruction
sequence.
Not every instruction depends on immediate
predecessor ? executing instructions
completely/partially in parallel possible
Classic 5-stage pipeline 1) Instruction Fetch
(Ifetch), 2) Register Read (Reg), 3) Execute
(ALU), 4) Data Memory Access (Dmem), 5)
Register Write (Reg)

6
Pipelined Instruction Execution Is Faster
7
Limits to Pipelining

Hazards prevent next instruction from executing
during its designated clock cycle
Structural hazards attempt to use the same
hardware to do two different things at once
Data hazards Instruction depends on result of
prior instruction still in the pipeline
Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (branches and jumps).

Time (clock cycles)
I n s t r. O r d e r
8
2) The Principle of Locality gt Caches ()

The Principle of Locality
Programs access a relatively small portion of the
address space at any instant of time.
Two Different Types of Locality
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., straight-line
code, array access)
For 30 years, HW has relied on locality for
memory perf.

MEM
P

9
Levels of the Memory Hierarchy
Capacity Access Time Cost
Staging Xfer Unit
CPU Registers 100s Bytes 300 500 ps (0.3-0.5 ns)
Upper Level
Registers
prog./compiler 1-8 bytes
Instr. Operands
Faster
L1 Cache
L1 and L2 Cache 10s-100s K Bytes 1 ns - 10
ns 1000s/ GByte
cache cntlr 32-64 bytes
Blocks
L2 Cache
cache cntlr 64-128 bytes
Blocks
Main Memory G Bytes 80ns- 200ns 100/ GByte
Memory
OS 4K-8K bytes
Pages
Disk 10s T Bytes, 10 ms (10,000,000 ns) 0.25
/ GByte
Disk
user/operator Mbytes
Files
Larger
Tape Vault Semi-infinite sec-min 1 / GByte
Tape
Lower Level
10
3) Focus on the Common CaseMake Frequent Case
Fast and Rest Right

Common sense guides computer design
Since its engineering, common sense is valuable
In making a design trade-off, favor the frequent
case over the infrequent case
E.g., Instruction fetch and decode unit used more
frequently than multiplier, so optimize it first
E.g., If database server has 50 disks /
processor, storage dependability dominates system
dependability, so optimize it 1st
Frequent case is often simpler and can be done
faster than the infrequent case
E.g., overflow is rare when adding 2 numbers, so
improve performance by optimizing more common
case of no overflow
May slow down overflow, but overall performance
improved by optimizing for the normal case
What is frequent case and how much performance
improved by making case faster gt Amdahls Law

11
4) Amdahls Law - Partial Enhancement Limits
Best to ever achieve

Example An I/O bound server gets a new CPU that
is 10X faster, but 60 of server time is spent
waiting for I/O.

A 10X faster CPU allures, but the server is only
1.6X faster.
12
5) Processor performance equation
CPI
Inst count
Cycle time

CPU time Inst Count x CPI x Clock Cycle
Program X
Compiler X (X)
Inst. Set. X X
Organization X X
Technology X

13
What Determines a Clock Cycle?
Latch or register
combinational logic

At transition edge(s) of each clock pulse, state
devices sample and save their present input
signals
Past 1 cycle time for signals to pass 10
levels of gates
Today determined by numerous time-of-flight
issues gate delays
clock propagation, wire lengths, drivers

14
?.Latency Lags ?.Bandwidth (for last 20 yrs)

Performance Milestones
Processor 286, 386, 486, Pentium, Pentium
Pro, Pentium 4 (21x,2250x)
Ethernet 10Mb, 100Mb, 1000Mb, 10000 Mb/s
(16x,1000x)
Memory Module 16bit plain DRAM, Page Mode DRAM,
32b, 64b, SDRAM, DDR SDRAM (4x,120x)
Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)

(Latency simple operation w/o contention,
BW
best-case)
15
Summary of Technology Trends

For disk, LAN, memory, and microprocessor,
bandwidth improves by more than the square of
latency improvement
In the time that bandwidth doubles, latency
improves by no more than 1.2X to 1.4X
Lag of gains for latency vs bandwidth probably
even larger in real systems, as bandwidth gains
multiplied by replicated components
Multiple processors in a cluster or even on a
chip
Multiple disks in a disk array
Multiple memory modules in a large memory
Simultaneous communication in switched local area
networks (LANs)
HW and SW developers should innovate assuming
Latency Lags Bandwidth
If everything improves at the same rate, then
nothing really changes
When rates vary, good designs require real
innovation

16
Define and quantify power ( 1 / 2)

For CMOS chips, traditional dominant energy use
has been in switching transistors, called dynamic
power

For mobile devices, energy is a better metric

For a fixed task, slowing clock rate (the
switching frequency) reduces power, but not
energy
Capacitive load is function of number of
transistors connected to output and the
technology, which determines the capacitance of
wires and transistors
Dropping voltage helps both, so ICs went from 5V
to 1V
To save energy dynamic power, most CPUs now
turn off clock of inactive modules (e.g. Fltg.
Pt. Arith. Unit)

If a 15 voltage reduction causes a 15 reduction
in frequency, what is the impact on dynamic
power?
New power/old 0.852 x 0.85 0.853 0.614 39
reduction

Because leakage current flows even when a
transistor is off, now static power important too

17
Define and quantity dependability (2/3)

Module reliability measure of continuous
service accomplishment (or time to failure).
Mean Time To Failure (MTTF) measures Reliability
Failures In Time (FIT) 1/MTTF, the failure rate
Usually reported as failures per billion hours of
operation
Definition Performance
Performance is in units of things-done per second
bigger is better
If we are primarily concerned with response time
" X is N times faster than Y" means

The Speedup N mushroom The BIG Time
the little time
18
And in conclusion

Computer Science at the crossroads from
sequential to parallel computing
Salvation requires innovation in many fields,
including computer architecture
An architect must track extrapolate technology
Bandwidth in disks, DRAM, networks, and
processors improves by at least as much as the
square of the improvement in Latency
Quantify dynamic and static power
Capacitance x Voltage2 x frequency, Energy vs.
power
Quantify dependability
Reliability (MTTF, FIT), Availability (99.9)
Quantify and summarize performance
Ratios, Geometric Mean, Multiplicative Standard
Deviation
Read Chapter 1, then Appendix A

19
CSE 502 Graduate Computer Architecture Lec 3-5
Performance Instruction Pipelining Review

Larry Wittie
Computer Science, StonyBrook University
http//www.cs.sunysb.edu/cse502 and lw
Slides adapted from David Patterson, UC-Berkeley
cs252-s06

20
A "Typical" RISC ISA

32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP take pair)
3-address, reg-reg arithmetic instruction
Single address mode for load/store base
displacement
no indirection (since it needs another memory
access)
Simple branch conditions (e.g., single-bit 0 or
not?)
(Delayed branch - ineffective in deep pipelines)

see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
21
Example MIPS
Register-Register R Format Arithmetic
operations
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate I Format All immediate
arithmetic ops
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch I Format Moderate relative distance
conditional branches
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call J Format Long distance jumps
31
26
0
25
target
Op
22
5-Stage MIPS Datapath(has pipeline latches)
Figure A.3, Page A-9
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next PC
MUX
Next SEQ PC
Next SEQ PC
Zero?
RS1
Reg File
MUX
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
WB Data
Imm
RD
RD
RD

Data stationary control
local decode for each instruction phase /
pipeline stage

23
Code SpeedUp Equation for Pipelining
For simple RISC pipeline, Ideal CPI 1
24
Data Hazard on Register R1 (If No
Forwarding)Figure A.6, Page A-17
Time (clock cycles)
No forwarding needed since write reg in 1st half
cycle, read reg in 2nd half cycle.
25
Three Generic Data Hazards

Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it
Caused by a Dependence (in compiler
nomenclature). This hazard results from an
actual need for communicating a new data value.

I add r1,r2,r3 J sub r4,r1,r3
26
Three Generic Data Hazards

Write After Read (WAR) InstrJ writes operand
before InstrI reads it
Called an anti-dependence by compiler
writers.This results from reuse of the name
r1.
Cannot happen in MIPS 5 stage pipeline because
All instructions take 5 stages, and
Register reads are always in stage 2, and
Register writes are always in stage 5

27
Three Generic Data Hazards

Write After Write (WAW) InstrJ writes operand
before InstrI writes it.
Called an output dependence by compiler
writersThis also results from the reuse of name
r1.
Cannot happen in MIPS 5 stage pipeline because
All instructions take 5 stages, and
Register writes are always in stage 5
Will see WAR and WAW in more complicated pipes

28
Forwarding to Avoid Data HazardFigure A.7, Page
A-19
Forwarding of ALU outputs needed as ALU inputs 1
2 cycles later.
Forwarding of LW MEM outputs to SW MEM or ALU
inputs 1 or 2 cycles later.
Time (clock cycles)
Need no forwarding since write reg is in 1st half
cycle, read reg in 2nd half cycle.
29
HW Datapath Changes (in red) for
ForwardingFigure A.23, Page A-37
To forward ALU, MEM 2 cycles to ALU
To forward ALU output 1 cycle to ALU inputs
MEM/WR
ID/EX
EX/MEM
NextPC
mux
Registers
(From LW Data Memory)
Data Memory
mux
mux
mux
Immediate
(From ALU)
To forward MEM 1 cycle to SW MEM input
What circuit detects and resolves this hazard?
30
Forwarding Avoids ALU-ALU LW-SW Data
HazardsFigure A.8, Page A-20
Time (clock cycles)
31
LW-ALU Data Hazard Even with Forwarding Figure
A.9, Page A-21
Time (clock cycles)
No forwarding needed since write reg in 1st half
cycle, read reg in 2nd half cycle.
32
Data Hazard Even with Forwarding(Similar to
Figure A.10, Page A-21)
Time (clock cycles)
No forwarding needed since write reg in 1st half
cycle, read reg in 2nd half cycle.
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Bubble
ALU
DMem
or r8,r1,r9
How is this hazard detected?
33
Software Scheduling to Avoid Load Hazards
Try producing fast code with no stalls for a
b c d e f assuming a, b, c, d ,e, and f
are in memory. Slow code LW Rb,b LW
Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e
LW Rf,f SUB Rd,Re,Rf SW d,Rd

Fast code (no stalls)
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

Stall gt
Stall gt
Compiler optimizes for performance. Hardware
checks for safety.
34
5-Stage MIPS Datapath(has pipeline latches)
Figure A.3, Page A-9
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next PC
MUX
Next SEQ PC
Next SEQ PC
Zero?
RS1
Reg File
MUX
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
WB Data
Imm
RD
RD
RD

Simple design put branch completion in stage 4
(Mem)

35
Control Hazard on Branch - Three Cycle Stall
MEM
ID/RF
What do you do with the 3 instructions in
between? How do you do it? Where is the commit?
36
Branch Stall Impact if Commit in Stage 4

If CPI 1 and 15 of instructions are branches,
Stall 3 cycles gt new CPI 1.45!
Two-part solution
Determine sooner whether branch taken or not, AND
Compute taken branch address earlier
MIPS branch tests if register 0 or ? 0
MIPS Solution
Move zero_test to ID/RF (Instr Decode Register
Fetch) stage (2, 4MEM)
Add extra adder to calculate new PC (Program
Counter) in ID/RF stage
Result is 1 clock cycle penalty for branch versus
3 when decided in MEM

37
Pipelined MIPS DatapathFigure A.24, page A-38
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next SEQ PC
Next PC
MUX
Adder
Zero?
RS1
Reg File
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
The fast_branch design needs a longer stage 2
cycle time, so the clock is slower for all stages.
WB Data
Imm
RD
RD
RD

Interplay of instruction set design and cycle
time.

38
Four Branch Hazard Alternatives

1 Stall until branch direction is clear
2 Predict Branch Not Taken
Execute the next instructions in sequence
PC4 already calculated, so use it to get next
instruction
Nullify bad instructions in pipeline if branch is
actually taken
Nullify easier since pipeline state updates are
late (MEM, WB)
47 MIPS branches not taken on average
3 Predict Branch Taken
53 MIPS branches taken on average
But have not calculated branch target address in
MIPS
MIPS still incurs 1 cycle branch penalty
Other machines branch target known before outcome

39
Four Branch Hazard Alternatives

4 Delayed Branch
Define branch to take place AFTER a following
instruction
branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn
branch target if taken
1 slot delay allows proper decision and branch
target address in 5 stage pipeline
MIPS 1st used this (Later versions of MIPS did
not pipeline deeper)

Branch delay of length n
40
And In Conclusion Control and Pipelining

Quantify and summarize performance
Ratios, Geometric Mean, Multiplicative Standard
Deviation
FP Benchmarks age, disks fail, single-point
failure
Control via State Machines and Microprogramming
Just overlap tasks easy if tasks are independent
Speed Up ? Pipeline Depth if ideal CPI is 1,
then
Hazards limit performance on computers
Structural need more HW resources
Data (RAW,WAR,WAW) need forwarding, compiler
scheduling
Control delayed branch or branch
(taken/not-taken) prediction
Exceptions and interrupts add complexity
Next time Read Appendix C
No class Tuesday 9/29/09, when Monday classes
will run.

41
CSE 502 Graduate Computer Architecture Lec 6-7
Memory Hierarchy Review

Larry Wittie
Computer Science, StonyBrook University
http//www.cs.sunysb.edu/cse502 and lw
Slides adapted from David Patterson, UC-Berkeley
cs252-s06

42
Since 1980, CPU has outpaced DRAM ...
Q. How do architects address this gap?
A. Put smaller, faster cache memories between
CPU and DRAM. Create a memory hierarchy.
Performance (1/latency)
CPU 60 per yr 2X in 1.5 yrs
1000
CPU
100
DRAM 9 per yr 2X in 10 yrs
10
DRAM
1980
2000
1990
Year
43
1977 DRAM faster than microprocessors
44
Memory Hierarchy Apple iMac G5
07 Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
Latency Cycles, Time 1, 0.6 ns 3, 1.9 ns 3, 1.9 ns 11, 6.9 ns 88, 55 ns 107, 12 ms
Goal Illusion of large, fast, cheap memory
Let programs address a memory space that scales
to the disk size, at a speed that is usually
nearly as fast as register access
45
iMacs PowerPC 970 (G5) All caches on-chip
46
The Principle of Locality

The Principle of Locality
Program access a relatively small portion of the
address space at any instant of time.
Two Different Types of Locality
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., straightline
code, array access)
For last 15 years, HW has relied on locality for
speed

Locality is a property of programs which is
exploited in machine design.
47
Programs with locality cache well ...
Memory Address (one dot per access)
Donald J. Hatfield, Jeanette Gerald Program
Restructuring for Virtual Memory. IBM Systems
Journal 10(3) 168-192 (1971)
Timegt
48
Memory Hierarchy Terminology

Hit data appears in some block in the upper
level (example Block X)
Hit Rate the fraction of memory accesses found
in the upper level
Hit Time Time to access the upper level which
consists of
RAM access time Time to determine hit/miss
Miss data needs to be retrieved from a block in
the lower level (Block Y)
Miss Rate 1 - (Hit Rate)
Miss Penalty Time to replace a block in the
upper level
Time to deliver the block to the upper level
Hit Time ltlt Miss Penalty(500 instructions on
21264!)

49
Cache Measures

Hit rate fraction found in that level
So high that usually talk about Miss rate
Miss rate fallacy as MIPS to CPU performance,
miss rate to average memory access time in
memory
Average memory-access time Hit time Miss
rate x Miss penalty (ns or clocks)
Miss penalty time to replace a block from lower
level, including time to replace in CPU
replacement time time to make upper-level room
for block
access time time to lower level
f(latency to lower level)
transfer time time to transfer block
f(BW between upper lower levels)

50
4 Questions for Memory Hierarchy

Q1 Where can a block be placed in the upper
level? (Block placement)
Q2 How is a block found if it is in the upper
level? (Block identification)
Q3 Which block should be replaced on a miss?
(Block replacement)
Q4 What happens on a write? (Write strategy)

51
Q1 Where can a block be placed in the upper
level?

Block 12 placed in 8 block cache
Fully associative, direct mapped, 2-way set
associative
S.A. Mapping Block Number Modulo (Number of
Sets)

Direct Mapped (12 mod 8) 4
2-Way Assoc (12 mod 4) 0
Full Mapped
Cache
Memory
52
Q2 How find block if in upper level cache?
Bits 18b tag 8b index 256 entries/cache
4b 16 wds/block 2b 4 Byte/wd

Bits (One-way) Direct Mapped Cache
Data Capacity 16KB
256 x 512 / 8
Index gt cache entry
Location of all
possible blocks
Tag for each block
No need to check
index, block-offset
Increasing
associativity
Shrinks index
expands tag size
Bit Fields in Memory Address Used to Access
Cache Word
Virtual Memory
Cache Block

18
53
Q3 Which block to replace after a miss? (After
start up, cache is nearly always full)

Easy if Direct Mapped (only 1 block 1 way per
index)
If Set Associative or Fully Associative, must
choose
Random (Ran) Easy to
implement, but not best if only 2-way
LRU (Least Recently Used) LRU is best, but hard
to implement if gt8-way
Also other LRU approximations better than Random
Miss Rates for 3 Cache Sizes Associativities
Associativity 2-way 4-way
8-way
DataSize LRU Ran LRU Ran
LRU Ran
16 KB 5.2 5.7 4.7 5.3
4.4 5.0
64 KB 1.9 2.0 1.5 1.7
1.4 1.5
256 KB 1.15 1.17 1.13 1.13
1.12 1.12
Random picks gt same low miss rate as LRU for
large caches

54
Q4 Write policy What happens on a write?
Write-Through Write-Back
Policy Data written to cache block is also written to next lower-level memory Write new data only to the cache Update lower level just before a written block leaves cache, i.e., is removed
Debugging Easier Harder
Can read misses force writes? No Yes (may slow read)
Do repeated writes touch lower level? Yes, memory busier No
Additional option -- let writes to an un-cached
address allocate a new cache line
(write-allocate), else just Write-Through.
55
Write Buffers for Write-Through Caches
Q. Why a write buffer ?
A. So CPU not stall for writes
Q. Why a buffer, why not just one register ?
A. Bursts of writes are common.
Q. Are Read After Write (RAW) hazards an issue
for write buffer?
A. Yes! Drain buffer before next read or check
buffer addresses before read-miss.
56
5 Basic Cache Optimizations

Reducing Miss Rate
Larger Block size (reduce Compulsory, cold,
misses)
Larger Cache size (reduce Capacity misses)
Higher Associativity (reduce Conflict misses) (
and multiprocessors have cache Coherence misses)
(4 Cs)
Reducing Miss Penalty
Multilevel Caches total miss rate p(local miss
rate)
Reducing Hit Time (minimal cache latency)
Giving Reads Priority over Writes, since CPU
waiting
Read completes before earlier writes in
write buffer

57
The Limits of Physical Addressing
A0-A31
A0-A31
Simple addressing method of archaic pre-1978
computers
CPU
Memory
D0-D31
D0-D31
Machine language programs had to be aware of the
machine organization
No way to prevent a program from accessing any
machine resource
58
Solution Add a Layer of Indirection
Virtual Addresses
Physical Addresses
A0-A31
Virtual
Physical
A0-A31
CPU
Main Memory
Address Translation
D0-D31
D0-D31
Data
All user programs run in an standardized virtual
address space starting at zero
Needs fast(!) Address Translation hardware,
managed by the operating system (OS), maps
virtual address to physical memory
Hardware supports modern OS features Memory
protection, Address translation, Sharing
59
Three Advantages of Virtual Memory

Translation
Program can be given consistent view of memory,
even though physical memory is scrambled (pages
of programs in any order in physical RAM)
Makes multithreading reasonable (now used a lot!)
Only the most important part of each program
(the Working Set) must be in physical memory at
any one time.
Contiguous structures (like stacks) use only as
much physical memory as necessary, yet still can
grow later as needed without recopying.
Protection (most important now)
Different threads (or processes) protected from
each other.
Different pages can be given special behavior
(Read Only, Invisible to user programs, Not
cached).
Kernel and OS data are protected from access by
User programs
Very important for protection from malicious
programs
Sharing
Can map same physical page to multiple
users(Shared memory)

60
Details of Page Table
Page Table
frame
frame
(Byte offset same in VA PA)
frame
page
frame
page
0

virtual address
page
page

Page table maps virtual page numbers to physical
frames (PTE Page Table Entry)
Virtual memory gt treats memory ? cache for disk

61
All page tables may not fit in memory!
A table for 4KB pages for a 32-bit physical
address space (max 4GB) has 1M entries
Each process needs its own address space tables!
Top-level table wired (stays) in main memory
Only a subset of the 1024 second-level tables are
in main memory rest are on disk or unallocated
62
MIPS Address Translation How does it work?
Physical Addresses
Virtual Addresses
Virtual
Physical
A0-A31
A0-A31
CPU
Memory
D0-D31
D0-D31
Data
TLB also contains protection bits for virtual
address
Fast common case If virtual address is in TLB,
process has permission to read/write it.
63
Can TLB translation overlap cache indexing?
Virtual Page Number Page Offset
Tag Part of Physical Addr Physical Page Number Index Byte Select
Cache Block
Cache Block

A. Inflexibility. Size of cache limited by page
size.
64
Problems With Overlapped TLB Access
Overlapped access only works so long as the
address bits used to index into the cache
do not change as the result of VA
translation This usually limits overlapping to
small caches, large page sizes, or high
n-way set associative caches if you want a large
capacity cache Example suppose everything the
same except that the cache is increased to
8 KBytes instead of 4 KB
11
2
cache index
00
This bit is changed by VA translation, but it is
needed for cache lookup.
12
20
virt page
disp
Solutions go to 8KByte page sizes
go to 2-way set associative cache or SW
guarantee VA13PA13
2-way set assoc cache
1K
10
4
4
65
Can CPU use virtual addresses for cache?
Virtual Addresses
Physical Addresses
A0-A31
Physical
Virtual
A0-A31
Translation Look-Aside Buffer (TLB)
Virtual
Cache
CPU
Main Memory
D0-D31
D0-D31
D0-D31
Only use TLB on a cache miss !
Downside a subtle, fatal problem. What is it?
(Aliasing)
A. Synonym problem. If two address spaces share a
physical frame, data may be in cache twice.
Maintaining consistency is a nightmare.
66
Summary 1/3 The Cache Design Space

Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs write-back
write allocation
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
67
Summary 2/3 Caches

The Principle of Locality
Program access a relatively small portion of the
address space at any instant of time.
Temporal Locality Locality in Time
Spatial Locality Locality in Space
Three Major Uniprocessor Categories of Cache
Misses
Compulsory Misses sad facts of life. Example
cold start misses.
Capacity Misses increase cache size
Conflict Misses increase cache size and/or
associativity. Nightmare Scenario ping pong
effect!
Write Policy Write Through vs. Write Back
Today CPU time is a function of (ops, cache
misses) vs. just f(ops) Increasing performance
affects Compilers, Data structures, and
Algorithms

68
Summary 3/3 TLB, Virtual Memory

Page tables map virtual address to physical
address
TLBs are important for fast translation
TLB misses are significant in processor
performance
funny times, as most systems cannot access all of
2nd level cache without TLB misses!
Caches, TLBs, Virtual Memory all understood by
examining how they deal with 4 questions 1)
Where can block be placed?2) How is block found?
3) What block is replaced on miss? 4) How are
writes handled?
Today VM allows many processes to share single
memory without having to swap all processes to
disk today VM protection is more important than
memory hierarchy benefits, but computers are
still insecure
Short in-class openbook quiz on appendices A-C
Chapter 1 near start of next (9/24) class. Bring
a calculator.
(Please put your best email address on your
exam.)

69
CSE 502 Graduate Computer Architecture Lec 8-10
Instruction Level Parallelism

Larry Wittie
Computer Science, StonyBrook University
http//www.cs.sunysb.edu/cse502 and lw
Slides adapted from David Patterson, UC-Berkeley
cs252-s06

70
Recall from Pipelining Review

Pipeline CPI Ideal pipeline CPI Structural
Stalls Data Hazard Stalls Control Stalls
Ideal pipeline CPI measure of the maximum
performance attainable by the implementation
Structural hazards HW cannot support this
combination of instructions
Data hazards Instruction depends on result of
prior instruction still in the pipeline
Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (branches and jumps)

71
Instruction Level Parallelism

Instruction-Level Parallelism (ILP) overlap the
execution of instructions to improve performance
2 approaches to exploit ILP
1) Rely on hardware to help discover and exploit
the parallelism dynamically (e.g., Pentium 4, AMD
Opteron, IBM Power) , and
2) Rely on software technology to find
parallelism, statically at compile-time (e.g.,
Itanium 2)
Next 3 lectures on this topic

72
Instruction-Level Parallelism (ILP)

Basic Block (BB) ILP is quite small
BB a straight-line code sequence with no
branches in except to the entry and no branches
out except at the exit
average dynamic branch frequency 15 to 25 gt 4
to 7 instructions execute between a pair of
branches
other problem instructions in a BB are likely to
depend on each other
To obtain substantial performance enhancements,
we must exploit ILP across multiple basic blocks
Simplest loop-level parallelism to exploit
parallelism among iterations of a loop. E.g.,
for (j0 jlt1000 jj1) xj1
xj1 yj1
for (i0 ilt1000 ii4) xI1
xI1 yI1 xI2 xI2 yI2
xI3 xI3 yI3 xI4 xI4
yI4
//Vector HW can make this run much faster.

73
Loop-Level Parallelism

Exploit loop-level parallelism to find run-time
parallelism by unrolling loops either via
dynamic branch prediction by CPU hardware or
static loop unrolling by a compiler
(Other ways vectors parallelism - covered
later)
Determining instruction dependence is critical to
Loop Level Parallelism
If two instructions are
parallel, they can execute simultaneously in a
pipeline of arbitrary depth without causing any
stalls (assuming no structural hazards)
dependent, they are not parallel and must be
executed in order, although they may often be
partially overlapped

74
ILP and Data Dependencies, Hazards

HW/SW must preserve program order give the same
results as if instructions were executed
sequentially in the original order of the source
program
Dependences are a property of programs
The presence of a dependence indicates the
potential for a hazard, but the existence of an
actual hazard and the length of any stall are
properties of the pipeline
Importance of the data dependencies
1) Indicate the possibility of a hazard
2) Determine the order in which results must be
calculated
3) Set upper bounds on how much parallelism can
possibly be exploited
HW/SW goal exploit parallelism by preserving
program order only where it affects the outcome
of the program

75
Name Dependence 1 Anti-dependence

Name dependence when two instructions use the
same register or memory location, called a name,
but no data flow between the instructions using
that name there are 2 versions of name
dependence, which may cause WAR and WAW hazards,
if a name such as r1 is reused
1. InstrJ may wrongly write operand r1 before
InstrI reads it
This anti-dependence of compiler writers may
cause a Write After Read (WAR) hazard in a
pipeline.
2. InstrJ may wrongly write operand r1 before
InstrI writes it
This output dependence of compiler writers may
cause a Write After Write (WAW) hazard in a
pipeline.
Instructions with a name dependence can execute
simultaneously if one name is changed by a
compiler or by register-renaming in HW.

76
Carefully Violate Control Dependencies

Every instruction is control dependent on some
set of branches, and, in general, these control
dependencies must be preserved to preserve
program order
if p1
S1
if p2
S2
S1 is control dependent on proposition p1, and S2
is control dependent on p2 but not on p1.
Control dependence need not always be preserved
Control dependences can be violated by executing
instructions that should not have been, if doing
so does not affect program results
Instead, two properties critical to program
correctness are
exception behavior and
data flow

77
Exception Behavior Is Important

Preserving exception behavior ? any changes in
instruction execution order must not change how
exceptions are raised in program (? no new
exceptions)
Example DADDU R2,R3,R4 BEQZ R2,L1 LW R1,-1(R
2)L1
(Assume branches are not delayed)
What is the problem with moving LW before BEQZ?
Array overflow what if R20, so -1R2 is out
of program memory bounds?

78
Data Flow Of Values Must Be Preserved

Data flow actual flow of data values from
instructions that produce results to those that
consume them
branches make flow dynamic (since we know
details only at runtime) must determine which
instruction is supplier of data
Example
DADDU R1,R2,R3BEQZ R4,LDSUBU R1,R5,R6L OR
R7,R1,R8
OR input R1 depends on which of DADDU or DSUBU?
Must preserve data flow on execution

79
FP Loop Where are the Hazards?

for (i1000 igt0 ii1)
xi xi s
First translate into MIPS code
-To simplify loop end, assume 8 is lowest
address, F2s, and R1 starts with address for
x1000

Loop L.D F0,0(R1) F0vector element xi
ADD.D F4,F0,F2 add scalar from F2 s
S.D 0(R1),F4 store result back into xi
DADDUI R1,R1,-8 decrement pointer 8B
(DblWd)
BNEZ R1,Loop branch R1!zero

80
FP Loop Showing Stalls
1 Loop L.D F0,0(R1) F0vector element
2 stall 3 ADD.D F4,F0,F2 add scalar in F2
4 stall 5 stall 6 S.D 0(R1),F4 store
result 7 DADDUI R1,R1,-8 decrement pointer 8B
(DW) 8 stall assume cannot forward to branch
9 BNEZ R1,Loop branch R1!zero
produce result use result stalls
between FP ALU op Other FP ALU op 3FP ALU
op Store double 2 Load double FP ALU
op 1Load double Store double
0Integer op Integer op 0

Loop every 9 clock cycles. How reorder code to
minimize stalls?

81
Revised FP Loop Minimizing Stalls
Original 9 cycle per loop code 1 Loop L.D
F0,0(R1) F0vector element 2 stall 3 ADD.D
F4,F0,F2 add scalar in F2 4 stall 5 stall
6 S.D 0(R1),F4 store result 7 DADDUI
R1,R1,-8 decrement pointer 8B 8 stall
assume cannot forward to branch 9
BNEZ R1,Loop branch R1!zero
1 Loop L.D F0,0(R1) 2 DADDUI R1,R1,-8
3 ADD.D F4,F0,F2 4 stall 5 stall
6 S.D 8(R1),F4 altered offset 0gt8 when moved
DADDUI 7 BNEZ R1,Loop
Swap DADDUI and S.D change address offset of S.D
produce result use result stalls
between FP ALU op Other FP ALU op 3FP ALU
op Store double 2 Load double FP ALU
op 1Load double Store double
0Integer op Integer op 0

Loop takes 7 clock cycles, but just 3 for
execution (L.D, ADD.D,S.D), 4 for loop overhead
How make faster?

82
Unroll Loop Four Times (straightforward way
gives 7gt6.75 cycles)
1 cycle stall
1 Loop L.D F0,0(R1) 3 ADD.D F4,F0,F2 6 S.D 0(R1),
F4 drop DADDUI BNEZ 7 L.D F6,-8(R1) 9 ADD.D F8
,F6,F2 12 S.D -8(R1),F8 drop DADDUI
BNEZ 13 L.D F10,-16(R1) 15 ADD.D F12,F10,F2 18 S.D
-16(R1),F12 drop DADDUI BNEZ 19 L.D F14,-24(R
1) 21 ADD.D F16,F14,F2 24 S.D -24(R1),F16 25 DADDU
I R1,R1,-32 alter to 48 27 BNEZ R1,LOOP Four
loops take 27 clock cycles, or 6.75 per
iteration (Assumes R1 is a multiple of 4)

How rewrite loop to minimize stalls?

2 cycles stall
1 cycle stall
83
Unrolled Loop That Minimizes (0) Stalls
1 Loop L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16
(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8
,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D
0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 D
ADDUI R1,R1,-32 13 S.D 8(R1),F16 8-32
-24 14 BNEZ R1,LOOP Four loops take 14 clock
cycles, or 3.5 per loop.

1 Loop L.D F0,0(R1)
3 ADD.D F4,F0,F2
S.D 0(R1),F4
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D -8(R1),F8
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,-32 48
27 BNEZ R1,LOOP
27 cycles
m means cycle m 1 stall
n means cycle n 2 stalls

84
Loop Unrolling Detail - Strip Mining

Do not usually know upper bound of loop
Suppose it is n, and we would like to unroll the
loop to make k copies of the body
Instead of a single unrolled loop, we generate a
pair of consecutive loops
1st executes (n mod k) times and has a body that
is the original loop called strip mining
of a loop
2nd is the unrolled body surrounded by an outer
loop that iterates ( n/k ) times
For large values of n, most of the execution time
will be spent in the n/k unrolled loops

85
Five Loop Unrolling Decisions

Requires understanding how one instruction
depends on another and how the instructions can
be changed or reordered given the dependences
Determine if loop unrolling can be useful by
finding that loop iterations are independent
(except for loop maintenance code)
Use different registers to avoid unnecessary
constraints forced by using the same registers
for different computations
Eliminate the extra test and branch instructions
and adjust the loop termination and iteration
increment/decrement code
Determine that loads and stores in unrolled loop
can be interchanged by observing that loads and
stores from different iterations are independent
Transformation requires analyzing memory
addresses and finding that no pairs refer to the
same address
Schedule (reorder) the code, preserving any
dependences needed to yield the same result as
the original code

86
Three Limits to Loop Unrolling

Decrease in amount of overhead amortized with
each extra unrolling
Amdahls Law
Growth in code size
For larger loops, size is a concern if it
increases the instruction cache miss rate
Register pressure potential shortfall in
registers created by aggressive unrolling and
scheduling
If not possible to allocate all live values to
registers, code may lose some or all of the
advantages of loop unrolling
Software pipelining is an older compiler
technique to unroll loops systematically.
Loop unrolling reduces the impact of branches on
pipelines another way is branch prediction.

87
_
_ Compiler Software-Pipelining of VSV Loop
Software pipelining structure tolerates the long
latencies of FltgPt operations l.s, mul.s,
s.s are single precision (SP) floating-pt. Load,
Multiply, Store. At start r1addr V(0),
r2addrV(last)4, f0 scalar SP fltg
multiplier. Instructions in iteration box are in
reverse order, from different iterations. If have
separate FltgPt function boxes for L, M, S, can
overlap S M L triples. Bg marks prologue
starting iterated code En marks epilogue to
finish code.
l.s f2,0(r1) mul.s f4,f0,f2 s.s f4,0(r1) addi
r1,r1,4 bne r1,r2,Lp l.s f2,0(r1) mul.s
f4,f0,f2 s.s f4,0(r1) addi r1,r1,4 bne
r1,r2,Lp l.s f2,0(r1) mul.s f4,f0,f2 s.s
f4,0(r1) addi r1,r1,4 bne r1,r2,Lp
Bg addi r1,r1,8 l.s f2,-8(r1) mul.s
f4,f0,f2 l.s f2,-4(r1) Lp s.s
f4,-8(r1) mul.s f4,f0,f2 l.s f2,0(r1)
addi r1,r1,4 bne r1,r2,Lp En s.s
f4,-4(r1) mul.s f4,f0,f2 s.s f4,0(r1)
I TIME ? T 1 2 3 4 5 6 7 8 E 1 L M S R 2
L M S A 3 L M S T 4 L M S I 5
L M S O 6 L M S N
?
88
Dynamic (Run-Time) Branch Prediction

Why does prediction work?
Underlying algorithm has regularities
Data that is being operated on has regularities
Instruction sequences have redundancies that are
artifacts of way that humans/compilers solve
problems
Is dynamic branch prediction better than static
prediction?
Seems to be
There are a small number of important branches in
programs which have dynamic behavior
Performance ƒ(accuracy, cost of misprediction)
Branch History Table Lower bits of PC address
index table of 1-bit values
Says whether or not branch taken last time
No address check
Problem 1-bit BHT will cause two mispredictions
per loop, (Average for loops is 9 iterations
before exit)
End of loop case, when it exits instead of
looping as before
First time through loop on next time through
code, when it predicts exit instead of looping

89
Dynamic Branch Prediction With 2 Bits

Solution 2-bit scheme where change prediction
only if get misprediction twice
Red stop, not taken
Green go, taken
Adds hysteresis to decision making process

90
Branch History Table (BHT) Accuracy

Mispredict because either
Make wrong guess for that branch
Got branch history of wrong branch when index the
table (same low address bits used for index).
4096 entry
BH table

Integer
Floating Point
91
Correlated Branch Prediction

Idea record m most recently executed branches
as taken or not taken, and use that pattern to
select the proper n-bit branch history table
Global Branch History m-bit shift register
keeping Taken/Not_Taken status of last m branches
anywhere.
In general, (m,n) predictor means use record of
last m global branches to select between 2m local
branch history tables, each with n-bit counters
Thus, the old 2-bit BHT is a (0,2) predictor
Each entry in table has m n-bit predictors.

92
Correlating Branch Predictors
(2,2) predictor with 16 sets of four 2-bit
predictions Behavior of most recent 2 branches
selects between four predictions for next branch,
updating just that prediction
Branch address
4
2-bits per branch predictor
Prediction
2-bit global branch history
93
Accuracy of Different Schemes
4096 Entries 2-bit BHT (4096) Unlimited Entries
2-bit BHT 1024 Entries (2,2) BHT (4096)
20
18
16
14
12
11
Frequency of Mispredictions
10
8
6
6
6
6
5
5
4
4
2
1
1
0
0
nasa7
li
matrix300
doducd
spice
fpppp
gcc
expresso
eqntott
tomcatv
4,096 entries 2-bits per entry
Unlimited entries 2-bits/entry
1,024 entries (2,2)
94
Tournament Predictors

Multilevel branch predictor
Use n-bit saturating counter to choose between
predictors
Usual choice is between global and local
predictors

95
Comparing Predictors (Fig. 2.8)

Advantage tournament predictor can select the
right predictor for a particular branch
Particularly crucial for integer benchmarks.
A typical tournament predictor will select the
global predictor almost 40 of the time for the
SPEC Integer benchmarks and less than 15 of the
time for the SPEC FP benchmarks

6.8 2-bit 3.7 Corr, 2.6 Tourn.
96
Branch Target Buffers (BTB)

Branch target calculation is costly and stalls
the instruction fetch one or more cycles.
BTB stores branch PCs and target PCs the same way
as caches store addresses and data blocks.
The PC of a branch is sent to the BTB
When a match is found the corresponding Predicted
target PC is returned
If the branch is predicted to be Taken,
instruction fetch continues at the returned
predicted PC

97
Branch Target Buffers
98
Dynamic Branch Prediction Summary

Prediction becoming important part of execution
Branch History Table 2 bits for loop accuracy
Correlation Recently executed branches
correlated with next branch
Either different branches (GA)
Or different executions of same branches (PA)
Tournament predictors take insight to next level,
by using multiple predictors
usually one based on global information and one
based on local information, and combining them
with a selector
In 2006, tournament predictors using ? 30K bits
are in processors like the Power5 and Pentium 4
Branch Target Buffer include branch address
prediction

99
Advantages of Dynamic Scheduling

Dynamic scheduling - hardware rearranges the
instruction execution to reduce stalls while
maintaining data flow and exception behavior
It handles cases in which dependences were
unknown at compile time
it allows the processor to tolerate unpredictable
delays such as cache misses, by executing other
code while waiting for the miss to resolve
It allows code compiled for one pipeline to run
efficiently on a different pipeline
It simplifies the compiler
Hardware speculation, a technique with
significant performance advantages, builds on
dynamic scheduling (next lecture)

100
HW Schemes Instruction Parallelism

Key idea Allow instructions behind stall to
proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F
8,F14
Enables out-of-order execution and allows
out-of-order completion (e.g., SUBD before slow
DIVD)
In a dynamically scheduled pipeline, all
instructions still pass through issue stage in
order (in-order issue)
Will distinguish when an instruction begins
execution from when it completes execution
between the two times, the instruction is in
execution
Note Dynamic scheduling creates WAR and WAW
hazards and makes exception handling harder

101
Dynamic Scheduling Step 1

Simple pipeline had only one stage to check both
structural and data hazards Instruction Decode
(ID), also called Instruction Issue
Split the ID pipe stage of simple 5-stage
pipeline into 2 stages to make a 6-stage
pipeline
IssueDecode instructions, check for structural
hazards
Read operandsWait until no data hazards, then
read operands

102
A Dynamic Algorithm Tomasulos

For IBM 360/91 (before caches!)
? Long memory latency
Goal High Performance without special compilers
Small number of floating point registers (4 in
360) prevented interesting compiler scheduling of
operations
This led Tomasulo to try to figure out how
effectively to get more registers renaming in
hardware!
Why Study a 1966 Computer?
The descendants of this have flourished!
Alpha 21264, Pentium 4, AMD Opteron, Power 5,

103
Tomasulo Algorithm

Control buffers distributed with Function Units
(FU)
FU buffers called reservation stations have
pending operands
Registers in instructions replaced by values or
pointers to reservation stations(RSs) called
register renaming
Renaming avoids WAR, WAW hazards
More reservation stations than registers, so can
do optimizations compilers cannot do without
access to the additional internal registers, the
reservation stations.
Results from RSs as leave each FU sent to waiting
RSs, not through registers, but over a Common
Data Bus that broadcasts results to all FUs and
their waiting RSs
Avoids RAW hazards by executing an instruction
only when its operands are available
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches
(predict taken), allowing FP ops beyond basic
block in FP queue

104
Tomasulo Organization
FP Registers
From Mem
FP Ops Queue
Load Buffers
Load1 Load2 Load3 Load4 Load5 Load6
Store Buffers
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
Common Data Bus (CDB)
105
Three Stages of Tomasulo Algorithm

1. Issueget instruction from FP Op Queue
If reservation station free (no structural
hazard), control issues instr sends operands
(renames registers).
2. Executeoperate on operands (EX)
When both operands ready, start to execute if
not ready, watch Common Data Bus for result
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting units
mark reservation station available
Normal data bus data destination (go to bus)
Common data bus data source (come from bus)
64 bits of data 4 bits of Functional Unit
source address
Write if matches expected Functional Unit
(produces result)
Does the broadcast
Example speed After start EX 2 clocks for LD
3 for Fl .pt. ,- 10 for 40 for /.

106
Reservation Station Components

Op Operation to perform in the unit (e.g., or
)
Vj, Vk Value of source operands for Op
Each store buffer has a V field, for the result
to be stored
Qj, Qk Reservation stations producing source
registers (value to be written)
Note Qj,Qk0 gt ready
Store buffers only have Qi for RS producing
result
Busy Indicates reservation station or FU is
busy
Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions that
will write that register.

107
Why Can Tomasulo Overlap Iterations Of Loops?

Register renaming
Multiple iterations use different physical
destinations for registers (dynamic loop
unrolling).
Reservation stations
Permit instruction issue to advance past integer
control flow operations
Also buffer old values of registers - totally
avoiding the WAR stall
Other perspective Tomasulo building data flow
dependency graph on the fly

108
Tomasulos Scheme Two Major Advantages

Distribution of the hazard detection logic
distributed reservation stations and the CDBus
If multiple instructions waiting on single result
and each instruction has other operand, then
instructions can be released simultaneously by
broadcast on CDB
If a centralized register file were used, the
units would have to read their results from the
registers when register buses are available
Elimination of stalls for WAW and WAR hazards

109
Tomasulo Drawbacks

Complexity
delays of 360/91, MIPS 10000, Alpha 21264, IBM
PPC 620 (in CAAQA 2/e, before it was in
silicon!)
Many associative stores (CDB) at high speed
Performance limited by Common Data Bus
Each CDB must go to multip

Write a Comment

User Comments (0)