CENG 450 Computer Systems and Architecture Lecture 15 - PowerPoint PPT Presentation

About This Presentation
Title:

CENG 450 Computer Systems and Architecture Lecture 15

Description:

Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali_at_ece.uvic.ca – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 32
Provided by: Shin174
Category:

less

Transcript and Presenter's Notes

Title: CENG 450 Computer Systems and Architecture Lecture 15


1
CENG 450Computer Systems and
ArchitectureLecture 15
  • Amirali Baniasadi
  • amirali_at_ece.uvic.ca

2
Announcements
  • Last Quiz scheduled for March 31st.

3
Cache Write Policy Write Through versus Write
Back
  • Cache read is much easier to handle than cache
    write
  • Instruction cache is much easier to design than
    data cache
  • Cache write
  • How do we keep data in the cache and memory
    consistent?
  • Two options
  • Write Back write to cache only. Write the cache
    block to memory when that cache block is being
    replaced on a cache miss.
  • Need a dirty bit for each cache block
  • Greatly reduce the memory bandwidth requirement
  • Control can be complex
  • Write Through write to cache and memory at the
    same time.
  • Isnt memory too slow for this?

4
Write Buffer for Write Through
Cache
Processor
DRAM
Write Buffer
  • A Write Buffer is needed between the Cache and
    Memory
  • Processor writes data into the cache and the
    write buffer
  • Memory controller write contents of the buffer
    to memory
  • Write buffer is just a FIFO
  • Typical number of entries 4
  • Works fine if Store frequency (w.r.t. time) ltlt
    1 / DRAM write cycle
  • Memory system designers nightmare
  • Store frequency (w.r.t. time) -gt 1 / DRAM
    write cycle
  • Write buffer saturation

5
Write Buffer Saturation
Cache
Processor
DRAM
Write Buffer
  • Store frequency (w.r.t. time) -gt 1 / DRAM
    write cycle
  • If this condition exist for a long period of time
    (CPU cycle time too quick and/or too many store
    instructions in a row)
  • Store buffer will overflow no matter how big you
    make it
  • The CPU Cycle Time lt DRAM Write Cycle Time
  • Solution for write buffer saturation
  • Use a write back cache
  • Install a second level (L2) cache

Cache
L2 Cache
Processor
DRAM
Write Buffer
6
Improving Cache Performance
  • Average Memory Access Time Hit Time Miss Rate
    Miss Penalty
  • 1. Reduce the miss rate,
  • 2. Reduce the miss penalty, or
  • 3. Reduce the time to hit in the cache.

7
1. Reduce Misses via Larger Block Size
8
2. Reduce Misses via Higher Associativity
  • 21 Cache Rule
  • Miss Rate DM cache size N Miss Rate 2-way cache
    size N/2
  • Beware Execution time is only final measure!
  • Will Clock Cycle time increase?
  • Hill 1988 suggested hit time for 2-way vs.
    1-way external cache 10, internal 2

9
3. Reducing Misses via a Victim Cache
  • How to combine fast hit time of direct mapped
    yet still avoid conflict misses?
  • Add buffer to place data discarded from cache
  • Jouppi 1990 4-entry victim cache removed 20
    to 95 of conflicts for a 4 KB direct mapped data
    cache
  • Used in Alpha, HP machines

DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
10
4. Reducing Misses via Pseudo-Associativity
  • How to combine fast hit time of Direct Mapped and
    have the lower conflict misses of 2-way
    associative cache?
  • Divide cache on a miss, check other half of
    cache to see if the data is there, if so have a
    pseudo-hit (slow hit)
  • Drawback Difficult to build a CPU pipeline if
    hit may take either 1 or 2 cycles
  • Better for caches not tied directly to processor
    (L2)
  • Used in MIPS R1000 L2 cache, similar in UltraSPARC

Hit Time
Miss Penalty
Pseudo Hit Time
Time
11
5. Reducing Misses by Compiler Optimizations
  • Instructions
  • Not discussed here.
  • Data
  • Merging Arrays improve spatial locality by
    single array of compound elements vs. 2 arrays
  • Loop Interchange change nesting of loops to
    access data in order stored in memory
  • Loop Fusion Combine 2 independent loops that
    have same looping and some variables overlap

12
Merging Arrays Example
  • / Before 2 sequential arrays /
  • int valSIZE
  • int keySIZE
  • / After 1 array of structures /
  • struct merge
  • int val
  • int key
  • struct merge merged_arraySIZE
  • Reducing conflicts between val key improve
    spatial locality

13
Loop Interchange Example
  • / Before /
  • for (k 0 k lt 100 k k1)
  • for (j 0 j lt 100 j j1)
  • for (i 0 i lt 5000 i i1)
  • xij 2 xij
  • / After /
  • for (k 0 k lt 100 k k1)
  • for (i 0 i lt 5000 i i1)
  • for (j 0 j lt 100 j j1)
  • xij 2 xij
  • Sequential accesses instead of striding through
    memory every 100 words improved spatial locality

14
Loop Fusion Example
  • / Before /
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • aij 1/bij cij
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • dij aij cij
  • / After /
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • aij 1/bij cij
  • dij aij cij
  • 2 misses per access to a c vs. one miss per
    access improve spatial locality

15
Summary of Compiler Optimizations (by hand)
16
Summary Miss Rate Reduction
  • 3 Cs Compulsory, Capacity, Conflict
  • 1. Reduce Misses via Larger Block Size
  • 2. Reduce Misses via Higher Associativity
  • 3. Reducing Misses via Victim Cache
  • 4. Reducing Misses via Pseudo-Associativity
  • 5. Reducing Misses by Compiler Optimizations

17
Improving Cache Performance
  • 1. Reduce the miss rate,
  • 2. Reduce the miss penalty, or
  • 3. Reduce the time to hit in the cache.

18
1. Reduce Miss Penalty Early Restart and
Critical Word First
  • Dont wait for full block to be loaded before
    restarting CPU
  • Early restartAs soon as the requested word of
    the block arrives, send it to the CPU and let the
    CPU continue execution
  • Critical Word FirstRequest the missed word first
    from memory and send it to the CPU as soon as it
    arrives let the CPU continue execution while
    filling the rest of the words in the block. Also
    called wrapped fetch and requested word first
  • Generally useful only when cache line gt bus width
  • Spatial locality a problem tend to want next
    sequential word, so not clear if benefit by early
    restart

block
19
2. Reduce Miss Penalty Non-blocking Caches to
reduce stalls on misses
  • Non-blocking cache or lockup-free cache allow
    data cache to continue to supply cache hits
    during a miss
  • hit under miss reduces the effective miss
    penalty by working during miss vs. ignoring CPU
    requests
  • hit under multiple miss or miss under miss
    may further lower the effective miss penalty by
    overlapping multiple misses
  • Significantly increases the complexity of the
    cache controller as there can be multiple
    outstanding memory accesses
  • Pentium Pro allows 4 outstanding memory misses

20
3 Use a multi-level cache
  • L2 Equations
  • AMAT Hit TimeL1 Miss RateL1 x Miss
    PenaltyL1
  • Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
    PenaltyL2
  • AMAT Hit TimeL1
  • Miss RateL1 x (Hit TimeL2 Miss RateL2 x
    Miss PenaltyL2)
  • Definitions
  • Local miss rate misses in this cache divided by
    the total number of memory accesses to this cache
    (Miss rateL2)
  • Global miss ratemisses in this cache divided by
    the total number of memory accesses generated by
    the CPU (Miss RateL1 x Miss RateL2)
  • Global Miss Rate is what matters

21
Reducing Misses Which apply to L2 Cache?
  • Reducing Miss Rate
  • 1. Reduce Misses via Larger Block Size
  • 2. Reduce Conflict Misses via Higher
    Associativity
  • 3. Reducing Conflict Misses via Victim Cache
  • 4. Reducing Conflict Misses via
    Pseudo-Associativity
  • 5. Reducing Capacity/Conf. Misses by Compiler
    Optimizations

22
L2 cache block size A.M.A.T.
  • 32KB L1, 8 byte path to memory

23
Reducing Miss Penalty Summary
  • Three techniques
  • Early Restart and Critical Word First on miss
  • Non-blocking Caches (Hit under Miss, Miss under
    Miss)
  • Second Level Cache
  • Can be applied recursively to Multilevel Caches
  • Danger is that time to DRAM will grow with
    multiple levels in between

24
Summary The Cache Design Space
  • Several interacting dimensions
  • cache size
  • block size
  • associativity
  • replacement policy
  • write-through vs. write-back
  • The optimal choice is a compromise
  • depends on access characteristics
  • workload
  • use (I-cache, D-cache, TLB)
  • depends on technology / cost
  • Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
25
IBM POWER4 Memory Hierarchy
4 cycles to load to a floating point
register 128-byte blocks divided into 32-byte
sectors
L1(Instr.) 64 KB Direct Mapped
L1(Data) 32 KB 2-way, FIFO
write allocate 14 cycles to load to a
floating point register 128-byte blocks
L2(Instr. Data) 1440 KB, 3-way,
pseudo-LRU (shared by two processors)
L3(Instr. Data) 128 MB 8-way (shared by two
processors)
? 340 cycles 512-byte blocks divided into
128-byte sectors
26
Intel Itanium Processor
L1(Data) 16 KB, 4-way dual-ported write through
L1(Instr.) 16 KB 4-way
32-byte blocks 2 cycles
64-byte blocks write allocate 12 cycles
L2 (Instr. Data) 96 KB, 6-way
4 MB (on package, off chip)
64-byte blocks 128 bits bus at 800 MHz (12.8
GB/s) 20 cycles
27
3rd Generation Itanium
  • 1.5 GHz
  • 410 million transistors
  • 6MB 24-way set associative L3 cache
  • 6-level copper interconnect, 0.13 micron
  • 130W (i.e. lasts 17s on an AA NiCd)

28
Cache performance
29
Impact on Performance
  • Suppose a processor executes at
  • Clock Rate 1 GHz (1 ns per cycle), Ideal (no
    misses) CPI 1.1
  • 50 arith/logic, 30 ld/st, 20 control
  • Suppose that 10 of memory operations get 100
    cycle miss penalty
  • Suppose that 1 of instructions get same miss
    penalty

78 of the time the proc is stalled waiting for
memory!
30
Example Harvard Architecture
  • Unified vs. Separate ID (Harvard)
  • 16KB ID Inst miss rate0.64, Data miss
    rate6.47
  • 32KB unified Aggregate miss rate1.99
  • Which is better (ignore L2 cache)?
  • Assume 33 data ops ? 75 accesses from
    instructions (1.0/1.33)
  • hit time1, miss time50
  • Note that data hit has 1 stall for unified cache
    (only one port)
  • AMATHarvard75x(10.64x50)25x(16.47x50)
    2.05
  • AMATUnified75x(11.99x50)25x(111.99x50)
    2.24

31
Summary
  • The Principle of Locality
  • Program access a relatively small portion of the
    address space at any instant of time.
  • Temporal Locality Locality in Time
  • Spatial Locality Locality in Space
  • Three Major Categories of Cache Misses
  • Compulsory Misses sad facts of life. Example
    cold start misses.
  • Conflict Misses increase cache size and/or
    associativity. Nightmare Scenario ping pong
    effect!
  • Capacity Misses increase cache size
  • Write Policy
  • Write Through need a write buffer. Nightmare
    WB saturation
  • Write Back control can be complex
  • Cache Performance
Write a Comment
User Comments (0)
About PowerShow.com