Chapter 5: Memory Hierarchy Design - PowerPoint PPT Presentation

1 / 116
About This Presentation
Title:

Chapter 5: Memory Hierarchy Design

Description:

2-Way Set Associative Simulation. N=16 addresses B=2 bytes/line S=2 sets E=2 entries/set ... [0] CA00-5-20. Fully associative cache. 0000. 0001. 0010. 0011. 0100 ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 117
Provided by: yirnga
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5: Memory Hierarchy Design


1
Chapter 5 Memory Hierarchy Design
  • Yirng-An Chen
  • Dept. of CIS
  • Computer Architecture
  • Fall, 2000

2
Computer System
3
Who Cares About the Memory Hierarchy?
  • Processor Only Thus Far in Course
  • CPU cost/performance, ISA, Pipelined Execution
  • CPU-DRAM Gap
  • 1980 no cache in µproc 1995 2-level cache on
    chip(1989 first Intel µproc with a cache on chip)

µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
4
Levels in a typical memory hierarchy
cache
virtual memory
4 B
8 B
4 KB
register reference
cache reference
memory reference
disk memory reference
size speed /Mbyte block size
200 B 3 ns 4 B
32 KB / 4MB 6 ns NT?/MB 8 B
128 MB 100 ns NT50/MB 4 KB
20 GB 10 ms NT0.9/MB
larger, slower, cheaper
5
Sources of Memory References
sum 0 for (i 0 i lt n i) sum ai v
sum
Memory Layout
0x0FC
I0
0x100
Abstract Version of Machine Code
I1
0x104
I2
I0 sum lt-- 0 I1 ap lt-- a I2 i lt--
0 I3 if (i gt n) goto done i4 loop t lt--
ap I5 sum lt-- sum t I6 ap lt-- ap
4 I7 i lt-- i 1 I8 if (i lt n) goto
loop I9 done v lt-- sum
0x108
I3
0x10C
I4
0x110
I5
0x114
I6
  
0x400
a0
0x404
a1
0x408
a2
0x40C
a3
0x410
a4
  • Memory addresses in bytes
  • Each instruction data word 4 bytes
  • Instruction sequences data arrays laid out as
    contiguous memory blocks

0x414
a5
  
0x7A4
v
6
Locality of reference
  • Principle of Locality
  • programs tend to reuse data and instructions near
    those they have used recently.
  • temporal locality recently referenced items are
    likely to be referenced in the near future.
  • spatial locality items with nearby addresses
    tend to be referenced close together in time.

sum 0 for (i 0 i lt n i) sum ai v
sum
  • Locality in Example
  • Data
  • Reference array elements in succession (spatial)
  • Instruction
  • Reference instructions in sequence (spatial)
  • Cycle through loop repeatedly (temporal)

7
Accessing data in a memory hierarchy
Between any two levels, memory divided into
blocks. Data moves between levels on demand, in
block-sized chunks. Upper-level blocks a subset
of lower-level blocks.
Access word w in block a (hit)
Access word v in block b (miss)
w
v
a
a
a
b
b
b
b
b
a
a
a
Locality smaller HW is faster memory hierarchy
8
Four ?s for Memory Hierarchy Designers
  • Q1 Where can a block be placed in the upper
    level? (Block placement)
  • Fully Associative, Set Associative, Direct Mapped
  • Q2 How is a block found if it is in the upper
    level? (Block identification)
  • Tag/Block
  • Q3 Which block should be replaced on a miss?
    (Block replacement)
  • Random, LRU
  • Q4 What happens on a write? (Write strategy)
  • Write Back or Write Through (with Write Buffer)

9
Address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
An n-bit address defines an address space of 2n
items 0,...,2n-1.
Address space for n5
10
Partitioning address spaces
Key idea partitioning the address bits
partitions the address space. In general, an
address partitioned into sets of t (tag), s (set
index), and b (block offset) bits, e.g.,
t
s
b
address
tag
set index
offset
belongs to one of 2s equivalence classes (sets),
where each set consists of 2t blocks of
addresses, and each block consists of 2b
addresses. The s bits uniquely identify an
equivalence class. The t bits uniquely identify
each block in the equivalence class. The b bits
define the offset of an address within a block
(block offset).
11
Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t1
s3
b1
1
011
0
2s 8 sets of blocks 2t 2 blocks/set 2b 2
addresses/block.
offset 0
block 1
12
Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t2
s2
b1
10
11
0
2s 4 sets of blocks 2t 4 blocks/set 2b 2
addresses/block.
offset 0
block 10
13
Partitioning address spaces
set 1
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t3
s1
b1
101
1
0
2s 2 sets of blocks 2t 8 blocks/set 2b 2
addresses/block.
offset 0
block 101
14
Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
set ø
t4
s0
b1
1011
0
2s 1 set of blocks 2t 16 blocks/set 2b 2
addresses/block.
block 1011
15
Basic cache organization
Cache (C S x E x B bytes)
Address space (N 2n bytes)
E blocks/set
Address (n t s b bits)
S 2s sets
t
s
b
Cache block (cache line)
E Describes associativity how many blocks in
set can reside in cache simultaneously
16
Direct mapped cache (E 1)
N 16 byte addresses (n4)
cache size C 8 data bytes line size B 2b
2 bytes/line
t1
s2
b1
x
xx
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
direct mapped cache
E 1 entry/set
00 01 10 11
S 2s 4 sets
1. Determine set from middle bits 2. If something
already there, knock it out 3. Put new block in
cache
17
Direct Mapped Cache Simulation
N16 byte addresses B2 bytes/block S4 sets E1
entry/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
0 0000 (miss)
13 1101 (miss)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
v
tag
data
v
tag
data
1
0
m1 m0
1
0
m1 m0
(1)
(2)
1
1
m13 m12
8 1000 (miss)
0 0000 (miss)
v
tag
data
v
tag
data
1
1
m9 m8
1
0
m1 m0
(3)
(4)
1
1
m13 m12
1
1
m13 m12
18
E-way Set-Associative Cache
N 16 addresses (n4)
Cache size C 8 data bytes Line size B 2b
2 bytes
t2
s1
b1
xx
x
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
2-way set associative cache
E 2 entries/set
00 01
S 21 2 sets
19
2-Way Set Associative Simulation
N16 addresses B2 bytes/line S2 sets E2
entries/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
0 (miss)
13 (miss)
8 (miss)
(LRU replacement)
0 (miss)
(LRU replacement)
20
Fully associative cache
N 16 addresses (n4)
cache size C 8 data bytes line size B 2b
2 bytes/line
t3
s0
b1
xxx
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
fully associative cache
E 4 entries/set
S 2s 1 set
21
Fully Associative Cache Simulation
N16 addresses B2 bytes/line S1 sets E4
entries/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
t3
s0
b1
xxx
x
0 (miss)
13 (miss)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
v
tag
data
v
tag
data
1
00
m1 m0
1
000
m1 m0
1
110
m13 m12
(1)
(2)
set ø
8 (miss)
v
tag
data
1
000
m1 m0
1
110
m13 m12
(3)
1
100
m9 m8
22
Replacement Algorithms
  • When a block is fetched, which block in the
    target set should be replaced?
  • Usage based algorithms
  • Least recently used (LRU)
  • replace the block that has been referenced least
    recently
  • hard to implement
  • Non-usage based algorithms
  • First-in First-out (FIFO)
  • treat the set as a circular queue, replace block
    at head of queue.
  • easy to implement
  • Random (RAND)
  • replace a random block in the set
  • even easier to implement

23
Implementing LRU
Create an ExE bit matrix for each set (only use
E(E-1) bits.) When block i is referenced, set
row i and clear column i. The LRU block is the
row with all zeros. All other blocks have been
referenced more recently than this one Example
trace (E4) 1 2 3 4 3 2 1
1
2
3
4
1
1
1
0
1
1
0
0
1
0
0
0
0
0
0
1
1
1
1
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
Setting Row My reference most recent Clearing
Column I was referenced after you were
3
2
1
0
0
0
0
0
0
1
1
1
1
0
0
1
1
1
0
1
1
1
1
1
1
0
1
0
0
1
1
1
0
1
0
0
0
0
0
24
Miss Rates
  • Tested on a VAX using 16-byte blocks.
  • Replacement strategy critical for small caches
    does not make a lot of difference for large ones.
  • Trends More-way associative larger cache size

25
Write Strategies
  • Write Policy
  • What happens when processor writes to the cache?
  • write through
  • information is written to the block in cache and
    memory.
  • memory always consistent with cache
  • Can overwrite cache entry
  • write back
  • information is written only block in cache.
    Modified block written to memory only when it is
    replaced.
  • requires a dirty bit for each block
  • To remove dirty block from cache, must write back
    to main memory
  • memory not always consistent with cache
  • Write Buffer
  • Common optimization for write-through caches
  • Overlaps memory updates with processor execution

26
Allocation Strategies
  • On a write miss, is the block loaded from memory
    into the cache?
  • Write Allocate
  • Block is loaded into cache on a write miss.
  • Usually used with write back
  • No-Write Allocate (Write Around)
  • Block is not loaded into cache on a write miss
  • Usually used with write through

27
Alpha 21064 direct mapped data cache
34-bit address 256 blocks 32-bytes/block
28
Write merging
A write buffer that does not do write merging
A write buffer that does write merging
29
2-way set associative Cache
  • Cache size 8192 bytes block size 8 bytes
    2-way set associative random replacement WT
    with a 1-word write buffer no write allocate.

30
Cache Performance
  • Average memory access time hit time Miss rate
    x Miss Penalty.
  • CPU time (CPU execution clock cycles Memory
    stall clock cycles) x clock cycle time
  • Memory stall clock cycles (Reads x Read miss
    rate x Read miss penalty Writes x Write miss
    rate x Write miss penalty)
  • CPUtime (CPIexecution Mem accesses per
    instruction x Miss rate x Miss penalty) x Clock
    cycle time x IC
  • Trend CPI, CCT reduced gt cache performance is
    more important than ever!

31
Cache Performance Metrics
  • Average mem access time hit time Miss rate x
    Miss Penalty.
  • Miss Rate
  • fraction of memory references not found in cache
    (misses/references)
  • Typical numbers 5-10 for L1, 1-2 for L2
  • Hit Time
  • time to deliver a block in the cache to the
    processor (includes time to determine whether the
    block is in the cache)
  • Typical numbers
  • 1 clock cycle for L1
  • 3-8 clock cycles for L2
  • Miss Penalty
  • additional time required because of a miss
  • Typically 10-30 cycles for main memory

32
Instruction and Data cache unified?
  • Miss rates for instruction, data, and unified
    caches (unified structural hazards?)
  • 75 (100/(100269)) instruction references
    25 data references (26 loads, 9 stores)
  • Question Which is better? Split or unified
    cache?
  • Miss rate? Memory access time?
  • Assumptions 1-cycle hit, 50-cycle miss penalty,
    2-cycle load/store
  • hit for unified caches (why more?)

33
Reducing Misses
  • Classifying Misses 3 Cs
  • CompulsoryThe first access to a block is not in
    the cache, so the block must be brought into the
    cache. Also called cold start misses or first
    reference misses.(Misses in even an Infinite
    Cache)
  • CapacityIf the cache cannot contain all the
    blocks needed during execution of a program,
    capacity misses will occur due to blocks being
    discarded and later retrieved.(Misses in Fully
    Associative Size X Cache)
  • ConflictIf block-placement strategy is set
    associative or direct mapped, conflict misses (in
    addition to compulsory capacity misses) will
    occur because a block can be discarded and later
    retrieved if too many blocks map to its set. Also
    called collision misses or interference
    misses.(Misses in N-way Associative, Size X
    Cache)

34
3Cs Absolute Miss Rate (SPEC92)
miss rate 1-way associative cache size X
miss rate 2-way associative cache size X/2
Conflict
35
How Can Reduce Misses?
  • 3 Cs Compulsory, Capacity, Conflict
  • In all cases, assume total cache size not
    changed
  • What happens if
  • 1) Change Block Size Which of 3Cs is obviously
    affected?
  • 2) Change Associativity Which of 3Cs is
    obviously affected?
  • 3) Change Compiler Which of 3Cs is obviously
    affected?

36
1. Reduce Misses via Larger Block Size
Conflict misses
Compulsory misses
Capacity misses
37
2. Reduce Misses via Higher Associativity
  • 21 Cache Rule
  • Miss Rate DM cache size N Miss Rate 2-way cache
    size N/2
  • Beware Execution time is only final measure!
  • Will Clock Cycle time increase?
  • Hill 1988 suggested hit time for 2-way vs.
    1-way external cache 10, internal 2

38
Avg. Memory Access Time vs. Miss Rate
  • Example assume CCT 1.10 for 2-way, 1.12 for
    4-way, 1.14 for 8-way vs. CCT direct mapped
  • Cache Size Associativity
  • (KB) 1-way 2-way 4-way 8-way
  • 1 2.33 2.15 2.07 2.01
  • 2 1.98 1.86 1.76 1.68
  • 4 1.72 1.67 1.61 1.53
  • 8 1.46 1.48 1.47 1.43
  • 16 1.29 1.32 1.32 1.32
  • 32 1.20 1.24 1.25 1.27
  • 64 1.14 1.20 1.21 1.23
  • 128 1.10 1.17 1.18 1.20
  • (Red means A.M.A.T. not improved by more
    associativity)

39
3. Reducing Misses via a Victim Cache
  • How to combine fast hit time of direct mapped yet
    still avoid conflict misses?
  • Add buffer to place data discarded from cache
  • Jouppi 1990 4-entry victim cache removed 20
    to 95 of conflicts for a 4 KB direct mapped data
    cache

CPU address Data Data in out
Tag
Data
?
Victim cache
Write buffer
?
Low level memory
40
4.Reducing Miss RatePseudo-Associativity Caches
  • How to combine fast hit time of direct mapped and
    have the lower conflict misses of 2-way SA cache?
  • Divide cache On a miss, check other half of
    cache to see if there If so, have a pseudo-hit
    (slow hit).
  • Drawback CPU pipeline is hard hit takes
    different cycles (hit or slow hit?).
  • Better for caches not tied directly to processor.

41
5. Reducing Misses by Hardware Prefetching of
Instructions Data
  • E.g., Instruction Prefetching
  • Alpha 21064 fetches 2 blocks on a miss
  • Extra block placed in stream buffer
  • On miss check stream buffer
  • Works with data blocks too
  • Jouppi 1990 1 data stream buffer got 25 misses
    from 4KB cache 4 streams got 43
  • Palacharla Kessler 1994 for scientific
    programs for 8 streams got 50 to 70 of misses
    from 2 64KB, 4-way set associative caches
  • Prefetching relies on having extra memory
    bandwidth that can be used without penalty

42
6. Reducing Misses by Software Prefetching Data
  • Data Prefetch
  • Load data into register (HP PA-RISC loads)
  • Cache Prefetch load into cache (MIPS IV,
    PowerPC, SPARC)
  • Special prefetching instructions cannot cause
    faultsa form of speculative execution
  • Issuing Prefetch Instructions takes time
  • Is cost of prefetch issues lt savings in reduced
    misses?
  • Higher superscalar reduces difficulty of issue
    bandwidth

43
7. Reducing Misses by Compiler Optimizations
  • McFarling 1989 reduced caches misses by 75 on
    8KB direct mapped cache, 4 byte blocks in
    software
  • Instructions
  • Reorder procedures in memory so as to reduce
    conflict misses
  • Profiling to look at conflicts(using tools they
    developed)
  • Data
  • Merging Arrays improve spatial locality by
    single array of compound elements vs. 2 arrays
  • Loop Interchange change nesting of loops to
    access data in order stored in memory
  • Loop Fusion Combine 2 independent loops that
    have same looping and some variables overlap
  • Blocking Improve temporal locality by accessing
    blocks of data repeatedly vs. going down whole
    columns or rows

44
Merging Arrays Example
  • / Before 2 sequential arrays /
  • int valSIZE
  • int keySIZE
  • / After 1 array of stuctures /
  • struct merge
  • int val
  • int key
  • struct merge merged_arraySIZE
  • Reducing conflicts between val key improve
    spatial locality

45
Loop Interchange Example
  • / Before /
  • for (k 0 k lt 100 k k1)
  • for (j 0 j lt 100 j j1)
  • for (i 0 i lt 5000 i i1)
  • xij 2 xij
  • / After /
  • for (k 0 k lt 100 k k1)
  • for (i 0 i lt 5000 i i1)
  • for (j 0 j lt 100 j j1)
  • xij 2 xij
  • Sequential accesses instead of striding through
    memory every 100 words improved spatial locality

46
Loop Fusion Example
  • / Before /
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • aij 1/bij cij
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • dij aij cij
  • / After /
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • aij 1/bij cij
  • dij aij cij
  • 2 misses per access to a c vs. one miss per
    access improve spatial locality

47
Blocking Example
  • / Before /
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • sum 0.0
  • for (k 0 k lt N k k1)
  • sum sum aikbkj
  • cij sum
  • Two Inner Loops
  • Read all NxN elements of b
  • Read N elements of 1 row of a repeatedly
  • Write N elements of 1 row of c
  • Capacity Misses a function of N Cache Size
  • 3 NxN gt no capacity misses otherwise ...
  • Idea compute on BxB submatrix that fits

48
Interactions Between Program Cache
  • Major Cache Effects to Consider
  • Total cache size
  • Try to keep heavily used data in highest level
    cache
  • Block size (sometimes referred to line size)
  • Exploit spatial locality
  • Example Application
  • Multiply n X n matrices
  • O(n3) total operations
  • Accesses
  • n reads per source element
  • n values summed per destination
  • But may be able to hold in register

Variable sum held in register
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

49
Layout of Arrays in Memory
Memory Layout
  • C Arrays Allocated in Row-Major Order
  • Each row in contiguous memory locations
  • Stepping Through Columns in Row
  • for (i 0 i lt n i)
  • sum a0i
  • Accesses successive elements
  • For block size gt 8, get spatial locality
  • Cold Start Miss Rate 8/B
  • Stepping Through Rows in Column
  • for (i 0 i lt n i)
  • sum ai0
  • Accesses distant elements
  • No spatial locality
  • Cold Start Miss rate 1

0x80000
a00
0x80008
a01
0x80010
a02
0x80018
a03
  
0x807F8
a0255
0x80800
a10
0x80808
a11
0x80810
a12
0x80818
a13
  
0x80FF8
a1255
    
0xFFFF8
a255255
50
Miss Rate Analysis
  • Assume
  • Block size 32B (big enough for 4 doubles)
  • n is very large
  • Approximate 1/n as 0.0
  • Cache not even big enough to hold multiple rows
  • Analysis Method
  • Look at access pattern by inner loop

C
51
Matrix multiplication (ijk)
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise
  • Approx. Miss Rates
  • a b c
  • 0.25 1.0 0.0

52
Matrix multiplication (jik)
/ jik / for (j0 jltn j) for (i0 iltn
i) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C
  • Approx. Miss Rates
  • a b c
  • 0.25 1.0 0.0

53
Matrix multiplication (kij)
/ kij / for (k0 kltn k) for (i0 iltn
i) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
  • Approx. Miss Rates
  • a b c
  • 0.0 0.25 0.25

54
Matrix multiplication (ikj)
/ ikj / for (i0 iltn i) for (k0 kltn
k) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
  • Approx. Miss Rates
  • a b c
  • 0.0 0.25 0.25

55
Matrix multiplication (jki)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
  • Approx. Miss Rates
  • a b c
  • 1.0 0.0 1.0

56
Matrix multiplication (kji)
/ kji / for (k0 kltn k) for (j0 jltn
j) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
  • Approx. Miss Rates
  • a b c
  • 1.0 0.0 1.0

57
Summary of Matrix Multiplication
ijk (L2, S0, MR1.25)
jik (L2, S0, MR1.25)
kij (L2, S1, MR0.5)
for (i0 iltn i) for (j0 jltn j)
sum 0.0 for (k0 kltn k)
sum aik bkj
cij sum
for (j0 jltn j) for (i0 iltn i)
sum 0.0 for (k0 kltn
k) sum aik bkj
cij sum
for (k0 kltn k) for (i0 iltn i)
r aik for (j0 jltn j)
cij r bkj
ikj (L2, S1, MR0.5)
jki (L2, S1, MR2.0)
kji (L2, S1, MR2.0)
for (i0 iltn i) for (k0 kltn k)
r aik for (j0 jltn j)
cij rbkj
for (j0 jltn j) for (k0 kltn k)
r bkj for (i0 iltn i)
cij aik r
for (k0 kltn k) for (j0 jltn j)
r bkj for (i0 iltn i)
cij aik r
58
Matmult Performance (Sparc20)
Multiple columns of B fit in cache?
(L2, S1, MR0.5)
(L2, S0, MR1.25)
(L2, S1, MR2.0)
  • As matrices grow in size, exceed cache capacity
  • Different loop orderings give different
    performance
  • Cache effects
  • Whether or not can accumulate in register

59
Block Matrix Multiplication
Example n8, B 4
A11 A12 A21 A22
B11 B12 B21 B22
C11 C12 C21 C22

X
Key idea Sub-blocks (i.e., Aij) can be treated
just like scalars.
C11 A11B11 A12B21 C12 A11B12
A12B22 C21 A21B11 A22B21 C22
A21B12 A22B22
60
Blocked Matrix Multiply (bijk)
for (jj0 jjltn jjbsize) for (i0 iltn
i) for (jjj j lt min(jjbsize-1,n) j)
cij 0.0 for (kk0 kkltn kkbsize)
for (i0 iltn i) for (jjj j lt
min(jjbsize,n) j) sum 0.0
for (kkk k lt min(kkbsize,n) k)
sum aik bkj
cij sum
  • bsize called Blocking Factor
  • Capacity Misses from 2N3 N2 to 2N3/B N2

61
Blocked Matrix Multiply Analysis
  • Innermost loop pair multiplies 1 X bsize sliver
    of A times bsize X bsize block of B and
    accumulates into 1 X bsize sliver of C
  • Loop over i steps through n row slivers of A C,
    using same B

Innermost Loop Pair
i
i
A
B
C
Update successive elements of sliver
row sliver accessed bsize times
block reused n times in succession
62
Blocked matmult perf (Sparc20)
63
Reducing Conflict Misses by Blocking
  • Conflict misses in caches not FA vs. Blocking size

64
Summary of Compiler Optimizations to Reduce Cache
Misses
65
Improving Cache Performance
  • 1. Reduce the miss rate,
  • 2. Reduce the miss penalty, or
  • 3. Reduce the time to hit in the cache.

66
1. Reducing Miss Penalty Read Priority over
Write on Miss
  • Write through with write buffers offer RAW
    conflicts with main memory reads on cache misses
  • If simply wait for write buffer to empty, might
    increase read miss penalty (old MIPS 1000 by 50
    )
  • Check write buffer contents before read if no
    conflicts, let the memory access continue
  • Write Back?
  • Read miss replacing dirty block
  • Normal Write dirty block to memory, and then do
    the read
  • Instead copy the dirty block to a write buffer,
    then do the read, and then do the write
  • CPU stall less since restarts as soon as do read

67
2. Reduce Miss Penalty Subblock Placement
  • Dont have to load full block on a miss
  • Have valid bits per subblock to indicate valid
  • (Originally invented to reduce tag storage)

100 300 200 204
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
Sub-blocks
Valid Bits
68
3. Reduce Miss Penalty Early Restart and
Critical Word First
  • Dont wait for full block to be loaded before
    restarting CPU
  • Early restartAs soon as the requested word of
    the block arrives, send it to the CPU and let the
    CPU continue execution
  • Critical Word FirstRequest the missed word first
    from memory and send it to the CPU as soon as it
    arrives let the CPU continue execution while
    filling the rest of the words in the block. Also
    called wrapped fetch and requested word first
  • Generally useful only in large blocks,

69
4. Reduce Miss Penalty Non-blocking Caches to
reduce stalls on misses
  • Non-blocking cache or lockup-free cache allow
    data cache to continue to supply cache hits
    during a miss
  • requires out-of-order execution CPU
  • hit under miss reduces the effective miss
    penalty by working during miss vs. ignoring CPU
    requests
  • hit under multiple miss or miss under miss
    may further lower the effective miss penalty by
    overlapping multiple misses
  • Significantly increases the complexity of the
    cache controller as there can be multiple
    outstanding memory accesses
  • Requires multiple memory banks (otherwise cannot
    support)

70
Value of Hit Under Miss for SPEC
0-gt1 1-gt2 2-gt64 Base
Hit under n Misses
Integer
Floating Point
  • 8 KB Data Cache, Direct Mapped, 32B block, 16
    cycle miss
  • FP programs on average AMAT 0.68 -gt 0.52 -gt
    0.34 -gt 0.26
  • Int programs on average AMAT 0.24 -gt 0.20 -gt
    0.19 -gt 0.19

71
5th Miss Penalty
  • L2 Equations
  • AMAT Hit TimeL1 Miss RateL1 x Miss
    PenaltyL1
  • Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
    PenaltyL2
  • AMAT Hit TimeL1 Miss RateL1 x (Hit TimeL2
    Miss RateL2 Miss PenaltyL2)
  • Definitions
  • Local miss rate misses in this cache divided by
    the total number of memory accesses to this cache
    (Miss rateL2)
  • Global miss ratemisses in this cache divided by
    the total number of memory accesses generated by
    the CPU (Miss RateL1 x Miss RateL2)
  • Global Miss Rate is what matters

72
L2 cache block size A.M.A.T.
  • 32KB L1, 8 byte path to memory

73
Multi-level caches
Can have separate Icache and Dcache or unified
Icache/Dcache
size speed block size
200 B 5 ns 4 B
8 KB 5 ns 16 B
128 MB DRAM 100 ns 4 KB
10 GB 10 ms
1M SRAM 6 ns 32 KB
larger, slower, cheaper
larger block size, higher associativity, more
likely to write back
74
Alpha 21164 Hierarchy
Processor Chip
L1 Data 1 cycle latency 8KB, direct Write-through
Dual Ported 32B lines
L2 Unified 8 cycle latency 96KB 3-way
assoc. Write-back Write allocate 32B/64B lines
L3 Unified 1M-64M direct Write-back Write
allocate 32B or 64B lines
Main Memory Up to 1TB
Regs.
L1 Instruction 8KB, direct 32B lines
  • Improving memory performance was main design goal
  • Earlier Alphas CPUs starved for data

75
Review Improving Cache Performance
  • 1. Reduce the miss rate,
  • 2. Reduce the miss penalty, or
  • 3. Reduce the time to hit in the cache.

76
1. Fast Hit times via Small and Simple Caches
  • Why Alpha 21164 has 8KB Instruction and 8KB data
    cache 96KB second level cache?
  • Small data cache and clock rate
  • Direct Mapped, on chip

77
2. Fast hits by Avoiding Address Translation
  • Send virtual address to cache? Called Virtually
    Addressed Cache or just Virtual Cache vs.
    Physical Cache
  • Every time process is switched logically must
    flush the cache otherwise get false hits
  • Cost is time to flush compulsory misses from
    empty cache
  • Dealing with aliases
  • Two different virtual addresses map to same
    physical address
  • I/O must interact with cache, so need virtual
    address
  • Solution to aliases
  • SW guarantees covers index field direct mapped,
    they must be uniquecalled page coloring
  • Solution to cache flush
  • Add process identifier tag that identifies
    process as well as address within process cant
    get a hit if wrong process

78
3. Fast Hit Times Via Pipelined Writes
  • Pipeline Tag Check and Update Cache as separate
    stages current write tag check previous write
    cache update
  • Only STORES in the pipeline empty during a
    missStore r2, (r1) Check r1Add --Sub
    --Store r4, (r3) Mr1lt-r2 check r3
  • In shade is Delayed Write Buffer must be
    checked on reads either complete write or read
    from buffer

79
Cache Optimization Summary
  • Technique MR MP HT Complexity
  • Larger Block Size 0Higher
    Associativity 1Victim Caches 2HW
    Prefetching of Instr/Data 2Compiler
    Controlled Prefetching 3Compiler Reduce
    Misses 0
  • Priority to Read Misses 1Subblock Placement
    1Early Restart Critical Word 1st
    2Non-Blocking Caches 3Second Level
    Caches 2
  • Small Simple Caches 0Avoiding Address
    Translation 2Pipelining Writes 1

miss rate
miss penalty
hit time
80
What Youve Learned About Caches?
  • 1960-1985 Speed (no. operations)
  • 1990
  • Pipelined Execution Fast Clock Rate
  • Out-of-Order execution
  • Superscalar Instruction Issue
  • 1998 Speed (non-cached memory accesses)

81
Static RAM (SRAM)
  • Fast
  • 10 ns 1995
  • Persistent
  • as long as power is supplied
  • no refresh required
  • Expensive
  • 6 transistors/bit
  • Stable
  • High immunity to noise and environmental
    disturbances
  • Technology for caches

82
Anatomy of an SRAM bit (cell)
Read - set bit lines high - set word line
high - see which bit line goes low
Write - set bit lines to opposite values -
set word line - Flip cell to new state
83
Example 1-level-decode SRAM (16 x 8)
b7
b7
b1
b1
b0
b0
W0
W1
memory cells
W15
R/W
sense/write amps
sense/write amps
sense/write amps
Input/output lines
d7
d1
d0
84
Dynamic RAM (DRAM)
  • Slower than SRAM
  • access time 70 ns 1995
  • Non-persistent
  • every row must be accessed every 1 ms
    (refreshed)
  • Cheaper than SRAM
  • 1 transistor/bit
  • Fragile
  • electrical noise, light, radiation
  • Workhorse memory technology

85
Anatomy of a DRAM Cell
Word Line
Storage Node
Bit Line
Access Transistor
Cnode
CBL
Writing
Word Line
Bit Line
V
Storage Node
86
Addressing arrays with bits
Consider an R x C array of addresses, where R
2r and C 2c. Then for each address,
row(address) address / C leftmost r bits of
address col(address) address C
rightmost c bits of address
r bits
c bits
row
col
address
87
Example 2-level decode DRAM (64Kx1)
RAS
256 Rows
Row decoder
256x256 cell array
Row address latch
row
256 Columns
A15-A0
column sense/write amps
R/W
col
Column address latch
column latch and decoder
CAS
Dout
Din
88
DRAM Operation
  • Row Address (50ns)
  • Set Row address on address lines strobe RAS
  • Entire row read stored in column latches
  • Contents of row of memory cells destroyed
  • Column Address (10ns)
  • Set Column address on address lines strobe CAS
  • Access selected bit
  • READ transfer from selected column latch to Dout
  • WRITE Set selected column latch to Din
  • Rewrite (30ns)
  • Write back entire row
  • Timing Access time 60ns lt cycle time 90ns
  • Must Refresh Periodically Approx. every 1ms
  • Perform complete memory cycle for each row
  • Handled in background by memory controller

89
Enhanced Performance DRAMs
  • Conventional Access
  • Row Col
  • RAS CAS RAS CAS ...
  • Page Mode
  • Row Series of columns
  • RAS CAS CAS CAS ...
  • Gives successive bits
  • Video RAM
  • Shift out entire row sequentially
  • At video rate

Entire row buffered here
Typical Performance
row access time col access time cycle time page
mode cycle time 50ns 10ns
90ns 25ns
90
Main Memory Background
  • Performance of Main Memory
  • Latency Cache Miss Penalty
  • Access Time time between request and word
    arrives
  • Cycle Time time between requests
  • Bandwidth I/O Large Block Miss Penalty (L2)
  • Main Memory is DRAM Dynamic Random Access Memory
  • Dynamic since needs to be refreshed periodically
    (8 ms)
  • Cache uses SRAM Static Random Access Memory
  • No refresh (6 transistors/bit vs. 1 transistor
    Size DRAM/SRAM 4-8, Cost/Cycle time
    SRAM/DRAM 8-16
  • DRAMs capacity 60/yr, cost 30/yr
  • 2.5X cells/area, 1.5X die size in 3 years
  • Order of importance 1) Cost/bit 2) Capacity

91
Bandwidth Matching
  • Challenge
  • CPU works with short cycle times
  • DRAM (relatively) long cycle times
  • How can we provide enough bandwidth between
    processor memory?
  • Effect of Caching
  • Caching greatly reduces amount of traffic to main
    memory
  • But, sometimes need to move large amounts of data
    from memory into cache
  • Trends
  • Need for high bandwidth much greater for
    multimedia applications
  • Repeated operations on image data
  • Recent generation machines (e.g., Pentium II)
    greatly improve on predecessors

92
High Bandwidth Memory Systems
Solution 1 High BW DRAM
Solution 2 Wide path between memory cache
Solution 3 Memory bank interleaving
Example Page Mode DRAM RAMbus
Example Alpha AXP 21064 256 bit wide bus, L2
cache, and memory.
Example Dec 3000
93
Independent Memory Banks
  • Memory banks for independent accesses vs. faster
    sequential accesses
  • Multiprocessor
  • I/O
  • CPU with Hit under n Misses, Non-blocking Cache
  • Superbank all memory active on one block
    transfer (or Bank)
  • Bank portion within a superbank that is word
    interleaved (or Subbank)

Superbank offset
Superbank number
Bank number
Bank offset
94
Avoiding Bank Conflicts
  • Lots of banks
  • int x256512
  • for (j 0 j lt 512 j j1)
  • for (i 0 i lt 256 i i1)
  • xij 2 xij
  • Even with 128 banks, since 512 is multiple of
    128, conflict on word accesses
  • SW loop interchange or declaring array not power
    of 2 (array padding)
  • HW Prime number of banks
  • bank number address mod number of banks
  • address within bank address / number of banks
  • modulo divide per memory access with prime no.
    banks?
  • address within bank address mod number words in
    bank

95
Fast Memory Systems DRAM specific
  • Multiple CAS accesses several names (page mode)
  • Extended Data Out (EDO) 30 faster in page mode
  • New DRAMs to address gap what will they cost,
    will they survive?
  • RAMBUS startup company reinvent DRAM interface
  • Each Chip a module vs. slice of memory
  • Short bus between CPU and chips
  • Does own refresh
  • Variable amount of data returned
  • 1 byte / 2 ns (500 MB/s per chip)
  • Synchronous DRAM 2 banks on chip, a clock signal
    to DRAM, transfer synchronous to system clock (66
    - 150 MHz)
  • Intel claims RAMBUS Direct (16 b wide) is future
    PC memory
  • Niche memory or main memory?
  • e.g., Video RAM for frame buffers, DRAM fast
    serial output

96
Virtual Memory
  • Main memory can act as a cache for the secondary
    storage (disk)
  • Advantages
  • illusion of having more physical memory
  • program relocation
  • protection

97
Virtual Memory (cont)
Provides illusion of very large memory sum of
the memory of many jobs greater than physical
memory address space of each job larger than
physical memory Allows available (fast and
expensive) physical memory to be very well
utilized Simplifies memory management Exploits
memory hierarchy to keep average access time
low. Involves at least two storage levels main
(RAM) and secondary (disk) Virtual Address --
address used by the programmer Virtual Address
Space -- collection of such addresses Physical
Address -- address of word in physical memory
98
Virtual Address Spaces
Key idea virtual and physical address spaces are
divided into equal-sized blocks known as virtual
pages and physical pages (page frames)
Physical addresses (PA)
Virtual addresses (VA)
0
0
address translation
vir. page
phy. page
Process 1
2n-1
0
Process 2
2n-1
2m-1
What if the virtual address spaces are bigger
than the physical address space?
99
VM as part of the memory hierarchy
Access word w in virtual page p (hit)
Access word v in virtual page q (miss or page
fault)
v cache block
w cache block
(page frames)
memory
p
p
p
q
page q
q
(pages)
q
q
disk
p
p
p
100
VM address translation
V 0, 1, . . . , n - 1 virtual address
space M 0, 1, . . . , m - 1 physical address
space MAP V --gt M U 0 address mapping
function
n gt m
MAP(a) a' if data at virtual address a is
present at physical
address a' and a' in M 0 if
data at virtual address a is not present in M
a
missing item fault
Name Space V
fault handler
Processor
0
Addr Trans Mechanism
Main Memory
Secondary memory
a
a'
physical address
OS performs this transfer
101
VM address translation
virtual address
31
0
11
12
virtual page number
page offset
address translation
0
11
12
29
physical page number
page offset
physical address
Notice that the page offset bits don't change as
a result of translation
102
Address translation with a page table
virtual address
31
0
11
12
virtual page number
page offset
page table register
access
valid
physical page number
if valid0 then page is not in memory and page
fault exception
0
11
12
29
physical page number
page offset
physical address
103
Page Tables
104
Address translation with a page table(cont)
separate page table(s) per process If V 1
then page is in main memory at frame address
stored in table else address is location of
page in secondary memory Access Rights R
Read-only, R/W read/write, X execute
only If kind of access not compatible with
specified access rights, then
protection_violation_fault If valid bit not set
then page fault Protection Fault access rights
violation causes trap to hardware,
microcode, or software fault handler Page Fault
page not resident in physical memory, also
causes trap usually accompanied by a
context switch current process suspended
while page is fetched from secondary storage
105
VM design issues
  • Everything driven by enormous cost of misses
  • hundreds of thousands of clocks.
  • vs units or tens of clocks for cache misses.
  • disks are high latency, low bandwidth devices
    (compared to memory)
  • disk performance 10 ms access time,
    10-33MBytes/sec transfer rate
  • Large block sizes
  • 4KBytes - 16 KBytes are typical
  • amortize high access time
  • reduce miss rate by exploiting locality

106
VM design issues (cont)
  • Fully associative page placement
  • eliminates conflict misses
  • every miss is a killer, so worth the lower hit
    time
  • Use smart replacement algorithms
  • handle misses in software
  • miss penalty is so high anyway, no reason to
    handle in hardware
  • small improvements pay big dividends
  • Write back only
  • disk access too slow to afford write through
    write buffer

107
Integrating VM and cache
miss
VA
PA
Trans- lation
Cache
Main Memory
CPU
hit
data
It takes an extra memory access to translate VA
to PA. Why not address cache with VA? Aliasing
problem 2 virtual addresses that point to the
same physical page. Result two cache blocks
for one physical location Solutions index
cache with low order VA bits that dont change
during translation. (requires small caches or OS
support such as page coloring)
108
Speeding up translation with a TLB
A translation lookaside buffer (TLB) is a small,
usually fully associative cache, that maps
virtual page numbers to physical page numbers.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
hit
miss
Trans- lation
data
109
Address translation with a TLB
31
0
11
12
virtual address
virtual page number
page offset
valid
physical page number
tag
dirty
valid
valid
valid
valid

TLB hit
physical address
tag
byte offset
index
valid
tag
data

data
cache hit
110
Alpha AXP 21064 TLB
page size 8KB block size 1 PTE (8 bytes) hit
time 1 clock miss penalty 20 clocks TLB size
ITLB 8 PTEs, DTLB 32
PTEs replacement random(but not
last used) placement Fully assoc
Page-frame address lt30gt
Page offset lt13gt
lt1gt lt2gt lt2gt lt30gt lt21gt
V R W Tag Physical
address
1
2
. . .
. . .
(Low-order 13 bits lt13gt of address)
. . .
34-bit physical address
321 Mux
3
lt21gt
4
(High-order 21 bits of address)
111
Mapping an Alpha 21064 virtual address
13 bits
10 bits
Virtual address
Seg0/seg1 000 0 or Selector 111 1
Level1
Level2
Page offset
Level3
PTE size 8 Bytes
Page table base register

L1 page table
L2 page table
L3 page table
Page table entry

PT size 1K PTEs (8 KBytes)
. . .
Page table entry

. . .
Page table entry
. . .
13 bits
21 bits
Physical address
Main memory
112
(No Transcript)
113
Alpha 21164 Chip Photo
  • Microprocessor Report 9/12/94
  • Caches
  • L1 data
  • L1 instruction
  • L2 unified
  • TLB
  • Branch history

114
Alpha Memory Performance Miss Rates of SPEC92
I miss 6 D miss 32 L2 miss 10
8K
8K
2M
I miss 2 D miss 13 L2 miss 0.6
I miss 1 D miss 21 L2 miss 0.3
115
Alpha CPI Components
  • Instruction stall branch mispredict (green)
  • Data cache (blue) Instruction cache (yellow)
    L2 (pink) Other compute reg conflicts,
    structural conflicts

116
Modern Systems
Write a Comment
User Comments (0)
About PowerShow.com