Chapter Seven Large and Fast: Exploiting Memory Hierarchy - PowerPoint PPT Presentation

Loading...

PPT – Chapter Seven Large and Fast: Exploiting Memory Hierarchy PowerPoint presentation | free to download - id: 6e7047-OGU5Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Chapter Seven Large and Fast: Exploiting Memory Hierarchy

Description:

Chapter Seven Large and Fast: Exploiting Memory Hierarchy – PowerPoint PPT presentation

Number of Views:9
Avg rating:3.0/5.0
Date added: 28 February 2020
Slides: 50
Provided by: TodA165
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Chapter Seven Large and Fast: Exploiting Memory Hierarchy


1
Chapter Seven Large and Fast Exploiting Memory
Hierarchy
2
Memories Review
  • SRAM
  • value is stored on a pair of inverting gates
  • very fast but takes up more space than DRAM (4 to
    6 transistors)
  • DRAM
  • value is stored as a charge on capacitor (must be
    refreshed)
  • very small but slower than SRAM (factor of 5 to
    10)

3
Exploiting Memory Hierarchy
  • Users want large and fast memories! SRAM access
    times are 2 - 25ns at cost of 100 to 250 per
    Mbyte. DRAM access times are 60-120ns at cost of
    5 to 10 per Mbyte. Disk access times are 10 to
    20 million ns at cost of .10 to .20 per Mbyte.
  • Try and give it to them anyway
  • build a memory hierarchy

1997
C
P
U
I
n
c
r
e
a
s
i
n
g

d
i
s
t
a
n
c
e

L
e
v
e
l

1
f
r
o
m

t
h
e

C
P
U

i
n

a
c
c
e
s
s

t
i
m
e

L
e
v
e
l

2
L
e
v
e
l
s

i
n

t
h
e
m
e
m
o
r
y

h
i
e
r
a
r
c
h
y
L
e
v
e
l

n
S
i
z
e

o
f

t
h
e

m
e
m
o
r
y

a
t

e
a
c
h

l
e
v
e
l
4
Locality
  • A principle that makes having a memory hierarchy
    a good idea
  • If an item is referenced, temporal locality it
    will tend to be referenced again soon
  • spatial locality nearby items will tend to be
    referenced soon.
  • Why does code have locality?
  • Our initial focus two levels (upper, lower)
  • block minimum unit of data
  • hit data requested is in the upper level
  • miss data requested is not in the upper level

5
Cache
  • Two issues
  • How do we know if a data item is in the cache?
  • If it is, how do we find it?
  • Our first example
  • block size is one word of data
  • "direct mapped"

For each item of data at the lower level, there
is exactly one location in the cache where it
might be. e.g., lots of items at the lower level
share locations in the upper level
6
Direct Mapped Cache
  • Mapping address is modulo the number of blocks
    in the cache

7
Example
  • Memory 32 words gt 5 bits
  • Cache 8 words gt 3 bits
  • 00001, 01001, 10001, 11001 in memory maps to
    cache 001 (use the lower log2(8) bits, equivalent
    to mod 8)
  • Similarly, 00101, 01101, 10101, 11101 maps to
    cache 101

8
Tags
  • The mapping from memory location to cache
    location is straightforward (many-to-one
    mapping), but how about the other way around
    (one-to-many mapping)?
  • In other words, how do we know whether the data
    in the cache corresponds to a requested word?
  • Need to add a set of tags to the cache. These
    tags contain the address information required to
    identify whether a hit occurs.
  • Tags contain the information we throw away in the
    many-to-one mapping, i.e., the upper portion of
    the address.
  • For example, we need 2-bit tags in Figure 7.5.
  • Also need an additional bit, called the valid
    bit, to indicate whether a cache block has valid
    information. (cache is empty at start up)
  • Figure 7.6 illustrates the contents of an 8-word
    direct-mapped cache as it responds to a series of
    requests from the processor.

9
Direct Mapped Cache
  • For MIPS
  • What kind of locality are we taking
    advantage of?

A
d
d
r
e
s
s

(
s
h
o
w
i
n
g

b
i
t

p
o
s
i
t
i
o
n
s
)
3
1

3
0







1
3

1
2

1
1







2

1

0
B
y
t
e
o
f
f
s
e
t
2
0
1
0
H
i
t
D
a
t
a
T
a
g
I
n
d
e
x
V
a
l
i
d
T
a
g
D
a
t
a
I
n
d
e
x
0
1
2
1
0
2
1
1
0
2
2
1
0
2
3
2
0
3
2
10
Cache Size
  • Each unit Block size tag size valid field
    size
  • Assuming 32-bit byte address, a direct-mapped
    cache of size 2n words with one-word (4-byte)
    block will require 2n x ( 32 (32 - n - 2)
    1) 2n x (63-n)
  • Example How many total bits are required for a
    direct-mapped cache with 64KB of data and
    one-word blocks, assuming 32-bit address? 64 KB
    16 K words 214 words 214 blocks Each
    block has 32 bits of data plus a tag, which is
    32-14-2 bits, plus a valid bit. Total cache size
    214 x (32 (32 -14 -2) 1 ) 214x 49 784x
    210 784 Kbits 96 KB.

11
Hits vs. Misses
  • Read hits
  • this is what we want!
  • Read misses
  • stall the CPU, fetch block from memory, deliver
    to cache, restart
  • different from pipelining since we must continue
    executing some instructions while stalling others
    there.
  • Write hits
  • can replace data in cache and memory
    (write-through)
  • write the data to the cache and the write buffer
    (write-buffer)
  • write the data only into the cache (write-back
    the cache later)
  • Write misses
  • read the entire block into the cache, then write
    the word

12
Dealing with instruction cache miss
  • Send the original PC value ( current PC -4 ) to
    the memory
  • Instruct the main memory to perform a read and
    wait for the memory to complete its access.
  • Write the cache entry.
  • Restart the instruction execution at the first
    step, which will refetch the instruction, this
    time finding it in the cache.
  • What about data cache miss?

13
DECStation 3100 Cache
  • One instruction cache, one data cache
  • Each cache 64 KB, or 16K words, with a one-word
    block
  • Use write-through scheme, writing the data into
    both the memory and the cache.

A
d
d
r
e
s
s

(
s
h
o
w
i
n
g

b
i
t

p
o
s
i
t
i
o
n
s
)
3
1

3
0












1
7

1
6

1
5











5

4

3

2

1

0
B
y
t
e
1
6
1
4
o
f
f
s
e
t
H
i
t
D
a
t
a
1
6

b
i
t
s
3
2

b
i
t
s
V
a
l
i
d
T
a
g
D
a
t
a
1
6
K
e
n
t
r
i
e
s
1
6
3
2
14
Direct Mapped Cache
  • Taking advantage of spatial locality Cache
    block (Block address) mod (Number of Cache
    Blocks)

15
Mapping an Address to a Multiword Cache Block
  • Consider a cache with 64 blocks and a block size
    of 16 bytes. What block number does byte address
    1200 map to?

16
Miss Rate versus Block Size

4
0

3
5

3
0

2
5

e
t
a
r

s
2
0

s
i
M
1
5

1
0

5

0

2
5
6
6
4
1
6
4
B
l
o
c
k

s
i
z
e

(
b
y
t
e
s
)
1

K
B
8

K
B
1
6

K
B
6
4

K
B
2
5
6

K
B
17
Hardware Issues
  • Make reading multiple words easier by using banks
    of memory
  • It can get a lot more complicated...

18
Performance
  • Increasing the block size tends to decrease miss
    rate
  • Use split caches because there is more spatial
    locality in code

4
0

3
5

3
0

2
5

e
t
a
r

s
2
0

s
i
M
1
5

1
0

5

0

2
5
6
6
4
1
6
4
B
l
o
c
k

s
i
z
e

(
b
y
t
e
s
)
1

K
B
8

K
B
1
6

K
B
6
4

K
B
2
5
6

K
B
19
Performance
  • Simplified model execution time (execution
    cycles stall cycles) x cycle time stall
    cycles of instructions x miss ratio x miss
    penalty
  • Two ways of improving performance
  • decreasing the miss ratio
  • decreasing the miss penalty
  • What happens if we increase block size?

20
Calculating Cache Performance
  • Assume an instruction cache miss rate for gcc of
    2 and a data cache miss rate of 4. If a machine
    has a CPI of 2 without any memory stalls and the
    miss penalty is 40 cycles for all misses,
    determined how much faster a machine would run
    with a perfect cache that never missed. (Use
    instruction frequencies for gcc from fig. 4.54)

21
Cache Performance with Increased Clock Rate
  • Relative cache penalties increase as machine
    becomes faster!

22
Direct-mapped vs Fully-Associative Scheme
  • In a direct-mapped scheme, there is a direct
    mapping from any block address in memory to a
    single location in the upper level of the
    hierarchy.
  • There exist other possibilities. For example, a
    block can be placed in any location in the cache.
    This is called the fully associative scheme.
  • In a fully associative scheme, we need to search
    all in entries in the cache to find a given
    block.
  • Additional hardware (comparator) is required.
    Thus fully associative scheme is practical only
    for caches with small number of blocks.
  • Between direct-mapped and fully-associative is
    the set-associative scheme.

23
Set-Associative Scheme
  • In a set-associative cache, there are a fixed
    number of locations (at least 2) where each block
    can be placed.
  • A set-associative cache with n locations for a
    block is called an n-way associative cache.
  • An n-way associative cache consists of a number
    of sets, each of which consists of n blocks.
  • The set containing the memory block (Block
    number) modulo (Number of sets in the cache)
  • We then need to perform a search to find the
    block in the set.

24
Location of Memory Block -- Example
  • Given a memory block whose address is 12

25
Decreasing miss ratio with associativity
  • Note, however, that hit time is increased with
    increasing degree of associativity.

26
Locating a Block in the Cache
  • Each block in a set-associative cache includes an
    address tag that gives the block address.
  • The index value is used to select the set
    containing the address of interest.
  • Block offset is the address of the desired data
    within the block.
  • What is the value of block offset in a
    direct-mapped cache?

Tag
Index
Block Offset
27
Size of Tags vs Set Associativity
  • If the total cache size is kept the same,
    increasing the number of associativity increases
    the number of blocks per set, which is the number
    of parallel comparison required.
  • Each increase by a factor of two in associativity
    doubles the number of blocks per set and halves
    the number of sets.
  • Thus each increase by a factor of two in
    associativity decreases the size of the index by
    1 and increases the size of the tag by 1.

28
An implementation (4-way association)
29
Example
  • Assume a cache of 4K blocks and a 32 bit address
  • 4K 212
  • For a direct-mapped scheme, we need (32-12) bits
    for a tag. Total of tags 4K. Thus total
    number of tag bits 20x4K 80Kbits.
  • For a 2-way associative scheme, we need (32- 11)
    bits for a tag. Total of tags 2 x 2K 4K,
    thus total number of tag bits (32- 11)x4K 84
    Kbits.
  • For a 4-way associative scheme, we need (32- 10)
    bits for a tag. Total of tags 4 x 1K 4K,
    thus total number of tag bits (32- 10)x4K 88
    Kbits.
  • For a fully associative scheme, we need 32 bits
    for a tag. Total of tags 4K x 1 4K, thus
    total number of tag bits 32x4K 128 Kbits.
  • The choice among direct-mapped, set-associative
    or fully associative mapping depends on the cost
    of a miss versus the cost of implementing
    associativity, both in time and in extra
    hardware.

30
Choosing Which Block to Replace
  • In an associative cache, we have a choice of
    where to place the requested block, an hence a
    choice of which block to replace when a miss
    occurs.
  • The most commonly used scheme is least recently
    used (LRU).
  • The block replaced is the one that has been
    unused for the longest time.
  • Again, extra hardware is needed to keep track of
    the usage.

31
Performance
1
5

1
2

9

e
t
a
r

s
s
i
M
6

3

0

E
i
g
h
t
-
w
a
y
F
o
u
r
-
w
a
y
T
w
o
-
w
a
y
O
n
e
-
w
a
y
A
s
s
o
c
i
a
t
i
v
i
t
y
1

K
B
1
6

K
B
2

K
B
3
2

K
B
4

K
B
6
4

K
B
8

K
B
1
2
8

K
B
32
Decreasing miss penalty with multilevel caches
  • Add a second level cache
  • often primary cache is on the same chip as the
    processor
  • use SRAMs to add another cache above primary
    memory (DRAM)
  • miss penalty goes down if data is in 2nd level
    cache
  • Example
  • CPI of 1.0 on a 500Mhz machine with a 5 miss
    rate, 200ns DRAM access
  • Adding 2nd level cache with 20ns access time
    decreases miss rate to 2
  • Using multilevel caches
  • try and optimize the hit time on the 1st level
    cache
  • try and optimize the miss rate on the 2nd level
    cache

33
Virtual Memory Motivations
  • To allow efficient and safe sharing of memory
    among multiple programs.
  • To remove the programming burdens of a small,
    limited amount of main memory.

34
Virtual Memory
  • Main memory can act as a cache for the secondary
    storage (disk)
  • Advantages
  • illusion of having more physical memory
  • program relocation
  • protection

35
Pages virtual memory blocks
  • Page faults the data is not in memory, retrieve
    it from disk
  • huge miss penalty, thus pages should be fairly
    large (e.g., 4KB)
  • reducing page faults is important (LRU is worth
    the price)
  • can handle the faults in software instead of
    hardware
  • using write-through is too expensive so we use
    write-back

36
Placing a Page and Finding It Again
  • We want the ability to use a clever and flexible
    replacement scheme.
  • We want to reduce page fault rate.
  • Fully-associative placement serves our purposes.
  • But full search is impractical, so we locate
    pages by using a full table that indexes the
    memory. gt page table (resides in memory)
  • Each program has it own page table, which maps
    the virtual address space of that program to main
    memory.

37
Page Table Register
l
e

r
e
g
i
s
t
e
r

3
1


3
0


2
9


2
8


2
7


1
5


1
4


1
3


1
2


1
1


1
0


9


8

3


2


1


0
2
0
1
2
P
a
g
e

t
a
b
l
e
1
8
2
9


2
8


2
7
1
5


1
4


1
3


1
2


1
1


1
0


9


8

3


2


1


0
38
Process
  • The page table, together with the program counter
    and the registers, specifies the state of a
    program.
  • If we want to allow another program to use the
    CPU, we must save this state.
  • We often refer to this state as a process.
  • A process is considered active when its in
    possession of the CPU.

39
Dealing With Page Faults
  • When the valid bit for a virtual page is off, a
    page fault occurs.
  • The operating system takes over, and the transfer
    is done with the exception mechanism.
  • The OS must find the page in the next level of
    hierarchy, and decide where to place the
    requested page in the main memory.
  • LRU policy is often used.

40
Page Tables
41
Page Tables

42
Making Address Translation Fast
  • A cache for address translations
    translation-lookaside buffer (TLB)

P
h
y
s
i
c
a
l

m
e
m
o
r
y
a
g
e
o
r

d
i
s
k

a
d
d
r
e
s
s
D
i
s
k

s
t
o
r
a
g
e
43
Integrating VM, TLBs and Caches
3
1

3
0

2
9


1
5

1
4

1
3

1
2

1
1

1
0

9

8


3

2

1

0

r
t
y
T
a
g
T
L
B
T
L
B

h
i
t
P
h
y
s
i
c
a
l

p
a
g
e

n
u
m
b
e
r
P
h
y
s
i
c
a
l

a
d
d
r
e
s
s
P
h
y
s
i
c
a
l

a
d
d
r
e
s
s

t
a
g
44
TLBs and caches
45
Implementing Protection with Virtual Memory
  • The OS takes care of this.
  • Hardware need to provide at least three
    capabilities
  • support at least two modes that indicate whether
    the running process is a user process or an OS
    process (kernel process, supervisor process,
    executive process)
  • provide a portion of the CPU state that a user
    process can read but not write.
  • Provide mechanisms whereby the CPU can go from
    the user mode to supervisor mode.

46
A Common Framework for Memory Hierarchies
  • Question 1 Where can a block be placed?
  • Question 2 How is a block found?
  • Question 3 Which block should be replaced on a
    cache miss?
  • Question 4 What happens on a Write?

47
The Three Cs
  • Compulsory misses
  • Capacity misses
  • Conflict misses

48
Modern Systems
  • Very complicated memory systems

49
Some Issues
  • Processor speeds continue to increase very
    fast much faster than either DRAM or disk
    access times
  • Design challenge dealing with this growing
    disparity
  • Trends
  • synchronous SRAMs (provide a burst of data)
  • redesign DRAM chips to provide higher bandwidth
    or processing
  • restructure code to increase locality
  • use prefetching (make cache visible to ISA)
About PowerShow.com