Memory Subsystem and Cache presentation

About This Presentation

Transcript and Presenter's Notes

Title: Memory Subsystem and Cache

1
Memory Subsystem and Cache

Adapted from lectures notes of Dr. Patterson and
Dr. Kubiatowicz of UC Berkeley

2
The Big Picture
Processor
Input
Control
Memory
Datapath
Output
3
Technology Trends
Capacity Speed (latency) Logic 2x
in 3 years 2x in 3 years DRAM 4x in 3
years 2x in 10 years Disk 4x in 3 years 2x
in 10 years
DRAM Year Size Cycle
Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1
Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145
ns 1995 64 Mb 120 ns
10001!
21!
4
Technology Trends contd
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
10
Less Law?
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5
The Goal Large, Fast, Cheap Memory !!!

Fact
Large memories are slow
Fast memories are small
How do we create a memory that is large, cheap
and fast (most of the time) ?
Hierarchy
Parallelism

By taking advantage of the principle of locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at the speed offered by the
fastest technology.

Processor
Control
Secondary Storage (Disk)
Main Memory (DRAM)
Second Level Cache (SRAM)
On-Chip Cache
Datapath
Registers
10,000,000ns (10 ms)
1s
Speed (ns)
10ns
100ns
10,000,000,000ns (10 sec)
100s
Gs
Size (bytes)
Ks
Ms
Ts
7
Todays Situation

Rely on caches to bridge gap
Microprocessor-DRAM performance gap
time of a full cache miss in instructions
executed
1st Alpha (7000) 340 ns/5.0 ns 68 clks x 2
or 136 instructions
2nd Alpha (8400) 266 ns/3.3 ns 80 clks x 4
or 320 instructions

8
Memory Hierarchy (1/4)

Processor
executes programs
runs on order of nanoseconds to picoseconds
needs to access code and data for programs where
are these?
Disk
HUGE capacity (virtually limitless)
VERY slow runs on order of milliseconds
so how do we account for this gap?

9
Memory Hierarchy (2/4)

Memory (DRAM)
smaller than disk (not limitless capacity)
contains subset of data on disk basically
portions of programs that are currently being run
much faster than disk memory accesses dont slow
down processor quite as much
Problem memory is still too slow(hundreds of
nanoseconds)
Solution add more layers (caches)

10
Memory Hierarchy (3/4)
Higher
Lower
11
Memory Hierarchy (4/4)

If level is closer to Processor, it must be
smaller
faster
subset of all higher levels (contains most
recently used data)
contain at least all the data in all lower levels
Lowest Level (usually disk) contains all
available data

12
Analogy Library

Youre writing a term paper (Processor) at a
table in Evans
Evans Library is equivalent to disk
essentially limitless capacity
very slow to retrieve a book
Table is memory
smaller capacity means you must return book when
table fills up
easier and faster to find a book there once
youve already retrieved it

13
Analogy Library contd

Open books on table are cache
smaller capacity can have very few open books
fit on table again, when table fills up, you
must close a book
much, much faster to retrieve data
Illusion created whole library open on the
tabletop
Keep as many recently used books open on table as
possible since likely to use again
Also keep as many books on table as possible,
since faster than going to library

14
Memory Hierarchy Basics

Disk contains everything.
When Processor needs something, bring it into to
all lower levels of memory.
Cache contains copies of data in memory that are
being used.
Memory contains copies of data on disk that are
being used.
Entire idea is based on Temporal Locality if we
use it now, well want to use it again soon (a
Big Idea)

15
Caches Why does it Work ?

Temporal Locality (Locality in Time)
gt Keep most recently accessed data items closer
to the processor
Spatial Locality (Locality in Space)
gt Move blocks consists of contiguous words to
the upper levels

16
Cache Design Issues

How do we organize cache?
Where does each memory address map to? (Remember
that cache is subset of memory, so multiple
memory addresses map to the same cache location.)
How do we know which elements are in cache?
How do we quickly locate them?

17
Direct Mapped Cache

In a direct-mapped cache, each memory address is
associated with one possible block within the
cache
Therefore, we only need to look in a single
location in the cache for the data if it exists
in the cache
Block is the unit of transfer between cache and
memory

18
Direct Mapped Cache contd

Cache Location 0 can be occupied by data from
Memory location 0, 4, 8, ...
In general any memory location that is multiple
of 4

19
Issues with Direct Mapped Cache

Since multiple memory addresses map to same cache
index, how do we tell which one is in there?
What if we have a block size gt 1 byte?
Result divide memory address into three fields

20
Example of a direct mapped cache

For a 2N byte cache
The uppermost (32 - N) bits are always the Cache
Tag
The lowest M bits are the Byte Select (Block Size
2M)

Block address
0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag

0
Byte 0
Byte 1
Byte 31

1
0x50
Byte 32
Byte 33
Byte 63
2
3

31
Byte 992
Byte 1023
21
Terminology

All fields are read as unsigned integers.
Index specifies the cache index (which row of
the cache we should look in)
Offset once weve found correct block, specifies
which byte within the block we want
Tag the remaining bits after offset and index
are determined these are used to distinguish
between all the memory addresses that map to the
same location

22
Terminology contd

Hit data appears in some block in the upper
level (example Block X)
Hit Rate the fraction of memory access found in
the upper level
Hit Time Time to access the upper level which
consists of
RAM access time Time to determine hit/miss
Miss data needs to be retrieve from a block in
the lower level (Block Y)
Miss Rate 1 - (Hit Rate)
Miss Penalty Time to replace a block in the
upper level
Time to deliver the block the processor
Hit Time ltlt Miss Penalty

Lower Level Memory
Upper Level Memory
To Processor
Blk X
From Processor
Blk Y
23
How is the hierarchy managed ?

Registers lt-gt Memory
by compiler (programmer?)
cache lt-gt memory
by the hardware
memory lt-gt disks
by the hardware and operating system (virtual
memory)
by the programmer (files)

24
Example

Suppose we have a 16KB of data in a direct-mapped
cache with 4 word blocks
Determine the size of the tag, index and offset
fields if were using a 32-bit architecture
Offset
need to specify correct byte within a block
block contains 4 words 16 bytes 24
bytes
need 4 bits to specify correct byte

25
Example contd

Index (index into an array of blocks)
need to specify correct row in cache
cache contains 16 KB 214 bytes
block contains 24 bytes (4 words)
rows/cache blocks/cache (since theres
one block/row) bytes/cache bytes/row
214 bytes/cache 24 bytes/row
210 rows/cache
need 10 bits to specify this many rows

26
Example contd

Tag use remaining bits as tag
tag length mem addr length -
offset - index 32 - 4 -
10 bits 18 bits
so tag is leftmost 18 bits of memory address

27
Accessing data in cache
Memory
Value of Word
Address (hex)

Ex. 16KB of data, direct-mapped, 4 word blocks
Read 4 addresses
0x00000014, 0x0000001C, 0x00000034, 0x00008014
Memory values on right
only cache/memory level of hierarchy

28
Accessing data in cache contd

4 Addresses
0x00000014, 0x0000001C, 0x00000034, 0x00008014
4 Addresses divided (for convenience) into Tag,
Index, Byte Offset fields

000000000000000000 0000000001 0100 000000000000000
000 0000000001 1100 000000000000000000 0000000011
0100 000000000000000010 0000000001 0100 Tag
Index Offset
29
16 KB Direct Mapped Cache, 16B blocks

Valid bit determines whether anything is stored
in that row (when computer initially turned on,
all entries are invalid)

Index
30
Read 0x00000014 000 0..001 0100

000000000000000000 0000000001 0100

Offset
Index field
Tag field
Index
31
So we read block 1 (0000000001)

000000000000000000 0000000001 0100

Tag field
Index field
Offset
Index
32
No valid data

000000000000000000 0000000001 0100

Tag field
Index field
Offset
Index
33
So load that data into cache, setting tag, valid

000000000000000000 0000000001 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
0
0
0
0
0
0
0
34
Read from cache at offset, return word b

000000000000000000 0000000001 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
0
0
0
0
0
0
0
35
Read 0x0000001C 000 0..001 1100

000000000000000000 0000000001 1100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
0
0
0
0
0
0
0
36
Data valid, tag OK, so read offset return word d

000000000000000000 0000000001 1100

Index
0
1
0
a
b
c
d
0
0
0
0
0
0
0
0
37
Read 0x00000034 000 0..011 0100

000000000000000000 0000000011 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
0
0
0
0
0
0
0
38
So read block 3

000000000000000000 0000000011 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
0
0
0
0
0
0
0
39
No valid data

000000000000000000 0000000011 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
0
0
0
0
0
0
0
40
Load that cache block, return word f

000000000000000000 0000000011 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
1
0
e
f
g
h
0
0
0
0
0
0
41
Read 0x00008014 010 0..001 0100

000000000000000010 0000000001 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
1
0
e
f
g
h
0
0
0
0
0
0
42
So read Cache Block 1, Data is Valid

000000000000000010 0000000001 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
1
0
e
f
g
h
0
0
0
0
0
0
43
Cache Block 1 Tag does not match (0 ! 2)

000000000000000010 0000000001 0100

Tag field
Index field
Offset
Index
0
1
0
a
b
c
d
0
1
0
e
f
g
h
0
0
0
0
0
0
44
Miss, so replace block 1 with new data tag

000000000000000010 0000000001 0100

Tag field
Index field
Offset
Index
0
1
2
i
j
k
l
0
1
0
e
f
g
h
0
0
0
0
0
0
45
And return word j

000000000000000010 0000000001 0100

Tag field
Index field
Offset
Index
0
1
2
i
j
k
l
0
1
0
e
f
g
h
0
0
0
0
0
0
46
Things to Remember

We would like to have the capacity of disk at the
speed of the processor unfortunately this is not
feasible.
So we create a memory hierarchy
each successively lower level contains most
used data from next higher level
Exploit temporal and spatial locality

Write a Comment

User Comments (0)

About PowerShow.com

Memory Subsystem and Cache PowerPoint PPT Presentation