Title: ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem
1ECE3055 Computer Architecture and Operating
SystemsLecture 8 Memory Subsystem
- Prof. Hsien-Hsin Sean Lee
- School of Electrical and Computer Engineering
- Georgia Institute of Technology
2Performance matters
- Consider basic 5 stage pipeline design
- CPU runs at 1ns cycle time (1GHz)
- Main memory runs at 100ns access time
- How is performance?
- Fetch A (100 cycles)
- Decode A (1 cycle) Fetch B (100 cycles)
- Ex A (1 cy) Dec B (1 cy) Fetch C (100 cycles)
- Effective 1 instr per 100 cycles --gt 10MHz CPU
- Latency killing system
- Problem only getting worse
- CPU speeds grow much faster than DRAM speeds
- DRAM bandwidth improving well
3Typical Solution
- With the large gap in CPU vs DRAM speeds, must
have solution - Cache (slave memories)
- Ultra-fast, small local memory
- Cache runs at 1ns access time...
- Fetch A (1 cycle)
- Decode A (1 cycle) Fetch B (1 cycle)
- Ex A (1 cy) Dec B (1 cy) Fetch C (1 cy)
- Effective 1 instr per 1 cyc ----gt 1GHz CPU
- Special terms
- hit
- miss
- rates (hit/miss)
4Memories Two basic types
- SRAM
- value is stored on a pair of inverting gates
- very fast but takes up more space than DRAM (4 to
6 transistors) - DRAM
- value is stored as a charge on capacitor (must be
refreshed) - very small but slower than SRAM (factor of 5 to
10) - ignoring new technologies on the horizon
5Memory Photos
Intel Paxville (dual Core) 90nm 8-way 2MB L2 for
each core
240-pin DDR2 DRAM
Intel Itanium2 .13µm 24-way 6MB L3
6Exploiting Memory Hierarchy
- Users want large and fast memories! For example
- SRAM access times are 700ps-1ns (1-3 cycles)
- DRAM access times are 60-100ns (100-250 cycles)
- Disk access times are 1 million ns (3M cycles)
- Try and give it to them anyway
- build a memory hierarchy
7Model of Memory Hierarchy
8P4 Prescott w/ 2MB L2 (90nm)
- Prescott runs very fast (3.4 GHz)
- 2MB L2 Unified Cache
- 12K trace cache (think I)
- 16KB data cache
- Where is the cache?
- What about the similar blocks?
- Why the visual differences?
- Why is it square?
- Whats with the colors?
- Check this out
- www.chip-architect.com
9Interfacing Processors and Peripherals
- I/O Design affected by many factors
(expandability, resilience) - Performance access latency throughput
connection between devices and the system the
memory hierarchy the operating system - A variety of different users (e.g., banks,
supercomputers, engineers)
10I/O Devices
- Very diverse devices behavior (i.e., input vs.
output) partner (who is at the other end?)
data rate
11I/O Example Disk Drives
-
- To access data seek position head over the
proper track (3.5-10 ms. avg.) rotational
latency wait for desired sector (.5 / RPM)
transfer grab the data (one or more sectors) 2
to 15 MB/sec - not considering disk buffer hits (100-320 MB/s)
12Locality
- A principle that makes having a memory hierarchy
a good idea - If an item is referenced,Temporal locality it
will tend to be referenced again soon - Spatial locality nearby items will tend to be
referenced soon. - Why does code have locality? What about data
locality? - Our initial focus two levels (upper, lower)
- block minimum unit of data
- hit data requested is in the upper level
- miss data requested is not in the upper level
13Cache
- Two issues
- How do we know if a data item is in the cache?
- If it is, how do we find it?
- Our first example
- block size is one word of data
- "direct mapped"
For each item of data at the lower level, there
is exactly one location in the cache where it
might be. e.g., lots of items at the lower level
share locations in the upper level
14Direct Mapped Cache
- Mapping address is modulo the number of blocks
in the cache
15Direct Mapped Cache
- For MIPS
- What kind of
locality are we taking advantage of?
16Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
6
0
7
17Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
6
0
7
18Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
24 is 0001 1000
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
6
0
7
19Example DM, 8-Entry, 4B
4-byte block, drop low 2 bits for byte offset!
Only matters for byte-addressable systems
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
24 is 0001 1000
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
6
0
7
20Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
24 is 0001 1000
Next log2(8) bits mod 8 Index
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
6
0
7
21Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
24 is 0001 1000
Next log2(8) bits mod 8 Index
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
6
0
7
22Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
24 is 0001 1000
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
6
0
7
23Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
24 is 0001 1000
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
a copy
6
0
7
24Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
24 is 0001 1000
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
0
a copy
000
6
0
7
25Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
24 is 0001 1000
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
0
7
26Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
0
7
27Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
0
7
28Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
28 is 0001 1100
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
0
7
29Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
28 is 0001 1100
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
0
b copy
000
7
30Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
b copy
000
7
31Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
b copy
000
7
32Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
60 is 0011 1100
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
b copy
000
7
Its valid! How to tell its the wrong address?
33Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
60 is 0011 1100
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
b copy
000
7
The tags dont match! Its not what we want to
access!
34Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
60 is 0011 1100
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
0
7
35Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
60 is 0011 1100
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
0
c copy
001
7
36Example DM, 8-Entry, 4B
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c copy
001
7
37Example DM, 8-Entry, 4B
Q What if the machine is only word-addressable?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c copy
001
7
38Example DM, 8-Entry, 4B
Q What if the machine is only word-addressable?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c copy
001
7
39Example DM, 8-Entry, 4B
Q What if the machine is only word-addressable?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
60 is 0011 1100
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c copy
001
7
40Example DM, 8-Entry, 4B
Q What if the machine is only word-addressable?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
60 is 0011 1100
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c copy
001
7
Tag is 2 bits larger otherwise same (note
indexdata change!)
41Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c copy
001
7
42Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c OLD
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c NEW
001
7
Do we update memory now? Or later?
43Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c OLD
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c NEW
001
7
Assume later...
44Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c OLD
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c NEW
001
7
45Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
188 1011 1100
b
28
c OLD
60
T
a
g
D
a
t
a
I
n
d
e
x
(4 bytes)
V
a
l
i
d
0
d
0
188
1
0
0
2
0
3
0
4
0
5
1
a copy
000
6
1
c NEW
001
7
Now What? How do we know to write back?
46Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
188 1011 1100
b
28
c OLD
60
D
a
t
(4 bytes)
a
T
a
g
I
n
d
e
x
Dirty
V
a
l
i
d
0
d
0
0
188
1
0
0
0
0
2
0
3
0
0
4
0
0
0
5
1
a copy
000
0
6
1
1
c NEW
001
7
Need extra state! The dirty bit!
47Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
188 1011 1100
b
28
c NEW
60
D
a
t
(4 bytes)
a
T
a
g
I
n
d
e
x
Dirty
V
a
l
i
d
0
d
0
0
188
1
0
0
0
0
2
0
3
0
0
4
0
0
0
5
1
a copy
000
0
6
1
1
c NEW
001
7
48Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
188 1011 1100
b
28
c NEW
60
D
a
t
(4 bytes)
a
T
a
g
I
n
d
e
x
Dirty
V
a
l
i
d
0
d
0
0
188
1
0
0
0
0
2
0
3
0
0
4
0
0
0
5
1
a copy
000
0
6
0
0
7
49Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
188 1011 1100
b
28
c NEW
60
D
a
t
(4 bytes)
a
T
a
g
I
n
d
e
x
Dirty
V
a
l
i
d
0
d
0
0
188
1
0
0
0
0
2
0
3
0
0
4
0
0
0
5
1
a copy
000
0
6
0
0
d copy
101
7
50Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c NEW
60
D
a
t
(4 bytes)
a
T
a
g
I
n
d
e
x
Dirty
V
a
l
i
d
0
d
0
0
188
1
0
0
0
0
2
0
3
0
0
4
0
0
0
5
1
a copy
000
0
6
1
0
d copy
101
7
51Example DM, 8-Entry, 4B
Q What about writing back to memory?
Main Memory (System)
lw 0, 24 a lw 1, 28 b sw 2, 60
c sw 3, 188 d
a
24
b
28
c NEW
60
D
a
t
(4 bytes)
a
T
a
g
I
n
d
e
x
Dirty
V
a
l
i
d
0
d OLD
0
0
188
1
0
0
0
0
2
0
3
0
0
4
0
0
0
5
1
a copy
000
0
6
1
1
d NEW
101
7
52DM Thoughts
- Trade-Offs
- Write-back or Write-Through?
- Write-Alloc or No-Write-Alloc?
- How does Tag change with of Entries?
- How does minimum machine word size impact tag?
- What kind of locality are we taking advantage of?
53Direct Mapped Cache
- Taking advantage of spatial locality
54Hits vs. Misses
- Read hits
- this is what we want!
- Read misses
- stall the CPU, fetch block from memory, deliver
to cache, restart - Write hits
- can replace data in cache and memory
(write-through) - write the data only into the cache (write-back
the cache later) - Write misses
- read the entire block into the cache, then write
the word ?
55Hardware Issues
- Make reading multiple words easier by using banks
of memory -
-
It can get a lot more complicated...
56Performance
- Increasing the block size tends to decrease miss
rate - Use split caches because there is more spatial
locality in code
57Performance
- Simplified model execution time (execution
cycles stall cycles) cycle time stall
cycles of instructions miss ratio miss
penalty - Two ways of improving performance
- decreasing the miss ratio
- decreasing the miss penalty
- What happens if we increase block size?
58Decreasing miss ratio with associativity
-
- Compared to direct mapped, give a series of
references that - results in a lower miss ratio using a 2-way set
associative cache - results in a higher miss ratio using a 2-way set
associative cache - (assuming least recently used
replacement strategy)
59An implementation
60Set-Associative Cache
- Multiple cache blocks (lines) can be allocated
into the same set - When full, needs to evict some block out of the
cache - Need to consider the locality
- Replacement policy
- Last-In First-Out (LIFO), like a stack
- Random
- First-In First-Out (FIFO)
- Least Recently Used (LRU)
61Least Recently Used (LRU)
MRU
LRU
LRU1
MRU-1
A
B
C
D
Access C
Access D
Access E
Access C
Access G
62Performance
63Decreasing miss penalty with multilevel caches
- Add a second level cache
- often primary cache is on the same chip as the
processor - use SRAMs to add another cache above primary
memory (DRAM) - miss penalty goes down if data is in 2nd level
cache - Example
- CPI of 1.0 on a 5GHz machine for no cache miss
- The same machine with 1st level cache (L1) with a
2 miss rate per instruction, and a 100ns DRAM
access (what is the CPI ?) - Adding 2nd level cache with 5ns access time
(including L1 access time) decreases miss rate
per instruction to 0.5, what is the speedup over
the machine with only L1 - Using multilevel caches
- try and optimize the hit time on the 1st level
cache - try and optimize the miss rate on the 2nd level
cache
64Virtual Memory
- Main memory can act as a cache for the secondary
storage (disk) - Advantages
- illusion of having more physical memory
- program relocation
- protection
65Pages virtual memory blocks
- Page faults the data is not in memory, retrieve
it from disk - huge miss penalty, thus pages should be fairly
large (e.g., 4KB) - reducing page faults is important (LRU is worth
the price) - can handle the faults in software instead of
hardware - using write-through is too expensive so we use
writeback
66Page Tables
67Page Tables
68Making Address Translation Fast
- A cache for address translations translation
lookaside buffer
69TLBs and caches
70Modern Systems
- Very complicated memory systems
71Some Issues
- Processor speeds continue to increase very
fast much faster than either DRAM or disk
access times - Design challenge dealing with this growing
disparity - Trends
- synchronous SRAMs (provide a burst of data)
- redesign DRAM chips to provide higher bandwidth
or processing - restructure code to increase locality
- use prefetching (make cache visible to ISA)