THE%20MEMORY%20HIERARCHY

About This Presentation

Title:

THE%20MEMORY%20HIERARCHY

Description:

Title: COSC3330/6308 Computer Architecture Author: Jehan-Fran ois P ris Last modified by: Jehan-Fran ois P ris Created Date: 8/29/2001 4:04:21 AM – PowerPoint PPT presentation

Number of Views:257

Avg rating:3.0/5.0

Slides: 273

Provided by: Jehan57

Category:

more less

Transcript and Presenter's Notes

Title: THE%20MEMORY%20HIERARCHY

1
THE MEMORY HIERARCHY

Jehan-François Pâris
jfparis_at_uh.edu

2
Chapter Organization

Technology overview
Caches
Cache associativity, write through andwrite
back,
Virtual memory
Page table organization, the translation
lookaside buffer (TLB), page fault handling,
memory protection
Virtual machines
Cache consistency

3
TECHNOLOGY OVERVIEW
4
Dynamic RAM

Standard solution for main memory since 70's
Replaced magnetic core memory
Bits represented stored on capacitors
Charged state represents a one
Capacitors discharge
Must be dynamically refreshed
Achieved by accessing each cell several thousand
times each second

5
Dynamic RAM
Row select
nMOS transistor
ColumnSelect
Capacitor
Ground
6
The role of the nMOS transistor
Not on the exam

Normally, no current can go from the source to
the drain

When the gate is positive with respect to the
ground, electrons are attracted to the gate (the
"field effect")and current can go through

7
Magnetic disks
Platter
Servo
Arm
R/W head
8
Magnetic disk (I)

Data are stored into circular tracks
Tracks are partitioned into a variable number of
fixed-size sectors
If disk drive has more than one platter, all
tracks corresponding to the same position of the
R/ W head form a cylinder

9
Magnetic disk (II)

Disk spins at a speed varying between
5,400 rpm (laptops) and
15,000 rpm (Seagate Cheetah X15, )
Accessing data requires
Positioning the head on the right track
Seek time
Waiting for the data to reach the R/W head
On the average half a rotation

10
Disk access times

Dominated by seek time and rotational delay
We try to reduce seek times by placing all data
that are likely to be accessed together on
nearby tracks or same cylinder
Cannot do as much for rotational delay
On the average half a rotation

11
Average rotational delay
RPM Delay (ms)
5400 5.6
7200 4.2
10,000 3.0
15,000 2.0
12
Overall performance

Disk access times are still dominated by
rotational latency
Were 8-10 ms in the late 70's when rotational
speeds were 3,000 to 3,600 RPM
Disk capacities and maximum transfer rates have
done much better
Pack many more tracks per platter
Pack many more bits per track

13
The internal disk controller

Printed circuit board attached to disk drive
As powerful as the CPU of a personal computer of
the early 80's
Functions include
Speed buffering
Disk scheduling

14
Reliability issues

Disk drives have more reliability issues than
most other computer components
Moving parts eventually wear
Infant mortality
Would be too costly to produceperfect magnetic
surfaces
Disks have bad blocks

15
Disk failure rates

Failure rates follow a bathtub curve
High infantile mortality
Low failure rate during useful life
Higher failure rates as disks wear out

16
Disk failure rates (II)
Failurerate
Wearout
Infantilemortality
Useful life
Time
17
Disk failure rates (III)

Infant mortality effect can last for months for
disk drives
Cheap ATA disk drives seem to age less gracefully
than SCSI drives

18
MTTF

Disk manufacturers advertise very highMean Times
To Fail (MTTF) for their products
500,000 to 1,000,000 hours, that is,57 to 114
years
Does not mean that disk will last that long!
Means that disks will fail at an average rate of
one failure per 500,000 to 100,000 hours
duringtheir useful life

19
More MTTF Issues (I)

Manufacturers' claims are not supported by solid
experimental evidence
Obtained by submitting disks to a stress test at
high temperature and extrapolating results to
ideal conditions
Procedure raises many issues

20
More MTTF Issues (II)

Failure rates observed in the field aremuch
higher
Can go up to 8 to 9 percent per year
Corresponding MTTFs are 11 to 12.5 years
If we have 100 disks and a MTTF of 12.5 years,
we can expect an average of 8 disk failures per
year

21
Bad blocks (I)

Also known as
Irrecoverable read errors
Latent sector errors
Can be caused by
Defects in magnetic substrate
Problems during last write

22
Bad blocks (II)

Disk controller uses redundant encoding that can
detect and correct many errors
When internal disk controller detects a bad block
Marks it as unusable
Remaps logical block address of bad block to
spare sectors
Each disk is extensively tested duringburn in
period before being released

23
The memory hierarchy (I)
Level Device Access Time
1 Fastest registers(2 GHz CPU) 0.5 ns
2 Main memory 10-60 ns
3 Secondary storage (disk) 7 ms
4 Mass storage(CD-ROM library) a few s
24
The memory hierarchy (II)

To make sense of these numbers, let us consider
an analogy

25
Writing a paper (I)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk
3 Book in library
4 Book far away
26
Writing a paper (II)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-120 s
3 Book in library
4 Book far away
27
Writing a paper (III)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-140 s
3 Book in library 162 days
4 Book far away
28
Writing a paper (IV)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-140 s
3 Book in library 162 days
4 Book far away 63 years
29
Major issues

Huge gaps between
CPU speeds and SDRAM access times
SDRAM access times and disk access times
Both problems have very different solutions
Gap between CPU speeds and SDRAM access times
handled by hardware
Gap between SDRAM access times and disk access
times handled by combination of software and
hardware

30
Why?

Having hardware handle an issue
Complicates hardware design
Offers a very fast solution
Standard approach for very frequent actions
Letting software handle an issue
Cheaper
Has a much higher overhead
Standard approach for less frequent actions

31
Will the problem go away?

It will become worse
RAM access times are not improving as fast as CPU
power
Disk access times are limited by rotational speed
of disk drive

32
What are the solutions?

To bridge the CPU/DRAM gap
Interposing between the CPU and the DRAM smaller,
faster memories that cache the data that the CPU
currently needs
Cache memories
Managed by the hardware and invisible to the
software (OS included)

33
What are the solutions?

To bridge the DRAM/disk drive gap
Storing in main memory the data blocks that are
currently accessed (I/O buffer)
Managing memory space and disk space as a single
resource (Virtual memory)
I/O buffer and virtual memory are managed by the
OS and invisible to the user processes

34
Why do these solutions work?

Locality principle
Spatial localityat any time a process only
accesses asmall portion of its address space
Temporal localitythis subset does not change
too frequently

35
Can we think of examples?

The way we write programs
The way we act in everyday life

36
CACHING
37
The technology

Caches use faster static RAM (SRAM)
Similar organization as that of D flipflops
Can have
Separate caches for instructions and data
Great for pipelining
A unified cache

38
A little story (I)

Consider a closed-stack library
Customers bring book requests to circulation desk
Librarians go to stack to fetch requested book
Solution is used in national libraries
Costlier than open-stack approach
Much better control of assets

39
A little story (II)

Librarians have noted that some books get asked
again and again
Want to put them closer to the circulation desk
Would result in much faster service
The problem is how to locate these books
They will not be at the right location!

40
A little story (III)

Librarians come with a great solution
They put behind the circulation desk shelves with
100 book slots numbered from 00 to 99
Each slot is a home for the most recently
requested book that has a call number whose
last two digits match the slot number
3141593 can only go in slot 93
1234567 can only go in slot 67

41
A little story (IV)
Let me see if it's in bin 93
The call number of the book I need is 3141593
42
A little story (V)

To let the librarian do her job each slot much
contain either
Nothing or
A book and its reference number
There are many books whose reference number ends
in 93 or any two given digits

43
A little story (VI)
Sure
Could I get this time the book whose call number
4444493?
44
A little story (VII)

This time the librarian will
Go bin 93
Find it contains a book with a different call
number
She will
Bring back that book to the stacks
Fetch the new book

45
Basic principles

Assume we want to store in a faster memory 2n
words that are currently accessed by the CPU
Can be instructions or data or even both
When the CPU will need to fetch an instruction or
load a word into a register
It will look first into the cache
Can have a hit or a miss

46
Cache hits

Occur when the requested word is found in the
cache
Cache avoided a memory access
CPU can proceed

47
Cache misses

Occur when the requested word is not found in the
cache
Will need to access the main memory
Will bring the new word into the cache
Must make space for it by expelling one of the
cache entries
Need to decide which one

48
Handling writes (I)

When CPU has to store the contents of a register
into main memory
Write will update the cache
If the modified word is already in the cache
Everything is fine
Otherwise
Must make space for it by expelling one of the
cache entries

49
Handling writes (II)

Two ways to handle writes
Write through
Each write updates both the cache and the main
memory
Write back
Writes are not propagated to the main memory
until the updated word is expelled from the cache

50
Handling writes (II)

Write through

Write back

CPU
Cache
later
RAM
51
Pros and cons

Write through
Ensures that memory is always up to date
Expelled cache entries can be overwritten
Write back
Faster writes
Complicates cache expulsion procedure
Must write back cache entries that have been
modified in the cache

52
Picking the right solution

Caches use write through
Provides simpler cache expulsions
Can minimize write-through overhead with
additional circuitry
I/O Buffers and virtual memory usewrite back
Write-through overhead would be too high

53
A better write through (I)

Add a small buffer to speed up write performance
of write-through caches
At least four words
Holds modified data until they are written into
main memory
Cache can proceed as soon as data are written
into the write buffer

54
A better write through (II)

Write through

Better write through

CPU
Cache
Write buffer
RAM
55
A very basic cache

Has 2n entries
Each entry contains
A word (4 bytes)
Its RAM address
Sole way to identify the word
A bit indicating whether the cache entry contains
something useful

56
A very basic cache (I)
Actual caches are much bigger
57
A very basic cache (II)
58
Comments (I)

The cache organization we have presented is
nothing but the hardware implementation of a hash
table
Each entry has
a key the word address
a value word contents plus valid bit

59
Comments (II)

The hash function is
h(k) (k/4) mod N
where k is the key and N is the cache size
Can be computed very fast
Unlike conventional hash tables, this
organization has no provision for handling
collisions
Use expulsion to resolve collisions

60
Managing the cache

Each word fetched into cache can occupy a single
cache location
Specified by n1 to 2 bits of its address
Two words with the same n1 to 2 bitscannot be
at the same time in the cache
Happens whenever the addresses of the two words
differ by K 2n2

61
Example

Assume cache can contain 8 words
If word 48 is in the cache it will be stored at
cache index (48/4) mod 8 12 mod 8 4
In our case 2n2 232 32
The only possible cache index for word 80 would
be (80/4) mod 8 20 mod 8 4
Same for words 112, 144, 176,

62
Managing the cache

Each word fetched into cache can occupy a single
cache location
Specified by n1 to 2 bits of its address
Two words with the same n1 to 2 bitscannot be
at the same time in the cache
Happens whenever the addresses of the two words
differ by K 2n2

63
Saving cache space

We do not need to store whole address of each
word in cache
Bits 1 and 0 will always be zero
Bits n 1 to 2 can be inferred from thecache
index
If cache has 8 entries, bits 4 to 2
Will only store in tag the remaining bits of
address

64
A very basic cache (III)
Cache uses bits 4 to 2 of word address
65
Storing a new word in the cache

Location of new word entry will be obtained from
LSB of word address
Discard 2 LSB
Always zero for a well-aligned word
Remove n next LSB for a cache of size 2n
Given by cache index

MSB of word address
00
n next LSB
66
Accessing a word in the cache (I)

Start with word address
Remove two least significant bit
Always zero

Word address
67
Accessing a word in the cache (II)

Split remainder of address into
n least significant bits
Word address in the cache
Cache tag

Word address minus two LSB
n LSB
Cache Tag
68
Towards a better cache

Our cache takes into account temporal locality of
accesses
Repeated accesses to the same location
But not their spatial locality
Accesses to neighboring locations
Cache space is poorly used
Need 26 1 bits of overhead to store32 bits of
data

69
Multiword cache (I)

Each cache entry will contain a block of 2, 4 ,
8, words with consecutive addresses
Will require words to be well aligned
Pair of words should start at an address that is
multiple of 24 8
Group of four words should start at an address
that is multiple of 44 16

70
Multiword cache (II)
Tag
Contents
71
Multiword cache (III)

Has 2n entries each containing 2m words
Each entry contains
2m words
A tag
A bit indicating whether the cache entry contains
useful data

72
Storing a new word in the cache

Location of new word entry will be obtained from
LSB of word address
Discard 2 m LSB
Always zero for a well-aligned group of words
Take n next LSB for a cache of size 2n

MSB of address
2 m LSB
n next LSB
73
Example

Assume
Cache can contain 8 entries
Each block contains 2 words
Words 48 and 52 belong to the same block
If word 48 is in the cache it will be stored at
cache index (48 /8) mod 8 6 mod 8 6
If word 48 is in the cache it will be stored at
cache index (49 /8) mod 8 6 mod 8 6

74
Selecting the right block size

Larger block sizes improve the performance of the
cache
Allows us to exploit spatial locality
Three limitations
Spatial locality effect less pronounced if block
size exceeds 128 bytes
Too many collisions in very small caches
Large blocks take more time to be fetched into
the cache

75
(No Transcript)
76
Collision effect in small cache

Consider a 4KB cache
If block size is 16 B, that is, 4 words,cache
will have 256 blocks
If block size is 128 B, that is 32 words,cache
will have 32 blocks
Too many collisions

77
Problem

Consider a very small cache with 8 entries and a
block size of 8 bytes (2 words)
Which words will be fetched in the cache when the
CPU accesses words at address 32, 48, 60 and 80?
How will these words will be stored in the cache?

78
Solution (I)

Since block size is 8 bytes
3 LSB of address used to address one of the 8
bytes in a block
Since cache holds 8 blocks,
Next 3 LSB of address used by the cache index
As a result, tag has 32 3 3 26 bits

79
Solution (I)

Consider words at address 32
Cache index is (32/23) mod 23 (32/8) mod 8 4
Block tag is 32/26 32/64 0

Row 4 Tag0 32 33 34 35 36 37 38 39
80
Solution (II)

Consider words at address 48
Cache index is (48/8) mod 8 6
Block tag is 48/64 0

Row 6 Tag0 48 49 50 51 52 53 54 55
81
Solution (III)

Consider words at address 60
Cache index is (60/8) mod 8 7
Block tag is 60/64 0

Row 6 Tag0 56 57 58 59 60 61 62 63
82
Solution (IV)

Consider words at address 80
Cache index is (80/8) mod 8 10 mod 8 2
Block tag is 80/64 1

Row 2 Tag1 80 81 82 83 84 85 86 67
83
Set-associative caches (I)

Can be seen as 2, 4, 8 caches attached together
Reduces collisions

84
Set-associative caches (II)
85
Set-associative caches (III)

Advantage
We take care of more collisions
Like a hash table with a fixed bucket size
Results in lower miss rates than direct-mapped
caches
Disadvantage
Slower access
Best solution if miss penalty is very big

86
Fully associative caches

The dream!
A block can occupy any index position in the
cache
Requires an associative memory
Content-addressable
Like our brain!
Remain a dream

87
Designing RAM to support caches

RAM connected to CPU through a "bus"
Clock rate much slower than CPU clock rate
Assume that a RAM access takes
1 bus clock cycle to send the address
15 bus clock cycle to initiate a read
1 bus clock cycle to send a word of data

88
Designing RAM to support caches

Assume
Cache block size is 4 words
One-word bank of DRAM
Fetching a cache block would take
1 415 41 65 bus clock cycles
Transfer rate is 0.25 byte/bus cycle
Awful!

89
Designing RAM to support caches

Could
Double bus width (from 32 to 64 bits)
Have a two-word bank of DRAM
Fetching a cache block would take
1 215 21 33 bus clock cycles
Transfer rate is 0.48 byte/bus cycle
Much better
Costly solution

90
Designing RAM to support caches

Could
Have an interleaved memory organization
Four one-word banks of DRAM
A 32-bit bus

32 bits
RAMbank 1
RAMbank 0
RAMbank 2
RAMbank 3
91
Designing RAM to support caches

Can do the 4 accesses in parallel
Must still transmit the block 32 bits by 32 bits
Fetching a cache block would take
1 15 41 20 bus clock cycles
Transfer rate is 0.80 word/bus cycle
Even better
Much cheaper than having a 64-bit bus

92
ANALYZING CACHE PERFORMANCE
93
Memory stalls

Can divide CPU time into
NEXEC clock cycles spent executing instructions
NMEM_STALLS cycles spent waiting for memory
accesses
We have
CPU time (NEXEC NMEM_STALLS)TCYCLE

94
Memory stalls

We assume that
cache access times can be neglected
most CPU cycles spent waiting for memory accesses
are caused by cache misses
Distinguishing between read stalls and write
stalls
NMEM_STALLS NRD_STALLS NWR_STALLS

95
Read stalls

Fairly simple
NRD_STALLS NMEM_RDRead miss rate Read
miss penalty

96
Write stalls (I)

Two causes of delays
Must fetch missing blocks before updating them
We update at most 8 bytes of the block!
Must take into account cost of write through
Buffering delay depends of proximity of writes
not number of cache misses
Writes too close to each other

97
Write stalls (II)

We have
NWR_STALLS NWRITESWrite miss rate
Write miss penalty
NWR_BUFFER_STALLS
In practice, very few buffer stalls if the buffer
contains at least four words

98
Global impact

We have
NMEM_STALLS NMEM_ACCESSESCache miss rate
Cache miss penalty
and also
NMEM_STALLS NINSTRUCTIONS(NMISSES/Instruction)
Cache miss penalty

99
Example

Miss rate of instruction cache is 2 percentMiss
rate of data cache is 4 percentIn the absence of
memory stalls, each instruction would take 2
cyclesMiss penalty is 100 cycles36 percent of
instructions access the main memory
How many cycles are lost due to cache misses?

100
Solution (I)

Impact of instruction cache misses
0.02100 2 cycles/instruction
Impact of data cache misses
0.360.04100 1.44 cycles/instruction
Total impact of cache misses
2 1.44 3.44 cycles/instruction

101
Solution (II)

Average number of cycles per instruction
2 3.44 5.44 cycles/instruction
Fraction of time wasted
3.44 /5.44 63 percent

102
Problem

Redo the example with the following data
Miss rate of instruction cache is 3 percentMiss
rate of data cache is 5 percentIn the absence of
memory stalls, each instruction would take 2
cyclesMiss penalty is 100 cycles40 percent of
instructions access the main memory

103
Solution

The fraction of time wasted to memory stalls is
71 percent

104
Average memory access time

Some authors call it AMAT
TAVERAGE TCACHE fTMISS
where f is the cache miss rate
Times can be expressed
In nanoseconds
In number of cycles

105
Example

A cache has a hit rate of 96 percent
Accessing data
In the cache requires one cycle
In the memory requires 100 cycles
What is the average memory access time?

106
Solution
Corrected

Miss rate 1 Hit rate 0.04
Applying the formula
TAVERAGE 1 0.04100 5 cycles

107
Impact of a better hit rate

What would be the impact of improving the hit
rate of the cache from 96 to 98 percent?

108
Solution

New miss rate 1 New hit rate 0.02
Applying the formula
TAVERAGE 1 0.02100 3 cycles

When the hit rate is above 80 percent small
improvements in the hit rate willresult in much
better miss rate
109
Examples

Old hit rate 80 percentNew hit rate 90
percent
Miss rates goes from 20 to 10 percent!
Old hit rate 94 percentNew hit rate 98
percent
Miss rates goes from 6 to 2 percent!

110
In other words
It's the miss rate, stupid!
111
Improving cache hit rate

Two complementary techniques
Using set-associative caches
Must check tags of all blocks with the same index
values
Slower
Have fewer collisions
Fewer misses
Use a cache hierarchy

112
A cache hierarchy (I)
CPU
L1
L1 misses
L2
L2 misses
L3
L3 misses
RAM
113
A cache hierarchy

Topmost cache
Optimized for speed, not miss rate
Rather small
Uses a small block size
As we go down the hierarchy
Cache sizes increase
Block sizes increase
Cache associativity level increases

114
Example

Cache miss rate per instruction is 2 percentIn
the absence of memory stalls, each instruction
would take one cycleCache miss penalty is 100
nsClock rate is 4GHz
How many cycles are lost due to cache misses?

115
Solution (I)

Duration of clock cycle
1/(4 Ghz) 0.2510-9 s 0.25 ns
Cache miss penalty
100ns 400 cycles
Total impact of cache misses
0.02400 8 cycles/instruction

116
Solution (II)

Average number of cycles per instruction
1 8 9 cycles/instruction
Fraction of time wasted
8/9 89 percent

117
Example (cont'd)

How much faster would the processor if we added a
L2 cache that
Has a 5 ns access time
Would reduce miss rate to main memory to 0.5
percent?
Will see later how to get that

118
Solution (I)

L2 cache access time
5ns 20 cycles
Impact of cache misses per instruction
L1 cache misses L2 cache misses
0.02200.005400 0.4 2.0 2.4
cycles/instruction
Average number of cycles per instruction
1 2.4 3.4 cycles/instruction

119
Solution (II)

Fraction of time wasted
2.4/3.4 63 percent
CPU speedup
9/3.4 2.6

120
How to get the 0.005 miss rate

Wanted miss rate corresponds to a combined cache
hit rate of 99.5 percent
Let H1 be hit rate of L1 cache and H2 be the hit
rate of the second cache
The combined hit rate of the cache hierarchy isH
H1 (1-H1)H2

121
How to get the 0.005 miss rate

We have0.995 0.98 0.02?H2
H2 (0.995 0.98)/0.02 0.75
Quite feasible!

122
Can we do better? (I)

Keep 98 percent hit rate for L1 cache
Raise hit rate of L2 cache to 85 percent
L2 cache is now slower 6ns
Impact of cache misses per instruction
L1 cache misses L2 cache misses
0.02240.020.15400 0.48 1.2 1.68
cycles/instruction

123
The verdict

Fraction of time wasted per cycle
1.68/2.68 63 percent
CPU speedup
9/2.68 3.36

124
Would a faster L2 cache help?

Redo the example assuming
Miss rate of L1 cache is till 98 percent
New faster L2 cache
Access time reduced to 3 ns
Hit rate only 50 percent

125
The verdict

Fraction of time wasted
87 percent
CPU speedup
1.72

New L2 cache with a lower access timebut a
higher miss rate performs much worsethan
original L2 cache
126
Cache replacement policy

Not an issue in direct mapped caches
We have no choice!
An issue in set-associative caches
Best policy is least recently used (LRU)
Expels from the cache a block in thesame set as
the incoming block
Pick block that has not been accessed for the
longest period of time

127
Implementing LRU policy

Easy when each set contains two blocks
We attach to each block a use bit that is
Set to 1 when the block is accessed
Reset to 0 when the other block is accessed
We expel block whose use bit is 0
Much more complicated for higher associativity
levels

128
REALIZATIONS
129
Caching in a multicore organization

Multicore organizations often involve multiple
chips
Say four chips with four cores per chip
Have a cache hierarchy on each chip
L1, L2, L3
Some caches are private, other are shared
Accessing a cache on a chip is much faster than
accessing a cache on another chip

130
AMD 16-core system (I)

AMD 16-core system
Sixteen cores on four chips
Each core has a 64-KB L1 and a 512-KB L2 cache
Each chip has a 2-MB shared L3 cache

131
X/Y where X is latency in cycles Y is bandwidth
in bytes/cycle
132
AMD 16-core system (II)

Observe that access times are non-uniform
Takes more time to access L1 or L2 cache of
another core than accessing shared L3 cache
Takes more time to access caches in another chip
than local caches
Access times and bandwidths depend onchip
interconnect topology

133
VIRTUAL MEMORY
134
Main objective (I)

To allow programmers to write programs that
reside
partially in main memory
partially on disk

135
Main objective (II)
Main memory
Address space (I)
Address space (II)
136
Motivation

Most programs do not access their whole address
space at the same time
Compilers go through several phases
Lexical analysis
Preprocessing (C, C)
Syntactic analysis
Semantic analysis

137
Advantages (I)

VM allows programmers to write programs that
would not otherwise fit in main memory
They will run although much more slowly
Very important in 70's and 80's
VM allows OS to allocate the main memory much
more efficiently
Do not waste precious memory space
Still important today

138
Advantages

VM let programmers use
Sparsely populated
Very large address spaces

VM
D
C
S
L
139
Sparsely populated address spaces

Let programmers put different items apart from
each other
Code segment
Data segment
Stack
Shared library
Mapped files

Wait until you take 4330 tostudy this
140
Big difference with caching

Miss penalty is much bigger
Around 5 ms
Assuming a memory access time of 50 ns,5 ms
equals 100,000 memory accesses
For caches, miss penalty was around100 cycles

141
Consequences

Will use much larger block sizes
Blocks, here called pages, measure 4 KB, 8KB,
with 4 KB an unofficial standard
Will use fully associative mapping to reduce
misses, here called page faults
Will use write back to reduce disk accesses
Must keep track of modified (dirty) pages in
memory

142
Virtual memory

Combines two big ideas
Non-contiguous memory allocationprocesses are
allocated page frames scattered all over the main
memory
On-demand fetchProcess pages are brought in
main memory when they are accessed for the first
time
MMU takes care of almost everything

143
Main memory

Divided into fixed-size page frames
Allocation units
Sizes are powers of 2 (512B, , 4KB, )
Properly aligned
Numbered 0 , 1, 2, . . .

0
1
2
3
4
5
6
7
8
144
Program address space

Divided into fixed-size pages
Same sizes as page frames
Properly aligned
Also numbered 0 , 1, 2, . . .

0
1
2
3
4
5
6
7
145
The mapping

Will allocate non contiguous page frames to the
pages of a process

146
The mapping
Page Number Frame number
0 0
1 4
2 2
147
The mapping

Assuming 1KB pages and page frames

Virtual Addresses Physical Addresses
0 to 1,023 0 to 1,023
1,024 to 2,047 4,096 to 5,119
2,048 to 3,071 2,048 to 3,071
148
The mapping

Observing that 210 1000000000 in binary
We will write 0-0 for ten zeroes and 1-1 for ten
ones

Virtual Addresses Physical Addresses
0000-0 to 0001-1 0000-0 to 0001-1
0010-0 to 0011-1 1000-0 to 1001-1
0100-0 to 0101-1 0100-0 to 0101-1
149
The mapping

The ten least significant bits of the address do
not change

Virtual Addresses Physical Addresses
000 0-0 to 000 1-1 000 0-0 to 000 1-1
001 0-0 to 001 1-1 100 0-0 to 100 1-1
010 0-0 to 010 1-1 010 0-0 to 010 1-1
150
The mapping

Must only map page numbers into page frame numbers

Page number Page frame number
000 000
001 100
010 010
151
The mapping

Same in decimal

Page number Page frame number
0 0
1 4
2 2
152
The mapping

Since page numbers are always in sequence, they
are redundant

X
Page number Page frame number
0 0
1 4
2 2
153
The algorithm

Assume page size 2p
Remove p least significant bits from virtual
address to obtain the page number
Use page number to find corresponding page frame
number in page table
Append p least significant bits from virtual
address to page frame number to get physical
address

154
Realization
155
The offset

Offset contains all bits that remain unchanged
through the address translation process
Function of page size

Page size Offset
1 KB 10 bits
2 KB 11 bits
4KB 12 bits
156
The page number

Contains other bits of virtual address
Assuming 32-bit addresses

Page size Offset Page number
1 KB 10 bits 22 bits
2 KB 11 bits 21 bits
4KB 12 bits 20 bits
157
Internal fragmentation

Each process now occupies an integer number of
pages
Actual process space is not a round number
Last page of a process is rarely full
On the average, half a page is wasted
Not a big issue
Internal fragmentation

158
On-demand fetch (I)

Most processes terminate without having accessed
their whole address space
Code handling rare error conditions, . . .
Other processes go to multiple phases during
which they access different parts of their
address space
Compilers

159
On-demand fetch (II)

VM systems do not fetch whole address space of a
process when it is brought into memory
They fetch individual pages on demand when they
get accessed the first time
Page miss or page fault
When memory is full, they expel from memory pages
that are not currently in use

160
On-demand fetch (III)

The pages of a process that are not in main
memory reside on disk
In the executable file for the program being
run for the pages in the code segment
In a special swap area for the data pages that
were expelled from main memory

161
On-demand fetch (IV)
Main memory
Code
Data
Disk
Executable
Swap area
162
On-demand fetch (V)

When a process tries to access data that are nor
present in main memory
MMU hardware detects that the page is missing and
causes an interrupt
Interrupt wakes up page fault handler
Page fault handler puts process in waiting state
and brings missing page in main memory

163
Advantages

VM systems use main memory more efficiently than
other memory management schemes
Give to each process more or less what it needs
Process sizes are not limited by the size of main
memory
Greatly simplifies program organization

164
Sole disadvantage

Bringing pages from disk is a relatively slow
operation
Takes milliseconds while memory access take
nanoseconds
Ten thousand times to hundred thousand times
slower

165
The cost of a page fault

Let
Tm be the main memory access time
Td the disk access time
f the page fault rate
Ta the average access time of the VM
Ta (1 f ) Tm f (Tm Td ) Tm f Td

166
Example

Assume Tm 50 ns and Td 5 ms

f Mean memory access time
10-3 50 ns 5 ms/103 5,050 ns
10-4 50 ns 5 ms/104 550 ns
10-5 50 ns 5 ms/105 100 ns
10-6 50 ns 5 ms/ 106 55 ns
167
Conclusion

Virtual memory works best when page fault rate
is less than a page fault per 100,000
instructions

168
Locality principle (I)

A process that would access its pages in a
totally unpredictable fashion would perform very
poorly in a VM system unless all its pages are in
main memory

169
Locality principle (II)

Process P accesses randomly a very large array
consisting of n pages
If m of these n pages are in main memory, the
page fault frequency of the process will be( n
m )/ n
Must switch to another algorithm

170
Tuning considerations

In order to achieve an acceptable performance,a
VM system must ensure that each process has in
main memory all the pages it is currently
referencing
When this is not the case, the system performance
will quickly collapse

171
First problem

A virtual memory system has
32 bit addresses
8 KB pages
What are the sizes of the
Page number field?
Offset field?

172
Solution (I)

Step 1Convert page size to power of 28 KB
2----- B
Step 2Exponent is length of offset field

173
Solution (II)

Step 3Size of page number field Address size
Offset sizeHere 32 ____ _____ bits
Highlight the text in the box to see the answers

13 bits for the offset and 19 bits for the page
number
174
PAGE TABLE REPRESENTATION
175
Page table entries

A page table entry (PTE) contains
A page frame number
Several special bits
Assuming 32-bit addresses, all fit into four bytes

176
The special bits (I)

Valid bit1 if page is in main memory, 0
otherwise
Missing bit1 if page is in not main memory, 0
otherwise
Serve the same functionUse different conventions

177
The special bits (II)

Dirty bit1 if page has been modified since it
was brought into main memory,0 otherwise
A dirty page must be saved in the process swap
area on disk before being expelled from main
memory
A clean page can be immediately expelled

178
The special bits (III)

Page-referenced bit1 if page has been recently
modified,0 otherwise
Often simulated in software

179
Where to store page tables

Use a three-level approach
Store parts of page table
In high speed registers located in the MMUthe
translation lookaside buffer (TLB)(good
solution)
In main memory (bad solution)
On disk (ugly solution)

180
The translation look aside buffer

Small high-speed memory
Contains fixed number of PTEs
Content-addressable memory
Entries include page frame number and page number

181
Realizations (I)

TLB of Intrisity FastMATH
32-bit addresses
4 KB pages
Fully associative TLB with 16 entries
Each entry occupies 64 bits
20 bits for page number
20 bits for page frame number
Valid bit, dirty bit,

182
Realizations (II)

TLB of ULTRA SPARC III
64-bit addresses
Maximum program size is 244 bytes, that is,16 TB
Supported page sizes are 4 KB, 16KB, 64 KB, 4MB
("superpages")

183
Realizations (III)

TLB of ULTRA SPARC III
Dual direct-mapping (?) TLB
64 entries for code pages
64 entries for data pages
Each entry occupies 64 bits
Page number and page frame number
Context
Valid bit, dirty bit,

184
The context (I)

Conventional TLBs contain the PTE's for a
specific address space
Must be flushed each time the OS switches from
the current process to a new process
Frequent action in any modern OS
Introduces a significant time penalty

185
The context (II)

UltraSPARC III architecture adds to TLB entries a
context identifying a specific address space
Page mappings from different address spaces can
coexist in the TLB
A TLB hit now requires a match for both page
number and context
Eliminates the need to flush the TLB

186
TLB misses

When a PTE cannot be found in the TLB, a TLB
miss is said to occur
TLB misses can be handled
By the computer firmware
Cost of miss is one extra memory access
By the OS kernel
Cost of miss is two context switches

187
Letting SW handle TLB misses

As in other exceptions, must save current value
of PC in EPC register
Must also assert the exception by the end of the
clock cycle during which the memory access occurs
In MIPS, must prevent WB cycle to occur after MEM
cycle that generated the exception

188
Example

Consider the instruction
lw 1, 0(2)
If word at address 2 is not in the TLB,we must
prevent any update of 1

189
Performance implications

When TLB misses are handled by the firmware,
they are very cheap
A TLB hit rate of 99 is very goodAverage
access cost will be
Ta 0.99Tm 0.012Tm 1.01Tm
Less true if TLB misses are handled by the kernel

190
Storing the rest of the page table

PTs are too large to be stored in main memory
Will store active part of the PT in main memory
Other entries on disk
Three solutions
Linear page tables
Multilevel page tables
Hashed page tables

191
Storing the rest of the page table

We will review these solutions even though page
table organizations are an operating system topic

192
Linear page tables (I)

Store PT in virtual memory (VMS solution)
Very large page tables need more than 2 levels (3
levels on MIPS R3000)

193
Linear page tables (II)
PhysicalMemory
Virtual Memory
Other PTs
PT
194
Linear page tables (III)

Assuming a page size of 4KB,
Each page of virtual memory requires 4 bytes of
physical memory
Each PT maps 4GB of virtual addresses
A PT will occupy 4MB
Storing these 4MB in virtual memory will require
4KB of physical memory

195
Multi-level page tables (I)

PT is divided into
A master index that always remains in main memory
Sub indexes that can be expelled

196
Multi-level page tables (II)
lt Page Number gt
VIRTUAL ADDRESS
Offset
1ary
2ary
MASTER INDEX
SUBINDEX
(unchanged)
Frame
Addr
Offset
Frame No
PHYSICAL ADDRESS
197
Multi-level page tables (III)

Especially suited for a page size of 4 KB and 32
bits virtual addresses
Will allocate
10 bits of the address for the first level,
10 bits for the second level, and
12 bits for the offset.
Master index and sub indexes will all have 210
entries and occupy 4KB

198
Hashed page tables (I)

Only contain paged that are in main memory
PTs are much smaller
Also known as inverted page tables

199
Hashed page table (II)
PN page number PFN page frame number
200
Selecting the right page size

Increasing the page size
Increases the length of the offset
Decreases the length of the page number
Reduces the size of page tables
Less entries
Increases internal fragmentation
4KB seems to be a good choice

201
MEMORY PROTECTION
202
Objective

Unless we have an isolated single-user system, we
must prevent users from
Accessing
Deleting
Modifying
the address spaces of other processes, including
the kernel

203
Historical considerations

Earlier operating systems for personal computers
did not have any protection
They were single-user machines
They typically ran one program at a time
Windows 2000, Windows XP, Vista and MacOS X are
protected

204
Memory protection (I)

VM ensures that processes cannot access page
frames that are not referenced in their page
table.
Can refine control by distinguishing among
Read access
Write access
Execute access
Must also prevent processes from modifying their
own page tables

205
Dual-mode CPU

Require a dual-mode CPU
Two CPU modes
Privileged mode or executive mode that allows
CPU to execute all instructions
User mode that allows CPU to execute only safe
unprivileged instructions
State of CPU is determined by a special bits

206
Switching between states

User mode will be the default mode for all
programs
Only the kernel can run in supervisor mode
Switching from user mode to supervisor mode is
done through an interrupt
Safe because the jump address is at a
well-defined location in main memory

207
Memory protection (II)

Has additional advantages
Prevents programs from corrupting address spaces
of other programs
Prevents programs from crashing the kernel
Not true for device drivers which are inside the
kernel
Required part of any multiprogramming system

208
INTEGRATING CACHES AND VM
209
The problem

In a VM system, each byte of memory has two
addresses
A virtual address
A physical address
Should cache tags contain virtual addresses or
physical addresses?

210
Discussion

Using virtual addresses
Directly available
Bypass TLB
Cache entries specific to a given address space
Must flush caches when the OS selects another
process

Using physical addresses
Must access first TLB
Cache entries not specific to a given address
space
Do not have to flush caches when the OS selects
another process

211
The best solution

Let the cache use physical addresses
No need to flush the cache at each context switch
TLB access delay is tolerable

212
Processing a memory access (I)

if virtual address in TLB get physical
addresselse
create TLB miss exception break

I use Python because it is very
compacthetland.org/writing/instant-python.html
213
Processing a memory access (II)

if read_access while data not in cache
stall deliver data to CPUelse
write_access

Continues on next page
214
Processing a memory access (III)

if write_access_OK while data not in cache
stall write data into cache update dirty
bit put data and address in write buffer
else
illegal access create TLB miss exception

215
More Problems (I)

A virtual memory system has a virtual address
space of 4 Gigabytes and a page size of 4
Kilobytes. Each page table entry occupies 4
bytes.

216
More Problems (II)

How many bits are used for the byte offset?
Since 4K 2___, the byte offset will use __ bits.
Highlight text in box to see the answer

Since 4KB 212bytes, the byte offset uses 12 bits
217
More Problems (III)

How many bits are used for the page number?
Since 4G 2__ we will have __-bit virtual
addresses. Since the byte offset occupies ___ of
these __ bits, __ bits are left for the page
number.

The page number uses 20 bits of the address
218
More Problems (IV)

What is the maximum number of page table entries
in a page table?
Address space/ Page size 2__ / 2__ 2 ___
PTEs.

220 page table entries
219
More problems (VI)

A computer has 32 bit addresses and a page size
of one kilobyte.
How many bits are used to represent the page
number?
___ bits
What is the maximum number of entries in a
process page table?
2___ entries

220
Answer

As 1KB 210 bytes, the byte offset occupies10
bits
The page number uses the remaining 22 bits ofthe
address

221
Some review questions

Why are TLB entries 64-bit wide while page table
entries only require 32 bits?
What would be the main disadvantage of a virtual
memory system lacking a dirty bit?
What is the big limitation of VM systems that
cannot prevent processes from executing the
contents of any arbitrary page in their address
space?

222
Answers

We need extra space for storing the page number
It would have to write back to disk all pages
thatit expels even when they were not modified
It would make the system less secure

223
VIRTUAL MACHINES
224
Key idea

Let different operating systems run at the same
time on a single computer
Windows, Linux and Mac OS
A real-time OS and a conventional OS
A production OS and a new OS being tested

225
How it is done

A hypervisor /VM monitor defines two or more
virtual machines
Each virtual machine has
Its own virtual CPU
Its own virtual physical memory
Its own virtual disk(s)

226
The virtualization process
Hypervisor
227
Reminder

In a conventional OS,
Kernel executes in privileged/supervisor mode
Can do virtually everything
User processes execute in user mode
Cannot modify their page tables
Cannot execute privileged instructions

228
User process
User process
User mode
System call
Privileged mode
Kernel
229
Two virtual machines
User mode
Privileged mode
Hypervisor
230
Explanations (II)

Whenever the kernel of a VM issues a privileged
instruction, an interrupt occurs
The hypervisor takes control and do the physical
equivalent of what the VM attempted to do
Must convert virtual RAM addresses into physical
RAM addresses
Must convert virtual disk block addresses into
physical block addresses

231
Translating a block address
That's block v, w of the actual disk
Access block x, y of my virtual disk
VM kernel
Hypervisor
Virtual disk
Access block v, w of actual disk
Actual disk
232
Handling I/Os