Title: CS1104%20Help%20Session%20I%20Memory%20Semester%20II%202001/02
1CS1104 Help Session IMemorySemester II 2001/02
- Colin Tan,
- S15-04-05,
- Ctank_at_comp.nus.edu.sg
2Memory
- Memory can be visualized as a stack of pigeon
holes. Current computers have about 128,000,000
pigeon holes. - Each pigeon hole is given a number, starting from
0. This number is called an address. - Each pigeon hole will contain either data (e.g.
numbers you want to add together) or instruction
(e.g. add two numbers)
3Memory
- Memory locations 0 to 3 contain instructions,
locations 4 to 6 contain data. - Note In reality, instructions are also encoded
into numbers!
4Addresses
- As mentioned, each pigeon hole has a number
identifying it called an address. - When the CPU requires an instruction, it will
send the instructions address to memory, and
the memory will return the instruction at that
address. - E.g. At IF CPU will send 0 to memory, and the
memory returns li t1, 5 - At MEM CPU will send 6 to memory, and memory
returns 10. - At WB, CPU writes 10 back to t1.
5Addressing Bits
- Computers work only in binary
- Hence addresses generated in the previous example
are also in binary! - In general, to address a maximum of n memory
locations, you will need m log2 n bits in your
address. - Conversely, if you had m bits in your address,
you can access a maximum of 2m memory locations.
6Memory Hierarchy
- Motivation
- Not all memory is created equal
- Cheap Memory gt Slow
- Fast Memory gt Expensive
- DRAM, 70 ns access time, 1/MByte
- SRAM, 8 ns access time, 50/Mbyte
- So, you can choose either
- Have fast but very small memory, OR
- Large but very slow memory.
7Memory Hierarchy
- Memory hierarchy gives you a third option
- Large, but very fast memory
- Though slower than the expensive memory mentioned
earlier.
8Locality
- Locality is a particular type of behavior
exhibited by running programs - Spatial locality If a memory location has been
accessed, it is very likely its neighbor will
also be accessed. - Temporal locality If a memory location has been
accessed, it is very likely that it will be
accessed again sometime soon.
9Locality - Example
- Consider the following program
- for(i0 ilt10 i)
- ai bi ci
10Locality - Example
- In memory it will look like this
11Locality - Example
- Tracing the execution of the program
12Locality - Example
- Focusing only on the addresses of the fetched
instructions, we see that the addresses the
instructions are fetched from are - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6,
7, 8, 9, 10, 2, 3, 4, 5, - Here we see both
- Spatial locality (e.g. after location 0 is
accessed, location 1 is accessed, then 2, etc.) - Temporal locality (e.g. location 2 is accessed 10
times!)
13Effect of Locality
- Locality means that in the short run out of all
the memory you have (perhaps up to 128,000,000
pigeon holes!), only a very small number of
locations are actually being accessed! - In our example for ten iterations only memory
locations 2 to 10 are being accessed out of
128,000,000 possible locations! - What if we had a tiny amount of very fast (but
expensive!) memory and kept these locations in
that fast memory? - We can speed up access times dramatically!!
- This is the idea behind caches.
14How Do Caches Help?
- The average time to access memory (AMAT) is given
by - AMAT hit_rate Tcache miss_rate x (Tmemory
Tcache) - Tcache Time to read the cache (8ns for SRAM
cache) - Tmemory Time to read main memory (70ns for
DRAM) - miss_rate Probability of not finding what we
want in the cache. - Because of locality, miss_rate is very small
- Typically about 3 to 5.
- Here, our AMAT 0.95 8ns 0.05 x (70 8) ns
11.5 ns - Our AMAT is about 43 slower than pure SRAM cache
memory (11.5 ns vs. 8 ns)
15How Do Caches Help?
- What about cost?
- Lets consider
- A system with 32 MB of DRAM memory, 512KB of SRAM
cache. - Cost is 1/MB for DRAM, and 50/MB for SRAM.
- If we had 32MB of SRAM, access time is 8 ns, but
cost will be 1,600 - With 32MB of DRAM, cost is 32, but access time
is 70 ns! - But with 32MB of DRAM and 512 (1/2 MB) of SRAM,
cost will be 32 (512/1024) 50 57!
16How do Caches Help?
- So with pure SRAM, we can have 8 ms average
access time at 1,600. - With pure DRAM, our memory will cost 32, but all
accesses will take 70 ns! - With DRAM memory and SRAM cache, we can have 11.5
ms access time at 57. - So for a performance drop of 43, we have a cost
savings of gt2700! - Hence caches give us large memory size (32 MB),
at close to the cost of the DRAM technology (57
vs. 32), but at close to the speed of expensive
SRAM technology (11.5 ms vs. 8 ms)
17Cache Architecture
- Caches consist of blocks (or lines). Each block
stores data from memory
- Block allocation problem
- Given data from an address A, how do we decide
which block of cache its data should go to?
18The Block Allocation Problem
- 3 possible solutions
- Data from each address A will go to to a fixed
block. - Direct Mapped Cache
- Data from each address A will go to any block.
- Fully associative cache
- Data from address A will go to a fix set of
blocks. - Data may be put into any block within a set.
- Set associative cache.
19Direct Mapped Caches
- The value of a portion of memory address is used
to decide which block to send the data to
Address A
Tag
Block Index
Block Offset
Byte Offset
- The Block Index portion is used to decide which
block data from this address should go to.
20Example
- The number of bits in the block index is log2N,
where N is the total number of blocks. - For a 4-block cache, the block index portion of
the address will be 2 bits, and these 2 bits can
take on the value of 00, 01, 10 or 11. - The exact value of these 2 bits will determine
which block the data for that address will go to.
21Direct Mapped Addressing E.g.
- Show how an addresses generated by the MIPS CPU
will be divided into byte offset, block offset,
block index and tag portions for the following
cases - i) Block size 1 word, 128 blocks
- ii) Block size 4 words, 64 blocks
- All MIPS addresses are 32 bit byte addresses
(i.e. they address individual bytes in a word).
22Case I
23Case II
24Example
- The value of the two block index bits will
determine which block the data will go to,
following the scheme shown below
25Solving Direct-Mapped Cache Problems
- Question 7.7
- Basic formula
- Blk_Addr floor(word_address/words_per_block)
mod N - N here is the total number of blocks in the cache
- This is the mathematical version of taking the
value of the Block Index bits from the address.
26A ComplicationMultiple Word Blocks
- Single word blocks do not support spatial
locality - Spatial locality Likelihood of accessing
neighbor of a piece of data that was just
accessed is high. - But with single word blocks, none of the
neighbors are in cache! - All accesses to neighbors that were not accessed
before will miss!
27An ExampleQuestion 7.8
28Accessing Individual Words
- In our example, each block has 4 words.
- But we always access memory 1 word at a time!
(e.g. lw) - Use the Block Offset to specify which of the 4
words in a block we want to read
Address A
29The Block Offset
- Number of block offset bits log2M, where M is
the number of words per block. - For our example, M4. So number of block offset
bits is 2. - These two bits can take on the values of 00, 01,
10 and 11. - Note that for single word blocks, the number of
block offset bits is log2 1, which is 0. I.e.
There are no block offset bits for single-word
blocks. - These values determine exactly which word within
a block address A is referring to
30Who am I?Purpose of the Tag
- Many different addresses may map to the same
block e.g. (Block Index portions shown
highlighted)
- All 3 addresses are different, but all map to
block 00010010
31Disambiguation
- We need a way to disambiguate the situation
- Otherwise how do we know that the data in block x
actually comes from address A and not from
another address A that has the same block index
bit value? - The portion of the address A to the left of the
Block Index can be used for disambiguation. - This portion is called the tag, and the tag for
address A is stored in the cache together with
address A data.
32The Tag
- When we access the cache, the Tag portion and
Block Index portions of address A are extracted. - The Block Index portion will tell the cache
controller which block of cache to look at. - The Tag portion is compared against the tag
stored in the block. If the tags match, we have a
cache hit. The data is read from the cache.
33Accessing Individual Bytes
- MIPS addresses are byte addresses, and actually
index individual bytes rather than words. - Each MIPS word consists of 4 bytes.
- The byte offset tells us exactly which byte
within a word we are referring to.
34Advantages Disadvantages ofDirect Mapped Caches
- Advantages
- Simple to implement
- Fast performance
- Less time to detect a cache hit gt less time to
get data from the cache gt faster performance - Disadvantages
- Poor temporal locality.
- Many addresses may map to the same block.
- The next time address A is accessed, it may have
been replaced by the contents of address A.
35Improving Temporal LocalityThe Fully Associative
Cache
- In the fully associative cache, data from an
address A can go to any block in cache. - In practice, data will go into the first
available cache block. - When the cache is full, a replacement policy is
invoked to choose which block of cache to throw
out.
36Advantages and DisadvantagesFully Associative
Cache
- Good temporal locality properties
- Flexible block placement allows smart replacement
policies such that blocks that are likely to be
referenced again will not be replaced. E.g. LRU,
LFU. - Disadvantages
- Complex and too expensive for large caches
- Each block needs a comparator to check the tag.
- With 8192 blocks, we need 8192 comparators!
37A CompromiseSet Associative Caches
- Represents a compromise between direct-mapped and
fully associative caches. - Cache is divided into sets of blocks.
- An address A is mapped directly to a set using a
similar scheme as for direct mapped caches. - Once the set has been determined, the data from A
may be stored in any block within a set - Fully
associative within a set!
38Set Associative Cache
- An n-way set associative cache will have n blocks
per set. - For example, for a 16-block cache that is
implemented as a 2-way set associative cache,
each set has 2 blocks, and we have a total of 8
sets.
39Advantages and DisadvantagesSet Associative Cache
- Advantages
- Almost as simple to build as a direct-mapped
cache. - Only n comparators are needed for an n-way set
associative cache. For 2-way set-associative,
only 2 comparators are needed to compare tags. - Supports temporal locality by having full
associativity within a set.
40Advantages and DisadvantagesSet Associative Cache
- Disadvantages
- Not as good as fully-associative cache in
supporting temporal locality. - For LRU schemes, because of small associativity,
actually possible to have 0 hit rate for
temporally local data. - E.g. If our accesses are A1 A2 A3 A1 A2 A3, and
if A1, A2 and A3 map to the same 2-way set, then
hit rate is 0 as subsequent accesses replace
previous accesses in the LRU scheme.
41Multi-level Cache
- Let the first level of cache (closest to CPU) be
called L1, and the next level L2. - Let Phit_l1 be the hit rate of L1, Tcache_L1 be
the cache access time of L1, Tmiss_L1 be the miss
penalty of L1. - AMAT of L1 Phit_l1 Tcache_L1 (1-Phit_l1)
Tmiss_L1 - What is Tmiss_L1?
- If L1 misses, then we will attempt to get data
from L2. Hence Tmiss_l1 is actually just the AMAT
of L2! - Let Phit_l2 be the hit rate of L2, Tcache_l2 be
the cache access time of L2, Tmiss_l2 be the miss
penalty of L2.
42Multilevel Cache
- Tmiss_l1 AMATl2 Phit_l2 Tcache_L2
(1-Phit_l2) Tmiss_L2 - Substitute this back and we get
- AMAT of L1 Phit_l1 Tcache_L1 (1-Phit_l1)
(Phit_l2 Tcache_L2 (1-Phit_l2) Tmiss_L2) - Tmiss_l2 is of course the time taken to access
the slow DRAM memory. - What if we had an L3 cache?
43Other Problems
44Virtual Memory Motivation
- Drive space is very very cheap
- Typically about 2cents per megabyte.
- It would be ideal if we could set aside a portion
of drive space to be used as memory. - Unfortunately disk drives are very slow
- Fastest access time is about 10ms, or about 1,000
times slower than SRAM and several hundred times
slower than DRAM. - Idea Use drive space as memory, and main memory
to cache the drive space! - This is the idea behind virtual memory.
45Main Idea
- Virtual memory (residing on disk) is cached by
main memory - Main memory is cached by system cache
- All memory transfers are only between consecutive
levels (e.g. VM to main memory, main memory to
cache).
46Cache vs. VM
- Concept behind VM is almost identical to concept
behind cache. - But different terminology!
- Cache Block VM Page
- Cache Cache Miss VM Page Fault
- Caches implemented completely in hardware. VM
implemented in software, with hardware support
from CPU. - Cache speeds up main memory access, while main
memory speeds up VM access.
47Technical Issues of VM
- Relatively cheap to remedy cache misses
- Miss penalty is essentially the time taken to
access the main memory (around 60-80ns). - Pipeline freezes for about 60-80 cycles.
- Page Faults are EXPENSIVE!
- Page fault penalty is the time taken to access
the disk. - May take up to 50 or more ms, depending on the
speed of the disk and I/O bus. - Wastes millions of processor cycles!
48Virtual Memory Design
- Because page-miss penalties are so heavy, not
practical to implement direct-mapped or
set-associative architectures - These have poorer hit rates.
- Main memory caching of VM is always fully
associative. - 1 or 2 improvement in hit rate over other fully
associative or set associative designs. - But with heavy page-miss penalties, 1
improvement is A LOT! - Also relatively cheap to implement full
associativity in software
49Summary
- Memory can be thought of as pigeon holes where
CPU stores instructions and data. - Each pigeon hole (memory location) is given a
number called its address. - Memory technology can be cheap and slow (DRAM) or
fast and expensive (SRAM) - Locality allows us to use a small amount of fast
expensive memory to store parts of the cheap and
slow memory to improve performance. - Caches are organized into blocks.
50Summary
- Mapping between memory addresses and blocks can
be accomplished by - Directly mapping a memory location to a cache
block (direct map) - Slotting a memory location to any block (fully
associative) - Mapping a memory location to a set of blocks,
then slotting it into any block within the set
(set associative) - Virtual memory attempts to use disk space as
main memory, DRAM main memory as cache to the
disk memory, and SRAM as cache to the DRAM.