Memory%20Hierarchy%20Design - PowerPoint PPT Presentation

About This Presentation
Title:

Memory%20Hierarchy%20Design

Description:

Memory Hierarchy Design – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 97
Provided by: Prab72
Learn more at: http://www.ece.uprm.edu
Category:

less

Transcript and Presenter's Notes

Title: Memory%20Hierarchy%20Design


1
Memory Hierarchy Design
2
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

3
Many Levels in Memory Hierarchy
Pipeline registers
Invisible only to high-levellanguage programmers
Register file
There can also bea 3rd (or more) cache levels
here
1st-level cache(on-chip)
Usually madeinvisible tothe programmer(even
assemblyprogrammers)
2nd-level cache(on same MCM as CPU)
Our focusin chapter 5
Physical memory(usu. mounted on same board as
CPU)
Virtual memory(on hard disk, often in same
enclosure as CPU)
Disk files(on hard disk often in same enclosure
as CPU)
Network-accessible disk files(often in the same
building as the CPU)
Tape backup/archive system(often in the same
building as the CPU)
Data warehouse Robotically-accessed room full of
shelves of tapes (usually on the same planet as
the CPU)
4
Simple Hierarchy Example
  • Note many orders of magnitude change in
    characteristics between levels

128?
8192?
200?
50,000 ?
4 ?
100 ?
5
CPU vs. Memory Performance Trends
Relative performance (vs. 1980 perf.) as a
function of year
55/year
35/year
7/year
6
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

7
Cache Basics
  • A cache is a (hardware managed) storage,
    intermediate in size, speed, and cost-per-bit
    between the programmer-visible registers and main
    physical memory.
  • The cache itself may be SRAM or fast DRAM.
  • There may be gt1 levels of caches
  • Basis for cache to work Principle of Locality
  • When a location is accessed, it and nearby
    locations are likely to be accessed again soon.
  • Temporal locality - Same location likely again
    soon.
  • Spatial locality - Nearby locations likely soon.

8
Four Basic Questions
  • Consider levels in a memory hierarchy.
  • Use block for the unit of data transfer satisfy
    Principle of Locality.
  • Transfer between cache levels, and the memory
  • The level design is described by four behaviors
  • Block Placement
  • Where could a new block be placed in the level?
  • Block Identification
  • How is a block found if it is in the level?
  • Block Replacement
  • Which existing block should be replaced if
    necessary?
  • Write Strategy
  • How are writes to the block handled?

9
Block Placement Schemes
10
Direct-Mapped Placement
  • A block can only go into one frame in the cache
  • Determined by blocks address (in memory space)
  • Frame number usually given by some low-order bits
    of block address.
  • This can also be expressed as
  • (Frame number) (Block address) mod (Number of
    frames (sets) in cache)
  • Note that in a direct-mapped cache,
  • block placement replacement are both completely
    determined by the address of the new block that
    is to be accessed.

11
Direct-Mapped Identification
Tags
Block frames
Address
Frm
Tag
Off.
Decode Row Select
One Selected Compared
Muxselect
?
Compare Tags
Data Word
Hit
12
Fully-Associative Placement
  • One alternative to direct-mapped is
  • Allow block to fill any empty frame in the cache.
  • How do we then locate the block later?
  • Can associate each stored block with a tag
  • Identifies the blocks location in cache.
  • When the block is needed, treat the cache as an
    associative memory, using the tag to match all
    frames in parallel, to pull out the appropriate
    block.
  • Another alternative to direct-mapped is placement
    under full program control.
  • A register file can be viewed as a small
    programmer-controlled cache (w. 1-word blocks).

13
Fully-Associative Identification
Block addrs
Block frames
Address
Block addr
Off.
Parallel Compare Select
  • Note that, compared to Direct
  • More address bits have to be stored with each
    block frame.
  • A comparator is needed for each frame, to do the
    parallel associative lookup.

Muxselect
Hit
Data Word
14
Set-Associative Placement
  • The block address determines not a single frame,
    but a frame set (several frames, grouped
    together).
  • Frame set Block address mod of frame sets
  • The block can be placed associatively anywhere
    within that frame set.
  • If there are n frames in each frame set, the
    scheme is called n-way set-associative.
  • Direct mapped 1-way set-associative.
  • Fully associative There is only 1 frame set.

15
Set-Associative Identification
Tags
Block frames
Address
Set
Tag
Off.
Note4separatesets
Set Select
Parallel Compare within the Set
  • Intermediate between direct-mapped and
    fully-associative in number of tag bits needed to
    be associated with cache frames.
  • Still need a comparator for each frame (but only
    those in one set need be activated).

Muxselect
Hit
Data Word
16
Cache Size Equation
  • Simple equation for the size of a cache
  • (Cache size) (Block size) (Number of sets)
    (Set Associativity)
  • Can relate to the size of various address fields
  • (Block size) 2( of offset bits)
  • (Number of sets) 2( of index bits)
  • ( of tag bits) ( of memory address bits)
    ? ( of index bits) ? ( of
    offset bits)

Memory address
17
Replacement Strategies
  • Which block do we replace when a new block comes
    in (on cache miss)?
  • Direct-mapped Theres only one choice!
  • Associative (fully- or set-)
  • If any frame in the set is empty, pick one of
    those.
  • Otherwise, there are many possible strategies
  • Random Simple, fast, and fairly effective
  • Least-Recently Used (LRU), and approximations
    thereof
  • Require bits to record replacement info., e.g.
    4-way requires 4! 24 permutations, need 5 bits
    to define the MRU to LRU positions
  • FIFO Replace the oldest block.

18
Write Strategies
  • Most accesses are reads, not writes
  • Especially if instruction reads are included
  • Optimize for reads performance matters
  • Direct mapped can return value before valid check
  • Writes are more difficult
  • Cant write to cache till we know the right block
  • Object written may have various sizes (1-8 bytes)
  • When to synchronize cache with memory?
  • Write through - Write to cache and to memory
  • Prone to stalls due to high bandwidth
    requirements
  • Write back - Write to memory upon replacement
  • Memory may be out of date

19
Another Write Strategy
  • Maintain a FIFO queue (write buffer) of cache
    frames (e.g. can use a doubly-linked list)
  • Meanwhile, take items from top of queue and write
    them to memory as fast as bus can handle
  • Reads might take priority, or have a separate bus
  • Advantages Write stalls are minimized, while
    keeping memory as up-to-date as possible

20
Write Miss Strategies
  • What do we do on a write to a block thats not in
    the cache?
  • Two main strategies Both do not stop processor
  • Write-allocate (fetch on write) - Cache the
    block.
  • No write-allocate (write around) - Just write to
    memory.
  • Write-back caches tend to use write-allocate.
  • White-through tends to use no-write-allocate.
  • Use dirty bit to indicate write-back is needed in
    write-back strategy

21
Example Alpha 21264
  • 64KB, 2-way, 64-byte block, 512 sets
  • 44 physical address bits

22
Instruction vs. Data Caches
  • Instructions and data have different patterns of
    temporal and spatial locality
  • Also instructions are generally read-only
  • Can have separate instruction data caches
  • Advantages
  • Doubles bandwidth between CPU memory hierarchy
  • Each cache can be optimized for its pattern of
    locality
  • Disadvantages
  • Slightly more complex design
  • Cant dynamically adjust cache space taken up by
    instructions vs. data

23
I/D Split and Unified Caches
Size I-Cache D-Cache Unified Cache
8KB 8.16 44.0 63.0
16KB 3.82 40.9 51.0
32KB 1.36 38.4 43.3
64KB 0.61 36.9 39.4
128KB 0.30 35.3 36.2
256KB 0.02 32.6 32.9
  • Miss per 1000 accesses
  • Much lower instruction miss rate than data miss
    rate

24
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

25
Cache Performance Equations
  • Memory stalls per program (blocking cache)
  • CPU time formula
  • More cache performance will be given later!

26
Cache Performance Example
  • Ideal CPI2.0, memory references / inst1.5,
    cache size64KB, miss penalty75ns, hit time1
    clock cycle
  • Compare performance of two caches
  • Direct-mapped (1-way) cycle time1ns, miss
    rate1.4
  • 2-way cycle time1.25ns, miss rate1.0

27
Out-Of-Order Processor
  • Define new miss penalty considering overlap
  • Compute memory latency and overlapped latency
  • Example (from previous slide)
  • Assume 30 of 75ns penalty can be overlapped, but
    with longer (1.25ns) cycle on 1-way design due to
    OOO

28
Cache Performance Improvement
  • Consider the cache performance equation
  • It obviously follows that there are three basic
    ways to improve cache performance
  • Reducing miss penalty (5.4)
  • Reducing miss rate (5.5)
  • Reducing miss penalty/rate via parallelism (5.6)
  • Reducing hit time (5.7)
  • Note that by Amdahls Law, there will be
    diminishing returns from reducing only hit time
    or amortized miss penalty by itself, instead of
    both together.

(Average memory access time) (Hit time) (Miss
rate)(Miss penalty)
Amortized miss penalty
29
Cache Performance Improvement
  • Reduce miss penalty
  • Multilevel cache Critical word first and early
    restart priority to read miss Merging write
    buffer Victim cache
  • Reduce miss rate
  • Larger block size Increase cache size Higher
    associativity Way prediction and
    Pseudo-associative caches Compiler
    optimizations
  • Reduce miss penalty/rate via parallelism
  • Non-blocking cache Hardware prefetching
    Compiler-controlled prefetching
  • Reduce hit time
  • Small simple cache Avoid address translation in
    indexing cache Pipelined cache access Trace
    caches

30
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

31
Multi-Level Caches
  • What is important faster caches or larger
    caches?
  • Average memory access time Hit time (L1)
    Miss rate (L1) x Miss Penalty (L1)
  • Miss penalty (L1)
  • Hit time (L2) Miss rate (L2) x Miss Penalty
    (L2)
  • Can plug 2nd equation into the first
  • Average memory access time
  • Hit time(L1) Miss rate(L1) x (Hit time(L2)
    Miss rate(L2)x Miss penalty(L2))

32
Multi-level Cache Terminology
  • Local miss rate
  • The miss rate of one hierarchy level by itself.
  • of misses at that level / accesses to that
    level
  • e.g. Miss rate(L1), Miss rate(L2)
  • Global miss rate
  • The miss rate of a whole group of hierarchy
    levels
  • of accesses coming out of that group (to lower
    levels) / of accesses to that group
  • Generally this is the product of the miss rates
    at each level in the group.
  • Global L2 Miss rate Miss rate(L1) Local Miss
    rate(L2)

33
Effect of 2-level Caching
  • L2 size usually much bigger than L1
  • Provide reasonable hit rate
  • Decreases miss penalty of 1st-level cache
  • May increase L2 miss penalty
  • Multiple-level cache inclusion property
  • Inclusive cache L1 is subset of L2 simplify
    cache coherence mechanism, effective cache size
    L2
  • Exclusive cache L1, L2 are exclusive increase
    effect cache sizes L1 L2
  • Enforce inclusion property Backward
    invalidation on L2 replacement

34
L2 Cache Performance
  1. Global cache miss rate is similar to the single
    cache miss rate
  2. Local miss rate is not a good measure of
    secondary caches

35
Early Restart, Critical Word First
  • Early restart
  • Dont wait for entire block to fill
  • Resume CPU as soon as requested word is fetched
  • Critical word first
  • ? wrapped fetch, requested word first
  • Fetch the requested word from memory first
  • Resume CPU
  • Then transfer the rest of the cache block
  • Most beneficial if block size is large
  • Commonly used in all the processors

36
Read Misses Take Priority
  • Processor must wait on a read, not on a write
  • Miss penalty is higher for reads to begin with
    and more benefit from reducing read miss penalty
  • Write buffer can queue values to be written
  • Until memory bus is not busy with reads
  • Careful about the memory consistency issue
  • What if we want to read a block in write buffer?
  • Wait for write, then read block from memory
  • Better Read block out of write buffer.
  • Dirty block replacement when reading
  • Write old block, read new block - Delays the
    read.
  • Old block to buffer, read new, write old. -
    Better!

37
Sub-block Placement
  • Larger blocks have smaller tags (match faster)
  • Smaller blocks have lower miss penalty
  • Compromise solution
  • Use a large block size for tagging purposes
  • Use a small block size for transfer purposes
  • How? Valid bits associated with sub-blocks.

Blocks
Tags
38
Merging Write Buffer
  • A mechanism to help reduce write stalls
  • On a write to memory, block address and data to
    be written are placed in a write buffer.
  • CPU can continue immediately
  • Unless the write buffer is full.
  • Write merging
  • If the same block is written again before it has
    been flushed to memory, old contents are replaced
    with new contents.
  • Care must be taken to not violate memory
    consistency and proper write ordering

39
Write Merging Example
40
Victim Cache
  • Small extra cache
  • Holds blocks overflowing from the occasional
    overfull frame set.
  • Very effective for reducing conflict misses.
  • Can be checked in parallel with main cache
  • Insignificant increase to hit time.

41
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

42
Three Types of Misses
  • Compulsory
  • During a program, the very first access to a
    block will not be in the cache (unless
    pre-fetched)
  • Capacity
  • The working set of blocks accessed by the program
    is too large to fit in the cache
  • Conflict
  • Unless cache is fully associative, sometimes
    blocks may be evicted too early because too many
    frequently-accessed blocks map to the same
    limited set of frames.

43
Misses by Type
Conflict
  • Conflict misses are significant in a
    direct-mapped cache.
  • From direct-mapped to 2-way helps as much as
    doubling cache size.
  • Going from direct-mapped to 4-way is better
    than doubling cache size.

44
As fraction of total misses
45
Larger Block Size
  • Keep cache size associativity constant
  • Reduces compulsory misses
  • Due to spatial locality
  • More accesses are to a pre-fetched block
  • Increases capacity misses
  • More unused locations pulled into cache
  • May increase conflict misses (slightly)
  • Fewer sets may mean more blocks utilized per set
  • Depends on pattern of addresses accessed
  • Increases miss penalty - longer block transfers

46
Block Size Effect
Miss rate is actually goes up if the block is too
large relative to the cache size
47
Larger Caches
  • Keep block size, set size, etc. constant
  • No effect on compulsory misses.
  • Block still wont be there on its 1st access!
  • Reduces capacity misses
  • More capacity!
  • Reduces conflict misses (in general)
  • Working blocks spread out over more frame sets
  • Fewer blocks map to a set on average
  • Less chance that the number of active blocks that
    map to a given set exceeds the set size.
  • But, increases hit time! (And cost.)

48
Higher Associativity
  • Keep cache size block size constant
  • Decreasing the number of sets
  • No effect on compulsory misses
  • No effect on capacity misses
  • By definition, these are misses that would happen
    anyway in fully-associative
  • Decreases conflict misses
  • Blocks in active set may not be evicted early
  • for set size smaller than capacity
  • Can increase hit time (slightly)
  • Direct-mapped is fastest
  • n-way associative lookup a bit slower for larger n

49
Performance Comparison
  • Assume
  • 4KB, 1-way miss-rate9.8 4-way miss-rate7.1

50
Higher Set-Associativity
Cache Size 1-way 2-way 4-way 8-way
4KB 3.44 3.25 3.22 3.28
8KB 2.69 2.58 2.55 2.62
16KB 2.23 2.40 2.46 2.53
32KB 2.06 2.30 2.37 2.45
64KB 1.92 2.14 2.18 2.25
128KB 1.52 1.84 1.92 2.00
256KB 1.32 1.66 1.74 1.82
512KB 1.20 1.55 1.59 1.66
  • Higher associativity increase the cycle time
  • The table shows the average memory access time
  • 1-way is better most of cases

51
Way Prediction
  • Keep in each set a way-prediction information to
    predict the block in each set will be accessed
    next
  • Only one tag may be matched at the first cycle
    if miss, other blocks need to be examined
  • Beneficial in two aspects
  • Fast data access Access the data without
    knowing the tag comparison results
  • Low power Only match a single tag if majority
    of the prediction is correct
  • Different systems use variations of the concept

52
Pseudo-Associative Caches
  • Essentially this is 2-way set-associative, but
    with sequential (rather than parallel) lookups.
  • Fast hit time if first frame checked is right.
  • An occasional slow hit if an earlier conflict had
    moved the block to its backup location.

53
Pseudo-Associative Caches
  • Placement
  • Place block b in frame (b mod n).
  • Identification
  • Look for block b first in frame (b mod n), then
    in its secondary location ((bn/2) mod n). (flip
    the most-significant bit) If found there,
    primary and secondary blocks are swapped.
  • May maintain a MRU bit to reduce the search and
    for better replacement.
  • Replacement
  • Block in frame (b mod n) is moved to secondary
    location ((bn/2) mod n). (Block there is
    flushed.)
  • Write strategy
  • Any desired write strategy can be used.

54
Compiler Optimizations
  • Reorganize code to improve locality properties.
  • The hardware designers favorite solution.
  • Requires no new hardware!
  • Various techniques Cache awareness
  • Merging Arrays
  • Loop Interchange
  • Loop Fusion
  • Blocking (in multidimensional arrays)
  • Other source-source transformation technique

55
Loop Blocking Matrix Multiply
Before
After
56
Effect of Compiler Optimizations
57
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

via Parallelism
58
Non-blocking Caches
  • Known as lockup-free cache, hit under miss
  • While a miss is being processed,
  • Allow other cache lookups to continue anyway
  • Useful in dynamically scheduled CPUs
  • Other instructions may be in the load queue
  • Reduces effective miss penalty
  • Useful CPU work fills the miss penalty delay
    slot
  • hit under multiple miss, miss under miss
  • Extend technique to allow multiple misses to be
    queued up, while still processing new hits

59
Non-blocking Caches
60
Hardware Prefetching
  • When memory is idle, speculatively get some
    blocks before the CPU first asks for them!
  • Simple heuristic Fetch 1 or more blocks that are
    consecutive to last one(s) fetched
  • Often, the extra blocks are placed in a special
    stream buffer so as not to conflict with actual
    active blocks in the cache, otherwise the
    prefetch may pollute the cache
  • Prefetching can reduce misses considerably
  • Speculative fetches should be low-priority
  • Use only otherwise-unused memory bandwidth
  • Energy-inefficient (like all speculation)

61
Compiler-Controlled Prefetching
  • Insert special instructions to load addresses
    from memory well before they are needed.
  • Register vs. cache, faulting vs. nonfaulting
  • Semantic invisibility, nonblocking-ness
  • Can considerably reduce misses
  • Can also cause extra conflict misses
  • Replacing a block before it is completely used
  • Can also delay valid accesses (tying up bus)
  • Low-priority, can be pre-empted by real access

62
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

63
Small and Simple Caches
  • Make cache smaller to improve hit time
  • Or (probably better), add a new smaller L0
    cache between existing L1 cache and CPU.
  • Keep L1 cache on same chip as CPU
  • Physically close to functional units that access
    it
  • Keep L1 design simple, e.g. direct-mapped
  • Avoids multiple tag comparisons
  • Tag can be compared after data cache fetch
  • Reduces effective hit time

64
Access Time in a CMOS Cache
65
Avoid Address Translation
  • In systems with virtual address spaces, virtual
    addr. must be mapped to physical addresses.
  • If cache blocks are indexed/tagged w. physical
    addresses, we must do this translation before we
    can do the cache lookup. Long hit time!
  • Solution Access cache using the virtual
    address. Call this a Virtual Cache
  • Drawback Cache flush on context switch
  • Can fix by tagging blocks with Process Ids (PIDs)
  • Another problem Aliasing, i.e. two virtual
    addresses mapped to same real address
  • Fix with anti-aliasing or page coloring

66
Benefit of PID Tags in Virtual Cache
W/o PIDs, purge
W. PIDs
Miss rate
W/o context switching
67
Pipelined Cache Access
  • Pipeline cache access so that
  • Effective latency of first level cache hit can be
    multiple clock cycles
  • Fast cycle time and slow hits
  • Hit times 1 for Pentium, 2 for P3, and 4 for P4
  • Increases number of pipeline stages
  • Higher penalty on mispredicted branches
  • More cycles from issue of load to use of data
  • In reality
  • Increases the bandwidth of instructions than
    decreasing the actual latency of a cache hit

68
Trace Caches
  • Supply enough instructions per cycle without
    dependencies
  • Finding ILP beyond 4 instructions per cycle
  • Dont limit instructions in a static cache block
    to spatial locality
  • Find dynamic sequence of instructions including
    taken branches
  • NetBurst (P4) uses trace caches
  • Addresses are no longer aligned.
  • Same instruction is stored more than once
  • If part of multiple traces

69
Summary of Cache Optimizations
70
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations for Improving
    Performance
  • Memory Technology
  • Virtual Memory
  • Conclusion

71
Wider Main Memory
72
Simple Interleaved Memory
  • Adjacent words found in different mem. banks
  • Banks can be accessed in parallel
  • Overlap latencies for accessing each word
  • Can use narrow bus
  • To return accessed words sequentially
  • Fits well with sequential access
  • e.g., of words in cache blocks

73
Independent Memory Banks
  • Original motivation for memory banks
  • Higher bandwidth by interleaving seq. accesses
  • Allows multiple independent accesses
  • Each bank requires separate address/data lines
  • Non-blocking caches allow CPU to proceed beyond a
    cache miss
  • Allows multiple simultaneous cache misses
  • Possible only with memory banks

74
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

75
Main Memory
  • Bandwidth Bytes read or written per unit time
  • Latency Described by
  • Access Time Delay between initiation/completion
  • For reads Present address till result ready.
  • Cycle time Minimum interval between separate
    requests to memory.
  • Address lines Separate bus CPU?Mem to carry
    addresses.
  • RAS (Row Access Strobe)
  • First half of address, sent first.
  • CAS (Column Access Strobe)
  • Second half of address, sent second.

76
RAS vs. CAS
DRAM bit-cell array
1. RAS selects a row
2. Parallelreadout ofall row data
3. CAS selectsa column to read
4. Selected bitwritten to memory bus
77
Typical DRAM Organization
(256 Mbit)
Low 14 bits
High14 bits
78
Types of Memory
  • DRAM (Dynamic Random Access Memory)
  • Cell design needs only 1 transistor per bit
    stored.
  • Cell charges leak away and may dynamically (over
    time) drift from their initial levels.
  • Requires periodic refreshing to correct drift
  • e.g. every 8 ms
  • Time spent refreshing kept to lt5 of bandwidth
  • SRAM (Static Random Access Memory)
  • Cell voltages are statically (unchangingly) tied
    to power supply references. No drift, no
    refresh.
  • But needs 4-6 transistors per bit.
  • DRAM 4-8x larger, 8-16x slower, 8-16x cheaper/bit

79
Amdahl/Case Rule
  • Memory size (and I/O bandwidth) should grow
    linearly with CPU speed
  • Typical 1 MB main memory, 1 Mbps I/O bandwidth
    per 1 MIPS CPU performance.
  • Takes a fairly constant 8 seconds to scan entire
    memory (if memory bandwidth I/O bandwidth, 4
    bytes/load, 1 load/4 instructions, no latency
    problem)
  • Moores Law
  • DRAM size doubles every 18 months (up 60/yr)
  • Tracks processor speed improvements
  • Unfortunately, DRAM latency has only decreased
    7/year. Latency is a big deal.

80
Some DRAM Trend Data
Since 1998, the rate of increase in chip capacity
has slowed to 2x per 2 years 128 Mb in
1998 256 Mb in 2000 512 Mb in 2002
81
ROM and Flash
  • ROM (Read-Only Memory)
  • Nonvolatile protection
  • Flash
  • Nonvolatile RAMs
  • NVRAMs require no power to maintain state
  • Reading flash is near DRAM speeds
  • Writing is 10-100x slower than DRAM
  • Frequently used for upgradeable embedded SW
  • Used in Embedded Processors

82
DRAM Variations
  • SDRAM Synchronous DRAM
  • DRAM internal operation synchronized by a clock
    signal provided on the memory bus
  • Double Data Rate (DDR) uses both clock edges
  • RDRAM RAMBUS (Inc.) DRAM
  • Proprietary DRAM interface technology
  • on-chip interleaving / multi-bank technology
  • a high-speed packet-switched (split-transaction)
    bus interface
  • byte-wide interface, synchronous, dual-rate
  • Licensed to many chip CPU makers
  • Higher bandwidth, costly than generic SDRAM
  • DRDRAM Direct RDRAM (2nd ed. spec.)
  • Separate row and column address/command buses
  • Higher bandwidth (18-bit data, more banks, faster
    clock)

83
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

84
Virtual Memory
The addition of the virtual memory mechanism
complicated the cache access
85
Paging vs. Segmentation
  • Paged Segment Each segment has integral number
    of pages for easy replacement and can still treat
    each segmentation as a unit

86
Four Important Questions
  • Where to place a block in main memory?
  • Operating systems takes care of it
  • Replacement takes very long fully associative
  • How to find a block in main memory?
  • Page table is used
  • Offset is concatenated when paging is used
  • Offset is added when segmentation is used.
  • Which block to replace when needed?
  • Obviously LRU is used to minimize page faults
  • What happens on a write?
  • Magnetic disks takes millions of cycles to
    access.
  • Always write back (use of dirty bit).

87
Addressing Virtual Memories
88
Fast Address Calculation
  • Page tables are very large
  • Kept in main memory
  • Two memory accesses for one read or write
  • Remember the last translation
  • Reuse if the address is on the same page
  • Exploit the principle of locality
  • If access have locality, the address translations
    should also have locality
  • Keep the address translations in a cache
  • Translation lookaside buffer (TLB)
  • The tag part stores the virtual address and the
    data part stores the page number.

89
TLB Example Alpha 21264
Same as PID
90
An Memory Hierarchy Example
28
91
Protection of Virtual Memory
  • Maintain two registers
  • Base
  • Bound
  • For each address check
  • base lt address lt bound
  • Provide two modes
  • User
  • OS (kernel, supervisor, executive)

92
Alpha-21264 Virtual Addr. Mapping
Supports both segmentation and paging
93
Outline
  • Introduction
  • Cache Basics
  • Cache Performance
  • Reducing Cache Miss Penalty
  • Reducing Cache Miss Rate
  • Reducing Hit Time
  • Main Memory and Organizations
  • Memory Technology
  • Virtual Memory
  • Conclusion

94
Design of Memory Hierarchies
  • Superscalar CPU, number of ports to cache
  • Speculative execution and memory system
  • Combine inst. Cache with fetch and decode
  • Caches in embedded systems!
  • Real-time vs. power
  • I/O and consistency of cached data
  • The cache coherence problem

95
The Cache Coherency Problem
96
Alpha 21264 Memory Hierarchy
Write a Comment
User Comments (0)
About PowerShow.com