CPE 431531 Chapter 7 Large and Fast: Exploiting Memory Hierarchy - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

CPE 431531 Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Description:

1. Electrical and Computer Engineering. CPE 431/531 ... Data is copied between only two levels at a time. The minimum data unit is a ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 44
Provided by: glen3
Category:

less

Transcript and Presenter's Notes

Title: CPE 431531 Chapter 7 Large and Fast: Exploiting Memory Hierarchy


1
CPE 431/531Chapter 7 - Large and Fast
Exploiting Memory Hierarchy
  • Dr. Rhonda Kay Gaede
  • UAH

2
7.1 Introduction
  • Programmers always want unlimited amounts of fast
    memory. Caches give that illusion.
  • Principle of Locality
  • Temporal Locality - ____________________________

  • ____________________________
  • Spatial Locality - ___________________________
    _

  • ____________________________
  • Build a memory hierarchy.

3
7.1 Introduction - Cache Terminology
  • Data is copied between only two levels at a time.
  • The minimum data unit is a ________.
  • If the data appears in the upper level, this
    situation is called a _______. The data not
    appearing in the upper level is called a ________.

4
7.1 Introduction More Terminology
  • The _____________ is the fraction of memory
    accesses found in the upper level.
  • The ______________ is the fraction of memory
    accesses not found in the upper level.
  • _____________ is the time to access the upper
    level of the memory hierarchy.
  • The _________________ is the time to replace a
    block in the upper level with the corresponding
    block from the lower level, plus the time to
    deliver this block to the processor.

5
7.2 The Basics of Caches Burning Questions
  • How do we know whether a data item is in the
    cache?
  • If it is, how do we find it?
  • The simplest scheme is that each item can be
    placed in exactly one place _______________.
  • Mapping

6
7.2 The Basics of Caches -Accessing a Cache
  • 22
  • 26
  • 22
  • 26
  • 16
  • 3
  • 16
  • 18

7
7.2 The Basics of Caches - Mapping Implemented
in Hardware
8
7.2 The Basics of Caches -Total Storage Required
  • Example
  • How many total bits are required for a
    direct-mapped cache with 16 KB of data and
    four-word blocks, assuming a 32-bit address?

9
7.2 The Basics of Caches -Mapping an Address to
a Multiword Cache Block
  • Consider a cache with 64 blocks and a block size
    of 16 bytes. What block number does byte address
    1200 map to?

10
7.2 The Basics of Caches - Miss Rate versus
Block Size
11
7.2 The Basics of Caches - Handling Cache Misses
  • Instruction Cache Miss
  • 1. Send the original PC value (current PC 4)
    to the memory.
  • 2. Instruct main memory to perform a read and
    wait for the memory to complete its access.
  • 3. Write the cache entry, putting the data from
    memory in the data portion of the entry, writing
    the upper bits of the address into the tag field
    and turn the valid bit on.
  • 4. Restart the instruction execution at the
    first step, which will refetch the instruction,
    this time finding it in the cache.

12
7.2 The Basics of Caches - Handling Writes
  • Suppose on a store instruction, we wrote the data
    into only the data cache (and not
    __________________).
  • Then the cache and main memory are said to be
    _____________
  • Solution _______________
  • Problem ___________________
  • Solution ________________
  • Alternative __________________________________

13
7.2 The Basics of Caches An Example Cache The
Intrinsity FastMATH Processor
This processor has a 12 stage pipeline. When
operating at peak speed, the processor can
request both an instruction word and a data word
on every clock cycle. Separate instruction and
data caches are used, each with 4K words and
16-word blocks. For writes, the FastMATH offers
both write-through and write-back, letting the OS
decide.
The Intrinsity FastMATH Processor is a fast
embedded processor that uses the MIPS
architecture and a simple cache implementation.
14
7.2 The Basics of Caches Designing Memory
Systems to Support Caches
  • Given
  • 1 memory bus clock cycle to send the address
  • 15 memory bus clock cycles for each DRAM access
    initiated
  • 1 memory bus clock cycle to send a word of data

15
7.3 Measuring and Improving Cache Performance -
Introduction
  • CPU time (_______________________
    __________________) x ______________
  • Memory-stall clock cycles ________________
    ______________
  • Read-stall cycles
  • Write-stall cycles
  • Memory-stall clock cycles
  • Calculating Cache Performance
  • imiss 2 , dmiss 4 , CPI 2, miss penalty
    100 cycles, SPECint loads and stores

16
7.3 Measuring and Improving Cache Performance
Impact of Increased Clock Rate
  • Suppose the processor in the previous example
    doubles its clock rate, making the miss penalty
    ________.
  • Total miss cycles per instruction _______.
  • Total CPI
  • Performance with fast clock compared to
    performance with slow clock

17
7.3 Measuring and Improving Cache Performance -
Flexible Placement
  • One Extreme - direct mapped -
  • Middle Range - set associative -
  • Other Extreme - fully associative -
  • Set associative mapping

18
7.3 Measuring and Improving Cache Performance -
Set Associativity
  • Conceptual View Pseudo-Implementation
  • View

19
7.3 Measuring and Improving Cache Performance
Misses and Associativity
  • Look at three small caches (four one word
    blocks)
  • a. fully associative
  • b. two-way set associative
  • c. direct mapped
  • Address sequence 0, 8, 0, 6, 8

20
7.3 Measuring and Improving Cache Performance
Locating a Block in the Cache
21
7.3 Measuring and Improving Cache Performance -
Tag Size Considerations
  • Size of Tags versus Set Associativity
  • For a cache with 4K blocks, a 32-bit address with
    0 bits for block and byte offsets, find the
    sets, tag bits for 1, 2, 4 and fully
    associative organizations

22
7.3 Measuring and Improving Cache Performance -
Multilevel Caches
  • Example
  • CPIbase 1.0, CR 5 GHz
  • Memaccess 100 ns, L1instmiss 2
  • L2access 5 ns, L2miss 0.5

23
7.4 Virtual Memory - Introduction
  • The main memory can act as a cache for the
    secondary storage.
  • Historically, two motivations for virtual memory
  • _________________________________________
  • __________________________________________
  • Virtual memory implements the ____________ of a
    programs address space to ________________.
  • This translation process enforces ______________
    of a programs address space from other programs.

24
7.4 Virtual Memory - Terminology
  • A virtual memory ______________ is called a
    _______.
  • A virtual memory _________ is called a ______
  • _________.
  • Each virtual address is translated to a physical
    address.
  • This process is called ______________________.

25
7.4 Virtual Memory - Facts
  • A virtual address is broken into a virtual page
    number and a page offset.
  • A page fault takes millions of cycles to process
  • Pages should be large enough to amortize the high
    access time, though embedded systems are going
    smaller.
  • Fully associative placement of pages is allowed.
  • Page faults can be handled in software.
  • Virtual memory uses write-back.

26
7.4 Virtual Memory - Mapping
  • Pages are located by using a
  • full table that indexes memory.
  • Each program has its own
  • page table.
  • The page table register points
  • to the beginning of the page table.
  • A full table is too expensive,
  • hierarchical page tables are used.
  • The state of a process consists
  • of the page table register, program
  • counter and registers.

27
7.4 Virtual Memory - Page Faults
  • The operating system manages page replacement.
  • The operating system usually creates the space on
    disk for all of the pages of a process when it
    creates the process, this space is called
    _________
  • A data structure records where each page is
    stored on disk. Another data structure tracks
    which processes and which virtual addresses use
    each physical page.
  • On a page fault, the ____________
  • ____ page is evicted.
  • Consider 10, 12, 9, 7, 11, 10, then 8

28
7.4 Virtual Memory Making Address Translation
Fast The TLB
  • With virtual memory, you need two memory
    accesses, one extra for the translation.
  • Add a cache to keep
  • track of recent translations.
  • Its called a translation
  • lookaside buffer (TLB).
  • A TLB miss may or may
  • not be a page fault
  • TLB characteristics
  • size ________________
  • block size ___________
  • hit time ______________
  • miss penalty ___________
  • miss rate ______________

29
7.4 Virtual Memory The Intrinsity FastMATH TLB
30
7.4 Virtual Memory Intrinsity FastMATH TLB and
Cache Processing a read or write through
31
7.4 Virtual Memory Overall Operation of a
Memory Hierarchy
32
7.4 Virtual Memory Implementing Protection with
Virtual Memory
  • Hardware Must Provide
  • At least two modes
  • _________________
  • _________________
  • Provide a portion of the processor state that a
    user process can read but not write
  • ______________________________________________
  • Provide mechanisms whereby the processor can move
    between modes
  • _________________________
  • _________________________________

33
7.4 Virtual Memory - Handling Page Faults and
TLB Misses
  • TLB Miss Exception handler at special address
  • ________________________________________
  • ________________________________________
  • Page Fault Exception handler at general address
  • ___________________________
  • ___________________________
  • ___________________________
  • During exception processing, we must make sure
    __________________, so we may __________
    exceptions for a period of time

34
7.5 A Common Framework for Memory Hierarchies -
Question 1
  • Question 1 Where Can a Block be Placed?
  • Answer A Range of Associativities are Possible
  • Advantage Increasing associativity decreases
    miss rates.
  • Disadvantage Increasing associativity increases
    cost and access time.

35
7.5 A Common Framework for Memory Hierarchies -
Question 2
  • Question 2 How is a Block Found?
  • Answers
  • Caches - Small degrees of associativity are used
    because high degrees are costly
  • Virtual Memory - Full associativity makes sense
    because
  • Misses are very expensive
  • Software can implement sophisticated replacement
    schemes
  • Full map can be easily indexed
  • Large items means small number of mappings

36
7.5 A Common Framework for Memory Hierarchies -
Question 3
  • Question 3 Which Block Should be Replaced on a
    Cache Miss?
  • Answers
  • Cache
  • Random
  • LRU
  • Virtual Memory
  • LRU

37
7.5 A Common Framework for Memory Hierarchies -
Question 4
  • Question 4 What Happens on a Write?
  • Answers
  • Caches - write-back is the strategy of the future
  • Virtual memory - always uses write-back

38
7.5 A Common Framework for Memory Hierarchies -
The Three Cs
  • Cache misses occur in three categories
  • Compulsory misses -
  • Capacity misses -
  • Conflict misses -
  • The challenge in designing memory hierarchies is
    that every change that potentially improves the
    miss rate can also negatively affect overall
    performance.

39
7.6 Real Stuff - The Pentium P4 and Opteron
Memory Hierarchies
  • Both have secondary caches on the main processor
    die.

40
7.6 Real Stuff - The Pentium P4 and Opteron
Memory Hierarchies
41
7.7 Fallacies and Pitfalls
  • Pitfall Forgetting to account for byte
    addressing or the cache block size in simulating
    a cache.
  • Pitfall Ignoring memory system behavior when
    writing programs or when generating code in
    compiler
  • Pitfall Using average memory access time to
    evaluate the memory hierarchy of an out-of-order
    processor
  • Pitfall Extending an address space by adding
    segments on top of an unsegmented address space.

42
7.8 Concluding Remarks
  • Because CPU speeds continue to increase faster
    than either DRAM access times or disk access
    times, memory will increasingly be the factor
    that limits performance.

43
7.8 Concluding Remarks Developments
  • Hardware
  • Changes in cache capabilities
  • Software
  • Restructuring loops
  • Compiler-directed prefetching
Write a Comment
User Comments (0)
About PowerShow.com