CPE 431531 Chapter 7 Large and Fast: Exploiting Memory Hierarchy - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

CPE 431531 Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Description:

1. Electrical and Computer Engineering. CPE 431/531 ... Data is copied between only two levels at a time. The minimum data unit is a ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 44

Provided by: glen3

Category:

more less

Transcript and Presenter's Notes

Title: CPE 431531 Chapter 7 Large and Fast: Exploiting Memory Hierarchy

1
CPE 431/531Chapter 7 - Large and Fast
Exploiting Memory Hierarchy

Dr. Rhonda Kay Gaede
UAH

2
7.1 Introduction

Programmers always want unlimited amounts of fast
memory. Caches give that illusion.
Principle of Locality
Temporal Locality - ____________________________
____________________________
Spatial Locality - ___________________________
_
____________________________
Build a memory hierarchy.

3
7.1 Introduction - Cache Terminology

Data is copied between only two levels at a time.
The minimum data unit is a ________.
If the data appears in the upper level, this
situation is called a _______. The data not
appearing in the upper level is called a ________.

4
7.1 Introduction More Terminology

The _____________ is the fraction of memory
accesses found in the upper level.
The ______________ is the fraction of memory
accesses not found in the upper level.
_____________ is the time to access the upper
level of the memory hierarchy.
The _________________ is the time to replace a
block in the upper level with the corresponding
block from the lower level, plus the time to
deliver this block to the processor.

5
7.2 The Basics of Caches Burning Questions

How do we know whether a data item is in the
cache?
If it is, how do we find it?
The simplest scheme is that each item can be
placed in exactly one place _______________.
Mapping

6
7.2 The Basics of Caches -Accessing a Cache

7
7.2 The Basics of Caches - Mapping Implemented
in Hardware
8
7.2 The Basics of Caches -Total Storage Required

Example
How many total bits are required for a
direct-mapped cache with 16 KB of data and
four-word blocks, assuming a 32-bit address?

9
7.2 The Basics of Caches -Mapping an Address to
a Multiword Cache Block

Consider a cache with 64 blocks and a block size
of 16 bytes. What block number does byte address
1200 map to?

10
7.2 The Basics of Caches - Miss Rate versus
Block Size
11
7.2 The Basics of Caches - Handling Cache Misses

Instruction Cache Miss
1. Send the original PC value (current PC 4)
to the memory.
2. Instruct main memory to perform a read and
wait for the memory to complete its access.
3. Write the cache entry, putting the data from
memory in the data portion of the entry, writing
the upper bits of the address into the tag field
and turn the valid bit on.
4. Restart the instruction execution at the
first step, which will refetch the instruction,
this time finding it in the cache.

12
7.2 The Basics of Caches - Handling Writes

Suppose on a store instruction, we wrote the data
into only the data cache (and not
__________________).
Then the cache and main memory are said to be
_____________
Solution _______________
Problem ___________________
Solution ________________
Alternative __________________________________

13
7.2 The Basics of Caches An Example Cache The
Intrinsity FastMATH Processor
This processor has a 12 stage pipeline. When
operating at peak speed, the processor can
request both an instruction word and a data word
on every clock cycle. Separate instruction and
data caches are used, each with 4K words and
16-word blocks. For writes, the FastMATH offers
both write-through and write-back, letting the OS
decide.
The Intrinsity FastMATH Processor is a fast
embedded processor that uses the MIPS
architecture and a simple cache implementation.
14
7.2 The Basics of Caches Designing Memory
Systems to Support Caches

Given
1 memory bus clock cycle to send the address
15 memory bus clock cycles for each DRAM access
initiated
1 memory bus clock cycle to send a word of data

15
7.3 Measuring and Improving Cache Performance -
Introduction

CPU time (_______________________
__________________) x ______________
Memory-stall clock cycles ________________
______________
Read-stall cycles
Write-stall cycles
Memory-stall clock cycles
Calculating Cache Performance
imiss 2 , dmiss 4 , CPI 2, miss penalty
100 cycles, SPECint loads and stores

16
7.3 Measuring and Improving Cache Performance
Impact of Increased Clock Rate

Suppose the processor in the previous example
doubles its clock rate, making the miss penalty
________.
Total miss cycles per instruction _______.
Total CPI
Performance with fast clock compared to
performance with slow clock

17
7.3 Measuring and Improving Cache Performance -
Flexible Placement

One Extreme - direct mapped -
Middle Range - set associative -
Other Extreme - fully associative -
Set associative mapping

18
7.3 Measuring and Improving Cache Performance -
Set Associativity

Conceptual View Pseudo-Implementation
View

19
7.3 Measuring and Improving Cache Performance
Misses and Associativity

Look at three small caches (four one word
blocks)
a. fully associative
b. two-way set associative
c. direct mapped
Address sequence 0, 8, 0, 6, 8

20
7.3 Measuring and Improving Cache Performance
Locating a Block in the Cache
21
7.3 Measuring and Improving Cache Performance -
Tag Size Considerations

Size of Tags versus Set Associativity
For a cache with 4K blocks, a 32-bit address with
0 bits for block and byte offsets, find the
sets, tag bits for 1, 2, 4 and fully
associative organizations

22
7.3 Measuring and Improving Cache Performance -
Multilevel Caches

Example
CPIbase 1.0, CR 5 GHz
Memaccess 100 ns, L1instmiss 2
L2access 5 ns, L2miss 0.5

23
7.4 Virtual Memory - Introduction

The main memory can act as a cache for the
secondary storage.
Historically, two motivations for virtual memory
_________________________________________
__________________________________________
Virtual memory implements the ____________ of a
programs address space to ________________.
This translation process enforces ______________
of a programs address space from other programs.

24
7.4 Virtual Memory - Terminology

A virtual memory ______________ is called a
_______.
A virtual memory _________ is called a ______
_________.
Each virtual address is translated to a physical
address.
This process is called ______________________.

25
7.4 Virtual Memory - Facts

A virtual address is broken into a virtual page
number and a page offset.
A page fault takes millions of cycles to process
Pages should be large enough to amortize the high
access time, though embedded systems are going
smaller.
Fully associative placement of pages is allowed.
Page faults can be handled in software.
Virtual memory uses write-back.

26
7.4 Virtual Memory - Mapping

Pages are located by using a
full table that indexes memory.
Each program has its own
page table.
The page table register points
to the beginning of the page table.
A full table is too expensive,
hierarchical page tables are used.
The state of a process consists
of the page table register, program
counter and registers.

27
7.4 Virtual Memory - Page Faults

The operating system manages page replacement.
The operating system usually creates the space on
disk for all of the pages of a process when it
creates the process, this space is called
_________

A data structure records where each page is
stored on disk. Another data structure tracks
which processes and which virtual addresses use
each physical page.
On a page fault, the ____________
____ page is evicted.
Consider 10, 12, 9, 7, 11, 10, then 8

28
7.4 Virtual Memory Making Address Translation
Fast The TLB

With virtual memory, you need two memory
accesses, one extra for the translation.
Add a cache to keep
track of recent translations.
Its called a translation
lookaside buffer (TLB).
A TLB miss may or may
not be a page fault
TLB characteristics
size ________________
block size ___________
hit time ______________
miss penalty ___________
miss rate ______________

29
7.4 Virtual Memory The Intrinsity FastMATH TLB
30
7.4 Virtual Memory Intrinsity FastMATH TLB and
Cache Processing a read or write through
31
7.4 Virtual Memory Overall Operation of a
Memory Hierarchy
32
7.4 Virtual Memory Implementing Protection with
Virtual Memory

Hardware Must Provide
At least two modes
_________________
_________________
Provide a portion of the processor state that a
user process can read but not write
______________________________________________
Provide mechanisms whereby the processor can move
between modes
_________________________
_________________________________

33
7.4 Virtual Memory - Handling Page Faults and
TLB Misses

TLB Miss Exception handler at special address
________________________________________
________________________________________
Page Fault Exception handler at general address
___________________________
___________________________
___________________________
During exception processing, we must make sure
__________________, so we may __________
exceptions for a period of time

34
7.5 A Common Framework for Memory Hierarchies -
Question 1

Question 1 Where Can a Block be Placed?
Answer A Range of Associativities are Possible
Advantage Increasing associativity decreases
miss rates.
Disadvantage Increasing associativity increases
cost and access time.

35
7.5 A Common Framework for Memory Hierarchies -
Question 2

Question 2 How is a Block Found?
Answers
Caches - Small degrees of associativity are used
because high degrees are costly
Virtual Memory - Full associativity makes sense
because
Misses are very expensive
Software can implement sophisticated replacement
schemes
Full map can be easily indexed
Large items means small number of mappings

36
7.5 A Common Framework for Memory Hierarchies -
Question 3

Question 3 Which Block Should be Replaced on a
Cache Miss?
Answers
Cache
Random
LRU
Virtual Memory
LRU

37
7.5 A Common Framework for Memory Hierarchies -
Question 4

Question 4 What Happens on a Write?
Answers
Caches - write-back is the strategy of the future
Virtual memory - always uses write-back

38
7.5 A Common Framework for Memory Hierarchies -
The Three Cs

Cache misses occur in three categories
Compulsory misses -
Capacity misses -
Conflict misses -
The challenge in designing memory hierarchies is
that every change that potentially improves the
miss rate can also negatively affect overall
performance.

39
7.6 Real Stuff - The Pentium P4 and Opteron
Memory Hierarchies

Both have secondary caches on the main processor
die.

40
7.6 Real Stuff - The Pentium P4 and Opteron
Memory Hierarchies
41
7.7 Fallacies and Pitfalls

Pitfall Forgetting to account for byte
addressing or the cache block size in simulating
a cache.
Pitfall Ignoring memory system behavior when
writing programs or when generating code in
compiler
Pitfall Using average memory access time to
evaluate the memory hierarchy of an out-of-order
processor
Pitfall Extending an address space by adding
segments on top of an unsegmented address space.

42
7.8 Concluding Remarks

Because CPU speeds continue to increase faster
than either DRAM access times or disk access
times, memory will increasingly be the factor
that limits performance.

43
7.8 Concluding Remarks Developments