Improving Cache Performance - PowerPoint PPT Presentation

About This Presentation
Title:

Improving Cache Performance

Description:

Page fault. Address fault. Memory mapping (address translation) ... Fetches from predicted path. FPU. Five functional units: Add. Multiply. Divide/square root ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 47
Provided by: csgw
Category:

less

Transcript and Presenter's Notes

Title: Improving Cache Performance


1
(No Transcript)
2
Improving Cache Performance
AMAT Hit time Miss rate Miss penalty
  • Four categories of optimisation
  • Reduce miss rate
  • Reduce miss penalty
  • Reduce miss rate or miss penalty using
    parallelism
  • Reduce hit time

3
5.5. Reducing Miss Rate
  • Three sources of misses
  • Compulsory
  • cold start misses
  • Capacity
  • Cache is full
  • Conflict
  • Set is full/block is occupied

Increase block size
Increase size of cache
Increase degree of associativity
4
Larger Block Size
  • Bigger blocks reduce compulsory misses
  • Spatial locality
  • BUT
  • Increased miss penalty
  • More data to transfer
  • Possibly increased overall miss rate
  • More conflict and capacity misses as there are
    fewer blocks

5
Effect of Block Size
6
Larger Caches
  • Reduces capacity misses
  • Increases hit time and cost

7
Higher Associativity
  • Miss rates improve with higher associativity
  • Two rules of thumb
  • 8-way set associative caches are almost as
    effective as fully associative
  • But much simpler!
  • 21 cache rule
  • A direct mapped cache of size N has about the
    same miss rate as a 2-way set associative cache
    of size N/2

8
Way Prediction
  • Set-associative cache predicts which block will
    be needed on next access to the set
  • Only one tag check is done
  • If mispredicted the whole set must be checked
  • E.g. Alpha 21264 instruction cache
  • Prediction rate gt 85
  • Correct prediction 1 cycle hit
  • Misprediction 3 cycles

9
Pseudo-Associative Caches
  • Check a direct mapped cache for a hit as usual
  • If it misses, check a second block
  • Invert MSB of index
  • One fast and one slow hit time

10
Compiler Optimisations
  • Compilers can optimise code to minimise miss
    rates
  • Reordering procedures
  • Aligning basic blocks with cache blocks
  • Reorganising array element accesses

11
5.6. Reduce Miss Rate or Miss Penalty via
Parallelism
  • Three techniques that overlap instruction
    execution with memory access

12
Nonblocking caches
  • Dynamic scheduling allows CPU to continue with
    other instructions while waiting for data
  • Nonblocking cache allows other cache accesses to
    continue while waiting for data

13
Hardware Prefetching
  • Fetch data/instructions before they are requested
    by the processor
  • Either into cache or another buffer
  • Particularly useful for instructions
  • High degree of spatial locality
  • UltraSPARC III
  • Special prefetch cache for data
  • Increases effectiveness by about four times

14
Compiler Prefetching
  • Compiler inserts prefetch instructions
  • Two types
  • Prefetch register value
  • Prefetch data cache block
  • Can be faulting or non-faulting
  • Cache continues as normal while data is prefetched

15
SPARC V9
  • Prefetch

prefetch rs1 rs2, fcn prefetch rs1
imm13, fcn
fcn Prefetch function 0 Prefetch for
several reads 1 Prefetch for one read 2
Prefetch for several writes 3 Prefetch
for one write 4 Prefetch page
16
5.7. Reducing Hit Time
  • Critical
  • Often affects CPU clock cycle time

17
Small, simple caches
  • Small usually equals fast in hardware
  • A small cache may reside on the processor chip
  • Decreases communication
  • Compromise tags on chip, data separate
  • Direct mapped
  • Data can be read in parallel with tag checking

18
Avoiding address translation
  • Physical caches
  • Use physical addresses
  • Address translation must happen before cache
    lookup
  • Virtual caches
  • Use virtual addresses
  • Protection issues
  • High context switching overhead

19
Virtual caches
  • Minimising context switch overhead
  • Add process-identifier tag to cache
  • Multiple virtual addresses may refer to a single
    physical address
  • Hardware enforces anti-aliasing
  • Software requires less significant bits to be the
    same

20
Avoiding address translation (cont.)
  • Choice of page size
  • Bigger than cache index offset
  • Address translation and tag lookup can happen in
    parallel

21
Pipelining cache access
  • Split cache access into several stages
  • Impacts on branch and load delays

22
Trace caches
  • Blocks follow program flow rather than spatial
    locality!
  • Branch prediction is taken into account by cache
  • Intel NetBurst microarchitecture
  • Complicates address mapping
  • Minimises wasted space within blocks

23
Cache OptimisationSummary
  • Cache optimisation is very complex
  • Improving one factor may have a negative impact
    on another

24
5.6. Main Memory
  • Latency and bandwidth are both important
  • Latency is composed of two factors
  • Access time
  • Cycle time
  • Two main technologies
  • DRAM
  • SRAM

25
5.7. Virtual Memory
  • Physical memory is divided into blocks
  • Allocated to processes
  • Provides protection
  • Allows swapping to disk
  • Simplifies loading
  • Historically
  • Overlays
  • Programmer controlled swapping

26
Terminology
  • Block
  • Page
  • Segment
  • Miss
  • Page fault
  • Address fault
  • Memory mapping (address translation)
  • Virtual address ? physical address

27
Characteristics
  • Block size
  • 4kB 64kB
  • Hit time
  • 50 150 cycles
  • Miss penalty
  • 1 000 000 10 000 000 cycles
  • Miss Rate
  • 0.000 01 0.001

?
?
28
Categorising VM Systems
  • Fixed block size
  • Pages
  • Variable block size
  • Segments
  • Difficult replacement
  • Hybrid approaches
  • Paged segments
  • Multiple page sizes (2n smallest)

29
Q1 Block placement?
  • Anywhere in memory
  • Fully associative
  • Minimises miss rate

30
Q2 Block identification?
  • Page/segment number gives physical page address
  • Paging offset concatenated
  • Segments offset added
  • Uses a page table
  • Number of pages in virtual address space
  • Save space inverted page table
  • Number of pages in physical memory

31
Q3 Block replacement?
  • Least-recently used (LRU)
  • Minimises miss rate
  • Hardware provides a use bit or reference bit

32
Q4 Write strategy?
  • Write back
  • With a dirty bit

You wont become famous by being the first to try
write through!
33
Fast Address Translation
  • Page tables are big
  • Stored in memory themselves
  • Two memory accesses for every datum!
  • Principle of locality
  • Cache recent translations
  • Translation look-aside buffer (TLB), or
    translation buffer (TB)

34
Alpha 21264 TLB
35
Selecting a Page Size
  • Big
  • Smaller page table
  • Allows parallel cache access
  • Efficient disk transfers
  • Reduces TLB misses
  • Small
  • Less memory wastage (internal fragmentation)
  • Quicker process startup

36
Putting it ALL Together!
  • SPARC Revisited

37
Two SPARCs
  • SuperSPARC
  • 1992
  • 32-bit superscalar design
  • UltraSPARC
  • Late 1990s
  • 64-bit design
  • Graphics support (VIS)

38
UltraSPARC
  • Four-way superscalar execution
  • Two integer ALUs
  • FP unit
  • Five functional units
  • Graphics unit

39
Pipeline
  • 9 stages
  • Fetch
  • Decode
  • Grouping
  • Execution
  • Cache access
  • Load miss
  • Integer pipe wait (for FP/graphics pipelines)
  • Trap resolution
  • Writeback

40
Branch Handling
  • Dynamic branch prediction
  • Two bit scheme
  • Every second instruction in cache has prediction
    bits (predicts up to 2048 branches)
  • 88 success rate (integer)
  • Target prediction
  • Fetches from predicted path

41
FPU
  • Five functional units
  • Add
  • Multiply
  • Divide/square root
  • Two graphics units (add and multiply)
  • Mostly fully pipelined (latency 3 cycles)
  • Except divide and square root (not pipelined,
    latency is 22 cycles for 64-bit)

42
Memory Hierarchy
  • On-chip instruction and data caches
  • Data
  • 16kB direct-mapped, write-through
  • Instructions
  • 16kB 2-way set associative
  • Both virtually addressed
  • External cache
  • Up to 4MB

43
Virtual Memory
  • 64-bit virtual addresses ? 44-bit physical
    addresses
  • TLB
  • 64 entry, fully-associative cache

44
Multimedia Support (VIS)
  • Integrated with FPU
  • Partitioned operations
  • Multiple smaller values in 64-bits
  • Video compression instructions
  • E.g. motion estimation instruction replaces 48
    simple instructions for MPEG compression

45
The End!
46
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com