EECS 470 - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

EECS 470

Description:

1 cycle access (early in pipeline) 1-3 cycle access. 6-15 cycle access. 50-300 cycle access ... First flagship architecture since the P6 microarchitecture ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 31
Provided by: todda7
Category:
Tags: eecs | flagship

less

Transcript and Presenter's Notes

Title: EECS 470


1
EECS 470
  • Cache Systems
  • Lecture 13
  • Coverage Chapter 5

2
Cache Design 101
Memory pyramid
Reg 100s bytes
1 cycle access (early in pipeline)
1-3 cycle access
6-15 cycle access
50-300 cycle access
Millions cycle access!
3
Direct-mapped cache
Memory
Cache
Address 01101
00000 00010 00100 00110 01000 01010 01100 01110 10
000 10010 10100 10110 11000 11010 11100 11110
V d tag data
78
23
29
218
0
120
10
0
123
44
0
71
16
0
150
141
162
28
173
214
Block Offset (1-bit)
18
33
21
98
Line Index (2-bit)
33
181
28
129
Tag (2-bit)
19
119
200
42
Compulsory Miss first reference to memory
block Capacity Miss Working set doesnt fit in
cache Conflict Miss Working set maps to same
cache line
210
66
225
74
4
2-way set associative cache
Memory
Cache
Address 01101
00000 00010 00100 00110 01000 01010 01100 01110 10
000 10010 10100 10110 11000 11010 11100 11110
V d tag data
78
23
29
218
0
120
10
0
123
44
0
71
16
0
150
141
162
28
173
214
Block Offset (unchanged)
18
33
21
98
1-bit Set Index
33
181
28
129
Larger (3-bit) Tag
19
119
200
42
Rule of thumb Increasing associativity decreases
conflict misses. A 2-way associative cache has
about the same hit rate as a direct mapped cache
twice the size.
210
66
225
74
5
Effects of Varying Cache Parameters
  • Total cache size block size ? sets ?
    associativity
  • Positives
  • Should decrease miss rate
  • Negatives
  • May increase hit time
  • Increased area requirements

6
Effects of Varying Cache Parameters
  • Bigger block size
  • Positives
  • Exploit spatial locality reduce compulsory
    misses
  • Reduce tag overhead (bits)
  • Reduce transfer overhead (address, burst data
    mode)
  • Negatives
  • Fewer blocks for given size increase conflict
    misses
  • Increase miss transfer time (multi-cycle
    transfers)
  • Wasted bandwidth for non-spatial data

7
Effects of Varying Cache Parameters
  • Increasing associativity
  • Positives
  • Reduces conflict misses
  • Low-assoc cache can have pathological behavior
    (very high miss)
  • Negatives
  • Increased hit time
  • More hardware requirements (comparators, muxes,
    bigger tags)
  • Decreased improvements past 4- or 8- way.

8
Effects of Varying Cache Parameters
  • Replacement Strategy (for associative caches)
  • LRU intuitive difficult to implement with high
    assoc worst case performance can occur (N1
    element array)
  • Random Pseudo-random easy to implement
    performance close to LRU for high associvity
  • Optimal replace block that has next reference
    farthest in the future hard to implement ?

9
Other Cache Design Decisions
  • Write Policy How to deal with write misses?
  • Write-through / no-allocate
  • Total traffic? Read misses ? block size writes
  • Common for L1 caches back by L2 (esp. on-chip)
  • Write-back / write-allocate
  • Needs a dirty bit to determine whether cache data
    differs
  • Total traffic? (read misses write misses) ?
    block size
    dirty-block-evictions ? block size
  • Common for L2 caches (memory bandwidth limited)
  • Variation Write validate
  • Write-allocate without fetch-on-write
  • Needs sub-block cache with valid bits for each
    word/byte

10
Other Cache Design Decisions
  • Write Buffering
  • Delay writes until bandwidth available
  • Put them in FIFO buffer
  • Only stall on write if buffer is full
  • Use bandwidth for reads first (since they have
    latency problems)
  • Important for write-through caches since write
    traffic is frequent
  • Write-back buffer
  • Holds evicted (dirty) lines for Write-back caches
  • Also allows reads to have priority on the L2 or
    memory bus.
  • Usually only needs a small buffer

Ref Eager Writeback Caches
11
Adding a Victim cache
V d tag data (Direct mapped)
V d tag data (fully associative)
0
0
0
0
0
0
0
0
Victim cache (4 lines)
0
0
0
Ref 11010011 Ref 01010011
0
0
0
010
0
  • Small victim cache adds associativity to hot
    lines
  • Blocks evicted from direct-mapped cache go to
    victim
  • Tag compares are made to direct mapped and
    victim
  • Victim hits cause lines to swap from L1 and
    victim
  • Not very useful for associative L1 caches

0
0
0
0
0
12
Hash-Rehash Cache
V d tag data (Direct mapped)
0
11010011 01010011 11010011
0
0
0
0
0
0
0
0
0
110
0
0
0
0
0
0
13
Hash-Rehash Cache
V d tag data (Direct mapped)
0
11010011 01010011 01000011 Allocate? 11010011
0
Miss Rehash miss
0
0
0
0
0
0
0
0
0
0
0
0
0
0
14
Hash-Rehash Cache
V d tag data (Direct mapped)
0
11010011 01010011 01000011 11010011
0
Miss Rehash miss
0
0
0
0
0
0
0
0
0
0
0
0
0
0
15
Hash-Rehash Cache
V d tag data (Direct mapped)
0
11010011 01010011 01000011 11010011 11000011
0
0
0
0
0
0
0
Miss Rehash Hit!
0
0
0
0
0
0
0
0
16
Hash-Rehash Cache
  • Calculating performance
  • Primary hit time (normal Direct mapped)
  • Rehash hit time (sequential tag lookups)
  • Block swap time?
  • Hit rate comparable to 2-way associative.

17
Compiler support for caching
  • Array Merging (array of structs vs. 2 arrays)
  • Loop interchange (row vs. column access)
  • Structure padding and alignment (malloc)
  • Cache conscious data placement
  • Pack working set into same line
  • Map to non-conflicting address is packing
    impossible

18
Prefetching
  • Already done bring in an entire line assuming
    spatial locality
  • Extend this Next Line Prefetch
  • Bring in the next block in memory as well a miss
    line (very good for Icache)
  • Software prefetch
  • Loads to R0 have no data dependency
  • Aggressive/speculative prefetch useful for L2
  • Speculative prefetch problematic for L1

19
Calculating the Effects of Latency
  • Does a cache miss reduce performance?
  • It depends on whether there are critical
    instructions waiting for the result

20
Calculating the Effects of Latency
  • It depends on whether critical resources are held
    up
  • Blocking When a miss occurs, all later reference
    to the cache must wait. This is a resource
    conflict.
  • Non-blocking Allows later references to access
    cache while miss is being processed.
  • Generally there is some limit to how many
    outstanding misses can be bypassed.

21
P4 Overview (Todds slides)
  • Latest iA32 processor from Intel
  • Equipped with the full set of iA32 SIMD
    operations
  • First flagship architecture since the P6
    microarchitecture
  • Pentium 4 ISA Pentium III ISA SSE2
  • SSE2 (Streaming SIMD Extensions 2) provides
    128-bit SIMD integer and floating point
    operations prefetch

22
Comparison Between Pentium III and Pentium 4
23
Trace Cache
  • Primary instruction cache in P4 architecture
  • Stores 12k decoded mops
  • On a miss, instructions are fetched from L2
  • Trace predictor connects traces
  • Trace cache removes
  • Decode latency after mispredictions
  • Decode power for all pre-decoded instructions

24
Execution Pipeline
25
Store and Load Scheduling
  • Out of order store and load operations
  • Stores are always in program order
  • 48 loads and 24 stores could be in flight
  • Store/load buffers are allocated at the
    allocation stage
  • Total 24 store buffers and 48 load buffers

26
Data Stream of Pentium 4 Processor
27
On-chip Caches
  • L1 instruction cache (Trace Cache)
  • L1 data cache
  • L2 unified cache
  • All caches use a pseudo-LRU replacement algorithm
  • Parameters

28
L1 Data Cache
  • Non-blocking
  • Support up to 4 outstanding load misses
  • Load latency
  • 2-clock for integer
  • 6-clock for floating-point
  • 1 Load and 1 Store per clock
  • Load speculation
  • Assume the access will hit the cache
  • Replay the dependent instructions when miss
    detected

29
L2 Cache
  • Non-blocking
  • Load latency
  • Net load access latency of 7 cycles
  • Bandwidth
  • 1 load and 1 store in one cycle
  • New cache operations may begin every 2 cycles
  • 256-bit wide bus between L1 and L2
  • 48Gbytes per second _at_ 1.5GHz

30
L2 Cache Data Prefetcher
  • Hardware prefetcher monitors the reference
    patterns
  • Bring cache lines automatically
  • Attempts to fetch 256 bytes ahead of current
    access
  • Prefetch for up to 8 simultaneous independent
    streams
Write a Comment
User Comments (0)
About PowerShow.com