Title: Non-Uniform Cache Architectures for Wire Delay Dominated Caches
1Non-Uniform Cache Architecturesfor Wire Delay
Dominated Caches
- Abhishek Desai
- Bhavesh Mehta
- Devang Sachdev
- Gilles Muller
2Plan
- Motivation
- What is NUCA
- UCA and ML-UCA
- Static NUCA
- Dynamic NUCA
- Simulation Results
3Motivation
- Bigger L2 and L3 Caches are needed
- Programs are larger
- SMT requires large cache for spatial locality
- BW demands have increased on the package
- Smaller technologies permit more bits per mm2
- Wire delays dominate in large caches
- Bulk of the access time will involve routing to
and from the banks, not the bank accesses
themselves
4What is NUCA?
- Data residing closer to the processor is
accessed much faster than data that reside
physically farther from the processor - Example
- The closest bank in a 16MB on-chip L2 cache
built in 50nm process technology could be
accessed in 4 cycles, while an access to the
farthest bank might take 47 cycles.
5UCA and ML-UCA
L2 41
L3 41
L2 10
ML-UCA Avg. access time 11/41 cycles Banks
8/32 Size 16MB Technology 50nm
UCA Avg. access time 255 cycles Banks 1 Size
16MB Technology 50nm
6Static-NUCA-1
S-NUCA-1 Avg. access time 34 cycles Banks
32 Size 16MB Technology 50nm Area Wire
overhead 20.9
7S-NUCA-1 cache design
8Static-NUCA-2
S-NUCA-2 Avg. access time 24 cycles Banks
32 Size 16MB Technology 50nm Area Channel
overhead 5.9
9S-NUCA-2 cache design
Addressbus
Sense amplifier
10Dynamic-NUCA
D-NUCA Avg. access time 18 cycles Banks
256 Size 16MB Technology 50nm
11Management of Data in DNUCA
- Mapping
- How the data are mapped to the banks and in which
banks a datum can reside? - Search
- How the set of possible locations are searched to
find a line? - Movement
- Under what conditions the data should be migrated
from one bank to another?
12Simple Mapping (implemented)
memory controller
bank
one set
way 1
way 2
way 3
way 4
8 bank sets
13Fair and Shared Mapping
memory controller
memory controller
Fair Mapping
Shared Mapping
14Searching Cached Lines
- Incremental search
- Multicast search (Implemented)
- Limited multicast
- Partitioned multicast
- Smart Search
- ss-performance
- ss-energy
15Dynamic Movement of Lines
- LRU line furthest and MRU line closest
- One-bank promotion on a hit (implemented)
- Policy on miss
- Which line is evicted?
- Line in the furthest (slowest) bank --
(implemented) - Where is the new line placed?
- Closest (fastest) bank
- Furthest (slowest) bank -- (implemented)
- What happens to the victim line?
- Zero copy policy (implemented)
- One copy policy
16Advantages of DNUCA over ML-UCA
- DNUCA does not enforce inclusion thus preventing
redundant copies of the same line - In ML-UCA the faster level may not match the
working set size of an application, either being
too large and thus slow, or being too small and
thus incurring misses
17Configuration for simulation
- Used Sim-Alpha and Cacti
- Simple mapping
- Multicast search
- One-bank promotion on each hit
- Replacement policy that chooses the block in the
slowest bank as the victim of a miss
18Hit Rate Distribution for D-NUCA
19Simulation results integer benchmarks
20Simulation results FP benchmarks
21Summary
- D-NUCA has the following plus points
- Low Access Latency
- Technology scalability
- Performance stability
- Flattens the memory hierarchy
22(No Transcript)