Cluster Prefetch: Tolerating OnChip Wire - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster Prefetch: Tolerating OnChip Wire

Description:

Partitioned architectures: small computational. units connected by a communication fabric ... Small computational units with limited functionality. fast clocks, ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 26
Provided by: rajeevbala
Category:

less

Transcript and Presenter's Notes

Title: Cluster Prefetch: Tolerating OnChip Wire


1
Cluster Prefetch Tolerating On-Chip Wire Delays
in Clustered Microarchitectures
Rajeev Balasubramonian School of Computing,
University of Utah
July 1st 2004
2
Billion-Transistor Chips
  • Partitioned architectures small computational
  • units connected by a communication fabric
  • Small computational units with limited
    functionality ?
  • fast clocks, low design effort, low power
  • Numerous computational units ? high
    parallelism

3
The Communication Bottleneck
  • Wire delays do not scale down at the same rate
    as
  • logic delays Agarwal, ISCA00Ho, Proc.
    IEEE01
  • 30 cycle delay to go across the chip in 10 years
  • 1-cycle inter-hop latency in the RAW
  • prototype Taylor, ISCA04

4
Cache Design
Centralized Cache
L1D
6 cyc RAM Access
Address Transfer 6 cyc
Data 6 cyc Transfer
18-cycle access (12 cycles for communication)
5
Cache Design
Centralized Cache
Decentralized Cache
L1D
L1D
L1D
6 cyc RAM Access
Address Transfer 6 cyc
Data 6 cyc Transfer
L1D
L1D
18-cycle access (12 cycles for communication)
6
Research Goals
  • Identify bottlenecks in cache access
  • Design cluster prefetch, a latency hiding
    mechanism
  • Evaluate and compare centralized and
  • decentralized designs

7
Outline
  • Motivation
  • Evaluation platform
  • Cluster prefetch
  • Centralized vs. decentralized caches
  • Conclusions

8
Clustered Microarchitectures
  • Centralized front-end
  • Dynamically steered
  • (dependences load)
  • O-o-o issue and 1-cycle
  • bypass within a cluster
  • Hierarchical interconnect

L1D
Instr Fetch
lsq
crossbar
ring
9
Simulation Parameters
  • Simplescalar-based simulator
  • In-flight instruction window of 480
  • 16 clusters, each with 60 registers, 30 issue
  • queue entries, and one FU of each kind
  • Inter-cluster latencies between 2-10
  • Primary focus on SPEC-FP programs

10
Steps Involved in Cache Access
L1D
Instr Fetch
RAM Access
Memory Disambiguation
lsq
Instr Dispatch
Data Transfer
Effective Address Transfer
Effective Address Computation
11
Lifetime of a Load
12
Load Address Prediction
Cache Access Cycle 68
Dispatch at cycle 0
Data Transfer Cycle 94
L1D
L S Q
Cluster
Eff. Addr. Transfer Cycle 27
13
Load Address Prediction
Cache Access Cycle 68
Dispatch at cycle 0
Data Transfer Cycle 94
L1D
L S Q
Cluster
Eff. Addr. Transfer Cycle 27
Cache Access Cycle 0
Data Transfer Cycle 26
L1D
L S Q
Cluster
Eff. Addr. Transfer Cycle 27
Address Predictor
14
Memory Dependence Speculation
  • To allow early cache access, loads must issue
  • before resolving earlier store addresses
  • High-confidence store address predictions are
  • employed for disambiguation
  • Stores that have never forwarded results within
  • the LSQ are ignored
  • Cluster Prefetch Combination of Load Address
  • Prediction and Memory Dependence Speculation

15
Implementation Details
  • Centralized table that maintains stride and last
  • address stride is determined by five
    consecutive
  • accesses and cleared in case of five
    mispredicts
  • Separate centralized table that maintains a
    single
  • bit per entry to indicate stores that pose
    conflicts
  • Each mispredict flushes all subsequent instrs
  • Storage overhead 18KB

16
Performance Results
Overall IPC improvement 21
17
Results Analysis
  • Roughly half the programs improved IPC by gt8
  • Load address prediction rate 65
  • Store address prediction rate 79
  • Stores likely to not pose conflicts 59
  • Avg. number of mispredicts 12K per 100M instrs

18
Decentralized Cache
L1D
L1D
  • Replicated Cache Banks
  • Loads do not travel far
  • Stores cache refills are
  • broadcast
  • Memory disambiguation is
  • not accelerated
  • Overheads interconnect for
  • broadcast and cache refill,
  • power for redundant writes,
  • distributed LRU, etc.

lsq
lsq
lsq
lsq
L1D
L1D
19
Comparing Centralized Decentralized
L1D
L1D
L1D
IPCs without cluster prefetch
lsq
lsq
lsq
1.43
1.52
IPCs with cluster prefetch
1.73
1.79
lsq
lsq
L1D
L1D
20
Sensitivity Analysis
  • Results verified for processor models with
  • varying resources and interconnect latencies
  • Evaluations on SPEC-Int address prediction rate
  • is only 38 ? modest speedups
  • twolf (7), parser (9)
  • crafty, gcc, vpr (3-4)
  • rest (lt 2)

21
Related Work
  • Modest speedups with decentralized caches
  • Racunas and Patt ICS 03, for dynamic
    clustered
  • processors Gibert et al. MICRO 02 , for
    VLIW
  • clustered processors
  • Gibert et al. MICRO 03 compiler-managed L0
  • buffers for critical data

22
Conclusions
  • Address prediction and memory dependence
  • speculation can hide latency to cache banks
  • prediction rate of 66 for SPEC-FP and
  • IPC improvement of 21
  • Additional benefits from decentralization are
  • modest
  • Future work build better predictors, impact on
  • power consumption WCED 04

23
(No Transcript)
24
Title
  • Bullet

25
Title
  • Bullet
Write a Comment
User Comments (0)
About PowerShow.com