Instruction Based Memory Distance Analysis and its Application to Optimization - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Instruction Based Memory Distance Analysis and its Application to Optimization

Description:

Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang Steve Carr Soner nder Zhenlin Wang Motivation Widening gap between ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 36

Provided by: StevenM176

Category:

more less

Transcript and Presenter's Notes

Title: Instruction Based Memory Distance Analysis and its Application to Optimization

1
Instruction Based Memory Distance Analysis and
its Application to Optimization

Changpeng Fang
Steve Carr
Soner Önder
Zhenlin Wang

2
Motivation

Widening gap between processor and memory speed
memory wall
Static compiler analysis has limited capability
regular array references only
index arrays
integer code
Reuse distance prediction across program inputs
number of distinct memory locations accessed
between two references to the same memory
location
applicable to more than just regular scientific
code
locality as a function of data size
predictable on whole program and per instruction
basis for scientific codes

3
Motivation

Memory distance
A dynamic quantifiable distance in terms of
memory reference between tow access to the same
memory location.
reuse distance
access distance
value distance
Is memory distance predictable across both
integer and floating-point codes?
predict miss rates
predict critical instructions
identify instructions for load speculation

4
Related Work

Reuse distance
Mattson, et al. 70
Sugamar and Abraham 94
Beyls and DHollander 02
Ding and Zhong 03
Zhong, Dropsho and Ding 03
Shen, Zhong and Ding 04
Fang, Carr, Önder and Wang 04
Marin and Mellor-Crummey 04
Load speculation
Moshovos and Sohi 98
Chyrsos and Emer 98
Önder and Gupta 02

5
Background

Memory distance
can use any granularity (cache line, address,
etc.)
either forward or backward
represented as a pattern
Represent memory distance as a pattern
divide consecutive ranges into intervals
we use powers of 2 up to 1K and then 1K intervals
Data size
the largest reuse distance for an input set
characterize reuse distance as a function of the
data size
Given two sets of patterns for two runs, can we
predict a third set of patterns given its data
size?

6
Background

Let be the distance of the ith bin in the
first pattern and be that of the second
pattern. Given the data sizes s1 and s2 we can
fit the memory distances using
Given ci, ei, and fi, we can predict the memory
distance of another input set with its data size

7
Instruction Based Memory Distance Analysis

How can we represent the memory distance of an
instruction?
For each active interval, we record 4 words of
data
min, max, mean, frequency
Some locality patterns cross interval boundaries
merge adjacent intervals, i and i 1, if
merging process stops when a minimum frequency is
found
needed to make reuse distance predictable
The set of merged intervals make up memory
distance patterns

8
Merging Example
9
What do we do with patterns?

Verify that we can predict patterns given two
training runs
coverage
accuracy
Predict miss rates for instructions
Predict loads that may be speculated

10
Prediction Coverage

Prediction coverage indicates the percentage of
instructions whose memory distance can be
predicted
appears in both training runs
access pattern appears in both runs and memory
distance does not decrease with increase in data
size (spatial locality)
same number of intervals in both runs
Called a regular pattern
For each instruction, we predict its ith pattern
by
curve fitting the ith pattern of both training
runs
applying the fitting function to construct a new
min, max and mean for the third run
Simple, fast prediction

11
Prediction Accuracy

An instructions memory distance is correctly
predicted if all of its patterns are predicted
correctly
predicted and observed patterns fall in same
interval
or, given two patterns A and B such thatB.min ?
A.max ? B.max

12
Experimental Methodology

Use 11 CFP2000 and 11 CINT2000 benchmarks
others dont compile correctly
Use ATOM to collect reuse distance statistics
Use test and train data sets for training runs
Evaluation based on dynamic weighting
Report reuse distance prediction accuracy
value and access very similar

13
Reuse Distance Prediction
Suite Patterns Patterns Coverage Accuracy
Suite constant linear Coverage Accuracy
CFP2000 85.1 7.7 93.0 97.6
CINT2000 81.2 5.1 91.6 93.8
14
Coverage issues

Reasons for no coverage
instruction does not appear in at least one test
run
reuse distance of test is larger than train
number of patterns does not remain constant in
both training runs

Suite Reason 1 Reason 2 Reason 3
CFP2000 4.2 0.3 2.5
CINT2000 2.2 4.4 1.8
15
Prediction Details

Other patterns
183.equake has 13.6 square root patterns
200.sixtrack, 186.crafty all constant (no data
size change)
Low coverage
189.lucas 31 of static memory operations do
not appear in training runs
164.gzip the test reuse distance greater than
train reuse distance
cache-line alignment

16
Number of Patterns
Suite 1 2 3 4 ?5
CFP2000 81.8 10.5 4.8 1.4 1.5
CINT2000 72.3 10.9 7.6 4.6 5.3
17
Miss Rate Prediction

Predict a miss for a reference if the backward
reuse distance is greater than the cache size.
neglects conflict misses
Accurate miss rate prediction

18
Miss Rate Prediction Methodology

Three miss-rate prediction schemes
TCS test cache simulation
Use the actual miss rates from running the
program on a the test data for the reference data
miss rates
RRD reference reuse distance
Use the actual reuse distance of the reference
data set to predict the miss rate for the
reference data set
An upper bound on using reuse distance
PRD predicted reuse distance
Use the predicted reuse distance for the
reference data set to predict the miss rate.

19
Cache Configurations
config no. L1 L2 L2
1 32K, fully assoc. 1M fully assoc.
2 34 32K, 2-way 1M 8-way4-way2-way
20
L1 Miss Rate Prediction Accuracy
Suite PRD RRD TCS
CFP2000 97.5 98.4 95.1
CINT2000 94.4 96.7 93.9
21
L2 Miss Rate Accuracy
Suite 2-way 2-way 2-way Fully Associative Fully Associative Fully Associative
Suite PRD RRD TCS PRD RRD TCS
CFP2000 91 93 87 97 99.9 91
CINT2000 91 95 87 94 99.9 89
22
Critical Instructions

Given reuse distance for an instruction
Can we determine which instructions are critical
in terms of cache performance?
An instruction is critical if it is in the set of
instructions that generate the most L2 cache
misses
Those top miss-rate instructions whose cumulative
total misses account for 95 of the misses in a
program.
Use the execution frequency of one training run
to determine the relative contribution number of
misses for each instruction
Compare the actual critical instructions with
predicted
Use cache configuration 2

23
Critical Instruction Prediction
Suite PRD RRD TCS pred act
CPF2000 92 98 51 1.66 1.67
CINT2000 89 98 53 0.94 0.97
24
Critical Instruction Patterns
Suite 1 2 3 4 ?5
CFP2000 22.1 38.4 20.0 12.8 6.7
CINT2000 18.7 14.5 25.5 22.5 18
25
Miss Rate Discussion

PRD performs better than TCS when data size is a
factor
TCS performs better when data size doesnt change
much and there are conflict misses
PRD is much better at identifying the critical
instructions than TCS
these instructions should be targets of
optimization

26
Memory Disambiguation

Load speculation
Can a load safely be issued prior to a preceding
store?
Use a memory distance to predict the likelihood
that a store to the same address has not finished
Access distance
The number of memory operations between a store
to and load from the same address
Correlated to instruction distance and window
size
Use only two runs
If access distance not constant, use the access
distance of larger of two data sets as a lower
bound on access distance

27
When to Speculate

Definitely no
access distance less than threshold
Definitely yes
access distance greater than threshold
Threshold lies between intervals
compute predicted mis-speculation frequency
(PMSF)
speculate is PMSF lt 5
When threshold does not intersect intervals
total of frequencies that lie below the threshold
Otherwise

28
Value-based Prediction

Memory dependence only if addresses and values
match
store a1, v1store a2, v2store a3, v3load
a4, v4
Can move ahead if a1a2a3a4, v2v3 and v1?v2
The access distance of a load to the first store
in a sequence of stores storing the same value is
called the value distance

29
Experimental Design

SPEC CPU2000 programs
SPEC CFP2000
171.swim, 172.mgrid, 173.applu, 177.mesa,
179.art, 183.equake, 188.ammp, 301.apsi
SPEC CINT2000
164.gzip, 175.vpr, 176.gcc, 181.mcf, 186.crafty,
197.parser, 253.perlbmk, 300.twolf
Compile with gcc 2.7.2 O3
Comparison
Access distance, value distance
Store set with 16KB table, also with values
Perfect disambiguation

30
Micro-architecture
issue width 8
fetch width 8
retire width 16
window size 128
load/store queue 128
functional units 8
fetch multiblock gshare
data cache perfect
memory ports 2
Operation Latency
load 2
int division 8
int multiply 4
other int 1
float multiply 4
float addition 3
float division 8
other float 2
31
IPC and Mis-speculation
Suite Access Distance Store Set 16KB Table Perfect
CFP2000 3.21 3.37 3.71
CINT2000 2.90 3.22 3.35
Suite Mis-speculation Rate Mis-speculation Rate Speculated Loads Speculated Loads
Suite Access Store Set Access Store Set
CFP2000 2.36 0.07 57.2 62.0
CINT2000 2.33 0.08 26.9 34.7
32
Value-based Disambiguation
Suite Value Distance Store Set 16KB Value
CFP2000 3.34 3.55
CINT2000 3.00 3.23
Suite Mis-speculation Rate Speculated Loads
CFP2000 1.22 59.3
CINT2000 1.55 27.6
33
Cache Model
Suite Access Store Set 16K
CFP2000 1.55 1.61
CINT2000 1.53 1.60
Suite Value Store Set 16K
CFP2000 1.59 1.63
CINT2000 1.55 1.65
34
Summary

Over 90 of memory operations can have reuse
distance predicted with a 97 and 93 accuracy,
for floating-point and integer programs,
respectively
We can accurately predict miss rates for
floating-point and integer codes
We can identify 92 of the instructions that
cause 95 of the L2 misses
Access- and value-distance-based memory
disambiguation are competitive with best hardware
techniques without a hardware table

35
Future Work