Methodologies for Performance Simulation of Super-scalar OOO processors - PowerPoint PPT Presentation

About This Presentation

Title:

Methodologies for Performance Simulation of Super-scalar OOO processors

Description:

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project – PowerPoint PPT presentation

Number of Views:157

Avg rating:3.0/5.0

Slides: 33

Provided by: anan176

Learn more at: https://home.engineering.iastate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Methodologies for Performance Simulation of Super-scalar OOO processors

1
Methodologies for Performance Simulation of
Super-scalar OOO processors

Srinivas Neginhal
Anantharaman Kalyanaraman
CprE 585 Survey Project

2
Architectural Simulators

Explore Design Space
Evaluate existing hardware, or Predict
performance of proposed hardware
Designer has control

Functional Simulators Model architecture
(programmers focus) Eg., sim-fast, sim-safe
Performance Simulators Model microarchitecture
(designers focus) Eg., cycle-by-cycle
(sim-outoforder)
3
Simulation Issues

Real-applications take too long for a
cycle-by-cycle simulation !!
Vast design space
Design Parameters
code properties, value prediction, dynamic
instruction distance, basic block size,
instruction fetch mechanisms, etc.
Architectural metrics
IPC/ILP, cache miss rate, branch prediction
accuracy, etc.
Find design flaws Provide design improvements
Correctness and accuracy of simulation results
Need a fast and robust simulation methodology !!

4
Two Simulation Methodologies

HLS
Hybrid Statistical Symbolic
REF
HLS Combining Statistical and Symbolic
Simulation to Guide Microprocessor Designs. M.
Oskin, F. T. Chong and M. Farrens. Proc. ISCA.
71-82. 2000.

BBDA
Basic block distribution analysis
REF
Basic Block Distribution Analysis to Find
Periodic Behavior and Simulation Points in
Applications. T. Sherwood, E. Perelman and B.
Calder. Proc. PACT. 2001.

5
HLS An Overview

A hybrid processor simulator

Statistical Model
HLS
Performance Contours spanned by design space
parameters
Symbolic Execution
What can be achieved? Explore design changes in
architectures and compilers that would be
impractical to simulate using conventional
simulators
6
HLS Main Idea
Synthetically generated code/data
Large Application code
Statistical Profiling
Instruction stream, data stream
Structural Simulation of FU, issue pipeline units

Code characteristics
basic block size
Dynamic instruction distance
Instruction mix

Architecture metrics
Cache behavior
Branch prediction accuracy

sim-fast Statistical Profiling sim-outorder
Structural Simulation
7
Statistical Code Generation

Each synthetic instruction contains the
following parameters based on the statistical
profile
Functional unit requirements
Dynamic instruction distances
Cache behavior

8
HLS Correctness and Accuracy

Validate HLS against SimpleScalar (use IPC)
For varying combinations of design parameters
Run original benchmark code on SimpleScalar (use
sim-outoforder)
Run statistically generated code on HLS
Compare SimpleScalar IPC vs. HLS IPC

9
Validation Single- and Multi-value correlations
IPC vs. L1-cache hit rate
For SPECint95 HLS Errors are within 5-7 of the
cycle-by-cycle results !!
10
HLS Code PropertiesBasic Block Size vs.
L1-Cache Hit Rate
Inferred Correlation Increasing basic block size
helps only when L1 cache hit rate is gt96 or lt82
11
HLS Value Prediction
GOAL Break True Dependency
Stall Penalty for mispredict vs. Value Prediction
Knowledge
DID vs. Value predictability
12
HLS Superscalar Issue Width vs. Dynamic
Instruction Distance

Inferred Correlation
DID and issue width are highly correlated,
especially as both start to increase

13
HLS Conclusions

Low error rate only on SPECint95 benchmark suite.
High error rates on SPECfp95 and STREAM
benchmarks
Findings by R. H. Bell et. Al, 2004
Reason
Instruction-level granularity for workload
Recommended Improvement
Basic block-level granularity

14
Basic Block Distribution Analysis

Basic Block Distribution Analysis to Find
Periodic Behavior and Simulation Points in
Applications.
T. Sherwood, E. Perelman and B. Calder.
Proc. PACT. 2001.

15
Introduction

Goal
To capture large scale program behavior in
significantly reduced simulation time.
Approach
Find a representative subset of the full program.
Find an ideal place to simulate given a specific
number of instructions one has to simulate
Accurate confidence estimation of the simulation
point.

Initialization

Simulation Points
Period
Program Execution
16
Program Behavior

Program behavior has ramifications on
architectural techniques.
Program behavior is different in different parts
of execution.
Initialization
Cyclic behavior (Periodic)
Cyclic Behavior is not representative of all
programs.
Common case for compute bound applications.

17
BBDA Basics

Fast profiling is used to determine the number of
times a basic block executes.
Behavior of the program is directly related to
the code that it is executing.
Profiling gives a basic block fingerprint for
that particular interval of time.
The interval chosen is ideally a representative
of the full execution of the program.
Profiling information is collected in intervals
of 100 million instructions.

18
Basic Block Vector (BBV)
BBV for Interval i
B1 B2 BD
B1
D Total number of Basic blocks in the program
code
B2
Frequency
Interval i

BBV Fingerprint of an interval
Varying size intervals
A BBV collected over an interval of N times 100
million instructions is a BBV of duration N.

Bx
19
Target BBV

BBVs are normalized
Each element divided by the sum of all elements.
Target BBV
BBV for the entire execution of the program.
Objective
Find a BBV of smallest duration similar to
Target BBV.

20
Basic Block Vector Difference

Difference between BBVs
Euclidean Distance
Manhattan Distance

Conservative Measure
21
Basic Block Difference Graph

Plot of how well each individual interval in the
program compares to the target BBV
For each interval of 100 million instructions, we
create a BBV and calculate its difference from
target BBV
Used to
Find the end of initialization phase
Find the period for the program

22
Basic Block Difference Graph
23
Initialization

Initialization is not trivial.
Important to simulate representative sections of
the initialization code.
Detection of the end of the initialization phase
is important.
Initialization Difference Graph
Initial Representative Signal - First quarter of
BB Difference graph.
Slide it across BB difference graph.
Difference calculated at each point for first
half of BBDG.
When IRS reaches the end of the initialization
stage on the BB difference graph, the difference
is maximized.

24
Initialization
25
Period

Period Difference Graph
Period Representative Signal
Part of BBDG, starting from the end of
initialization to ¼th the length of program
execution.
Slide across half the BBDG.
Distance between the minimum Y-axis points is the
period.
Using larger durations of a BBV creates a BBDG
that emphasizes larger periods.

26
Period
27
Summary of Results

Compared with cycle-by-cycle simulation of the
full program.
IPC Differed by 5
Most of the other metrics match up closely.

28
Characterizing Program Behavior Through Clustering

Automatically characterizing Large Scale Program
Behavior.
T. Sherwood, E. Perelman, G. Hamerly and B.
Calder.
ASPLOS 2002

29
Clustering Approach
1
P1
2
P2
Clustering
N BBVs
Multiple Simulation Points

K
Pk
Clusters
30
Clustering (k-means)

Goal is to divide a set of points into groups
such that points within each group are similar to
one another by a desired metric.
Input N points in D-dimensional space
Output A partition of k clusters
Algorithm
Randomly choose k points as centroids
(initialization)
Compute cluster membership of each point based on
its distance from each centroid
Compute new centroid for each cluster
Iterate steps 2 and 3 until convergence

Runtime complexity affected by the curse of
dimensionality
31
Dimension Reduction Technique

Random Projection
Reduces the dimension of the BBVs to 15
Dimension Selection
Dimension Reduction
Random Linear Projection.

32
BBDA Conclusions

BBDA provides better sensitivity and lower
performance variation in phases
Other related work such as instruction working
set technique provides higher stability
For further evaluation of different techniques
refer to
Comparing Program Phase Detection Techniques
A. S. Dhodapkar and J. E. Smith

33
Related Work