Methodologies for Performance Simulation of Super-scalar OOO processors - PowerPoint PPT Presentation

About This Presentation
Title:

Methodologies for Performance Simulation of Super-scalar OOO processors

Description:

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 33
Provided by: anan176
Category:

less

Transcript and Presenter's Notes

Title: Methodologies for Performance Simulation of Super-scalar OOO processors


1
Methodologies for Performance Simulation of
Super-scalar OOO processors
  • Srinivas Neginhal
  • Anantharaman Kalyanaraman
  • CprE 585 Survey Project

2
Architectural Simulators
  • Explore Design Space
  • Evaluate existing hardware, or Predict
    performance of proposed hardware
  • Designer has control

Functional Simulators Model architecture
(programmers focus) Eg., sim-fast, sim-safe
Performance Simulators Model microarchitecture
(designers focus) Eg., cycle-by-cycle
(sim-outoforder)
3
Simulation Issues
  • Real-applications take too long for a
    cycle-by-cycle simulation !!
  • Vast design space
  • Design Parameters
  • code properties, value prediction, dynamic
    instruction distance, basic block size,
    instruction fetch mechanisms, etc.
  • Architectural metrics
  • IPC/ILP, cache miss rate, branch prediction
    accuracy, etc.
  • Find design flaws Provide design improvements
  • Correctness and accuracy of simulation results
  • Need a fast and robust simulation methodology !!

4
Two Simulation Methodologies
  • HLS
  • Hybrid Statistical Symbolic
  • REF
  • HLS Combining Statistical and Symbolic
    Simulation to Guide Microprocessor Designs. M.
    Oskin, F. T. Chong and M. Farrens. Proc. ISCA.
    71-82. 2000.
  • BBDA
  • Basic block distribution analysis
  • REF
  • Basic Block Distribution Analysis to Find
    Periodic Behavior and Simulation Points in
    Applications. T. Sherwood, E. Perelman and B.
    Calder. Proc. PACT. 2001.

5
HLS An Overview
  • A hybrid processor simulator

Statistical Model
HLS
Performance Contours spanned by design space
parameters
Symbolic Execution
What can be achieved? Explore design changes in
architectures and compilers that would be
impractical to simulate using conventional
simulators
6
HLS Main Idea
Synthetically generated code/data
Large Application code
Statistical Profiling
Instruction stream, data stream
Structural Simulation of FU, issue pipeline units
  • Code characteristics
  • basic block size
  • Dynamic instruction distance
  • Instruction mix
  • Architecture metrics
  • Cache behavior
  • Branch prediction accuracy

sim-fast Statistical Profiling sim-outorder
Structural Simulation
7
Statistical Code Generation
  • Each synthetic instruction contains the
    following parameters based on the statistical
    profile
  • Functional unit requirements
  • Dynamic instruction distances
  • Cache behavior

8
HLS Correctness and Accuracy
  • Validate HLS against SimpleScalar (use IPC)
  • For varying combinations of design parameters
  • Run original benchmark code on SimpleScalar (use
    sim-outoforder)
  • Run statistically generated code on HLS
  • Compare SimpleScalar IPC vs. HLS IPC

9
Validation Single- and Multi-value correlations
IPC vs. L1-cache hit rate
For SPECint95 HLS Errors are within 5-7 of the
cycle-by-cycle results !!
10
HLS Code PropertiesBasic Block Size vs.
L1-Cache Hit Rate
Inferred Correlation Increasing basic block size
helps only when L1 cache hit rate is gt96 or lt82
11
HLS Value Prediction
GOAL Break True Dependency
Stall Penalty for mispredict vs. Value Prediction
Knowledge
DID vs. Value predictability
12
HLS Superscalar Issue Width vs. Dynamic
Instruction Distance
  • Inferred Correlation
  • DID and issue width are highly correlated,
    especially as both start to increase

13
HLS Conclusions
  • Low error rate only on SPECint95 benchmark suite.
    High error rates on SPECfp95 and STREAM
    benchmarks
  • Findings by R. H. Bell et. Al, 2004
  • Reason
  • Instruction-level granularity for workload
  • Recommended Improvement
  • Basic block-level granularity

14
Basic Block Distribution Analysis
  • Basic Block Distribution Analysis to Find
    Periodic Behavior and Simulation Points in
    Applications.
  • T. Sherwood, E. Perelman and B. Calder.
  • Proc. PACT. 2001.

15
Introduction
  • Goal
  • To capture large scale program behavior in
    significantly reduced simulation time.
  • Approach
  • Find a representative subset of the full program.
  • Find an ideal place to simulate given a specific
    number of instructions one has to simulate
  • Accurate confidence estimation of the simulation
    point.

Initialization







Simulation Points
Period
Program Execution
16
Program Behavior
  • Program behavior has ramifications on
    architectural techniques.
  • Program behavior is different in different parts
    of execution.
  • Initialization
  • Cyclic behavior (Periodic)
  • Cyclic Behavior is not representative of all
    programs.
  • Common case for compute bound applications.

17
BBDA Basics
  • Fast profiling is used to determine the number of
    times a basic block executes.
  • Behavior of the program is directly related to
    the code that it is executing.
  • Profiling gives a basic block fingerprint for
    that particular interval of time.
  • The interval chosen is ideally a representative
    of the full execution of the program.
  • Profiling information is collected in intervals
    of 100 million instructions.

18
Basic Block Vector (BBV)
BBV for Interval i
B1 B2 BD
B1
D Total number of Basic blocks in the program
code
B2
Frequency
Interval i
  • BBV Fingerprint of an interval
  • Varying size intervals
  • A BBV collected over an interval of N times 100
    million instructions is a BBV of duration N.

Bx
19
Target BBV
  • BBVs are normalized
  • Each element divided by the sum of all elements.
  • Target BBV
  • BBV for the entire execution of the program.
  • Objective
  • Find a BBV of smallest duration similar to
    Target BBV.

20
Basic Block Vector Difference
  • Difference between BBVs
  • Euclidean Distance
  • Manhattan Distance

Conservative Measure
21
Basic Block Difference Graph
  • Plot of how well each individual interval in the
    program compares to the target BBV
  • For each interval of 100 million instructions, we
    create a BBV and calculate its difference from
    target BBV
  • Used to
  • Find the end of initialization phase
  • Find the period for the program

22
Basic Block Difference Graph
23
Initialization
  • Initialization is not trivial.
  • Important to simulate representative sections of
    the initialization code.
  • Detection of the end of the initialization phase
    is important.
  • Initialization Difference Graph
  • Initial Representative Signal - First quarter of
    BB Difference graph.
  • Slide it across BB difference graph.
  • Difference calculated at each point for first
    half of BBDG.
  • When IRS reaches the end of the initialization
    stage on the BB difference graph, the difference
    is maximized.

24
Initialization
25
Period
  • Period Difference Graph
  • Period Representative Signal
  • Part of BBDG, starting from the end of
    initialization to ¼th the length of program
    execution.
  • Slide across half the BBDG.
  • Distance between the minimum Y-axis points is the
    period.
  • Using larger durations of a BBV creates a BBDG
    that emphasizes larger periods.

26
Period
27
Summary of Results
  • Compared with cycle-by-cycle simulation of the
    full program.
  • IPC Differed by 5
  • Most of the other metrics match up closely.

28
Characterizing Program Behavior Through Clustering
  • Automatically characterizing Large Scale Program
    Behavior.
  • T. Sherwood, E. Perelman, G. Hamerly and B.
    Calder.
  • ASPLOS 2002

29
Clustering Approach
1
P1
2
P2
Clustering
N BBVs
Multiple Simulation Points


K
Pk
Clusters
30
Clustering (k-means)
  • Goal is to divide a set of points into groups
    such that points within each group are similar to
    one another by a desired metric.
  • Input N points in D-dimensional space
  • Output A partition of k clusters
  • Algorithm
  • Randomly choose k points as centroids
    (initialization)
  • Compute cluster membership of each point based on
    its distance from each centroid
  • Compute new centroid for each cluster
  • Iterate steps 2 and 3 until convergence

Runtime complexity affected by the curse of
dimensionality
31
Dimension Reduction Technique
  • Random Projection
  • Reduces the dimension of the BBVs to 15
  • Dimension Selection
  • Dimension Reduction
  • Random Linear Projection.

32
BBDA Conclusions
  • BBDA provides better sensitivity and lower
    performance variation in phases
  • Other related work such as instruction working
    set technique provides higher stability
  • For further evaluation of different techniques
    refer to
  • Comparing Program Phase Detection Techniques
  • A. S. Dhodapkar and J. E. Smith

33
Related Work
  • Find smaller representative inputs Klein Osowski
    et al., 2000.
  • Fast forwarding and checkpointing Haskins and
    Skadron, 2002.
  • Simulation points based Lafage et al., 2000.
  • Statistical Simulation Oskin et al., 2000.
  • Trace-driven approach for Statistical Simulation
    Carl et al., 1998.
Write a Comment
User Comments (0)
About PowerShow.com