The Mark-II Performance Simulator for VIRAM-1 Gagan Prakash, Brian Gaeke CS 252 Spring 2001 http://www-inst.eecs.berkeley.edu/~brg/vsimII brg@eecs.berkeley.edu gagpcool@hkn.eecs.berkeley.edu - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

The Mark-II Performance Simulator for VIRAM-1 Gagan Prakash, Brian Gaeke CS 252 Spring 2001 http://www-inst.eecs.berkeley.edu/~brg/vsimII brg@eecs.berkeley.edu gagpcool@hkn.eecs.berkeley.edu

Description:

Measured resident size using ps (pages actually touched) Predicted cycle count ... Well defined input format (parsing traces is Evil) ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 18
Provided by: Brg8
Category:

less

Transcript and Presenter's Notes

Title: The Mark-II Performance Simulator for VIRAM-1 Gagan Prakash, Brian Gaeke CS 252 Spring 2001 http://www-inst.eecs.berkeley.edu/~brg/vsimII brg@eecs.berkeley.edu gagpcool@hkn.eecs.berkeley.edu


1
The Mark-II Performance Simulator for
VIRAM-1Gagan Prakash, Brian GaekeCS 252 Spring
2001http//www-inst.eecs.berkeley.edu/brg/vsimI
Ibrg_at_eecs.berkeley.edugagpcool_at_hkn.eecs.berkele
y.edu
2
Problems in Performance Simulation
(Simulation) Runtime Does Matter! Current
performance simulator gt 1500X slowdown Many
problems now assume "normal" VIRAM-1 chip
config When simulator was designed, "normal"
chip not known Lots of parameters no longer
needed Many computation-intensive datasets
cannot be simulated Software architecture of
current simulator non-portable Current simulator
no longer maintained Out of date with respect to
simpler (functional) simulator Can't use today's
machines (180 MHz fastest simulation machine)
3
Solutions for Simulation Trace-based
cycle-level simulator Traces from
actively-maintained functional simulator No more
version skew! Emphasis on
portability Faster simulation machines gt faster
simulations Streamline parametrization Make
it look like the "normal" VIRAM-1 chip Restrict
parametrization Potential pitfall traces
are huge Support compressed traces
4
High-Level Simulator ArchitectureDesign of
simulator mirrors design of VIRAM-1 chip
Software Units Lexer Parser
Performance Analyzer Control Unit and
queue Issue Unit and queue
Functional Units
5
Low-Level Simulator Architecture Functional
Units (FUs) Memory Functional Unit
Memory system Translation system
Flag Functional Unit
Arithmetic Functional Units 1 IntFP, 1 Int
only Element group queues
6
Wall clock time to simulate
7
Peak simulator memory usage
Measured resident size using ps (pages actually
touched)
8
Predicted cycle count Measuring inner loops
only
Percent Difference Update 13 Transitive
5 Pointer 17
9
Project Successes Useful parametrizations!
Lanes, banks/subbanks, memory size
Reduced simulator memory size Lots of simple
optimizations Don't simulate empty
queues Retire no-ops early Reduced
implementation complexity 7,500 LOC vs
117,000 LOC
10
Project "Not-So-Successes" Cycle-level
simulation Memory FU resolves hazards
per-element-group Element groups from many
instructions in any cycle Interlocks between
memory unit and other FUs Control/issue unit
simulations basically trivial Trace
size Small traces range from 50 - 250
megs Simulator spends 70 - 95 of time in I/O
Memory system Implementor information
starvation! Memory bandwidth numbers are
unavailable TLB undocumented Scalar
core????
11
Conclusions and Future Work Program
dependent average analysis Multiple
idealized models Each with a queue
model and a few typical kernels Could
enable multicycle simulation You need a
general simulator to enable this, though Cut
the fat out of the old simulator Port it
to other platforms? Exception modeling
untouched "We still don't have an OS"
Software-managed TLB effects unknown Is
this simulation really better? (Hennessy)
12
What We Learned Leverage Existing Work First!
Why rewrite when you can port, extend, or
document? Need extremely detailed docs to
write simulator A good simulator can be
documentation Need access to random
notes, not just theses Emphasize leaving
behind good docs when you graduate? Devising
good approximations for complex HW is a black
art But approximations are
indispensible Trading off accuracy vs.
complexity Experiment with compilers and
standard libraries Portability and
efficiency
13
Backup Slides
14
Why We Ditched Multicycle Simulation 1.
Finding register file structural hazards requires
per-cycle Suppose full pipeline...
Every cycle, some FU is doing a reg read
Could cause structural hazard w/ first
memory unit stage 2. Memory unit must be
synched with other FUs Memory unit
controls other units' stalls To figure
out whether other units can go ahead
Need all the details of memory unit state per
pipeline stage 3. Added overhead of
multicycle ? Amdahl's Law Simplify
implementation by always assuming single cycle
15
Compiler EffectsFallacy The compiler that
understands the language better produces the
faster code. Stepanov Abstraction Penalty
Benchmark Measures speedup of C library
algos/data abstractions versus naive
(FORTRAN-like) hand coded loops On same
floating point vector kernel
You pay 2.3x in runtime for using a smarter
compiler
16
Library EffectsPitfall Relying on standard
library for programmer efficiency. Surprises
in profile for early version Lib calls
(string) and object constructors??? When you
are dealing with 200MB traces you want to
be I/O bound. Workaround Don't use objects
Make everything extern "C" ...
Use C strcpy instead of C stringassign
Result Time in I/O reduced
from 95 to 70
17
Ideal Simulator Construction Experience
User selectable multiple levels of detail
Having a detailed understanding of processor
first Access to documentation, notes
Information about design decisions A
better mix of C and C Well defined input
format (parsing traces is Evil) Component
framework for simulator construction
Standardize interface between pieces
RTL, coarse-grained cycle, q'ing theory,
memory interface, hand hacked... 
custom
RTL
Queue model
Queue model
Write a Comment
User Comments (0)
About PowerShow.com