Title: A Performance Estimator for Parallel Hierarchical Memory Systems -- PetaSIM
1A Performance Estimator for Parallel Hierarchical
Memory Systems -- PetaSIM
- Yuhong Wen and Geoffrey C. Fox
- Northeast Parallel Architecture Center (NPAC)
- Syracuse University
- wen,gcf_at_npac.syr.edu
2Outlines
- Performance Estimation Process
- PetaSIM Motivation and Idea
- Help to design computer architecture and
applications - Java applet friendly user interface
- Performance Specification Language Features
- Design and Implementation of PetaSIM
- Experiment Results
3Why Performance Prediction?
- Application complexities
- A lot of processors required
- Large amount of data involved
- Time-consuming processing
- Performance prediction to fasten
- new parallel computer architecture design
- application model design
4Performance Prediction Approaches
- Concept design level performance prediction
- aim to provide a quick and roughly correct
performance prediction at the early stage of
model design - PetaSIM based on this level
- Detailed performance prediction
- to provide detailed information of a given
application running on specific computer system
-- Simulation
5PetaSIM Motivation
- PetaSIM was designed to allow qualitative
performance estimates where in particular the
design of machine is particularly easy to change - Applications are to be derived by hand or by
automatic generation from Maryland Application
Emulators - Special attention to support of hierarchical
memory machines and data intensive applications - Support simulation of pure data-parallel and
composition of linked modules
6Peta-Computing Hierarchy
Full Heterogeneous MetaProblem
Module
Module
Module
Components
Components
Module
Module
Task Parallelism
Aggregate
Aggregate
Data Parallelism
Simulate
Loosely Synchronize Computation
Splitting into Lower Level Memory Hierarchy
PetaSIM
Real Computing
7Performance Prediction Model
Application Domain
Software / Operating System Domain
Hardware Domain
Multi Domain model
8Three Domain Performance Prediction
- Application Domain
- to extract the data aggregates
- to give abstract data movement and computation
behavior - Software / Operating System Domain
- to provide the methods for task process and
memory management, communication and parallel
file access - Hardware Domain
- to provide the model of processor and memory
components, includes cache as well
9Performance Specification Language
Because of the complexities of performance
prediction
- Various different kinds of applications
- Different kinds of parallel architectures
Its very important to design a general
performance specification language (PSL) to
represent all the features of the different
aspects in the performance process. PetaSIM
shows an initial step to suggest that
characteristics of such a Performance
Specification Language (PSL).
10Performance Specification Language
Application Domain
- The size of each data block
- the number of data blocks
- the amount of data operations in the data block
- data distribution model
- data processing sequence / flow of the data
blocks -- the application algorithm
11Performance Specification Language
Software / Operating System Domain
- The memory management approach
- the cache management approach
- parallel task schedule method
- parallel file access pattern
- computing, communication overlap approach
12Performance Specification Language
Hardware Domain
- Computing capability of each processor, which
include the CPU speed and the bandwidth - memory size and cache size
- architecture of each processing node
- inter-communication topology of the parallel
machines, which is to provide the information of
communication between the processors
13Petasim Estimator Emulator
Applications
Hand Code Applications
UMD Emulators
Execution Script
Dataset Distribution
Nodeset Linkset
PetaSIM
Performance Estimation
14Emulators
- Extract the applications computational and data
access patterns - A simplified version of the real application,
contains all the necessary communication,
computation and I/O characteristics - less accurate than full application, but more
robust - fast performance prediction for rapid prototyping
15PetaSIM Design
- We define an object structure for computer
(including network) and data - Architecture Description
- nodeset linkset
- (describe the architecture memory hierarchy)
- Data Description
- dataset distribution
- Application Description
- execution script
- System / Software Description
16Architecture Description
- A nodeset is a collection of entities with types
liked - memory with cache disks
- CPU where results can be calculated
- pathway such as bus, switch or network
- A linkset connects nodesets together in various
ways
17Application Description
- An application consists of dataset objects
- dataset implementation is controled by the
distribution objects - application behavior is represented by execution
script - a set of command statements
- data movements
- data computation
- synchronization
18Nodeset Object Structure
- Name one per nodeset object
- type choose from memory, cache, disk, CPU,
pathway - number number of members of this nodeset in the
architecture - grainsize size in bytes of each member of this
nodeset (for memory, cache, disk) - bandwidth maximum bandwidth allowed in any one
member of this nodeset - floatspeed CPUs float calculating speed
- calculate() method used by CPU nodeset to
perform computation - cacherule controls persistence of data in a
memory or cache - portcount number of ports on each member of
nodeset - portname ports connected to linkset
- portlink name of linkset connecting to this
port - nodeset_member_list list of nodeset members in
this nodeset (for nodeset member identification)
19Linkset Object Structure
- Name one per linkset object
- type choose from updown, across
- nodesetbegin name of initial nodeset joined by
this linkset - nodesetend name of final nodeset joined buy this
linkset - topology used for across networks to specify
linkage between members of a single nodeset - duplex choose from full or half
- number number of members of this linkset in the
architecture - latency time to send zero length message across
any member of linkset - bandwidth maximum bandwidth allowed in any link
of this linkset - send() method that calculates cost of sending a
message across the linkset - distribution name of geometric distribution
controlling this linkset - linkset_member_list list of linkset members in
this linkset ( for linkset member identification )
20Dataset Object Structure
- Name one per dataset object
- choose from grid1dim, grid2dim, grid3dim,
specifies type of dataset - bytesperunit number of bytes in each unit
- floatsperunit update cost as a floating point
arithmetic count - operationsperunit operations in each unit
- update() method that updates given dataset which
is contained in a CPU nodeset and a grainsize
controlled by last memory nodeset visited - transmit() method that calculates cost of
transmission of dataset between memory levels
either communication or movement up and down
hierarchy - Methods can use other parameters or be custom
21Execution Script
- Currently a few primitives which stress (unlike
most languages) movement of data through memory
hierarchies - send DATAFAMILY from MEM-LEVEL-L to MEM-LEVEL-K
- These reference object names for data and memory
nodesets - move DATAFAMILY from MEM-LEVEL-L to MEM-LEVEL-K
- Use distribution DISTRIBUTION from MEM-LEVEL-L to
MEM-LEVEL-K - compute DATAFAMILY-A, DATAFAMILY-B, on
MEM-LEVEL-L - synchronize (synchronizes all processors ---
loosely synchronous barrier) - loop operation
22PetaSIM Estimation Schedule
- Each nodeset member has its usage control to
record when dataset arrives and when to send out
to next nodeset member - Each linkset member has its usage control to
record at what time the linkset member is free or
occupied - Data Driven model Ready ---gt Go (First come,
First Service) - Support Both data parallel mode and individual
operation on each nodeset, linkset member mode
23Architecture of PetaSIM
C Simulator
Multi-User Java Server
StandardJava AppletClient
StandardJava AppletClient
24PetaSIM Experiments
Typical SP2 configuration of nodeset and linkset
components
25Pathfinder Performance Estimation Results
26Pathfinder Estimation Results II
27Titan Estimation Results (Fixed)
28VMScope Estimation Results
29PetaSIM Features
- Accurate estimation
- Friendly user interface
- Easy to modify the architecture design
- Easy to monitor the effect of the design change
- Fast Estimation
- Detail performance estimation
- Provide detail usage of each individual nodeset
and linkset member in the memory hierarchy
30Compare with some other Simulators
- Different Simulation Approach
- PetaSIM not real run the application, estimate
the execution script (operation abstraction) - PetaSIM running on single processor
- Similar performance estimation results
- PetaSIM can easily deal with different kinds of
computer architecture - PetaSIM can get detailed information of any part
of the architecture
31PetaSIM Current Progress Summary
- Architecture Description (nodeset linkset)
- Application Description (dataset execution
script) - Link to Application Emulators
- Jacobi hand-written example
- Pathfinder, Titan, VMScope real applications
(Generated by UMDs Emulator) - Easy modified Architecture and Application
description - Fast and relatively Accurate performance
estimation (PetaSIM running on single processor) - Java applet based user Interface
- Data Parallel Model Individual Control
32Possible Future Work
- Richer set of applications using standard
benchmarks and DoD MSTAR - Relate object model to those used in seamless
interfaces / metacomputing i.e. to efforts to
establish (distributed) object model for
computation - Review very simple execution script -- should we
add more complex primitives or regard
application emulators as this complex script - Binary format (compiled PetaSIM) of
architecture and application description ( ASCII
format will make execution script very large) - Translation tool from ASCII format to binary
format (to retain the friendly user interface) - Upgrade performance evaluation model
- Run performance simulation in parallel (i.e.
PetaSIM running on multi-processors)
33PetaSIM Web-Site URL
http//kopernik.npac.syr.edu4096/petasim/V1.0/Pet
aSIM.html
-- PetaSIM Java Applet front user interface and
demo -- Related PetaSIM documents
34Interface of PetaSIM Client