A Performance Estimator for Parallel Hierarchical Memory Systems -- PetaSIM - PowerPoint PPT Presentation

About This Presentation
Title:

A Performance Estimator for Parallel Hierarchical Memory Systems -- PetaSIM

Description:

Link to Application Emulators. Jacobi hand-written example. Pathfinder, Titan, VMScope real applications (Generated by UMD's Emulator) ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 35
Provided by: ValuedGate152
Learn more at: http://www.new-npac.org
Category:

less

Transcript and Presenter's Notes

Title: A Performance Estimator for Parallel Hierarchical Memory Systems -- PetaSIM


1
A Performance Estimator for Parallel Hierarchical
Memory Systems -- PetaSIM
  • Yuhong Wen and Geoffrey C. Fox
  • Northeast Parallel Architecture Center (NPAC)
  • Syracuse University
  • wen,gcf_at_npac.syr.edu

2
Outlines
  • Performance Estimation Process
  • PetaSIM Motivation and Idea
  • Help to design computer architecture and
    applications
  • Java applet friendly user interface
  • Performance Specification Language Features
  • Design and Implementation of PetaSIM
  • Experiment Results

3
Why Performance Prediction?
  • Application complexities
  • A lot of processors required
  • Large amount of data involved
  • Time-consuming processing
  • Performance prediction to fasten
  • new parallel computer architecture design
  • application model design

4
Performance Prediction Approaches
  • Concept design level performance prediction
  • aim to provide a quick and roughly correct
    performance prediction at the early stage of
    model design
  • PetaSIM based on this level
  • Detailed performance prediction
  • to provide detailed information of a given
    application running on specific computer system
    -- Simulation

5
PetaSIM Motivation
  • PetaSIM was designed to allow qualitative
    performance estimates where in particular the
    design of machine is particularly easy to change
  • Applications are to be derived by hand or by
    automatic generation from Maryland Application
    Emulators
  • Special attention to support of hierarchical
    memory machines and data intensive applications
  • Support simulation of pure data-parallel and
    composition of linked modules

6
Peta-Computing Hierarchy
Full Heterogeneous MetaProblem
Module
Module
Module
Components
Components
Module
Module
Task Parallelism
Aggregate
Aggregate
Data Parallelism
Simulate
Loosely Synchronize Computation
Splitting into Lower Level Memory Hierarchy
PetaSIM
Real Computing
7
Performance Prediction Model
Application Domain
Software / Operating System Domain
Hardware Domain
Multi Domain model
8
Three Domain Performance Prediction
  • Application Domain
  • to extract the data aggregates
  • to give abstract data movement and computation
    behavior
  • Software / Operating System Domain
  • to provide the methods for task process and
    memory management, communication and parallel
    file access
  • Hardware Domain
  • to provide the model of processor and memory
    components, includes cache as well

9
Performance Specification Language
Because of the complexities of performance
prediction
  • Various different kinds of applications
  • Different kinds of parallel architectures

Its very important to design a general
performance specification language (PSL) to
represent all the features of the different
aspects in the performance process. PetaSIM
shows an initial step to suggest that
characteristics of such a Performance
Specification Language (PSL).
10
Performance Specification Language
Application Domain
  • The size of each data block
  • the number of data blocks
  • the amount of data operations in the data block
  • data distribution model
  • data processing sequence / flow of the data
    blocks -- the application algorithm

11
Performance Specification Language
Software / Operating System Domain
  • The memory management approach
  • the cache management approach
  • parallel task schedule method
  • parallel file access pattern
  • computing, communication overlap approach

12
Performance Specification Language
Hardware Domain
  • Computing capability of each processor, which
    include the CPU speed and the bandwidth
  • memory size and cache size
  • architecture of each processing node
  • inter-communication topology of the parallel
    machines, which is to provide the information of
    communication between the processors

13
Petasim Estimator Emulator
Applications
Hand Code Applications
UMD Emulators
Execution Script
Dataset Distribution
Nodeset Linkset
PetaSIM
Performance Estimation
14
Emulators
  • Extract the applications computational and data
    access patterns
  • A simplified version of the real application,
    contains all the necessary communication,
    computation and I/O characteristics
  • less accurate than full application, but more
    robust
  • fast performance prediction for rapid prototyping

15
PetaSIM Design
  • We define an object structure for computer
    (including network) and data
  • Architecture Description
  • nodeset linkset
  • (describe the architecture memory hierarchy)
  • Data Description
  • dataset distribution
  • Application Description
  • execution script
  • System / Software Description

16
Architecture Description
  • A nodeset is a collection of entities with types
    liked
  • memory with cache disks
  • CPU where results can be calculated
  • pathway such as bus, switch or network
  • A linkset connects nodesets together in various
    ways

17
Application Description
  • An application consists of dataset objects
  • dataset implementation is controled by the
    distribution objects
  • application behavior is represented by execution
    script
  • a set of command statements
  • data movements
  • data computation
  • synchronization

18
Nodeset Object Structure
  • Name one per nodeset object
  • type choose from memory, cache, disk, CPU,
    pathway
  • number number of members of this nodeset in the
    architecture
  • grainsize size in bytes of each member of this
    nodeset (for memory, cache, disk)
  • bandwidth maximum bandwidth allowed in any one
    member of this nodeset
  • floatspeed CPUs float calculating speed
  • calculate() method used by CPU nodeset to
    perform computation
  • cacherule controls persistence of data in a
    memory or cache
  • portcount number of ports on each member of
    nodeset
  • portname ports connected to linkset
  • portlink name of linkset connecting to this
    port
  • nodeset_member_list list of nodeset members in
    this nodeset (for nodeset member identification)

19
Linkset Object Structure
  • Name one per linkset object
  • type choose from updown, across
  • nodesetbegin name of initial nodeset joined by
    this linkset
  • nodesetend name of final nodeset joined buy this
    linkset
  • topology used for across networks to specify
    linkage between members of a single nodeset
  • duplex choose from full or half
  • number number of members of this linkset in the
    architecture
  • latency time to send zero length message across
    any member of linkset
  • bandwidth maximum bandwidth allowed in any link
    of this linkset
  • send() method that calculates cost of sending a
    message across the linkset
  • distribution name of geometric distribution
    controlling this linkset
  • linkset_member_list list of linkset members in
    this linkset ( for linkset member identification )

20
Dataset Object Structure
  • Name one per dataset object
  • choose from grid1dim, grid2dim, grid3dim,
    specifies type of dataset
  • bytesperunit number of bytes in each unit
  • floatsperunit update cost as a floating point
    arithmetic count
  • operationsperunit operations in each unit
  • update() method that updates given dataset which
    is contained in a CPU nodeset and a grainsize
    controlled by last memory nodeset visited
  • transmit() method that calculates cost of
    transmission of dataset between memory levels
    either communication or movement up and down
    hierarchy
  • Methods can use other parameters or be custom

21
Execution Script
  • Currently a few primitives which stress (unlike
    most languages) movement of data through memory
    hierarchies
  • send DATAFAMILY from MEM-LEVEL-L to MEM-LEVEL-K
  • These reference object names for data and memory
    nodesets
  • move DATAFAMILY from MEM-LEVEL-L to MEM-LEVEL-K
  • Use distribution DISTRIBUTION from MEM-LEVEL-L to
    MEM-LEVEL-K
  • compute DATAFAMILY-A, DATAFAMILY-B, on
    MEM-LEVEL-L
  • synchronize (synchronizes all processors ---
    loosely synchronous barrier)
  • loop operation

22
PetaSIM Estimation Schedule
  • Each nodeset member has its usage control to
    record when dataset arrives and when to send out
    to next nodeset member
  • Each linkset member has its usage control to
    record at what time the linkset member is free or
    occupied
  • Data Driven model Ready ---gt Go (First come,
    First Service)
  • Support Both data parallel mode and individual
    operation on each nodeset, linkset member mode

23
Architecture of PetaSIM
C Simulator
Multi-User Java Server
StandardJava AppletClient
StandardJava AppletClient
24
PetaSIM Experiments
Typical SP2 configuration of nodeset and linkset
components
25
Pathfinder Performance Estimation Results
26
Pathfinder Estimation Results II
27
Titan Estimation Results (Fixed)
28
VMScope Estimation Results
29
PetaSIM Features
  • Accurate estimation
  • Friendly user interface
  • Easy to modify the architecture design
  • Easy to monitor the effect of the design change
  • Fast Estimation
  • Detail performance estimation
  • Provide detail usage of each individual nodeset
    and linkset member in the memory hierarchy

30
Compare with some other Simulators
  • Different Simulation Approach
  • PetaSIM not real run the application, estimate
    the execution script (operation abstraction)
  • PetaSIM running on single processor
  • Similar performance estimation results
  • PetaSIM can easily deal with different kinds of
    computer architecture
  • PetaSIM can get detailed information of any part
    of the architecture

31
PetaSIM Current Progress Summary
  • Architecture Description (nodeset linkset)
  • Application Description (dataset execution
    script)
  • Link to Application Emulators
  • Jacobi hand-written example
  • Pathfinder, Titan, VMScope real applications
    (Generated by UMDs Emulator)
  • Easy modified Architecture and Application
    description
  • Fast and relatively Accurate performance
    estimation (PetaSIM running on single processor)
  • Java applet based user Interface
  • Data Parallel Model Individual Control

32
Possible Future Work
  • Richer set of applications using standard
    benchmarks and DoD MSTAR
  • Relate object model to those used in seamless
    interfaces / metacomputing i.e. to efforts to
    establish (distributed) object model for
    computation
  • Review very simple execution script -- should we
    add more complex primitives or regard
    application emulators as this complex script
  • Binary format (compiled PetaSIM) of
    architecture and application description ( ASCII
    format will make execution script very large)
  • Translation tool from ASCII format to binary
    format (to retain the friendly user interface)
  • Upgrade performance evaluation model
  • Run performance simulation in parallel (i.e.
    PetaSIM running on multi-processors)

33
PetaSIM Web-Site URL
http//kopernik.npac.syr.edu4096/petasim/V1.0/Pet
aSIM.html
-- PetaSIM Java Applet front user interface and
demo -- Related PetaSIM documents
34
Interface of PetaSIM Client
Write a Comment
User Comments (0)
About PowerShow.com