BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines - PowerPoint PPT Presentation

Loading...

PPT – BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines PowerPoint presentation | free to download - id: 47d355-NWRkO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines

Description:

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 28
Provided by: Geng8
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines


1
BigSim A Parallel Simulator for Performance
Prediction of Extremely Large Parallel Machines
  • Gengbin Zheng
  • Gunavardhan Kakulapati
  • Laxmikant V. Kale
  • University of Illinois at Urbana-Champaign

2
Motivations
  • Extremely large parallel machines around the
    corner
  • Examples
  • ASCI Purple (12K, 100TF)
  • BlueGene/L (64K, 360TF)
  • BlueGene/C (8M, 1PF)
  • PF machines likely to have 100k processors (1M?)
  • Would existing parallel applications scale?
  • Machines are not there
  • Parallel performance is hard to model without
    actually running the program

3
BlueGene/L
4
Roadmap
  • Explore suitable programming models
  • Charm (Message-driven)
  • MPI and its extension - AMPI (adaptive version of
    MPI)
  • Use a parallel emulator to run applications
  • Coarse-grained simulator for performance
    prediction (not hardware simulation)

5
Charm - Object-based programming model
User is only concerned with interaction between
objects
User View
6
Charm Object-based Programming Model
  • Processor virtualization
  • Divide computation into large number of pieces
  • Independent of number of processors
  • Typically larger than number of processors
  • Let system map objects to processors
  • Empowers an adaptive, intelligent runtime system

7
Charm for Peta-scale Machines
  • Explicit management of resources
  • This data on that processor
  • This work on that processor
  • Object can migrate
  • Automatic efficient resource management
  • One sided communication
  • Asynchronous global operations (reductions, ..)

8
AMPI - MPI processor virtualization
9
Parallel Emulator
  • Actually run a parallel program
  • Emulate full machine on existing parallel
    machines
  • Based on a common low level abstraction (API)
  • Many multiprocessor nodes connected via message
    passing
  • Emulator supports Charm/AMPI

Gengbin Zheng, Arun Singla, Joshua Unger,
Laxmikant V. Kalé, A Parallel-Object
Programming Model for PetaFLOPS Machines and Blue
Gene/Cyclops'' in NGS Program Workshop, IPDPS2002
10
Emulation on a Parallel Machine
Emulating 8M threads on 96 ASCI-Red processors
11
Emulator Performance
  • Scalable
  • Emulating a real-world MD application on a 200K
    processor BG machine

Gengbin Zheng, Arun Singla, Joshua Unger,
Laxmikant V. Kalé, A Parallel-Object
Programming Model for PetaFLOPS Machines and Blue
Gene/Cyclops'' in NGS Program Workshop, IPDPS02
12
Emulator to Simulator
  • Predicting parallel performance
  • Modeling parallel performance accurately is
    challenging
  • Communication subsystem
  • Behavior of runtime system
  • Size of the machine is big

13
Performance Prediction
  • Parallel Discrete Event Simulation (PDES)
  • Logical processor (LP) has virtual clock
  • Events are time-stamped
  • State of an LP changes when an event arrives to
    it
  • Our emulator was extended to carry out PDES

14
Predict Parallel Components
  • How to predict parallel components?
  • Multiple resolution levels
  • Sequential component
  • User supplied expression
  • Performance counters
  • Instruction level simulation
  • Parallel component
  • Simple latency-based network model
  • Contention-based network simulation

15
Prior PDES Work
  • Conservative vs. optimistic protocols
  • Conservative (example DaSSF)
  • Ensure safety of processing events in global
    fashion
  • Typically require a look-ahead high global
    synchronization overhead
  • MPI-SIM
  • Optimistic (examples Time Warp, SPEEDS)
  • Each LP process the earliest event on its own,
    undo earlier out of order execution when
    causality errors occur
  • Exploit parallelism of simulation better, and is
    preferred

16
Why not use existing PDES?
  • Major synchronization overheads
  • Rollback/restart overhead
  • Checkpointing overhead
  • We can do better in simulation of some parallel
    applications
  • Property of Inherent determinacy in parallel
    applications
  • Most parallel programs are written to be
    deterministic, example Jacobi

17
Timestamp Correction
  • Messages should be executed in the order of their
    timestamps
  • Causality error due to out-of-order message
    delivery
  • Rollback and checkpoint are necessary in
    traditional methods
  • Inherent determinacy is hidden in applications
  • Need to capture event dependency
  • Run-time detection
  • Use language structured dagger to express
    dependency

18
Simulation of Different Applications
  • Linear-order applications
  • No wildcard MPI receives
  • Strong determinacy, no timestamp correction
    necessary
  • Reactive applications (atomic)
  • Message driven objects
  • Methods execute as corresponding messages arrive
  • Multi-dependent applications
  • Irecvs with WaitAll (MPI)
  • Uses of structured dagger to capture dependency
    (Charm)

19
Structured-Dagger
  • entry void jacobiLifeCycle()
  • for (i0 iltMAX_ITER i)
  • atomic sendStripToLeftAndRight()
  • overlap
  • when getStripFromLeft(Msg leftMsg)
  • atomic copyStripFromLeft(leftMsg)
  • when getStripFromRight(Msg rightMsg)
  • atomic copyStripFromRight(rightMsg)
  • atomic doWork() / Jacobi Relaxation /

20
Time Stamping messages
LP Virtual Timer curT
21
Timestamps Correction
22
Architecture of BigSim Simulator
Performance visualization (Projections)
Simulation output trace logs
Online PDES engine
Charm Runtime
Instruction Sim (RSim, IBM, ..)
Simple Network Model
Performance counters
Load Balancing Module
BigSim Emulator
Charm and MPI applications
23
Architecture of BigSim Simulator
Performance visualization (Projections)
Network Simulator
Offline PDES
BigNetSim (POSE)
Simulation output trace logs
Online PDES engine
Charm Runtime
Instruction Sim (RSim, IBM, ..)
Simple Network Model
Performance counters
Load Balancing Module
BigSim Emulator
Charm and MPI applications
24
Big Network Simulation
  • Simulate network behavior packetization,
    routing, contention, etc.
  • Incorporate with post-mortem timestamp correction
    via POSE
  • Switches are connected in torus network

25
BigSim Validation on Lemieux
32 real processors
26
Jacobi on a 64K BG/L
27
Case Study - LeanMD
  • Molecular dynamics simulation designed for large
    machines
  • K-away cut-off parallelization
  • Benchmark er-gre with 3-away
  • 36573 atoms
  • 1.6 million objects
  • 8 step simulation
  • 32k processor BG machine
  • Running on 400 PSC Lemieux processors

Performance visualization tools
28
Load Imbalance
Histogram
29
Performance of the BigSim
Real processors (PSC Lemieux)
30
Conclusions
  • Improved the simulation efficiency by taking
    advantage of inherent determinacy of parallel
    applications
  • Explored simulation techniques show good parallel
    scalability
  • http//charm.cs.uiuc.edu

31
Future Work
  • Improving simulation accuracy
  • Instruction level simulator
  • Network simulator
  • Developing run-time techniques (load balancing)
    for very large machines using the simulator
About PowerShow.com