Petascale Programming with Virtual Processors: Charm , AMPI, and domainspecific frameworks

About This Presentation

Title:

Petascale Programming with Virtual Processors: Charm , AMPI, and domainspecific frameworks

Description:

IBM's Blue Gene L: 65k nodes, 3D-taurus topology. Red Storm (10k procs) Future? ... Jay deSouza. Chao Huang. Chee Wai Lee. Recent Funding: NSF (NGS: Frederica Darema) ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 143

Provided by: san7199

Category:

more less

Transcript and Presenter's Notes

Title: Petascale Programming with Virtual Processors: Charm , AMPI, and domainspecific frameworks

1
Petascale Programming with Virtual
ProcessorsCharm, AMPI, and domain-specific
frameworks

Laxmikant Kale
http//charm.cs.uiuc.edu
Parallel Programming Laboratory
Dept. of Computer Science
University of Illinois at Urbana Champaign

2
Outline

Challenges and opportunities character of the
new machines
Charm and AMPI
Basics
Capabilities,
Programming techniques
Dice them fine
VPS to the rescue
Juggling for overlap
Load balancing
scenarios and strategies

Case studies
Classical Molecular Dynamics
Car-Parinello AI MD Quantum Chemistry
Rocket SImulation
Raising level of abstraction
Higher level compiler supported notations
Domain-specific frameworks
Example
Unstructured mesh (FEM) framework

3
Machines current, planned and future

Current
Lemieux 3000 processors, 750 nodes,
full-bandwidth fat-tree network
ASCI Q similar architecture
System X Infiniband
Tungston myrinet
Thunder
Earth Simulator
Planned
IBMs Blue Gene L 65k nodes, 3D-taurus topology
Red Storm (10k procs)

Future?
BG/L is an example
1M processors!
0.5 MB per procesor
HPCS 3 architecural plans

4
Some Trends Communication

Bisection bandwidth
Cant scale as well with number of processors
without being expensive
Wire-length delays
even on lemieux messages going thru the highest
level switches take longer
Two possibilities
Grid topologies, with near neighbor connections
High Link speed, low bisection bandwidth
Expensive, full-bandwidth networks

5
Trends Memory

Memory latencies are 100 times slower than
processor!
This will get worse
A solution put more processors in,
To increase bandwidth between processors and
memory
On chip DRAM
In other words low memory-to-processor ratio
But this can be handled with programming style
Application viewpoint, for physical modeling
Given a fixed amount of run-time (4 hours or 10
days)
Doubling spatial resolution
increases CPU needs more than 2-fold (smaller
time-steps)

6
Application Complexity is increasing

Why?
With more FLOPS, need better algorithms..
Not enough to just do more of the same..
Example Dendritic growth in materials
Better algorithms lead to complex structure
Example Gravitational force calculation
Direct all-pairs O(N2), but easy to parallelize
Barnes-Hut N log(N) but more complex
Multiple modules, dual time-stepping
Adaptive and dynamic refinements
Ambitious projects
Projects with new objectives lead to dynamic
behavior and multiple components

7
Specific Programming Challenges

Explicit management of resources
This data on that processor
This work on that processor
Analogy memory management
We declare arrays, and malloc dynamic memory
chunks as needed
Do not specify memory addresses
As usual, Indirection is the key
Programmer
This data, partitioned into these pieces
This work divided that way
System map data and work to processors

8
Virtualization Object-based Parallelization

Idea Divide the computation into a large number
of objects
Let the system map objects to processors

User is only concerned with interaction between
objects
System implementation
User View
9
Virtualization Charm and AMPI

These systems seek an optimal division of labor
between the system and programmer
Decomposition done by programmer,
Everything else automated

Decomposition
Mapping
HPF
Charm
Abstraction
Scheduling
MPI
Expression
Specialization
10
Charm and Adaptive MPI

Charm Parallel C
Asynchronous methods
Object arrays
In development for over a decade
Basis of several parallel applications
Runs on all popular parallel machines and
clusters

AMPI A migration path for legacy MPI codes
Allows them dynamic load balancing capabilities
of Charm
Uses Charm object arrays
Minimal modifications to convert existing MPI
programs
Automated via AMPizer
Collaboration w. David Padua
Bindings for
C, C, and Fortran90

Both available from http//charm.cs.uiuc.edu
11
Protein Folding
Molecular Dynamics
Quantum Chemistry (QM/MM)
Computational Cosmology
Parallel Objects, Adaptive Runtime System
Libraries and Tools
Crack Propagation
Space-time meshes
Dendritic Growth
Rocket Simulation
The enabling CS technology of parallel objects
and intelligent Runtime systems has led to
several collaborative applications in CSE
12
Message From This Talk

Virtualization is ready and powerful to meet the
needs of tomorrows applications and machines
Virtualization and associated techniques that we
have been exploring for the past decade are ready
and powerful enough to meet the needs of high-end
parallel computing and complex and dynamic
applications
These techniques are embodied into
Charm
AMPI
Frameworks (Strucured Grids, Unstructured Grids,
Particles)
Virtualization of other coordination languages
(UPC, GA, ..)

13
Acknowlwdgements

Graduate students including
Gengbin Zheng
Orion Lawlor
Milind Bhandarkar
Terry Wilmarth
Sameer Kumar
Jay deSouza
Chao Huang
Chee Wai Lee

Recent Funding
NSF (NGS Frederica Darema)
DOE (ASCI Rocket Center)
NIH (Molecular Dynamics)

14
Charm Object Arrays

A collection of data-driven objects (aka chares),
With a single global name for the collection, and
Each member addressed by an index
Mapping of element objects to processors handled
by the system

Users view
A0
A1
A2
A3
A..
15
Charm Object Arrays

A collection of chares,
with a single global name for the collection, and
each member addressed by an index
Mapping of element objects to processors handled
by the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
16
Chare Arrays

Elements are data-driven objects
Elements are indexed by a user-defined data
type-- sparse 1D, 2D, 3D, tree, ...
Send messages to index, receive messages at
element. Reductions and broadcasts across the
array
Dynamic insertion, deletion, migration-- and
everything still has to work!

17
Charm Remote Method Calls

To call a method on a remote C object foo, use
the local proxy C object CProxy_foo generated
from the interface file

method and parameters

This results in a network message, and eventually
to a call to the real objects method

In another .C file
void foobar(int x) ...
18
Charm Startup Process Main
module myModule array1D foo entry
foo(int problemNo) entry void bar(int x)
mainchare myMain entry myMain(int
argc,char argv)
Interface (.ci) file
Special startup object
Generated class
include myModule.decl.h class myMain public
CBase_myMain myMain(int argc,char argv)
int nElements7, inElements/2 CProxy_foo
fCProxy_foockNew(2,nElements)
fi.bar(3) include myModule.def.h
Called at startup
In a .C file
19
Other Features

Broadcasts and Reductions
Runtime creation and deletion
nD and sparse array indexing
Library support (modules)
Groups per-processor objects
Node Groups per-node objects
Priorities control ordering

20
AMPI Adaptive MPI

MPI interface, for C and Fortran, implemented on
Charm
Multiple virtual processors per physical
processor
Implemented as user-level threads
Very fast context switching-- 1us
E.g., MPI_Recv only blocks virtual processor, not
physical
Supports migration (and hence load balancing) via
extensions to MPI

21
AMPI
22
AMPI
Implemented as virtual processors (user-level
migratable threads)
23
How to Write an AMPI Program

Write your normal MPI program, and then
Link and run with Charm
Compile and link with charmc
charmc -o hello hello.c -language ampi
charmc -o hello2 hello.f90 -language ampif
Run with charmrun
charmrun hello

24
How to Run an AMPI program

Charmrun
A portable parallel job execution script
Specify number of physical processors pN
Specify number of virtual MPI processes vpN
Special nodelist file for net- versions

25
AMPI MPI Extensions

Process Migration
Asynchronous Collectives
Checkpoint/Restart

26
How to Migrate a Virtual Processor?

Move all application state to new processor
Stack Data
Subroutine variables and calls
Managed by compiler
Heap Data
Allocated with malloc/free
Managed by user
Global Variables

27
Stack Data

The stack is used by the compiler to track
function calls and provide temporary storage
Local Variables
Subroutine Parameters
C alloca storage
Most of the variables in a typical application
are stack data

28
Migrate Stack Data

Without compiler support, cannot change stacks
address
Because we cant change stacks interior pointers
(return frame pointer, function arguments, etc.)
Solution isomalloc addresses
Reserve address space on every processor for
every thread stack
Use mmap to scatter stacks in virtual memory
efficiently
Idea comes from PM2

29
Migrate Stack Data
Processor As Memory
Processor Bs Memory
0xFFFFFFFF
0xFFFFFFFF
Thread 1 stack
Thread 2 stack
Migrate Thread 3
Thread 3 stack
Thread 4 stack
Heap
Heap
Globals
Globals
Code
Code
0x00000000
0x00000000
30
Migrate Stack Data
Processor As Memory
Processor Bs Memory
0xFFFFFFFF
0xFFFFFFFF
Thread 1 stack
Thread 2 stack
Migrate Thread 3
Thread 3 stack
Thread 4 stack
Heap
Heap
Globals
Globals
Code
Code
0x00000000
0x00000000
31
Migrate Stack Data

Isomalloc is a completely automatic solution
No changes needed in application or compilers
Just like a software shared-memory system, but
with proactive paging
But has a few limitations
Depends on having large quantities of virtual
address space (best on 64-bit)
32-bit machines can only have a few gigs of
isomalloc stacks across the whole machine
Depends on unportable mmap
Which addresses are safe? (We must guess!)
What about Windows? Blue Gene?

32
Heap Data

Heap data is any dynamically allocated data
C malloc and free
C new and delete
F90 ALLOCATE and DEALLOCATE
Arrays and linked data structures are almost
always heap data

33
Migrate Heap Data

Automatic solution isomalloc all heap data just
like stacks!
-memory isomalloc link option
Overrides malloc/free
No new application code needed
Same limitations as isomalloc
Manual solution application moves its heap data
Need to be able to size message buffer, pack data
into message, and unpack on other side
pup abstraction does all three

34
Comparison with Native MPI

Performance
Slightly worse w/o optimization
Being improved
Flexibility
Small number of PE available
Special requirement by algorithm

Problem setup 3D stencil calculation of size
2403 run on Lemieux. AMPI runs on any of PEs
(eg 19, 33, 105). Native MPI needs cube .
35
Benefits of Virtualization

Software engineering
Number of virtual processors can be independently
controlled
Separate VPs for different modules
Message driven execution
Adaptive overlap of communication
Modularity
Predictability
Automatic out-of-core
Asynchronous reductions
Dynamic mapping
Heterogeneous clusters
Vacate, adjust to speed, share
Automatic checkpointing
Change set of processors used

Principle of persistence
Enables runtime optimizations
Automatic dynamic load balancing
Communication optimizations
Other runtime optimizations

More info http//charm.cs.uiuc.edu
36
Data driven execution
Scheduler
Scheduler
Message Q
Message Q
37
Adaptive Overlap of Communication

With Virtualization, you get Data-driven
execution
There are multiple entities (objects, threads) on
each proc
No single object or threads holds up the
processor
Each one is continued when its data arrives
No need to guess which is likely to arrive first
So Achieves automatic and adaptive overlap of
computation and communication
This kind of data-driven idea can be used in MPI
as well.
Using wild-card receives
But as the program gets more complex, it gets
harder to keep track of all pending communication
in all places that are doing a receive

38
Why Message-Driven Modules ?
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
39
Checkpoint/Restart

Any long running application must be able to save
its state
When you checkpoint an application, it uses the
pup routine to store the state of all objects
State information is saved in a directory of your
choosing
Restore also uses pup, so no additional
application code is needed (pup is all you need)

40
Checkpointing Job

In AMPI, use MPI_Checkpoint(ltdirgt)
Collective call returns when checkpoint is
complete
In Charm, use CkCheckpoint(ltdirgt,ltresumegt)
Called on one processor calls resume when
checkpoint is complete
Restarting
The charmrun option restart ltdirgt is used to
restart
Number of processors need not be the same

41
AMPIs Collective Communication Support

Communication operation in which all or a large
subset participate
For example broadcast
Performance impediment
All to all communication
All to all personalized communication (AAPC)
All to all multicast (AAM)

42
Communication Optimization
Organize processors in a 2D (virtual) Mesh
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
2 messages instead of P-1
But each byte travels twice on the network
43
Performance Benchmark
A Mystery ?
44
CPU time vs Elapsed Time

Time breakdown of an all-to-all operation using
Mesh library
Computation is only a small proportion of the
elapsed time
A number of optimization techniques are developed
to improve collective communication performance

45
Asynchronous Collectives

Time breakdown of 2D FFT benchmark ms
VPs implemented as threads
Overlapping computation with waiting time of
collective operations
Total completion time reduced

46
Shrink/Expand

Problem Availability of computing platform may
change
Fitting applications on the platform by object
migration

Time per step for the million-row CG solver on a
16-node cluster Additional 16 nodes available at
step 600
47
Projections Performance Analysis Tool
48
Projections

Projections is designed for use with a
virtualized model like Charm or AMPI
Instrumentation built into runtime system
Post-mortem tool with highly detailed traces as
well as summary formats
Java-based visualization tool for presenting
performance information

49
Trace Generation (Detailed)

Link-time option -tracemode projections
In the log mode each event is recorded in full
detail (including timestamp) in an internal
buffer
Memory footprint controlled by limiting number of
log entries
I/O perturbation can be reduced by increasing
number of log entries
Generates a ltnamegt.ltpegt.log file for each
processor and a ltnamegt.sts file for the entire
application
Commonly used Run-time options
traceroot DIR
logsize NUM

50
Visualization Main Window
51
Post mortem analysis views

Utilization Graph
Mainly useful as a function of processor
utilization against time and time spent on
specific parallel methods
Profile stacked graphs
For a given period, breakdown of the time on each
processor
Includes idle time, and message-sending,
receiving times
Timeline
upshot-like, but more details
Pop-up views of method execution, message arrows,
user-level events

52
(No Transcript)
53
Projections Views continued

Histogram of method execution times
How many method-execution instances had a time of
0-1 ms? 1-2 ms? ..

54
Projections Views continued

Overview
A fast utilization chart for entire machine
across the entire time period

55
(No Transcript)
56
Projections Conclusions

Instrumentation built into runtime
Easy to include in Charm or AMPI program
Working on
Automated analysis
Scaling to tens of thousands of processors
Integration with hardware performance counters

57
Multi-run analysis in progress

Collect performance data from different runs
On varying number of processors
See which functions increase in computation time
Algorithmic overhead
See how the communication costs scale up
per processor and total

58
Load Balancing
59
Load balancing scenarios

Dynamic creation of tasks
Initial vs continuous
Coarse grained vs fine grained tasks
Master-slave
Tree structured
Use Seed Balancers in Charm/AMPI
Iterative Computations
When there is a strong correlation across
iterations
Measurement based load balancers
When the correlation is weak
Wehen there is no co-relation use seed baalncer

60
Measurement Based Load Balancing

Principle of persistence
Object communication patterns and computational
loads tend to persist over time
In spite of dynamic behavior
Abrupt but infrequent changes
Slow and small changes
Runtime instrumentation
Measures communication volume and computation
time
Measurement based load balancers
Use the instrumented data-base periodically to
make new decisions
Many alternative strategies can use the database

61
Periodic Load balancing Strategies

Stop the computation?
Centralized strategies
Charm RTS collects data (on one processor) about
Computational Load and Communication for each
pair
If you are not using AMPI/Charm, you can do the
same instrumentation and data collection
Partition the graph of objects across processors
Take communication into account
Pt-to-pt, as well as multicast over a subset
As you map an object, add to the load on both
sending and receiving processor
The red communication is free, if it is a
multicast.

62
Object partitioning strategies

You can use graph partitioners like METIS, K-R
BUT graphs are smaller, and optimization
criteria are different
Greedy strategies
If communication costs are low use a simple
greedy strategy
Sort objects by decreasing load
Maintain processors in a heap (by assigned load)
In each step assign the heaviest remaining
object to the least loaded processor
With small-to-moderate communication cost
Same strategy, but add communication costs as you
add an object to a processor
Always add a refinement step at the end
Swap work from heaviest loaded processor to some
other processor
Repeat a few times or until no improvement
Refinement-only strategies

63
Object partitioning strategies

When communication cost is significant
Still use greedy strategy, but
At each assignment step, choose between assigning
O to least loaded processor and the processor
that already has objects that communicate most
with O.
Based on the degree of difference in the two
metrics
Two-stage assignments
In early stages, consider communication costs as
long as the processors are in the same (broad)
load class,
In later stages, decide based on load
Branch-and-bound
Searches for optimal, but can be stopped after a
fixed time

64
Crack Propagation
Decomposition into 16 chunks (left) and 128
chunks, 8 for each PE (right). The middle area
contains cohesive elements. Both decompositions
obtained using Metis. Pictures S. Breitenfeld,
and P. Geubelle
As computation progresses, crack propagates, and
new elements are added, leading to more complex
computations in some chunks
65
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
66
Distributed Load balancing

Centralized strategies
Still ok for 3000 processors for NAMD
Distributed balancing is needed when
Number of processors is large and/or
load variation is rapid
Large machines
Need to handle locality of communication
Topology sensitive placement
Need to work with scant global information
Approximate or aggregated global information
(average/max load)
Incomplete global info (only neighborhood)
Work diffusion strategies (1980s work by author
and others!)
Achieving global effects by local action

67
Other features

Client-Server interface (CCS)
Live Visualization support
Libraries
Communication optimization libraries
2D, 3D FFTs, CG, ..
Debugger
freeze/thaw

68
Scaling to PetaFLOPS machines Advice

Dice them fine
Use a fine grained decomposition
Just enough to amortize the overhead
Juggle as much as you can
Keeping communication ops in flight for latency
tolerance
Avoid synchronizations as much as possible
Use asynchronous reductions,
Async. Collectives in general

69
Grainsize control

A Simple definition of grainsize
Amount of computation per message
Problem short message/ long message
More realistic
Computation to communication ratio

70
Grainsize Control Wisdom

One may think that
One should chose the largest grainsize that will
generate sufficient parallelization
In fact
One should select smallest grainsize that will
amortize the overhead
Total CPU Time T
T Tseq (Tseq/g)Toverhead

71
How to avoid Barriers/Reductions

Sometimes, they can be eliminated
with careful reasoning
Somewhat complex programming
When they cannot be avoided,
one can often render them harmless
Use asynchronous reduction (not normal MPI)
E.g. in NAMD, energies need to be computed via a
reductions and output.
Not used for anything except output
Use Asynchronous reduction, working in the
background
When it reports to an object at the root, output
it

72
Asynchronous reductions Jacobi

Convergence check
At the end of each Jacobi iteration, we do a
convergence check
Via a scalar Reduction (on maxError)
But note
each processor can maintain old data for one
iteration
So, use the result of the reduction one iteration
later!
Deposit of reduction is separated from its
result.
MPI_Ireduce(..) returns a handle (like MPI_Irecv)
And later, MPI_Wait(handle) will block when you
need to.

73
Asynchronous reductions in Jacobi
reduction
Processor timeline with sync. reduction
compute
compute
This gap is avoided below
Processor timeline with async. reduction
reduction
compute
compute
74
Asynchronous or Split-phase interfaces

Notify/wait syncs in CAF

75
Case Studies Examples of Scalability

Series of examples
Where we attained scalability
What techniques were useful
What lessons we learned
Molecular Dynamics NAMD
Rocket Simulation

76

Object Based Parallelization for MD Force
Decomposition Spatial Deomp.

Now, we have many objects to load balance
Each diamond can be assigned to any proc.
Number of diamonds (3D)
14Number of Patches

77
Bond Forces

Multiple types of forces
Bonds(2), Angles(3), Dihedrals (4), ..
Luckily, each involves atoms in neighboring
patches only
Straightforward implementation
Send message to all neighbors,
receive forces from them
262 messages per patch!
Instead, we do
Send to (7) upstream nbrs
Each force calculated at one patch

78
Virtualized Approach to implementation using
Charm
192 144 VPs
700 VPs
30,000 VPs
These 30,000 Virtual Processors (VPs) are
mapped to real processors by charm runtime system
79
(No Transcript)
80
Case Study NAMD (Molecular Dynamics)
1.02 TeraFLOPs
NAMD Biomolecular Simulation on Thousands of
Processors J. C. Phillips, G. Zheng, S. Kumar,
and L. V. Kale Proc. Of Supercomputing
2002 Gordon Bell Award Unprecedented performance
for this application
ATPase synthase
81
Scaling to 64K/128K processors of BG/L

What issues will arise?
Communication
Bandwidth use more important than processor
overhead
Locality
Global Synchronizations
Costly, but not because it takes longer
Rather, small jitters have a large impact
Sum of Max vs Max of Sum
Load imbalance important, but low grainsize is
crucial
Critical paths gains importance

82
Electronic Structures using CP

Car-Parinello method
Based on pinyMD
Glenn Martyna, Mark Tuckerman
Data structures
A bunch of states (say 128)
Represented as
3D arrays of coeffs in G-space, and
also 3D arrays in real space
Real-space prob. density
S-matrix one number for each pair of states
For orthonormalization
Nuclei

Computationally
Transformation from g-space to real-space
Use multiple parallel 3D-FFT
Sums up real-space densities
Computes energies from density
Computes forces
Normalizes g-space wave function

83
One Iteration
84
Parallel Implementation
85
(No Transcript)
86
Orthonormalization

At the end of every iteration, after updating
electron configuration
Need to compute (from states)
a correlation matrix S, Si,j depends on
entire data from states i, j
its transform T
Update the values
Computation of S has to be distributed
Compute Si,j,p, where p is plane number
Sum over p to get Si,j
Actual conversion from S-gtT is sequential

87
Orthonormalization
88
Computation/Communication Overlap
89
G-Space PlanesIntegration, 1D-FFT
Real-Space Planes 2D-FFT
Compute Forces on/by Nuclei
Rho-Real-Space Planes
Rho-Real-Space Planes
Real-Space Planes 2D-IFFT
G-Space PlanesIntegration, 1D-IFFT
Pair-calculators
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
Rocket Simulation

Dynamic, coupled physics simulation in 3D
Finite-element solids on unstructured tet mesh
Finite-volume fluids on structured hex mesh
Coupling every timestep via a least-squares data
transfer
Challenges
Multiple modules
Dynamic behavior burning surface, mesh adaptation

Robert Fielder, Center for Simulation of Advanced
Rockets
Collaboration with M. Heath, P. Geubelle, others
95
Application Example GEN2
96
Rocket simulation via virtual processors

Scalability challenges
Multiple independently developed modules,
possibly executing concurrently
Evolving simulation
Changes the balance between fluid and solid
Adaptive refinements
Dynamic insertion of sub-scale simulation
components
Crack-driven fluid flow and combustion
Heterogeneous (speed-wise) clusters

97
Rocket simulation via virtual processors
98
AMPI and Roc Communication

By separating independent modules into separate
sets of virtual processors, flexibility was
gained to deal with alternate formulations
Fluids and solids executing concurrently OR one
after other.
Change in pattern of load distribution within or
across modules

Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
99
Using a SImulator
100
Performance Prediction on Large Machines

Solution
Leverage virtualization
Develop a machine emulator
Simulator accurate time modeling
Run a program on 100,000 processors using only
hundreds of processors

Problem
How to develop a parallel application for a
non-existent machine?
How to predict performance of applications on
future machines?
How to do performance tuning without continuous
access to a large machine?

Originally targeted to BlueGene/Cyclops Now
generalized (and used for BG/L)
101
Why Emulate?

Allow development of parallel software
Exposes scalability limitations in data
structures, e.g.
O(P) arrays are ok if P is not a million
Software is ready before the machine is

102
How to emulate 1M processor apps

Leverage processor virtualization
Let each Virtual Processor of Charm stand for a
real processor of the emulated machine
Adequate if you want emulate MPI apps on 1 M
processors!

103
Emulation on a Parallel Machine
Simulated multi-processor nodes
Simulating (Host) Processors
Emulating 8M threads on 96 ASCI-Red processors
104
How to emulate 1M processor apps

A twist what if you want to emulate Charm app?
E.g. 8 M objects VPs using 1M target machine
processors?
A little runtime trickery
Processors modeled as data structures, while VPs
as VPs!

VP
VP
VP
VP
Processor
Processor
105
Memory Limit?

Some applications have low memory use
Molecular dynamics
Some Large machines may have low
memory-per-processor
E.g. BG/L 256 MB for 2 processor node
A BG/C design 16-32 MB for 32 processor node
More general solution is still needed
Provided by out-of-core execution capability of
Charm

106
Message Driven Execution and Out-of-Core Execution
Virtualization leads to Message Driven Execution
So, we can Prefetch data accurately Automatic
Out-of-core execution
107
A Success Story

Emulation based implementation of lower layers,
as well as Charm and AMPI completed last year
As a result,
BG/L Port of Charm/AMPI accomplished in 1-2
days
Actually, 1-2 hours for the basic port
1-2 days to fix a OS level bug that prevented
user-level multi-threading

108
Emulator Performance

Scalable
Emulating a real-world MD application on a 200K
processor BG machine

Gengbin Zheng, Arun Singla, Joshua Unger,
Laxmikant V. Kalé, A Parallel-Object
Programming Model for PetaFLOPS Machines and Blue
Gene/Cyclops'' in NGS Program Workshop, IPDPS02
109
Performance Prediction

How to predict component performance?
Multiple resolution levels
Sequential Component
user supplied expression timers performance
counters instruction level simulation
Communication component
Simple latency-based network model
contention-based network simulation
Parallel Discrete Event Simulation (PDES)
Logical processor (LP) has virtual clock
Events are time-stamped
State of an LP changes when an event arrives to
it
ProtocolsConservative vs. optimistic protocols
Conservative (example DaSSF, MPISIM)
Optimistic (examples Time Warp, SPEEDES)

110
Why not use existing PDES?

Major synchronization overheads
Checkpointing overhead
Rollback overhead
We can do better
Inherent determinacy of parallel application
Most parallel programs are written to be
deterministic,

111
Categories of Applications

Linear-order applications
No wildcard receives
Strong determinacy, no timestamp correction
necessary
Reactive applications (atomic)
Message driven objects
Methods execute as corresponding messages arrive
Multi-dependent applications
Irecvs with WaitAll
Uses of structured dagger to capture dependency

Gengbin Zheng, Gunavardhan Kakulapati, Laxmikant
V. Kalé, BigSim A Parallel Simulator for
Performance Prediction of Extremely Large
Parallel Machines'', in IPDPS 2004
112
Architecture of BigSim Simulator
Performance visualization (Projections)
Simulation output trace logs
Online PDES engine
Charm Runtime
Instruction Sim (RSim, IBM, ..)
Simple Network Model
Performance counters
Load Balancing Module
BigSim Emulator
Charm and MPI applications
113
Architecture of BigSim Simulator
Performance visualization (Projections)
Network Simulator
Offline PDES
BigNetSim (POSE)
Simulation output trace logs
Online PDES engine
Charm Runtime
Instruction Sim (RSim, IBM, ..)
Simple Network Model
Performance counters
Load Balancing Module
BigSim Emulator
Charm and MPI applications
114
Big Network Simulation

Simulate network behavior packetization,
routing, contention, etc.
Incorporate with post-mortem timestamp correction
via POSE
Currently models torus (BG/L), fat-tree (qs-net)

115
BigSim Validation on Lemieux
116
Performance of the BigSim
Speed up
Real processors (PSC Lemieux)
117
FEM simulation

Simple 2D structural simulation in AMPI
5 million element mesh
16k BG processors
Running on only 32 PSC Lemieux processors

118
Case Study - LeanMD

Molecular dynamics simulation designed for large
machines
K-away cut-off parallelization

Benchmark er-gre with 3-away
36573 atoms
1.6 million objects vs. 6000 in 1-away
8 step simulation
32k processor BG machine
Running on 400 PSC Lemieux processors

Performance visualization tools
119
Load Imbalance
Histogram
120
Performance visualization
121
Domain Specific Frameworks
122
Component Frameworks

Motivation
Reduce tedium of parallel programming for
commonly used paradigms
Encapsulate required parallel data structures and
algorithms
Provide easy to use interface,
Sequential programming style preserved
No alienating invasive constructs
Use adaptive load balancing framework
Component frameworks
FEM
Multiblock
AMR

123
FEM framework

Present clean, almost serial interface
Hide parallel implementation in the runtime
system
Leave physics and time integration to user
Users write code similar to sequential code
Or, easily modify sequential code
Input
connectivity file (mesh), boundary data and
initial data
Framework
Partitions data, and
Starts driver for each chunk in a separate thread
Automates communication, once user registers
fields to be communicated
Automatic dynamic load balancing

124
Why use the FEM Framework?

Makes parallelizing a serial code faster and
easier
Handles mesh partitioning
Handles communication
Handles load balancing (via Charm)
Allows extra features
IFEM Matrix Library
NetFEM Visualizer
Collision Detection Library

125
Serial FEM Mesh
126
Partitioned Mesh
127
FEM Mesh Node Communication

Summing forces from other processors only takes
one call
FEM_Update_field
Similar call for updating ghost regions

128
FEM Framework Users CSAR

Rocflu fluids solver, a part of GENx
Finite-volume fluid dynamics code
Uses FEM ghost elements
Author Andreas Haselbacher

Robert Fielder, Center for Simulation of Advanced
Rockets
129
FEM Experience

Previous
3-D volumetric/cohesive crack propagation code
(P. Geubelle, S. Breitenfeld, et. al)
3-D dendritic growth fluid solidification code
(J. Dantzig, J. Jeong)
Adaptive insertion of cohesive elements
Mario Zaczek, Philippe Geubelle
Performance data
Multi-Grain contact (in progress)
Spandan Maiti, S. Breitenfield, O. Lawlor, P.
Guebelle
Using FEM framework and collision detection
NSF funded project
Space-time meshes

Did initial parallelization in 4 days

130
Performance data ASCI Red
Mesh with 3.1 million elements
Speedup of 1155 on 1024 processors.
131
Dendritic Growth

Studies evolution of solidification
microstructures using a phase-field model
computed on an adaptive finite element grid
Adaptive refinement and coarsening of grid
involves re-partitioning

Jon Dantzig et al with O. Lawlor and Others from
PPL
132
Overhead of Multipartitioning
Conclusion Overhead of virtualization is small,
and in fact it benefits by creating automatic
133
Parallel Collision Detection

Detect collisions (intersections) between objects
scattered across processors

Approach, based on Charm Arrays
Overlay regular, sparse 3D grid of voxels (boxes)
Send objects to all voxels they touch
Collide objects within each voxel independently
and collect results
Leave collision response to user code

134
Parallel Collision Detection

Results 2?s per polygon
Good speedups to 1000s of processors

ASCI Red, 65,000 polygons per processor. (scaled
problem) Up to 100 million polygons

This was a significant improvement over the
state-of-art.
Made possible by virtualization, and
Asynchronous, as needed, creation of voxels
Localization of communication voxel often on the
same processor as the contributing polygon

135
Summary

Processor virtualization is a powerful techniques
Charm/AMPI are production quality systems
with bells and whistles
Can scale to petaFLOPS class machines
Domain-specific frameworks
Can raise the level of abstraction and promote
reuse
Unstructured Mesh framework
Next compiler support, new coordination
mechanisms
Software available
http//charm.cs.uiuc.edu

136
Optimizing for Communication Patterns

The parallel-objects Runtime System can observe,
instrument, and measure communication patterns
Communication is from/to objects, not processors
Load balancers can use this to optimize object
placement
Communication libraries can optimize
By substituting most suitable algorithm for each
operation
Learning at runtime

V. Krishnan, MS Thesis, 1996
137
Molecular Dynamics Benefits of avoiding barrier

In NAMD
The energy reductions were made asynchronous
No other global barriers are used in cut-off
simulations
This came handy when
Running on Pittsburgh Lemieux (3000 processors)
The machine ( our way of using the communication
layer) produced unpredictable, random delays in
communication
A send call would remain stuck for 20 ms, for
example
How did the system handle it?
See timeline plots

138
Golden Rule of Load Balancing
Fallacy objective of load balancing is to
minimize variance in load across processors
Example 50,000 tasks of equal size, 500
processors A All processors get 99,
except last 5 gets 10099 199 OR, B All
processors have 101, except last 5 get 1
Identical variance, but situation A is much worse!
Golden Rule It is ok if a few processors idle,
but avoid having processors that are overloaded
with work
Finish time maxTime on Ith processorExcepting
data dependence and communication overhead issues
139
Amdahlss Law and grainsize

Before we get to load balancing
Original law
If a program has K sequential section, then
speedup is limited to 100/K.
If the rest of the program is parallelized
completely
Grainsize corollary
If any individual piece of work is gt K time
units, and the sequential program takes Tseq ,
Speedup is limited to Tseq / K
So
Examine performance data via histograms to find
the sizes of remappable work units
If some are too big, change the decomposition
method to make smaller units

140
Grainsize LeanMD for Blue Gene/L

BG/L is a planned IBM machine with 128k
processors
Here, we need even more objects
Generalize hybrid decomposition scheme
1-away to k-away

2-away cubes are half the size.
141
76,000 vps
5000 vps
256,000 vps
142
New strategy

Write a Comment

User Comments (0)

About PowerShow.com

Petascale Programming with Virtual Processors: Charm , AMPI, and domainspecific frameworks - PowerPoint PPT Presentation

Petascale Programming with Virtual Processors: Charm , AMPI, and domainspecific frameworks

IBM's Blue Gene L: 65k nodes, 3D-taurus topology. Red Storm (10k procs) Future? ... Jay deSouza. Chao Huang. Chee Wai Lee. Recent Funding: NSF (NGS: Frederica Darema) ... – PowerPoint PPT presentation