Parallel Computing with Datadriven Objects

About This Presentation

Title:

Parallel Computing with Datadriven Objects

Description:

New complex apps with dynamic and irregular structure ... 57 ms/step. ApoA-I on Origin 2000. ApoA-I on Linux Cluster. ApoA-I on O2K and T3E. ApoA-I on T3E ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 86

Provided by: jdes4

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Computing with Datadriven Objects

1
Parallel Computing withData-driven Objects

Laxmikant (Sanjay) Kale
Parallel Programming Laboratory
Department of Computer Science
http//charm.cs.uiuc.edu

2
Parallel Programming Laboratory

Funding
Dept of Energy (via Rocket center)
National Science Foundation
National Institute of Health
Group Members

Affiliated (NIH/Biophysics) Jim Phillips Kirby
Vandivoort
Joshua Unger Gengbin Zheng Jay Desouza Sameer
Kumar Chee wai Lee
Milind Bhandarkar Terry Wilmarth Orion
Lawlor Neelam Saboo Arun Singla Karthikeyan Mahesh
3
The Parallel Programming Problem

Is there one?
We can all write MPI programs, right?
Several Large Machines in use
But
New complex apps with dynamic and irregular
structure
Should all application scientists also be experts
in parallel computing?

4
What makes it difficult?

Multiple objectives
Correctness, Sequential efficiency, speedups
Nondeterminacy affects correctness
Several obstacles to speedup
communication costs
Load imbalances
Long critical paths

5
Parallel Programming

Decomposition
Decide what to do in parallel
Tasks (loop iterations, functions,.. ) that can
be done in parallel
Mapping
Which processor does each task
Scheduling (sequencing)
On each processor
Machine dependent expression
Express the above decisions for the particular
parallel machine

6
Spectrum of parallel Languages
Parallelizing Fortran compiler
Decomposition
Mapping
Charm
Leve l
Scheduling (sequencing)
Machine dependent expression
MPI
Specialization
What is automated
7
Overview

Context approach and methodology
Molecular dynamics for biomolecules
Our program, NAMD
Basic parallelization strategy
NAMD performance optimizations
Techniques
Results
Conclusions summary, lessons and future work

8
Parallel Programming Laboratory

Objective Enhance performance and productivity
in parallel programming
For complex, dynamic applications
Scalable to thousands of processors
Theme
Adaptive techniques for handling dynamic behavior
Strategy Look for optimal division of labor
between human programmer and the system
Let the programmer specify what to do in parallel
Let the system decide when and where to run them
Data driven objects as the substrate Charm

9
System Mapped Objects
5
1
8
1
2
10
3
4
8
2
3
9
7
5
6
10
9
4
9
12
11
13
6
7
13
11
12
10
Data Driven Execution
Scheduler
Scheduler
Message Q
Message Q
11
Charm

Parallel C with data driven objects
Object Arrays and collections
Asynchronous method invocation
Object Groups
global object with a representative on each PE
Prioritized scheduling
Mature, robust, portable
http//charm.cs.uiuc.edu

12
Multi-partition Decomposition

Writing applications with Charm
Decompose the problem into a large number of
chunks
Implements chunks as objects
Or, now, as MPI threads (AMPI on top of Charm)
Let Charm map and remap objects
Allow for migration of objects
If desired, specify potential migration points

13
Load Balancing Mechanisms

Re-map and migrate objects
Registration mechanisms facilitate migration
Efficient message delivery strategies
Efficient global operations
Such as reductions and broadcasts
Several classes of load balancing strategies
provided
Incremental
Centralized as well as distributed
Measurement based

14
Principle of Persistence

An observation about CSE applications
Extension of principle of locality
Behavior of objects, including computational load
and communication patterns, tend to persist over
time
Application induced imbalance
Abrupt, but infrequent, or
Slow, cumulative
Rarely frequent, large changes
Our framework still deals with this case as well
Measurement based strategies

15
Measurement-Based Load Balancing Strategies

Collect timing data for several cycles
Run heuristic load balancer
Several alternative ones
Robert Brunners recent Ph.D. thesis
Instrumentation framework
Strategies
Performance comparisons

16
Molecular Dynamics
ApoA-I 92k Atoms
17
Molecular Dynamics and NAMD

MD is used to understand the structure and
function of biomolecules
Proteins, DNA, membranes
NAMD is a production-quality MD program
Active use by biophysicists (published science)
50,000 lines of C code
1000 registered users
Features include
CHARMM and XPLOR compatibility
PME electrostatics and multiple timestepping
Steered and interactive simulation via VMDl

18
NAMD Contributors

PI s
Laxmikant Kale, Klaus Schulten, Robert Skeel
NAMD Version 1
Robert Brunner, Andrew Dalke, Attila Gursoy,
Bill Humphrey, Mark Nelson
NAMD2
M. Bhandarkar, R. Brunner, Justin Gullingsrud, A.
Gursoy, N.Krawetz, J. Phillips, A. Shinozaki, K.
Varadarajan, Gengbin Zheng, ..

Theoretical Biophysics Group, supported by NIH
19
Molecular Dynamics

Collection of charged atoms, with bonds
Newtonian mechanics
At each time-step
Calculate forces on each atom
Bonds
Non-bonded electrostatic and van der Waals
Calculate velocities and advance positions
1 femtosecond time-step, millions needed!
Thousands of atoms (1,000 - 100,000)

20
Cut-off Radius

Use of cut-off radius to reduce work
8 - 14 Å
Far away atoms ignored! (screening effects)
80-95 work is non-bonded force computations
Some simulations need faraway contributions
Particle-Mesh Ewald (PME)
Even so, cut-off based computations are
important
Near-atom calculations constitute the bulk of the
above
Multiple time-stepping is used k cut-off steps,
1 PME
So, (k-1) steps do just cut-off based simulation

21
Scalability

The program should scale up to use a large number
of processors.
But what does that mean?
An individual simulation isnt truly scalable
Better definition of scalability
If I double the number of processors, I should
be able to retain parallel efficiency by
increasing the problem size

22
Isoefficiency

Quantify scalability
(Work of Vipin Kumar, U. Minnesota)
How much increase in problem size is needed to
retain the same efficiency on a larger machine?
Efficiency sequential-time/ (P parallel-time)
Parallel-time computation communication
idle

23
Early methods

Atom replication
Each processor has data for all atoms
Force calculations parallelized
Collection of forces O(N log p) communication
Computation O(N/P)
Communication/computation Ratio O(P log P) Not
Scalable
Atom Decomposition
Partition the atoms array across processors
Nearby atoms may not be on the same processor
Communication O(N) per processor
Ratio O(N) / (N / P) O(P) Not Scalable

24
Force Decomposition

Distribute force matrix to processors
Matrix is sparse, non uniform
Each processor has one block
Communication
Ratio
Better scalability in practice
Can use 100 processors
Plimpton
Hwang, Saltz, et al
6 on 32 processors
36 on 128 processor
Yet not scalable in the sense defined here!

25
Spatial Decomposition

Allocate close-by atoms to the same processor
Three variations possible
Partitioning into P boxes, 1 per processor
Good scalability, but hard to implement
Partitioning into fixed size boxes, each a little
larger than the cut-off distance
Partitioning into smaller boxes
Communication O(N/P)
Communication/Computation ratio O(1)
So, scalable in principle

26
Ongoing work

Plimpton, Hendrickson
new spatial decomposition
NWChem (PNL)
Peter Kollman, Yong Duan et al
microsecond simulation
AMBER version (SANDER)

27
Spatial Decomposition in NAMD
But the load balancing problems are still severe
28
Hybrid Decomposition
29
FD SD

Now, we have many more objects to load balance
Each diamond can be assigned to any processor
Number of diamonds (3D)
14Number of Patches

30
Bond Forces

Multiple types of forces
Bonds(2), angles(3), dihedrals (4), ..
Luckily, each involves atoms in neighboring
patches only
Straightforward implementation
Send message to all neighbors,
receive forces from them
262 messages per patch!

31
Bond Forces

Assume one patch per processor
An angle force involving atoms in patches
(x1,y1,z1), (x2,y2,z2), (x3,y3,z3) is calculated
in patch (maxxi, maxyi, maxzi)

C
A
B
32
NAMD Implementation

Multiple objects per processor
Different types patches, pairwise forces, bonded
forces
Each may have its data ready at different times
Need ability to map and remap them
Need prioritized scheduling
Charm supports all of these

33
Load Balancing

Is a major challenge for this application
Especially for a large number of processors
Unpredictable workloads
Each diamond (force compute object) and patch
encapsulate variable amount of work
Static estimates are inaccurate
Very slow variations across timesteps
Measurement-based load balancing framework

Compute
Cell (patch)
Cell (patch)
34
Bipartite Graph Balancing

Background load
Patches (integration, etc.) and bond-related
forces
Migratable load
Non-bonded forces
Bipartite communication graph
Between migratable and non-migratable objects
Challenge
Balance load while minimizing communication

Compute
Cell (Patch)
Cell (patch)
35
Load Balancing Strategy
Greedy variant (simplified) Sort compute objects
(diamonds) Repeat (until all assigned) S set
of all processors that -- are not
overloaded -- generate least new commun.
P least loaded S Assign heaviest compute
to P
Refinement Repeat - Pick a compute from
the most overloaded PE - Assign it to a
suitable underloaded PE Until (No movement)
Cell
Cell
Compute
36
(No Transcript)
37
Speedups in 1998
ApoA-I 92k atoms
38
Optimizations

Series of optimizations
Examples discussed here
Grainsize distributions (bimodal)
Integration message sending overheads
Several other optimizations
Separation of bond/angle/dihedral objects
Inter-patch and intra-patch
Prioritization
Local synchronization to avoid interference
across steps

39
Grainsize and Amdahlss Law

A variant of Amdahls law, for objects, would be
The fastest time can be no shorter than the time
for the biggest single object!
How did it apply to us?
Sequential step time was 57 seconds
To run on 2k processors, no object should be more
than 28 msecs.
Should be even shorter
Grainsize analysis via projections showed that
was not so..

40
Grainsize Analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
41
Grainsize Reduced
42
Performance Audit

Through the optimization process,
an audit was kept to decide where to look to
improve performance

Total Ideal Actual Total 57.04 86 nonBonded 52.
44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Ove
rhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receiv
es 0 1.61
Integration time doubled
43
Integration Overhead Analysis
integration
Problem integration time had doubled from
sequential run
44
Integration Overhead Example

The projections pictures showed the overhead was
associated with sending messages.
Many cells were sending 30-40 messages.
The overhead was still too much compared with the
cost of messages.
Code analysis memory allocations!
Identical message is being sent to 30
processors.
Simple multicast support was added to Charm
Mainly eliminates memory allocations (and some
copying)

45
Integration Overhead After Multicast
46
ApoA-I on ASCI Red
57 ms/step
47
ApoA-I on Origin 2000
48
ApoA-I on Linux Cluster
49
ApoA-I on O2K and T3E
50
ApoA-I on T3E
51
BC1 complex 200k atoms
52
BC1 on ASCI Red
58.4 GFlops
53
Lessons Learned

Need to downsize objects!
Choose smallest possible grainsize that amortizes
overhead
One of the biggest challenge
Was getting time for performance tuning runs on
parallel machines

54
ApoA-I with PME on T3E
55
Future and Planned Work

Increased speedups on 2k-10k processors
Smaller grainsizes
Parallelizing integration further
New algorithms for reducing communication impact
New load balancing strategies
Further performance improvements for PME
With multiple timestepping
Needs multi-phase load balancing
Speedup on small molecules!
Interactive molecular dynamics

56
More Information

Charm and associated framework
http//charm.cs.uiuc.edu
NAMD and associated biophysics tools
http//www.ks.uiuc.edu
Both include downloadable software

57
Initial Speedup Results ASCI Red
58
Performance Size of System
Performance data on Cray T3E
59
Performance Various Machines
60
Charm

Data Driven Objects
Asynchronous method invocation
Prioritized scheduling
Object Arrays
Object Groups
global object with a representative on each PE
Information sharing abstractions
readonly data
accumulators
distributed tables

61
Data Driven Execution
Objects
Scheduler
Scheduler
Message Q
Message Q
62
CkChareID mainhandle mainmain(CkArgMsg m)
int i, low 0 for (i0 ilt100 i)
new CProxy_piPart() responders 100 count
0 mainhandle thishandle // readonly
initialization void mainresults(DataMsg
msg) count msg-gtcount if (0
--responders) CkPrintf("pi f \n",
4.0count/100000) CkExit()
Execution begins here
argc/argv
Exit scheduler after method returns
63
piPartpiPart() // declarations..
CProxy_main mainproxy(mainhandle)
srand48((long) this) mySamples
100000/100 for (i 0 ilt mySamples i)
x drand48() y drand48() if
((xx yy) lt 1.0) localCount
DataMsg result new DataMsg result-gtcount
localCount mainproxy.results(result) delete
this
64
Group Mission and Approach

To enhance Performance and Productivity in
programming complex parallel applications
Approach Application Oriented yet CS centered
research
Develop enabling technology, for many apps.
Develop, use and test it in the context of real
applications
Theme
Adaptive techniques for irregular and dynamic
applications
Optimal division of labor system and
programmer
Decomposition done by programmer, everything else
automated
Develop standard library for parallel programming
of reusable components

65
Active Projects

Charm/ Converse parallel infrastructure
Scientific/Engineering apps
Molecular Dynamics
Rocket Simulation
Finite Element Framework
Web-based interaction and monitoring
Faucets anonymous compute power
Parallel
Operations Research, discrete event simulation,
combinatorial search

66
Charm Parallel C With Data Driven Objects

Chares dynamically balanced objects
Object Groups
global object with a representative on each PE
Object Arrays/ Object Collections
User defined indexing (1D,2D,..,quad and
oct-tree,..)
System supports remapping and forwarding
Asynchronous method invocation
Prioritized scheduling
Mature, robust, portable
http//charm.cs.uiuc.edu

Data driven Execution
67
Multi-partition Decomposition

Idea divide the computation into a large number
of pieces
Independent of number of processors
Typically larger than number of processors
Let the system map entities to processors

68
Converse

Portable parallel run-time system that allows
interoperability among parallel languages
Rich features to allow quick and efficient
implementation of new parallel languages
Based on message-driven execution that allows
co-existence of different control regimes
Support for debugging and performance analysis of
parallel programs
Support for building parallel servers

69
Converse

Languages and paradigms implemented
Charm, a parallel object-oriented language
Thread-safe MPI and PVM
Parallel Java, message-driven Perl, pC
Platforms supported
SGI Origin2000, IBM SP, ASCI Red, CRAY T3E,
Convex Ex.
Workstation clusters (Solaris, HP-UX, AIX, Linux
etc.)
Windows NT Clusters

Paradigms Languages, Libraries,
Converse
Parallel Machines
70
Adaptive MPI

A bridge between legacy MPI codes and dynamic
load balancing capabilities of Charm
AMPI MPI dynamic load balancing
Based on Charm object arrays and Converses
migratable threads
Minimal modification needed to convert existing
MPI programs (to be automated in future)
Bindings for C, C, and Fortran90
Currently supports most of the MPI 1.1 standard

71
Converse Use in NAMD
72
Molecular Dynamics

Collection of charged atoms, with bonds
Newtonian mechanics
At each time-step
Calculate forces on each atom
bonds
non-bonded electrostatic and van der Waals
Calculate velocities and Advance positions
1 femtosecond time-step, millions needed!
Thousands of atoms (1,000 - 200,000)

Collaboration with Klaus Schulten, Robert Skeel
73
Spatial Decomposition in NAMD

Space divided into cubes
Forces between atoms in neighboring cubes
computed by individual compute objects
Compute objects are remapped by load balancer

74
NAMD a Production-quality MD Program

NAMD is used by biophysicists routinely, with
several published results
NIH funded collaborative effort with Profs. K.
Schulten and R. Skeel
Supports full range electrostatics
Parallel Particle-Mesh Ewald for periodic and
Fast multipole for aperiodic systems
Implemented ic C/Charm
Supports visualization (via VMD), Interactive MD,
and haptic interface
see http//www.ks.uiuc.edu
Part of Biophysics collaboratory

ApoLipoprotein A1
75
NAMD Scalable Performance
Sequential Performance of NAMD (a C program) is
comparable to or better than contemporary MD
programs, written in Fortran.
Gordon Bell award finalist, SC2000
Speedup of 1250 on 2048 processors on ASCI red,
simulating BC1 with about 200k atoms. (compare
with best speedups on production-quality MD by
others 170/256 processors) Around 10,000
varying-size objects mapped by the load balancer
76
Rocket Simulation

Rocket behavior (and therefore its simulation) is
irregular, dynamic
We need to deal with dynamic variations
adaptively
Dynamic behavior arises from
Combustion moving boundaries
Crack propagation
Evolution of the system

77
Rocket Simulation

Our Approach
Multi-partition decomposition
Data-driven objects (Charm)
Automatic load balancing framework
AMPI Migration path for existing MPIFortran90
codes
ROCFLO, ROCSOLID, and ROCFACE

78
FEM Framework

Objective To make it easy to parallelize
existing Finite Element Method (FEM) Applications
and to quickly build new parallel FEM
applications including those with irregular and
dynamic behavior
Hides the details of parallelism developer
provides only sequential callback routines
Embedded mesh partitioning algorithms split mesh
into chunks that are mapped to different
processors (many-to-one)
Developers callbacks are executed in migratable
threads, monitored by the run-time system
Migration of chunks to correct load imbalance
Examples
Pressure-driven crack propagation
3-D Dendritic Growth

79
FEM Framework Responsibilities
FEM Application (Initialize, Registration of
Nodal Attributes, Loops Over Elements, Finalize)
FEM Framework (Update of Nodal properties,
Reductions over nodes or partitions)
Partitioner
Combiner
Charm (Dynamic Load Balancing, Communication)
METIS
I/O
80
Crack Propagation

Explicit FEM code
Zero-volume Cohesive Elements inserted near the
crack
As the crack propagates, more cohesive elements
added near the crack, which leads to severe load
imbalance
Framework handles
Partitioning elements into chunks
Communication between chunks
Load Balancing

Decomposition into 16 chunks (left) and 128
chunks, 8 for each PE (right). The middle area
contains cohesive elements. Pictures S.
Breitenfeld, and P. Geubelle
81
Dendritic Growth

Studies evolution of solidification
microstructures using a phase-field model
computed on an adaptive finite element grid
Adaptive refinement and coarsening of grid
involves re-partitioning

Work by Prof. Jon Dantzig, Jun-ho Jeong
82
(No Transcript)
83
Anonymous Compute Power
What is needed to make this metaphor
work? Timeshared parallel machines in the
background effective resource management Quality
of computational service contracts/guarantees Fron
t ends that will allow agents to submit jobs on
users behalf
Computational Faucets
84
Computational Faucets

What does a Computational faucet do?
Submit requests to the grid
Evaluate bids and decide whom to assign work
Monitor applications (for performance and
correctness)
Provide interface to users
Interacting with jobs, and monitoring behavior
What does it look like?

A browser!
85
Faucets QoS

User specifies desired job parameters such as
program executable name, executable platform, min
PE, max PE, estimated CPU-seconds (for various
PE), priority, etc.
User does not specify machine. Faucet software
contacts a central server and obtains a list of
available workstation clusters, then negotiates
with clusters and chooses one to submit the job.
User can view status of clusters.
Planned file transfer, user authentication,
merge with Appspector for job monitoring.

Workstation Cluster
Faucet Client
Central Server
Workstation Cluster
Web Browser
Workstation Cluster
86
Timeshared Parallel Machines

Need resource management
Shrink and expand individual jobs to available
sets of processors
Example Machine with 100 processors
Job1 arrives, can use 20-150 processors
Assign 100 processors to it
Job2 arrives, can use 30-70 processors,
and will pay more if we meet its deadline
Make resource allocation decisions

87
Time-shared Parallel Machines

To bid effectively (profitably) in such an
environment, a parallel machine must be able to
run well-paying (important) jobs, even when it is
already running others.
Allows a suitably written Charm/Converse
program running on a workstation cluster to
dynamically change the number of CPU's it is
running on, in response to a network (CCS)
request.
Works in coordination with a Cluster Manager to
give a job as many CPU's as are available when
there are no other jobs, while providing the
flexibility to accept new jobs and scale down.

88
Appspector

Appspector provides a web interface to submitting
and monitoring parallel jobs.
Submission user specifies machine, login,
password, program name (which must already be
available on the target machine).
Jobs can be monitored from any computer with a
web browser. Advanced program information can be
shown on the monitoring screen using CCS.

89
BioCoRE
Goal Simulate the process of doing research.
Provide a web-based way to virtually bring
scientists together.