Parallel Computing with Datadriven Objects - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Computing with Datadriven Objects

Description:

New complex apps with dynamic and irregular structure ... 57 ms/step. ApoA-I on Origin 2000. ApoA-I on Linux Cluster. ApoA-I on O2K and T3E. ApoA-I on T3E ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 86
Provided by: jdes4
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel Computing with Datadriven Objects


1
Parallel Computing withData-driven Objects
  • Laxmikant (Sanjay) Kale
  • Parallel Programming Laboratory
  • Department of Computer Science
  • http//charm.cs.uiuc.edu

2
Parallel Programming Laboratory
  • Funding
  • Dept of Energy (via Rocket center)
  • National Science Foundation
  • National Institute of Health
  • Group Members

Affiliated (NIH/Biophysics) Jim Phillips Kirby
Vandivoort
Joshua Unger Gengbin Zheng Jay Desouza Sameer
Kumar Chee wai Lee
Milind Bhandarkar Terry Wilmarth Orion
Lawlor Neelam Saboo Arun Singla Karthikeyan Mahesh
3
The Parallel Programming Problem
  • Is there one?
  • We can all write MPI programs, right?
  • Several Large Machines in use
  • But
  • New complex apps with dynamic and irregular
    structure
  • Should all application scientists also be experts
    in parallel computing?

4
What makes it difficult?
  • Multiple objectives
  • Correctness, Sequential efficiency, speedups
  • Nondeterminacy affects correctness
  • Several obstacles to speedup
  • communication costs
  • Load imbalances
  • Long critical paths

5
Parallel Programming
  • Decomposition
  • Decide what to do in parallel
  • Tasks (loop iterations, functions,.. ) that can
    be done in parallel
  • Mapping
  • Which processor does each task
  • Scheduling (sequencing)
  • On each processor
  • Machine dependent expression
  • Express the above decisions for the particular
    parallel machine

6
Spectrum of parallel Languages
Parallelizing Fortran compiler
Decomposition
Mapping
Charm
Leve l
Scheduling (sequencing)
Machine dependent expression
MPI
Specialization
What is automated
7
Overview
  • Context approach and methodology
  • Molecular dynamics for biomolecules
  • Our program, NAMD
  • Basic parallelization strategy
  • NAMD performance optimizations
  • Techniques
  • Results
  • Conclusions summary, lessons and future work

8
Parallel Programming Laboratory
  • Objective Enhance performance and productivity
    in parallel programming
  • For complex, dynamic applications
  • Scalable to thousands of processors
  • Theme
  • Adaptive techniques for handling dynamic behavior
  • Strategy Look for optimal division of labor
    between human programmer and the system
  • Let the programmer specify what to do in parallel
  • Let the system decide when and where to run them
  • Data driven objects as the substrate Charm

9
System Mapped Objects
5
1
8
1
2
10
3
4
8
2
3
9
7
5
6
10
9
4
9
12
11
13
6
7
13
11
12
10
Data Driven Execution
Scheduler
Scheduler
Message Q
Message Q
11
Charm
  • Parallel C with data driven objects
  • Object Arrays and collections
  • Asynchronous method invocation
  • Object Groups
  • global object with a representative on each PE
  • Prioritized scheduling
  • Mature, robust, portable
  • http//charm.cs.uiuc.edu

12
Multi-partition Decomposition
  • Writing applications with Charm
  • Decompose the problem into a large number of
    chunks
  • Implements chunks as objects
  • Or, now, as MPI threads (AMPI on top of Charm)
  • Let Charm map and remap objects
  • Allow for migration of objects
  • If desired, specify potential migration points

13
Load Balancing Mechanisms
  • Re-map and migrate objects
  • Registration mechanisms facilitate migration
  • Efficient message delivery strategies
  • Efficient global operations
  • Such as reductions and broadcasts
  • Several classes of load balancing strategies
    provided
  • Incremental
  • Centralized as well as distributed
  • Measurement based

14
Principle of Persistence
  • An observation about CSE applications
  • Extension of principle of locality
  • Behavior of objects, including computational load
    and communication patterns, tend to persist over
    time
  • Application induced imbalance
  • Abrupt, but infrequent, or
  • Slow, cumulative
  • Rarely frequent, large changes
  • Our framework still deals with this case as well
  • Measurement based strategies

15
Measurement-Based Load Balancing Strategies
  • Collect timing data for several cycles
  • Run heuristic load balancer
  • Several alternative ones
  • Robert Brunners recent Ph.D. thesis
  • Instrumentation framework
  • Strategies
  • Performance comparisons

16
Molecular Dynamics
ApoA-I 92k Atoms
17
Molecular Dynamics and NAMD
  • MD is used to understand the structure and
    function of biomolecules
  • Proteins, DNA, membranes
  • NAMD is a production-quality MD program
  • Active use by biophysicists (published science)
  • 50,000 lines of C code
  • 1000 registered users
  • Features include
  • CHARMM and XPLOR compatibility
  • PME electrostatics and multiple timestepping
  • Steered and interactive simulation via VMDl

18
NAMD Contributors
  • PI s
  • Laxmikant Kale, Klaus Schulten, Robert Skeel
  • NAMD Version 1
  • Robert Brunner, Andrew Dalke, Attila Gursoy,
    Bill Humphrey, Mark Nelson
  • NAMD2
  • M. Bhandarkar, R. Brunner, Justin Gullingsrud, A.
    Gursoy, N.Krawetz, J. Phillips, A. Shinozaki, K.
    Varadarajan, Gengbin Zheng, ..

Theoretical Biophysics Group, supported by NIH
19
Molecular Dynamics
  • Collection of charged atoms, with bonds
  • Newtonian mechanics
  • At each time-step
  • Calculate forces on each atom
  • Bonds
  • Non-bonded electrostatic and van der Waals
  • Calculate velocities and advance positions
  • 1 femtosecond time-step, millions needed!
  • Thousands of atoms (1,000 - 100,000)

20
Cut-off Radius
  • Use of cut-off radius to reduce work
  • 8 - 14 Å
  • Far away atoms ignored! (screening effects)
  • 80-95 work is non-bonded force computations
  • Some simulations need faraway contributions
  • Particle-Mesh Ewald (PME)
  • Even so, cut-off based computations are
    important
  • Near-atom calculations constitute the bulk of the
    above
  • Multiple time-stepping is used k cut-off steps,
    1 PME
  • So, (k-1) steps do just cut-off based simulation

21
Scalability
  • The program should scale up to use a large number
    of processors.
  • But what does that mean?
  • An individual simulation isnt truly scalable
  • Better definition of scalability
  • If I double the number of processors, I should
    be able to retain parallel efficiency by
    increasing the problem size

22
Isoefficiency
  • Quantify scalability
  • (Work of Vipin Kumar, U. Minnesota)
  • How much increase in problem size is needed to
    retain the same efficiency on a larger machine?
  • Efficiency sequential-time/ (P parallel-time)
  • Parallel-time computation communication
    idle

23
Early methods
  • Atom replication
  • Each processor has data for all atoms
  • Force calculations parallelized
  • Collection of forces O(N log p) communication
  • Computation O(N/P)
  • Communication/computation Ratio O(P log P) Not
    Scalable
  • Atom Decomposition
  • Partition the atoms array across processors
  • Nearby atoms may not be on the same processor
  • Communication O(N) per processor
  • Ratio O(N) / (N / P) O(P) Not Scalable

24
Force Decomposition
  • Distribute force matrix to processors
  • Matrix is sparse, non uniform
  • Each processor has one block
  • Communication
  • Ratio
  • Better scalability in practice
  • Can use 100 processors
  • Plimpton
  • Hwang, Saltz, et al
  • 6 on 32 processors
  • 36 on 128 processor
  • Yet not scalable in the sense defined here!

25
Spatial Decomposition
  • Allocate close-by atoms to the same processor
  • Three variations possible
  • Partitioning into P boxes, 1 per processor
  • Good scalability, but hard to implement
  • Partitioning into fixed size boxes, each a little
    larger than the cut-off distance
  • Partitioning into smaller boxes
  • Communication O(N/P)
  • Communication/Computation ratio O(1)
  • So, scalable in principle

26
Ongoing work
  • Plimpton, Hendrickson
  • new spatial decomposition
  • NWChem (PNL)
  • Peter Kollman, Yong Duan et al
  • microsecond simulation
  • AMBER version (SANDER)

27
Spatial Decomposition in NAMD
But the load balancing problems are still severe
28
Hybrid Decomposition
29
FD SD
  • Now, we have many more objects to load balance
  • Each diamond can be assigned to any processor
  • Number of diamonds (3D)
  • 14Number of Patches

30
Bond Forces
  • Multiple types of forces
  • Bonds(2), angles(3), dihedrals (4), ..
  • Luckily, each involves atoms in neighboring
    patches only
  • Straightforward implementation
  • Send message to all neighbors,
  • receive forces from them
  • 262 messages per patch!

31
Bond Forces
  • Assume one patch per processor
  • An angle force involving atoms in patches
    (x1,y1,z1), (x2,y2,z2), (x3,y3,z3) is calculated
    in patch (maxxi, maxyi, maxzi)


C
A
B
32
NAMD Implementation
  • Multiple objects per processor
  • Different types patches, pairwise forces, bonded
    forces
  • Each may have its data ready at different times
  • Need ability to map and remap them
  • Need prioritized scheduling
  • Charm supports all of these

33
Load Balancing
  • Is a major challenge for this application
  • Especially for a large number of processors
  • Unpredictable workloads
  • Each diamond (force compute object) and patch
    encapsulate variable amount of work
  • Static estimates are inaccurate
  • Very slow variations across timesteps
  • Measurement-based load balancing framework

Compute
Cell (patch)
Cell (patch)
34
Bipartite Graph Balancing
  • Background load
  • Patches (integration, etc.) and bond-related
    forces
  • Migratable load
  • Non-bonded forces
  • Bipartite communication graph
  • Between migratable and non-migratable objects
  • Challenge
  • Balance load while minimizing communication

Compute
Cell (Patch)
Cell (patch)
35
Load Balancing Strategy
Greedy variant (simplified) Sort compute objects
(diamonds) Repeat (until all assigned) S set
of all processors that -- are not
overloaded -- generate least new commun.
P least loaded S Assign heaviest compute
to P
Refinement Repeat - Pick a compute from
the most overloaded PE - Assign it to a
suitable underloaded PE Until (No movement)
Cell
Cell
Compute
36
(No Transcript)
37
Speedups in 1998
ApoA-I 92k atoms
38
Optimizations
  • Series of optimizations
  • Examples discussed here
  • Grainsize distributions (bimodal)
  • Integration message sending overheads
  • Several other optimizations
  • Separation of bond/angle/dihedral objects
  • Inter-patch and intra-patch
  • Prioritization
  • Local synchronization to avoid interference
    across steps

39
Grainsize and Amdahlss Law
  • A variant of Amdahls law, for objects, would be
  • The fastest time can be no shorter than the time
    for the biggest single object!
  • How did it apply to us?
  • Sequential step time was 57 seconds
  • To run on 2k processors, no object should be more
    than 28 msecs.
  • Should be even shorter
  • Grainsize analysis via projections showed that
    was not so..

40
Grainsize Analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
41
Grainsize Reduced
42
Performance Audit
  • Through the optimization process,
  • an audit was kept to decide where to look to
    improve performance

Total Ideal Actual Total 57.04 86 nonBonded 52.
44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Ove
rhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receiv
es 0 1.61
Integration time doubled
43
Integration Overhead Analysis
integration
Problem integration time had doubled from
sequential run
44
Integration Overhead Example
  • The projections pictures showed the overhead was
    associated with sending messages.
  • Many cells were sending 30-40 messages.
  • The overhead was still too much compared with the
    cost of messages.
  • Code analysis memory allocations!
  • Identical message is being sent to 30
    processors.
  • Simple multicast support was added to Charm
  • Mainly eliminates memory allocations (and some
    copying)

45
Integration Overhead After Multicast
46
ApoA-I on ASCI Red
57 ms/step
47
ApoA-I on Origin 2000
48
ApoA-I on Linux Cluster
49
ApoA-I on O2K and T3E
50
ApoA-I on T3E
51
BC1 complex 200k atoms
52
BC1 on ASCI Red
58.4 GFlops
53
Lessons Learned
  • Need to downsize objects!
  • Choose smallest possible grainsize that amortizes
    overhead
  • One of the biggest challenge
  • Was getting time for performance tuning runs on
    parallel machines

54
ApoA-I with PME on T3E
55
Future and Planned Work
  • Increased speedups on 2k-10k processors
  • Smaller grainsizes
  • Parallelizing integration further
  • New algorithms for reducing communication impact
  • New load balancing strategies
  • Further performance improvements for PME
  • With multiple timestepping
  • Needs multi-phase load balancing
  • Speedup on small molecules!
  • Interactive molecular dynamics

56
More Information
  • Charm and associated framework
  • http//charm.cs.uiuc.edu
  • NAMD and associated biophysics tools
  • http//www.ks.uiuc.edu
  • Both include downloadable software

57
Initial Speedup Results ASCI Red
58
Performance Size of System
Performance data on Cray T3E
59
Performance Various Machines
60
Charm
  • Data Driven Objects
  • Asynchronous method invocation
  • Prioritized scheduling
  • Object Arrays
  • Object Groups
  • global object with a representative on each PE
  • Information sharing abstractions
  • readonly data
  • accumulators
  • distributed tables

61
Data Driven Execution
Objects
Scheduler
Scheduler
Message Q
Message Q
62
CkChareID mainhandle mainmain(CkArgMsg m)
int i, low 0 for (i0 ilt100 i)
new CProxy_piPart() responders 100 count
0 mainhandle thishandle // readonly
initialization void mainresults(DataMsg
msg) count msg-gtcount if (0
--responders) CkPrintf("pi f \n",
4.0count/100000) CkExit()
Execution begins here
argc/argv
Exit scheduler after method returns
63
piPartpiPart() // declarations..
CProxy_main mainproxy(mainhandle)
srand48((long) this) mySamples
100000/100 for (i 0 ilt mySamples i)
x drand48() y drand48() if
((xx yy) lt 1.0) localCount
DataMsg result new DataMsg result-gtcount
localCount mainproxy.results(result) delete
this
64
Group Mission and Approach
  • To enhance Performance and Productivity in
    programming complex parallel applications
  • Approach Application Oriented yet CS centered
    research
  • Develop enabling technology, for many apps.
  • Develop, use and test it in the context of real
    applications
  • Theme
  • Adaptive techniques for irregular and dynamic
    applications
  • Optimal division of labor system and
    programmer
  • Decomposition done by programmer, everything else
    automated
  • Develop standard library for parallel programming
    of reusable components

65
Active Projects
  • Charm/ Converse parallel infrastructure
  • Scientific/Engineering apps
  • Molecular Dynamics
  • Rocket Simulation
  • Finite Element Framework
  • Web-based interaction and monitoring
  • Faucets anonymous compute power
  • Parallel
  • Operations Research, discrete event simulation,
    combinatorial search

66
Charm Parallel C With Data Driven Objects
  • Chares dynamically balanced objects
  • Object Groups
  • global object with a representative on each PE
  • Object Arrays/ Object Collections
  • User defined indexing (1D,2D,..,quad and
    oct-tree,..)
  • System supports remapping and forwarding
  • Asynchronous method invocation
  • Prioritized scheduling
  • Mature, robust, portable
  • http//charm.cs.uiuc.edu

Data driven Execution
67
Multi-partition Decomposition
  • Idea divide the computation into a large number
    of pieces
  • Independent of number of processors
  • Typically larger than number of processors
  • Let the system map entities to processors

68
Converse
  • Portable parallel run-time system that allows
    interoperability among parallel languages
  • Rich features to allow quick and efficient
    implementation of new parallel languages
  • Based on message-driven execution that allows
    co-existence of different control regimes
  • Support for debugging and performance analysis of
    parallel programs
  • Support for building parallel servers

69
Converse
  • Languages and paradigms implemented
  • Charm, a parallel object-oriented language
  • Thread-safe MPI and PVM
  • Parallel Java, message-driven Perl, pC
  • Platforms supported
  • SGI Origin2000, IBM SP, ASCI Red, CRAY T3E,
    Convex Ex.
  • Workstation clusters (Solaris, HP-UX, AIX, Linux
    etc.)
  • Windows NT Clusters

Paradigms Languages, Libraries,
Converse
Parallel Machines
70
Adaptive MPI
  • A bridge between legacy MPI codes and dynamic
    load balancing capabilities of Charm
  • AMPI MPI dynamic load balancing
  • Based on Charm object arrays and Converses
    migratable threads
  • Minimal modification needed to convert existing
    MPI programs (to be automated in future)
  • Bindings for C, C, and Fortran90
  • Currently supports most of the MPI 1.1 standard

71
Converse Use in NAMD
72
Molecular Dynamics
  • Collection of charged atoms, with bonds
  • Newtonian mechanics
  • At each time-step
  • Calculate forces on each atom
  • bonds
  • non-bonded electrostatic and van der Waals
  • Calculate velocities and Advance positions
  • 1 femtosecond time-step, millions needed!
  • Thousands of atoms (1,000 - 200,000)

Collaboration with Klaus Schulten, Robert Skeel
73
Spatial Decomposition in NAMD
  • Space divided into cubes
  • Forces between atoms in neighboring cubes
    computed by individual compute objects
  • Compute objects are remapped by load balancer

74
NAMD a Production-quality MD Program
  • NAMD is used by biophysicists routinely, with
    several published results
  • NIH funded collaborative effort with Profs. K.
    Schulten and R. Skeel
  • Supports full range electrostatics
  • Parallel Particle-Mesh Ewald for periodic and
    Fast multipole for aperiodic systems
  • Implemented ic C/Charm
  • Supports visualization (via VMD), Interactive MD,
    and haptic interface
  • see http//www.ks.uiuc.edu
  • Part of Biophysics collaboratory

ApoLipoprotein A1
75
NAMD Scalable Performance
Sequential Performance of NAMD (a C program) is
comparable to or better than contemporary MD
programs, written in Fortran.
Gordon Bell award finalist, SC2000
Speedup of 1250 on 2048 processors on ASCI red,
simulating BC1 with about 200k atoms. (compare
with best speedups on production-quality MD by
others 170/256 processors) Around 10,000
varying-size objects mapped by the load balancer
76
Rocket Simulation
  • Rocket behavior (and therefore its simulation) is
    irregular, dynamic
  • We need to deal with dynamic variations
    adaptively
  • Dynamic behavior arises from
  • Combustion moving boundaries
  • Crack propagation
  • Evolution of the system

77
Rocket Simulation
  • Our Approach
  • Multi-partition decomposition
  • Data-driven objects (Charm)
  • Automatic load balancing framework
  • AMPI Migration path for existing MPIFortran90
    codes
  • ROCFLO, ROCSOLID, and ROCFACE

78
FEM Framework
  • Objective To make it easy to parallelize
    existing Finite Element Method (FEM) Applications
    and to quickly build new parallel FEM
    applications including those with irregular and
    dynamic behavior
  • Hides the details of parallelism developer
    provides only sequential callback routines
  • Embedded mesh partitioning algorithms split mesh
    into chunks that are mapped to different
    processors (many-to-one)
  • Developers callbacks are executed in migratable
    threads, monitored by the run-time system
  • Migration of chunks to correct load imbalance
  • Examples
  • Pressure-driven crack propagation
  • 3-D Dendritic Growth

79
FEM Framework Responsibilities
FEM Application (Initialize, Registration of
Nodal Attributes, Loops Over Elements, Finalize)
FEM Framework (Update of Nodal properties,
Reductions over nodes or partitions)
Partitioner
Combiner
Charm (Dynamic Load Balancing, Communication)
METIS
I/O
80
Crack Propagation
  • Explicit FEM code
  • Zero-volume Cohesive Elements inserted near the
    crack
  • As the crack propagates, more cohesive elements
    added near the crack, which leads to severe load
    imbalance
  • Framework handles
  • Partitioning elements into chunks
  • Communication between chunks
  • Load Balancing

Decomposition into 16 chunks (left) and 128
chunks, 8 for each PE (right). The middle area
contains cohesive elements. Pictures S.
Breitenfeld, and P. Geubelle
81
Dendritic Growth
  • Studies evolution of solidification
    microstructures using a phase-field model
    computed on an adaptive finite element grid
  • Adaptive refinement and coarsening of grid
    involves re-partitioning

Work by Prof. Jon Dantzig, Jun-ho Jeong
82
(No Transcript)
83
Anonymous Compute Power
What is needed to make this metaphor
work? Timeshared parallel machines in the
background effective resource management Quality
of computational service contracts/guarantees Fron
t ends that will allow agents to submit jobs on
users behalf
Computational Faucets
84
Computational Faucets
  • What does a Computational faucet do?
  • Submit requests to the grid
  • Evaluate bids and decide whom to assign work
  • Monitor applications (for performance and
    correctness)
  • Provide interface to users
  • Interacting with jobs, and monitoring behavior
  • What does it look like?

A browser!
85
Faucets QoS
  • User specifies desired job parameters such as
    program executable name, executable platform, min
    PE, max PE, estimated CPU-seconds (for various
    PE), priority, etc.
  • User does not specify machine. Faucet software
    contacts a central server and obtains a list of
    available workstation clusters, then negotiates
    with clusters and chooses one to submit the job.
  • User can view status of clusters.
  • Planned file transfer, user authentication,
    merge with Appspector for job monitoring.

Workstation Cluster
Faucet Client
Central Server
Workstation Cluster
Web Browser
Workstation Cluster
86
Timeshared Parallel Machines
  • Need resource management
  • Shrink and expand individual jobs to available
    sets of processors
  • Example Machine with 100 processors
  • Job1 arrives, can use 20-150 processors
  • Assign 100 processors to it
  • Job2 arrives, can use 30-70 processors,
  • and will pay more if we meet its deadline
  • Make resource allocation decisions

87
Time-shared Parallel Machines
  • To bid effectively (profitably) in such an
    environment, a parallel machine must be able to
    run well-paying (important) jobs, even when it is
    already running others.
  • Allows a suitably written Charm/Converse
    program running on a workstation cluster to
    dynamically change the number of CPU's it is
    running on, in response to a network (CCS)
    request.
  • Works in coordination with a Cluster Manager to
    give a job as many CPU's as are available when
    there are no other jobs, while providing the
    flexibility to accept new jobs and scale down.

88
Appspector
  • Appspector provides a web interface to submitting
    and monitoring parallel jobs.
  • Submission user specifies machine, login,
    password, program name (which must already be
    available on the target machine).
  • Jobs can be monitored from any computer with a
    web browser. Advanced program information can be
    shown on the monitoring screen using CCS.

89
BioCoRE
Goal Simulate the process of doing research.
Provide a web-based way to virtually bring
scientists together.
  • Project Based
  • Workbench for Modeling
  • Conferences/Chat Rooms
  • Lab Notebook
  • Joint Document Preparation

http//www.ks.uiuc.edu/Research/biocore/
Write a Comment
User Comments (0)
About PowerShow.com