Title: Parallel Programming for Particle Dynamics and Systems of Ordinary Differential Equations
1Parallel Programming for Particle Dynamics and
Systems of Ordinary Differential Equations
- Spring Semester 2005
- Geoffrey Fox
- Community Grids Laboratory
- Indiana University
- 505 N Morton
- Suite 224
- Bloomington IN
- gcf_at_indiana.edu
2Abstract of Parallel Programming for Particle
Dynamics
- This uses the simple O(N2) Particle Dynamics
Problem as a motivator to discuss solution of
ordinary differential equations - We discuss Euler, Runge Kutta and predictor
corrector methods - Various Message parallel O(N2) algorithms are
described with performance comments - The initial discussion covers the type of
problems that use these algorithms - Advanced topics include fast multipole methods
and their application to earthquake simulations
3Classes of Physical Simulations
- Mathematical (Numerical) formulations of
simulations fall into a few key classes which
have their own distinctive algorithmic and
parallelism issues - Most common formalism is that of a field theory
where quantities of interest are represented by
densities defined over a 1,2,3 or 4 dimensional
space. - Such a description could be fundamental as in
electromagnetism or relativity for gravitational
field or approximate as in CFD where a fluid
density averages over a particle description. - Our Laplace example is of this form where field ?
could either be fundamental (as in
electrostatics) or approximate if comes from
Euler equations for CFD
4Applications reducing to Coupled set of Ordinary
Differential Equations
- Another set of models of physical systems
represent them as coupled set of discrete
entities evolving over time - Instead of ?(x,t) one gets ?i(t) labeled by an
index i - Discretizing x in continuous case leads to
discrete case but in many cases, discrete
formulation is fundamental - Within coupled discrete system class, one has two
important approaches - Classic time stepped simulations -- loop over all
i at fixed t updating to - Discrete event simulations -- loop over all
events representing changes of state of ?i(t)
5Particle Dynamics or Equivalent Problems
- Particles are sets of entities -- sometimes fixed
(atoms in a crystal) or sometimes moving
(galaxies in a universe) - They are characterized by force Fij on particle i
due to particle j - Forces are characterized by their range r
Fij(xi,xj) is zero if distance xi-xj greater
than r - Examples
- The universe
- A globular star cluster
- The atoms in a crystal vibrating under
interatomic forces - Molecules in a protein rotating and flexing under
interatomic force - Laws of Motion are typically ordinary
differential equations - Ordinary means differentiate wrt one variable --
typically time
6Classes of Particle Problems
- If the range r is small (as in a crystal), the
one gets numerical formulations and parallel
computing considerations similar to those in
Laplace example with local communication - We showed in Laplace module that efficiency
increases as range of force increases - If r is infinite ( no cut-off for force) as in
gravitational problem, one finds rather different
issues which we will discuss in this module - There are several non-particle problems
discussed later that reduce to long range force
problem characterized by every entity interacting
with every other entity - Characterized by a calculation where updating
entity i involves all other entities j
7Circuit Simulations I
- An electrical or electronic network has the same
structure as a particle problem where particles
are components (transistor, resistance,
inductance etc.) and force between components i
and j is nonzero if and only if i and j are
linked in the circuit - For simulations of electrical transmission
networks (the electrical grid), one would
naturally use classic time stepped simulations
updating each component i from state at time t to
state at time t?t. - If one is simulating something like a chip, then
time stepped approach is very wasteful as 99.99
of the components are doing nothing (i.e. remain
in same state) at any given time step! - Here is where discrete event simulations (DES)
are useful as one only computes where the action
is - Biological simulations often are formulated as
networks where each component (say a neuron or a
cell) is described by an ODE and the network
couples components
8Circuit Simulations II
- Discrete Event Simulations are clearly preferable
on sequential machines but parallel algorithms
are hard due to need for dynamic load balancing
(events are dynamic and not uniform throughout
system) and synchronization (which events can be
executed in parallel?) - There are several important approaches to DES of
which best known is Time Warp method originally
proposed by David Jefferson -- here one
optimistically executes events in parallel and
rolls back to an earlier state if this is found
to be inconsistent - Conservative methods (only execute those events
you are certain cannot be impacted by earlier
events) have little paralellism - e.g. there is only one event with lowest global
time - DES do not exhibit the classic loosely
synchronous compute-communicate structure as
there is no uniform global time - typically even with time warp, no scalable
parallelism
9Discrete Event Simulations
- Suppose we try to execute in parallel events E1
and E2 at times t1 and t2 with t1lt t2. - We show the timelines of several(4) objects in
the system and our two events E1 and E2 - If E1 generates no interfering events or one E12
at a time greater than t2 then our parallel
execution of E2 is consistent - However if E1 generates E12 before t2 then
execution of E2 has to be rolled back and E12
should be executed first
E1
E21
E11
Objects in System
E22
E12
E12
Time
E2
10Matrices and Graphs I
- Especially in cases where the force is linear
in the ?i(t) , it is convenient to think of force
being specified by a matrix M whose elements mij
are nonzero if and only if the force between i
and j is nonzero. A typical force law is Fi ?
mij ?j(t) - In Laplace Equation example, the matrix M is
sparse ( most elements are zero) and this is a
specially common case where one can and needs to
develop efficient algorithms - We discuss in another talk the matrix formulation
in the case of partial differential solvers
11Matrices and Graphs II
- Another way of looking at these problems is as
graphs G where the nodes of the graphs are
labeled by the particles i, and one has edges
linking i to j if and only if the force Fij is
non zero - In these languages, long range force problems
correspond to dense matrix M (all elements
nonzero) and fully connected graphs G
10
1
3
4
7
11
9
5
2
6
8
12
12Other N-Body Like Problems - I
- The characteristic structure of N-body problem is
an observable that depends on all pairs of
entities from a set of N entities. - This structure is seen in diverse applications
- 1) Look at a database of items and calculate some
form of correlation between all pairs of database
entries - 2) This was first used in studies of measurements
of a "chaotic dynamical system" with points xi
which are vectors of length m Put rij distance
between xi and xj in m dimensional space Then
probability p(rij r) is proportional to r(d-1) - where d (not equal to m) is dynamical dimension
of system - calculate by forming all the rij (for i and j
running over observable points from our system --
usually a time series) and accumulating in a
histogram of bins in r - Parallel algorithm in a nutshell Store
histograms replicated in all processors,
distribute vectors equally in each processor and
just pipeline xj through processors and as they
pass through accumulate rij add histograms
together at end.
13Other N-Body Like Problems - II
- 3) Green's Function Approach to simple Partial
Differential equations gives solutions as
integrals of known Green's functions times
"source" or "boundary" terms. - For the simulation of earthquakes in GEM project
the source terms are strains in faults and the
stresses in any fault segment are the integral
over strains in all other segments - Compared to particle dynamics, Force law replaced
by Green's function but in each case total
stress/Force is sum over contributions associated
with other entities in formulation - 4) In the so called vortex method in CFD
(Computational Fluid Dynamics) one models the
Navier Stokes Equation as the long range
interactions between entities which are the
vortices - 5) Chemistry uses molecular dynamics and so
particles are molecules but force is not Newton's
laws usually but rather Van der Waals forces
which are long range but fall off faster than 1/r2
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18Eulers Method for ODEs
- Eulers method involves a simple linear
extrapolation to proceed from point to point with
the ODE providing the slope - Grid ti t0 hi and X0 Initial Value
- Xi1 Xi h f(ti,Xi)
- In practice not used as not accurate enough
19Estimate of Local Error in Eulers Method
- Let X(ti) be numerical solution and Y(ti) the
exact solution. We have Taylor Expansion - And by assumption X(ti) Y(ti) and in Eulers
method we use exact derivative at ti. ThusAnd
so X(tih) Y(tih) O(h2) and so local error
is O(h2) - Accumulating this error over 1/h steps gives a
global error of order h
20Relationship of Error to Computational Approach
- Whenever f satisfies certain smoothness
conditions, there is always a sufficiently small
step size h such that the difference between the
real function value Yi1 at ti1 and the
approximation Xi1 is less than some required
error magnitude ?. e.g. Burden and Faires - Euler's method very quick as one computation of
the derivative function f at each step. - Other methods require more computation per step
size h but can produce the specified error ? with
fewer steps as can use a larger value of h. Thus
Eulers method is not best.
21Simple Example of Eulers Equation
- The CSEP Book http//csep.hpcc.nectec.or.th/CSEP/
has lots of useful material. Consider simple
example from there - With the trivial and uninteresting (as step size
too large) choice h0.25, we get
And so
22(No Transcript)
23Whats wrong with Eulers Method?
- To extrapolate from ti to ti1 one should
obviously best use (if just a straight line) the
average slope (of tangent) and NOT the value at
start - This immediately gives an error smaller by factor
h but isnt trivial because you dont seem to
know f(ti0.5h,X(ti0.5h))
24Runge-Kutta Methods Modified Euler I
- In the Runge Kutta methods, one uses intermediate
values to calculate such midpoint derivatives - Key idea is that use an approximation for
X(ti0.5h) as this is an argument of f which is
multiplied by h. Thus error is (at least) one
factor of h better than approximation - So if one wishes just to do one factor of h
better than Euler, one can use Euler to estimate
value of X(ti0.5h)
25Runge-Kutta Methods Modified Euler II
- Midpoint Method is ?1 f(ti,X(ti)) ?2
f(ti0.5h,X(ti0.5h)) Xi1 Xi h ?2
Global Error is O(h2))Second Order Method
?2 Approximate Tangent at Midpoint
?1 Tangentat ti
Exact Yi1
Midpoint Approximation Xi1
Xi Yi
Approximate Tangent at Midpoint through ti
ti
ti1
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30Time Discretization
- In ordinary differential equations, one focuses
on different ways of represent d?i(t)/dt in some
difference form such as Eulers form
d?i(t)/dt (?i(t ?t) - ?i(t)) / ?t - Or more accurate Runge-Kutta and other formulae
- The same time discretization methods are
applicable to (time derivatives in) partial
differential equations involving time and are
used to approximate ??i(x,t) / ?t
31(No Transcript)
32(No Transcript)
33(No Transcript)
34The Mathematical Problem
- We have 3 long vectors of size N
- X X1, X2, X3, XN
- V V1, V2, V3, VN
- M M1, M2, M3, MN is constant
- Critical Equation isdVi / dt Sum over all
other particles j of the force due to j th
particle which is (Xj Xi)/(distance between i
and j )3 which we term Grav(X,M) - We call the message parallel (MPI) implementation
of the calculation of Grav the function MPGrav
35The Problem
- D(X,V)/dt Function of X,V, M
- Where each of X, V and M are N dimensional arrays
- And there some scalars such as Gravitational
Constant
36Essential Ideas
- The array size N is from 100,000 (Chemistry) to
100,000,000 (cosmology) - In simplest case of Eulers method
- X(tdt) X(t) h V(t)
- V(tdt) V(t) h Grav(Vectors X and M)
- Runge-Kutta is more complicated but this doesnt
effect essential issues for parallelism - The X equation needs to be calculated for each
component i it takes time of order N - The V equation also needs to be calculated for
each component i but now each component is
complicted and needs a computation itself of O(N)
total time is of order N2 and dominates
37Form of parallel Computation
- Computation of numerical method is inherently
iterative at each time step, the solution
depends on the immediately preceding one. - At each time step, MPGrav is called (several
times as using Runge Kutta) - For each particle i, one must sum over the forces
due to all other particles j. - This requires communication as information about
other particles are stored in other processors - We will use 4th order Runge Kutta to integrate in
time and the program is designed around an
overall routine looping over time with
parallelism hidden in MPGrav routines and so
identical between parallel and sequential versions
38(No Transcript)
39Status of Parallelism in Various N Body Cases
- Data Parallel approach is really only useful for
the simple O(N2) case and even here it is quite
tricky to express algorithm so that it is - both Space Efficient and
- Captures factor of 2 savings from Newton's law of
action and reaction Fij - Fji - We have discussed these issues in a different
foilset - The shared memory approach is effective for a
modest number of processors in both algorithms. - It is only likely to scale properly for O(N2)
case as the compiler will find it hard to capture
clever ideas needed to make fast multipole
efficient - Message Parallel approach gives you very
efficient algorithms in both O(N2) and O(NlogN) - O(N2) case has very low communication
- O(NlogN) has quite low communication
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51Relation to General Speed Up and Efficiency
Analysis
- We discussed in case of Jacobi example, that in
many problems there is an elegant formulafcomm
constant . tcomm/(n1/d tfloat) - d is system information dimension which is equal
to geometric dimension in problems like Jacobi
Iteration where communication is a surface and
calculation a volume effect - This geometric type formula comes in any case
where we are solving partial differential
equation by local algorithm (locality corresponds
to geometry)
52(No Transcript)
53(No Transcript)
54Factor of 2 in N-body Algorithm
- Symmetry of force on particles Fij -Fji
(Newton's Law of Action and Reaction!) - Only half need to be computed
- Sequentially this can easily be taken into
account for one just does loops with sum over
particles i, and then sums for force calculation
over particles j lt i and then calculate
algebraic form of Fij and then accumulates --
Force on i is incremented by Fij and Force on j
is decremented by Fij - Parallel version has two issues
- Firstly one cannot use "owner-computes" rule
directly as typically i and j are stored in
different processors and one must calculate Fij
in wrong processor for either i or j (in
algorithm above the force on the particle with
the larger index is calculated in wrong
processor) - Secondly one must worry about load balancing as
processors responsible for particles with low
numbered particles will have more work
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59Best Message Parallel N Body Algorithm - III
- So owners compute rule is preserved as the home
for particle i accumulates the force on I and
calculates its acceleration and change in
velocity and position - Ignoring blocking, we send a message for each
particle around in a pipeline that visits each
other processor - Message contains the mass, position and the
off-processor contribution to force that is
incremented as message travels through the
processors - Message returns to the home processor for each
particle and the travelling force is added into
the part of the force that had been accumulated
in home processor.
60Best Message Parallel N Body Algorithm - IV
- Key characteristic of algorithm are rotating
messages containing data about and updating force
for each particle - Essential idea for efficiency is that when a
message arrives in a new processor, it is used to
update all particles that it should in that
processor so message visits each processor once
and once only - On average message only updates 50 of particles
in each new processor - Load balancing requires some sort of scattered
decomposition so each processor has as many high
and lo numbered particles - Insensitivity to message latency requires that we
block messages and send several at a time - This algorithm is too hard to be found by a
compiler unless you build it in.
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)