Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware - PowerPoint PPT Presentation

About This Presentation

Title:

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware

Description:

Minimum Image Convention. Periodic Boundary Conditions. Neighbor Lists. Link Cell Method ... Incorporate multiple MD-GRAPE 2 boards into the current setup. ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 28

Provided by: Sum102

Learn more at: http://www.cs.fsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware

1
Scheduling Many-Body Short Range MD Simulations
on a Cluster of Workstations and Custom VLSI
Hardware

Sumanth J.V, David R. Swanson and Hong Jiang
University of Nebraska-Lincoln

We thank ONR, RCF and SDI(NSF 0091900) for
funding this research and Dr. Kenji Yasuoka and
Dr. Takahiro Koishi for the many useful talks we
had.
2
Introduction

MD is very computationally intensive.
Luckily, MD is parallelizable hence can be
efficiently implemented on a cluster.
Custom VLSI solutions like the MD-GRAPE 2 are
another approach, but is more limited to kinds of
potential functions that can be evaluated.
We combine the above two techniques to combine
the advantages of both.

3
Computational Aspects of MD

Perform time integration of following equation

Forces are computed as

4
Potential Function

We restrict ourselves to 2-body and 3-body
potentials.
The MD-GRAPE 2 is designed to compute only 2-body
potentials.
The cluster can however be programmed to evaluate
any kind of potential.
We use a combination of the cluster and the
MD-GRAPE 2 board to evaluate a 3-body potential.

5
Lennard-Jones(LJ) Potential

Is a very simple empirical 2-body potential

6
Reactive Bond Order (REBO) Potential
7
Simple MD Algorithm

Can improve execution time by using cut-off
radius, neighbor lists, link cell or combination
of these.
Cut-off radius introduces discontinuities. Can be
overcome by smoothing the potential function.
Velocity-Verlet Integration.

8
Boundary Conditions
Minimum Image Convention
Periodic Boundary Conditions
9
Neighbor Lists
10
Link Cell Method
11
Parallel MD Atom Decomposition

Involves dividing up the N atoms into sets of N/P
atoms and assigning each set to one of the P
processors.
At every time-step, two global communication
operations are required (one for updating
positions and the other for updating forces).
Runs in time proportional to the square of the
number of atoms N.
Very good efficiency but long running times.
Is a suitable technique if the system is dense.

12
Parallel MD Spatial Decomposition

Involves dividing up the simulation box into
domains and assigning domains to processors.
Communication is local.
Efficiency is worse, but has lower running time.
Works better when the system is not very dense.
Load balancing can be performed by dynamically
varying the volume of the domains.
When the system is not very dense, the running
time is nearly linear.

13
Efficiency of Atom and Spatial Decomposition
Atom Decomposition
Spatial Decomposition
14
MD-GRAPE 2 for MD simulations

Parallel pipelined special purpose hardware for
computing non-bonded forces.
Bonded forces and time integrations are performed
on the host machine.
Can compute forces either in the all-pairs method
or link-cell method.
If there are more than half a million atoms in
the system, they must be split into batches of at
most half a million before being sent to the
MD-GRAPE 2.
Peak Performance of 64Gflops.

15
MD-GRAPE 2 Calculations

The forces and potentials are computed using the
following two equations.

The function G(x) is evaluated using a segmented
fourth order polynomial interpolation.

16
MD-GRAPE 2 Architecture
17
Link Cell Method on MD-GRAPE 2
18
Number of Processors vs. Execution Time for
MD-GRAPE 2 link cell method and domain
decomposition method.
Relative Error in Computing Total Energy with
MD-GRAPE2
19
Scheduling MD on a cluster and MD-GRAPE 2
simultaneously

The REBO is a three-body potential.
It comprises of three two-body components VR, VA
and VvdW and a three-body component Bij.
The MD-GRAPE 2 is not capable of computing
three-body potentials due to its architecture.
The custom function evaluation table does not
allow for conditional statements to be placed in
the function, but this feature is required to
evaluate VR and VA.
This allows us to compute the VvdW on the
MD-GRAPE 2 and all the other components on the
cluster.

20
Scheduling MD on a cluster and MD-GRAPE 2
simultaneously contd.

The motivation for doing so is that the Vvdw has
a cut-off that is roughly twice that of the other
components.
This can however be efficiently computed on the
MD-GRAPE 2 while the other components are being
evaluated on the cluster simultaneously.
To aid in communication between the cluster and
the machine hosting the MD-GRAPE 2, we implement
a server.

21
Scheduling MD on a cluster and MD-GRAPE 2
simultaneously contd.

The server accepts a position vector and outputs
a partial forces vector and a partial potential
vector.
They are called partial since they only contain
contributions due to the Vvdw component.
This has to be added to the other contributions
that are computed on the cluster.

22
Scheduling MD on a cluster and MD-GRAPE 2
simultaneously contd.

At every time step, before the parallel code
begins its computations, it sends a copy of the
position vector to the MD-GRAPE 2.
Now the cluster and the MD-GRAPE 2 compute
partial forces/potentials simultaneously.
When the MD-GRAPE 2 completes its computations,
it returns the partial forces/potentials to the
host and the host sums them to give the actual
forces and total potential energy.
The cut-off for the computations on the cluster
is now 2.5 instead of 5.5 which is required if
all the components of the REBO potential were
computed on it.

23
Scheduling MD on a cluster and MD-GRAPE 2
simultaneously contd.

The execution time of an MD simulation using the
atom-decomposition method can be well
approximated by a second degree polynomial tc(N).
The execution time on the MD-GRAPE 2 can also be
approximated by a second degree polynomial tg(N).
The total time it takes to run such a simulation
on the cluster and the MD-GRAPE 2 simultaneously
is given by

24
Scheduling MD on a cluster and MD-GRAPE 2
simultaneously contd.

The optimum number of processors to use can be
determined by

Experimentally, we have determined an optimal p
to be 35 for our setup.
With this setup, we found the speedup to
gradually approach 1.4 and nearly remain constant
after that.
We used the atom-decomposition method since the
system being simulated is very dense.

25
Plot of speedup when using a cluster and MD-GRAPE
2 simultaneously vs. using a cluster alone
26
Conclusion

At the time of writing, cost per processor
including network was 1500 USD.
Cost of MD-GRAPE 2 was 15000 USD.
For long range potentials it is more cost
effective to use MD-GRAPE 2 since it takes 61
cluster CPUs to equal its performance.
For short range potentials, it is more effective
to use a cluster since it takes only 12 cluster
CPUs to match performance.
However, using a combination of a cluster and
MD-GRAPE 2 to solve more complex potentials can
yield a significant gain.

27
Future Work

Incorporate multiple MD-GRAPE 2 boards into the
current setup.
Schedule MD simulations on larger scale systems
with Globus.
Custom FPGA solutions to solve more than just
pair potentials.
Using GPUs to perform MD.
Hand optimizing energy calculations to use
SSE/SSE2/SSE3 instructions for optimal
performance.