Title: Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters
1Comparing The Performance Of Distributed Shared
Memory And Message Passing Programs Using The
Hyperion Java Virtual Machine On Clusters
2Overview
- For this thesis we wanted to evaluate the
performance of the Hyperion distributed virtual
machine, designed at UNH, when compared to a
preexisting parallel computing API. - The results would indicate where Hyperions
strength and weaknesses were and possibly
validate Hyperion as a high-performance computing
alternative.
3What Is A Cluster?
- A cluster is a group of lowcost computers
connected with an offtheshelf network. - The clusters network is isolated from WAN data
traffic and the computers on the cluster are
presented to the user as a single resource.
4Why Use Clusters?
- Clusters are cost effective when compared to
traditional parallel systems. - Clusters can be grown as needed.
- Software components are based on standards
allowing portable software to be designed for the
cluster.
5Cluster Computing
- Cluster computing takes advantage of the cluster
by distributing computational workload among
nodes of the cluster, thereby reducing total
computation time. - There are many programming models for
distributing data throughout the cluster.
6Distributed Shared Memory
- Distributed Shared Memory (DSM) allows the user
to view the whole cluster as one resource. - Memory is shared among the nodes. Each node has
access to all other nodes memory as if it owns
it. - Data coordination among nodes is generally hidden
from the user.
7Message Passing
- Message Passing (MP) requires explicit messages
to be employed to distribute data throughout the
cluster. - The programmer must coordinate all data exchanges
when designing the application through a language
level MP API.
8Related Treadmarks Vs. PVM
- Treadmarks (Rice, 1995) implements a DSM model
while PVM implements a MP model. The two
approaches were compared with benchmarks. - On average, PVM was found to perform two times
better the Treadmarks. - Treadmarks suffered from excessive messages that
were required for the requestresponse
communication DSM model employed. - Treadmarks was found to be more natural to
program with saving development time.
9Hyperion
- Hyperion is a distributed Java Virtual Machine
(JVM), designed at UNH. - The Java language provides parallelism through
its threading model. Hyperion extends this model
by distributing the threads among the cluster. - Hyperion implements the DSM model via DSM-PM2,
which allows for lightweight thread creation and
data distribution.
10Hyperion, Continued
- Hyperion has a fixed memory size that it shares
with all threads executing across the cluster. - Hyperion uses pagebased data distribution if a
thread accesses memory it does not have locally,
a pagefault occurs and the memory is transmitted
from the node that owns the memory to the
requesting node a page at a time.
11Hyperion, Continued
- Hyperion translates Java bytecodes into native C
code. - A native executable is generated by a native C
compiler. - The belief is that native executables are
optimized by the C compiler and will benefit the
application by executing faster than interpreted
code.
12Hyperions Threads
- Threads are created in a round robin fashion
among the nodes of the cluster. - Data is transmitted between threads via a
request/response mechanism. This approach
requires two messages. - In order to respond to a request message, a
response thread must be scheduled. This thread
handles the request by sending back the requested
data in a response message.
13mpiJava
- mpiJava is a Java wrapper for the Message Passing
Interface (MPI). - The Java Native Interface (JNI) is used to
translate between Java and native code. - We used MPICH for the native MPI implementation.
14Clusters
- The Star cluster (UNH) consists of 16 PIII
667MHz Linux PCs on a 100Mb Fast Ethernet
network. TCP is communication protocol. - The Paraski cluster (France) consists of 16
PIII 1GHz Linux PCs on a 2Gb Myrinet network. BIP
(DSM) and GM (MPI) are the communication
protocols.
15Clusters, Continued
- The implementation of MPICH on BIP was not stable
in time for this thesis. GM had to be used in
place of BIP for MPICH. GM has not been ported to
Hyperion and a port would be unreasonable at this
time. - BIP performs better than GM as the message size
increases.
16BIP vs. GM Latency (Paraski)
17DSM MPI In Hyperion
- For consistency, mpiJava was ported into
Hyperion. - Both DSM and MPI versions of the benchmarks could
be compiled by Hyperion. - The executables produced by Hyperion are then
executed by the respective native launchers (PM2
and MPICH).
18Benchmarks
- The Java Grande Forum (JGF) developed a suite of
benchmarks to test Java implementations. - We used two of the JGF benchmark suites,
multithreaded and javaMPI.
19Benchmarks, Continued
- Benchmarks used
- Fourier coefficient analysis
- Lower/upper matrix factorization
- Successive over-relaxation
- IDEA encryption
- Sparse matrix multiplication
- Molecular dynamics simulation
- 3D Ray Tracer
- Monte Carlo simulation (only with MPI)
20Benchmarks And Hyperion
- The multi-threaded JGF benchmarks had
unacceptable performance when run out of the
box. - Each benchmark creates all of its data objects on
the root node causing all remote object access to
occur through this one node. - This type of access causes a performance
bottleneck on the root node as it has to service
all the requests while calculating its algorithm
part. - The solution was to modify the benchmarks to be
cluster aware.
21Hyperion Extensions
- Hyperion makes up for Javas limited thread data
management by providing efficient reduce and
broadcast mechanisms. - Hyperion also provides a cluster aware
implementation of arraycopy.
22Hyperion Extension Reduce
- Reduce blocks all enrolled threads until each
thread has the final result of the reduce. - This is done by neighbor threads exchanging their
data for computation, then their neighbors, and
so on until each thread has the same answer. - This operation is faster and scales well as
opposed to performing the calculation serially.
The operation is LogP.
23Hyperion Extension Broadcast
- The broadcast mechanism transmits the same data
to all enrolled threads. - Like reduce, data is distributed to the threads
in a LogP fashion, which scales better than
serial distribution of data.
24Hyperion Extension Arraycopy
- The arraycopy method is part of the Java System
class. The Hyperion version was extended to be
cluster aware. - If data is copied across threads, this version
will send all data as one message instead of
relying on paging mechanisms to access remote
array data.
25Benchmark Modifications
- The multithreaded benchmarks had unacceptable
performance. - The benchmarks were modified in order to reduce
remote object access and root node bottlenecks. - Techniques, such as arraycopy, broadcast and
reduce were employed to improve performance.
26Experiment
- Each benchmark was executed 50 times at each node
size to provide a sample mean. - Node sizes were 1, 2, 4, 8, and 16.
- Confidence intervals (95 level) were used to
determine which version, MPI or DSM, performed
better.
27Results On The Star Cluster
28Results On The Paraski Cluster
29Fourier Coefficient Analysis
- Calculates the first 10,000 pairs of Fourier
coefficients. - Each node is responsible for calculating its
portion of the coefficient array. - Each node sends back its array portion to the
root node, which accumulates the final array.
30Fourier DSM Modifications
- The original multithreaded version required all
threads to update arrays located on the root
node, causing the root node to be flooded with
requests. - The modified version used arraycopy to distribute
the local arrays back to the root thread arrays.
31Fourier mpiJava
- The mpiJava version is similar to the DSM
version. - Each process is responsible for its portion of
the arrays. - MPI_Ssend and MPI_Recv were called to distribute
the array portions to the root process.
32Fourier Results
33Fourier Conclusions
- Most of the time in this benchmark is spent in
the computation. - Network communication does not play a significant
role in the overall time. - Both MPI and DSM perform similar on each cluster,
scaling well when more nodes are added.
34Lower/Upper Factorization
- Solves a 500 x 500 linear system with LU
factorization followed with a triangular solve. - The factorization is parallelized while the
triangular solve is computed in serial.
35LU DSM Modifications
- The original version created the matrix on the
root thread and all access was through this
thread, causing performance bottlenecks. - The benchmark was modified to use Hyperions
Broadcast facility to distribute the pivot
information and arraycopy was used to coordinate
the final data for the solve.
36LU mpiJava
- MPI_Bcast is used to distribute the pivot
information. - MPI_Send and MPI_Recv are used so the root
process can acquire the final matrix.
37LU Results
38LU Conclusions
- While the DSM version uses a similar data
distribution mechanism as the MPI version, there
is significant overhead that is exposed when
executing these methods in large loops. - This overhead is minimized on the Paraski cluster
due to the nature Myrinet and BIP.
39Successive Over-Relaxation
- Performs 100 iterations of SOR on a 1000 x 1000
grid. - A red-black ordering mechanism allows array
rows to be distributed to nodes in blocks. - After initial data distribution, only neighbor
rows need be communicated during the SOR.
40SOR DSM Modifications
- Excessive remote thread object access made it
necessary to modify the benchmark. - Modified version uses arraycopy to update
neighbor rows during the SOR. - When the SOR completes, arraycopy is used to
assemble the final matrix on the root thread.
41SOR mpiJava
- MPI_Sendrecv is used to exchange neighbor rows.
- MPI_Ssend and MPI_Recv are used to build the
final matrix on the root process.
42SOR Results
43SOR Conclusions
- The DSM version requires an extra barrier after
row neighbors are exchanged due to the network
reactivity problem. - A thread must be able to service all requests in
a timely fashion. If the thread is busy
computing, it cannot react quickly enough to
schedule the request thread. - The barrier will block all threads until each
reaches the barrier, which guarantees that all
nodes have their requested data and it is safe to
continue with the computation.
44IDEA Crypt
- Performs IDEA encryption and decryption on a
3,000,000 byte array. - The array is divided among nodes in a block
manner. - Each node encrypts and decrypts its portion.
- When complete, the root node collects the
decrypted array for validation.
45Crypt DSM Modifications
- The original created the whole array on the root
thread and required each remote thread to page in
their portions. - The modified version used arraycopy to distribute
each threads portion from the root thread. - When decryption finishes, arraycopy copies back
the decrypted portion to the root thread.
46Crypt mpiJava
- The mpiJava version uses MPI_Ssend to send the
array portions to the remote processes and
MPI_Recv to receive the portions. - When complete, MPI_Ssend is used to send back the
processes portion and MPI_Recv receives each
portion.
47Crypt Results
48Crypt Conclusions
- Results are similar on both clusters.
- There is a slight performance problem with 4 and
8 nodes with the DSM version. - This can be attributed to the placing of a
barrier that causes all threads to block before
computing, in the DSM version, while the MPI
version does not block.
49Sparse Matrix Multiplication
- A 50,000 x 50,000 unstructured matrix stored in
compressed-row format is multiplied over 200
iterations. - Only the final result is communicated as each
node has its own portion of data and initial
distribution is not timed.
50Sparse DSM Modifications
- This benchmark originally produced excessive
network traffic through remote object access. - The modifications involved removing the object
access during the multiplication loop and using
arraycopy to distribute the final result to the
root thread.
51Sparse mpiJava
- This benchmarks only communication is an
MPI_Allreduce, which performs a array reduce
leaving the result on all nodes. This is employed
to obtain the final result of the multiplication.
52Sparse Results
53Sparse Conclusions
- The results are similar on both clusters.
- The DSM version has better performance on both
cluster. - The MPI version uses the MPI_Allreduce method
that places the results on all nodes. - This method adds extra overhead that is not
present in the DSM version, where the results are
just collected on the root node.
54Molecular Dynamics
- This benchmark is a 2048-body code that models
particles interacting under a Lennard-Jones
potential in a cubic spatial volume with periodic
boundary conditions. - Parallelization is provided by dividing the range
of iterations over the particles among nodes.
55MolDyn DSM Modifications
- Significant amount of remote thread object access
necessitated modifications. - Particle forces are updated by remote threads
using arraycopy to send the forces to the root
thread and the root thread serially updates the
forces and sends the new force array back to the
remote threads via arraycopy.
56MolDyn mpiJava
- This version uses six MPI_Allreduce commands to
update various force and movement arrays. This
occurs at every time step.
57MolDyn Results
58MolDyn Conclusions
- The DSM version must update particle forces
serially on the root thread. - This causes all threads to block while sending
its local forces to the root thread and wait for
the updated forces to be sent back. - The MPI version uses MPI_Allreduce to efficiently
compute the forces among all the processes in a
parallel fashion causing nodes to only block for
its neighbor force updates.
59Ray Tracer
- A scene with 64 spheres is rendered at 150 x 150
pixels. - Parallelization is provided by a cyclic
distribution for load balancing when looping over
the rows of pixels.
60Ray Tracer DSM Modifications
- This benchmark was poorly designed as it created
far too many temporary objects. The DSM version
was heavily modified to eliminate temporary
object creation. - Broadcast is used to transmit the array row
reference to each thread. - Once the rendering is complete, arraycopy is used
to copy the row data to the root thread. - Reduce is used to compute the pixel checksum for
validation purposes.
61Ray Tracer mpiJava
- The mpiJava version was also modified to remove
temporary object creation. - MPI_Reduce is used to generate the pixel
checksum. - MPI_Send and MPI_Recv are used to transmit the
row data to the root process.
62Ray Tracer Results
63Ray Tracer Conclusions
- The results on both nodes are almost identical.
- Very little network communication is required.
Most of the time is spent rendering the scene.
64Monte Carlo
- Uses Monte Carlo techniques to price products
derived from the worth of an underlying asset. - Generates 2,000 samples.
- Parallelization is provided by dividing the work
in the principle loop of the Monte Carlo runs in
block fashion and distributing the blocks to
remote nodes.
65Monte Carlo Lack Of DSM
- This benchmark required an unacceptable amount of
memory for Hyperion to handle. - It is embarrassingly parallel and we have other
embarrassingly parallel benchmarks. - We felt that it was unnecessary to develop a
working DSM version from this large set of code.
66Monte Carlo mpiJava
- The mpiJava version provided some insight into
the mpiJava to Hyperion port, specifically in the
object serialization portion. - Monte Carlo relies heavily on object
serialization as Java objects are distributed via
send and receive commands instead of primitive
types. - By creating a working version of Monte Carlo, we
were able to eliminate many subtle mpiJava to
Hyperion port bugs.
67Monte Carlo Results
68Monte Carlo Conclusions
- The MPI version scales well on both clusters.
- Without a DSM version, a comparison is not
possible.
69Conclusions
- The Hyperion system can compete with traditional
parallel programming models. - However, to compete, a Hyperion user cannot
simply write multi-threaded Java code and expect
it to perform well on a cluster. - Users must be aware of how thread creation works
in Hyperion and the effects of remote object
access in order to tune their applications.
70Conclusions, Continued
- With an application developed using Hyperions
thread management and data exchange techniques
(reduce, broadcast, arraycopy), a Hyperion user
can achieve competitive performance. - We feel that methods for performing operations
on groups of threads, such as the above, should
be part of the Java threading API, as they could
be useful even outside a parallel environment.