Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters - PowerPoint PPT Presentation

About This Presentation
Title:

Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters

Description:

The belief is that native executables are optimized by the C compiler and will ... The executables produced by Hyperion are then executed by the respective native ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 71
Provided by: mathe82
Learn more at: https://www.cs.unh.edu
Category:

less

Transcript and Presenter's Notes

Title: Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters


1
Comparing The Performance Of Distributed Shared
Memory And Message Passing Programs Using The
Hyperion Java Virtual Machine On Clusters
  • Mathew Reno

2
Overview
  • For this thesis we wanted to evaluate the
    performance of the Hyperion distributed virtual
    machine, designed at UNH, when compared to a
    preexisting parallel computing API.
  • The results would indicate where Hyperions
    strength and weaknesses were and possibly
    validate Hyperion as a high-performance computing
    alternative.

3
What Is A Cluster?
  • A cluster is a group of lowcost computers
    connected with an offtheshelf network.
  • The clusters network is isolated from WAN data
    traffic and the computers on the cluster are
    presented to the user as a single resource.

4
Why Use Clusters?
  • Clusters are cost effective when compared to
    traditional parallel systems.
  • Clusters can be grown as needed.
  • Software components are based on standards
    allowing portable software to be designed for the
    cluster.

5
Cluster Computing
  • Cluster computing takes advantage of the cluster
    by distributing computational workload among
    nodes of the cluster, thereby reducing total
    computation time.
  • There are many programming models for
    distributing data throughout the cluster.

6
Distributed Shared Memory
  • Distributed Shared Memory (DSM) allows the user
    to view the whole cluster as one resource.
  • Memory is shared among the nodes. Each node has
    access to all other nodes memory as if it owns
    it.
  • Data coordination among nodes is generally hidden
    from the user.

7
Message Passing
  • Message Passing (MP) requires explicit messages
    to be employed to distribute data throughout the
    cluster.
  • The programmer must coordinate all data exchanges
    when designing the application through a language
    level MP API.

8
Related Treadmarks Vs. PVM
  • Treadmarks (Rice, 1995) implements a DSM model
    while PVM implements a MP model. The two
    approaches were compared with benchmarks.
  • On average, PVM was found to perform two times
    better the Treadmarks.
  • Treadmarks suffered from excessive messages that
    were required for the requestresponse
    communication DSM model employed.
  • Treadmarks was found to be more natural to
    program with saving development time.

9
Hyperion
  • Hyperion is a distributed Java Virtual Machine
    (JVM), designed at UNH.
  • The Java language provides parallelism through
    its threading model. Hyperion extends this model
    by distributing the threads among the cluster.
  • Hyperion implements the DSM model via DSM-PM2,
    which allows for lightweight thread creation and
    data distribution.

10
Hyperion, Continued
  • Hyperion has a fixed memory size that it shares
    with all threads executing across the cluster.
  • Hyperion uses pagebased data distribution if a
    thread accesses memory it does not have locally,
    a pagefault occurs and the memory is transmitted
    from the node that owns the memory to the
    requesting node a page at a time.

11
Hyperion, Continued
  • Hyperion translates Java bytecodes into native C
    code.
  • A native executable is generated by a native C
    compiler.
  • The belief is that native executables are
    optimized by the C compiler and will benefit the
    application by executing faster than interpreted
    code.

12
Hyperions Threads
  • Threads are created in a round robin fashion
    among the nodes of the cluster.
  • Data is transmitted between threads via a
    request/response mechanism. This approach
    requires two messages.
  • In order to respond to a request message, a
    response thread must be scheduled. This thread
    handles the request by sending back the requested
    data in a response message.

13
mpiJava
  • mpiJava is a Java wrapper for the Message Passing
    Interface (MPI).
  • The Java Native Interface (JNI) is used to
    translate between Java and native code.
  • We used MPICH for the native MPI implementation.

14
Clusters
  • The Star cluster (UNH) consists of 16 PIII
    667MHz Linux PCs on a 100Mb Fast Ethernet
    network. TCP is communication protocol.
  • The Paraski cluster (France) consists of 16
    PIII 1GHz Linux PCs on a 2Gb Myrinet network. BIP
    (DSM) and GM (MPI) are the communication
    protocols.

15
Clusters, Continued
  • The implementation of MPICH on BIP was not stable
    in time for this thesis. GM had to be used in
    place of BIP for MPICH. GM has not been ported to
    Hyperion and a port would be unreasonable at this
    time.
  • BIP performs better than GM as the message size
    increases.

16
BIP vs. GM Latency (Paraski)
17
DSM MPI In Hyperion
  • For consistency, mpiJava was ported into
    Hyperion.
  • Both DSM and MPI versions of the benchmarks could
    be compiled by Hyperion.
  • The executables produced by Hyperion are then
    executed by the respective native launchers (PM2
    and MPICH).

18
Benchmarks
  • The Java Grande Forum (JGF) developed a suite of
    benchmarks to test Java implementations.
  • We used two of the JGF benchmark suites,
    multithreaded and javaMPI.

19
Benchmarks, Continued
  • Benchmarks used
  • Fourier coefficient analysis
  • Lower/upper matrix factorization
  • Successive over-relaxation
  • IDEA encryption
  • Sparse matrix multiplication
  • Molecular dynamics simulation
  • 3D Ray Tracer
  • Monte Carlo simulation (only with MPI)

20
Benchmarks And Hyperion
  • The multi-threaded JGF benchmarks had
    unacceptable performance when run out of the
    box.
  • Each benchmark creates all of its data objects on
    the root node causing all remote object access to
    occur through this one node.
  • This type of access causes a performance
    bottleneck on the root node as it has to service
    all the requests while calculating its algorithm
    part.
  • The solution was to modify the benchmarks to be
    cluster aware.

21
Hyperion Extensions
  • Hyperion makes up for Javas limited thread data
    management by providing efficient reduce and
    broadcast mechanisms.
  • Hyperion also provides a cluster aware
    implementation of arraycopy.

22
Hyperion Extension Reduce
  • Reduce blocks all enrolled threads until each
    thread has the final result of the reduce.
  • This is done by neighbor threads exchanging their
    data for computation, then their neighbors, and
    so on until each thread has the same answer.
  • This operation is faster and scales well as
    opposed to performing the calculation serially.
    The operation is LogP.

23
Hyperion Extension Broadcast
  • The broadcast mechanism transmits the same data
    to all enrolled threads.
  • Like reduce, data is distributed to the threads
    in a LogP fashion, which scales better than
    serial distribution of data.

24
Hyperion Extension Arraycopy
  • The arraycopy method is part of the Java System
    class. The Hyperion version was extended to be
    cluster aware.
  • If data is copied across threads, this version
    will send all data as one message instead of
    relying on paging mechanisms to access remote
    array data.

25
Benchmark Modifications
  • The multithreaded benchmarks had unacceptable
    performance.
  • The benchmarks were modified in order to reduce
    remote object access and root node bottlenecks.
  • Techniques, such as arraycopy, broadcast and
    reduce were employed to improve performance.

26
Experiment
  • Each benchmark was executed 50 times at each node
    size to provide a sample mean.
  • Node sizes were 1, 2, 4, 8, and 16.
  • Confidence intervals (95 level) were used to
    determine which version, MPI or DSM, performed
    better.

27
Results On The Star Cluster
28
Results On The Paraski Cluster
29
Fourier Coefficient Analysis
  • Calculates the first 10,000 pairs of Fourier
    coefficients.
  • Each node is responsible for calculating its
    portion of the coefficient array.
  • Each node sends back its array portion to the
    root node, which accumulates the final array.

30
Fourier DSM Modifications
  • The original multithreaded version required all
    threads to update arrays located on the root
    node, causing the root node to be flooded with
    requests.
  • The modified version used arraycopy to distribute
    the local arrays back to the root thread arrays.

31
Fourier mpiJava
  • The mpiJava version is similar to the DSM
    version.
  • Each process is responsible for its portion of
    the arrays.
  • MPI_Ssend and MPI_Recv were called to distribute
    the array portions to the root process.

32
Fourier Results
33
Fourier Conclusions
  • Most of the time in this benchmark is spent in
    the computation.
  • Network communication does not play a significant
    role in the overall time.
  • Both MPI and DSM perform similar on each cluster,
    scaling well when more nodes are added.

34
Lower/Upper Factorization
  • Solves a 500 x 500 linear system with LU
    factorization followed with a triangular solve.
  • The factorization is parallelized while the
    triangular solve is computed in serial.

35
LU DSM Modifications
  • The original version created the matrix on the
    root thread and all access was through this
    thread, causing performance bottlenecks.
  • The benchmark was modified to use Hyperions
    Broadcast facility to distribute the pivot
    information and arraycopy was used to coordinate
    the final data for the solve.

36
LU mpiJava
  • MPI_Bcast is used to distribute the pivot
    information.
  • MPI_Send and MPI_Recv are used so the root
    process can acquire the final matrix.

37
LU Results
38
LU Conclusions
  • While the DSM version uses a similar data
    distribution mechanism as the MPI version, there
    is significant overhead that is exposed when
    executing these methods in large loops.
  • This overhead is minimized on the Paraski cluster
    due to the nature Myrinet and BIP.

39
Successive Over-Relaxation
  • Performs 100 iterations of SOR on a 1000 x 1000
    grid.
  • A red-black ordering mechanism allows array
    rows to be distributed to nodes in blocks.
  • After initial data distribution, only neighbor
    rows need be communicated during the SOR.

40
SOR DSM Modifications
  • Excessive remote thread object access made it
    necessary to modify the benchmark.
  • Modified version uses arraycopy to update
    neighbor rows during the SOR.
  • When the SOR completes, arraycopy is used to
    assemble the final matrix on the root thread.

41
SOR mpiJava
  • MPI_Sendrecv is used to exchange neighbor rows.
  • MPI_Ssend and MPI_Recv are used to build the
    final matrix on the root process.

42
SOR Results
43
SOR Conclusions
  • The DSM version requires an extra barrier after
    row neighbors are exchanged due to the network
    reactivity problem.
  • A thread must be able to service all requests in
    a timely fashion. If the thread is busy
    computing, it cannot react quickly enough to
    schedule the request thread.
  • The barrier will block all threads until each
    reaches the barrier, which guarantees that all
    nodes have their requested data and it is safe to
    continue with the computation.

44
IDEA Crypt
  • Performs IDEA encryption and decryption on a
    3,000,000 byte array.
  • The array is divided among nodes in a block
    manner.
  • Each node encrypts and decrypts its portion.
  • When complete, the root node collects the
    decrypted array for validation.

45
Crypt DSM Modifications
  • The original created the whole array on the root
    thread and required each remote thread to page in
    their portions.
  • The modified version used arraycopy to distribute
    each threads portion from the root thread.
  • When decryption finishes, arraycopy copies back
    the decrypted portion to the root thread.

46
Crypt mpiJava
  • The mpiJava version uses MPI_Ssend to send the
    array portions to the remote processes and
    MPI_Recv to receive the portions.
  • When complete, MPI_Ssend is used to send back the
    processes portion and MPI_Recv receives each
    portion.

47
Crypt Results
48
Crypt Conclusions
  • Results are similar on both clusters.
  • There is a slight performance problem with 4 and
    8 nodes with the DSM version.
  • This can be attributed to the placing of a
    barrier that causes all threads to block before
    computing, in the DSM version, while the MPI
    version does not block.

49
Sparse Matrix Multiplication
  • A 50,000 x 50,000 unstructured matrix stored in
    compressed-row format is multiplied over 200
    iterations.
  • Only the final result is communicated as each
    node has its own portion of data and initial
    distribution is not timed.

50
Sparse DSM Modifications
  • This benchmark originally produced excessive
    network traffic through remote object access.
  • The modifications involved removing the object
    access during the multiplication loop and using
    arraycopy to distribute the final result to the
    root thread.

51
Sparse mpiJava
  • This benchmarks only communication is an
    MPI_Allreduce, which performs a array reduce
    leaving the result on all nodes. This is employed
    to obtain the final result of the multiplication.

52
Sparse Results
53
Sparse Conclusions
  • The results are similar on both clusters.
  • The DSM version has better performance on both
    cluster.
  • The MPI version uses the MPI_Allreduce method
    that places the results on all nodes.
  • This method adds extra overhead that is not
    present in the DSM version, where the results are
    just collected on the root node.

54
Molecular Dynamics
  • This benchmark is a 2048-body code that models
    particles interacting under a Lennard-Jones
    potential in a cubic spatial volume with periodic
    boundary conditions.
  • Parallelization is provided by dividing the range
    of iterations over the particles among nodes.

55
MolDyn DSM Modifications
  • Significant amount of remote thread object access
    necessitated modifications.
  • Particle forces are updated by remote threads
    using arraycopy to send the forces to the root
    thread and the root thread serially updates the
    forces and sends the new force array back to the
    remote threads via arraycopy.

56
MolDyn mpiJava
  • This version uses six MPI_Allreduce commands to
    update various force and movement arrays. This
    occurs at every time step.

57
MolDyn Results
58
MolDyn Conclusions
  • The DSM version must update particle forces
    serially on the root thread.
  • This causes all threads to block while sending
    its local forces to the root thread and wait for
    the updated forces to be sent back.
  • The MPI version uses MPI_Allreduce to efficiently
    compute the forces among all the processes in a
    parallel fashion causing nodes to only block for
    its neighbor force updates.

59
Ray Tracer
  • A scene with 64 spheres is rendered at 150 x 150
    pixels.
  • Parallelization is provided by a cyclic
    distribution for load balancing when looping over
    the rows of pixels.

60
Ray Tracer DSM Modifications
  • This benchmark was poorly designed as it created
    far too many temporary objects. The DSM version
    was heavily modified to eliminate temporary
    object creation.
  • Broadcast is used to transmit the array row
    reference to each thread.
  • Once the rendering is complete, arraycopy is used
    to copy the row data to the root thread.
  • Reduce is used to compute the pixel checksum for
    validation purposes.

61
Ray Tracer mpiJava
  • The mpiJava version was also modified to remove
    temporary object creation.
  • MPI_Reduce is used to generate the pixel
    checksum.
  • MPI_Send and MPI_Recv are used to transmit the
    row data to the root process.

62
Ray Tracer Results
63
Ray Tracer Conclusions
  • The results on both nodes are almost identical.
  • Very little network communication is required.
    Most of the time is spent rendering the scene.

64
Monte Carlo
  • Uses Monte Carlo techniques to price products
    derived from the worth of an underlying asset.
  • Generates 2,000 samples.
  • Parallelization is provided by dividing the work
    in the principle loop of the Monte Carlo runs in
    block fashion and distributing the blocks to
    remote nodes.

65
Monte Carlo Lack Of DSM
  • This benchmark required an unacceptable amount of
    memory for Hyperion to handle.
  • It is embarrassingly parallel and we have other
    embarrassingly parallel benchmarks.
  • We felt that it was unnecessary to develop a
    working DSM version from this large set of code.

66
Monte Carlo mpiJava
  • The mpiJava version provided some insight into
    the mpiJava to Hyperion port, specifically in the
    object serialization portion.
  • Monte Carlo relies heavily on object
    serialization as Java objects are distributed via
    send and receive commands instead of primitive
    types.
  • By creating a working version of Monte Carlo, we
    were able to eliminate many subtle mpiJava to
    Hyperion port bugs.

67
Monte Carlo Results
68
Monte Carlo Conclusions
  • The MPI version scales well on both clusters.
  • Without a DSM version, a comparison is not
    possible.

69
Conclusions
  • The Hyperion system can compete with traditional
    parallel programming models.
  • However, to compete, a Hyperion user cannot
    simply write multi-threaded Java code and expect
    it to perform well on a cluster.
  • Users must be aware of how thread creation works
    in Hyperion and the effects of remote object
    access in order to tune their applications.

70
Conclusions, Continued
  • With an application developed using Hyperions
    thread management and data exchange techniques
    (reduce, broadcast, arraycopy), a Hyperion user
    can achieve competitive performance.
  • We feel that methods for performing operations
    on groups of threads, such as the above, should
    be part of the Java threading API, as they could
    be useful even outside a parallel environment.
Write a Comment
User Comments (0)
About PowerShow.com