Title: MPJ Message Passing in Java: The past, present, and future
1MPJ (Message Passing in Java) The past,
present, and future
- Aamir Shafi
- Distributed Systems Group
- University of Portsmouth
2Presentation Outline
- Introduction
- Java messaging system
- Java NIO (New I/O package)
- Comparison of Java with C
- The trend towards SMP clusters
- Background and review
- MPJ design implementation
- Performance evaluation
- Conclusions
3Introduction
- A lot of interest in a Java messaging system
- There is no reference messaging system in pure
Java, - A reference system should follow the API defined
by the MPJ specification. - What a Java messaging system has to offer?
- Portability
- Write once run anywhere.
- Object oriented programming concept
- Higher level of abstraction for parallel
programming. - An extensive set of API libraries
- Avoids reinventing the wheel.
- Multi-threaded language
- Thread-safety mechanisms
- synchronized blocks,
- wait() and notify() in Object class.
- Automatic memory management.
4Introduction
- But, is not Java slower than C (in terms of I/O)?
- The traditional I/O package of Java is an
blocking API - A separate thread to handle each socket
connection. - Java New I/O package
- Adds non-blocking I/O to the Java language
- C select () like functionality.
- Direct Buffers
- Conventional Java objects are allocated on JVM
heap, - Unlike conventional Java objects, direct buffers
are allocated in the native OS memory, - Provides faster I/O, not subject to garbage
collection. - JIT (Just In Time) compilers
- Convert the object code into native machine code.
- Communication performance
- Comparison of Java NIO and C Netpipe drivers,
- Java performs similar to C on Fast Ethernet.
5(No Transcript)
6(No Transcript)
7Introduction Some background
- Parallel programming paradigms
- Shared memory
- Standard SHMEM and more recently OpenMP.
- Implementation JOMP (Java OpenMP).
- Distributed memory
- Standard Message Passing Interface (MPI).
- Implementation MPJ (Message Passing in Java).
- Hybrid paradigms.
- Clusters have become a cost effective alternative
to traditional HPC hardware - This trend towards clusters lead to the emergence
of SMP clusters - StarBug, the DSG cluster consists of eight dual
CPU nodes, - Shared memory for intra-node communications,
- Distributed memory for inter-node communications,
- Thus, a framework based on a hybrid programming
paradigm.
8Aims of the project
- Research and development of a reference messaging
system based on Java NIO. - A secure runtime infrastructure to bootstrap and
control MPJ processes. - MPJ framework for SMP clusters
- Integrate MPJ and JOMP
- Use MPJ for distributed memory,
- Use JOMP for shared memory.
- Map the parallel application to the underlying
hardware for optimal execution. - Debugging, monitoring and profiling tools.
- This talk discusses MPJ, the secure runtime and
motivates for efficient execution on shared
memory processors.
9Presentation Outline
- Introduction
- Background and review
- MPJ design implementation
- Performance evaluation
- Conclusions
10Background and review
- This section of talk discusses
- Messaging systems in Java,
- Shared memory libraries in Java,
- The runtime infrastructures.
- A detailed literature review is available in DSG
first year technical report - A Status Report Early Experiences with the
implementation of a Message Passing System using
Java NIO - http//dsg.port.ac.uk/shafia/res/papers/DSG_2.pdf
11Messaging systems in Java
- Three approaches to build messaging systems in
Java, using - RMI (Remote Method Invocation)
- An API of Java that allows execution of remote
objects, - Meant for client server interaction,
- Transfers primitive datatypes as objects.
- JNI (Java Native Interface)
- An interface that allows to invoke C (and other
languages) from Java, - Not truly portable,
- Additional copying between Java and C.
- Sockets interface
- Java standard I/O package,
- Java New I/O package.
12Using RMI
- JMPI (University of Massachusetts)
- Cons
- Not active,
- Poor performance because of RMI,
- KaRMI was used instead of RMI
- KaRMI runs on Myrinet.
- CCJ (Vrije Universiteit Amsterdam)
- Cons
- Not active.
- Supports the transfer of objects as well as basic
datatypes. - Poor performance because of RMI.
- JMPP (National Chiao-Tung University).
13Using JNI
- mpiJava (Indiana University UoP)
- Pros
- Moving towards the MPJ API specification,
- Well-supported and widely used.
- Cons
- Uses JNI and native MPI as the communication
medium. - JavaMPI (University of Westminster)
- Cons
- No longer active (uses Native Method Interface
NMI), - Source code not available.
- M-JavaMPI (The University of Hong Kong)
- Supports process migration using JVMDI (JVM Debug
Interface) that has been deprecated in Java 1.5. - JVMTI (JVM Tool Interface)
14Using sockets interface
- MPJava (University of Maryland)
- Pros
- Based on Java NIO,
- Cons
- No runtime infrastructure,
- Source code is not available,
- MPP (University of Bergen)
- Based on Java NIO
- Subset of MPI functionality
- of a bug in the TCP/IP stack.
15Shared memory libraries in Java
- OpenMP implementation using Java (EPCC)
- JOMP (Java OpenMP),
- Single JVM, starts multiple threads to match the
number of processors on an SMP node. - Efficient shared memory communications can also
be implemented by MappedByteBuffer class - Memory mapped to a file,
- Sender may lock and write to the file,
- Reader may lock and read from the file.
- Single JVM implementation of mpjdev
- Threads are processes
16The runtime infrastructures
- Shell/Perl scripts
- Most messaging systems use SSH to start remote
processes (for linux). - SPMD (Argonne National Lab)
- Part of MPICH-2,
- SPMD stands for Super Multi Purpose Daemon,
- Different implementation for linux and windows.
- Java is ideal to implement the runtime
infrastructure - Portability - same implementation will run on
different operating systems.
17Presentation Outline
- Introduction
- Literature review
- MPJ design implementation
- Performance evaluation
- Conclusions
18Design Goals
- Portability.
- Standard Java
- Assuming no language extensions.
- High Performance.
- Modular and layered architecture
- Device drivers, and other layers.
- Allows higher level of abstraction
- By enabling the transfer of objects.
19The Generic Design
20Implementation of MPJ
- Device drivers
- Java NIO device driver (mpjdev).
- The native MPI device driver (native mpjdev).
- Swapped in/out of MPJ.
- Similar to device drivers in MPICH.
- MPJ Point to point communications
- Blocking and non-blocking.
- Communicators, virtual topologies.
- MPJ Collective Communications
- Various collective communications methods.
- Instantiation of MPJ design (on next slide).
21(No Transcript)
22Java NIO device driver
- Communication Protocols
- Eager-Send
- Assumes the receiver has infinite memory,
- For small messages (lt 128 Kbytes),
- May incur additional copying.
- Rendezvous
- Exchange of control messages before the actual
transmission, - For long messages (? 128 Kbytes).
- The buffering API
- Supports gathering/scattering of data,
- Support the transfer of Java objects.
23Pt2Pt and collective methods
- Point to Point communications
- Blocking/non-Blocking methods,
- Buffered/ready/synchronous modes of send
- Supported by eager-send and rendezvous protocols
at the device level. - Communicators.
- Virtual Topologies.
- Collective Communications
- Provided as utility to MPI programmers,
- Gather/scatter/all-to-all/reduce/all-Reduce/scan.
24The runtime infrastructure
25Design of the runtime infrastructure
26Implementation of the runtime
- The administrator installs MPJDaemons
- SSH allows us to install the daemons remotely on
Linux, - Adds admin certificate to all the daemons
keystore (a repository of certificates). - Using the daemons
- The administrator adds the user certificate into
the keystore, - The MPJRun module is used run the parallel
application. - Copying executables from MPJRun to MPJDaemon
- Via dynamic class loading.
- Stdout/Stderr is redirected to MPJRun.
27Implementation issues
- Issues with Java NIO
- Taming the NIO circus thread
- http//forum.java.sun.com/thread.jsp?forum4threa
d459338start0range15hilitefalseq - Allocating direct buffers lead to
OutOfMemoryException (a bug). - Selectors taking 100 percent CPU
- No need to register for write events,
- Only register for read events.
- J2SE (Java2 Standard Edition) 1.5 has solved many
problems. - MPJDaemons went out of memory because of direct
buffers - These buffers are not subject to garbage
collection, - Details shown in coming slides,
- Solved by starting a new JVM at MPJDaemon.
28Machine names where MPJDaemon will be installed
29Installing the daemon from the initiator machine
30First execution
31Memory stats of one of machines where MPJDaemon
is installed and is executing MPJ app
32After second execution
33After a few more executions
34Finally, Out of memory .
35Presentation Outline
- Introduction
- Literature review
- MPJ design implementation
- Performance evaluation
- Conclusion
36Sequence of perf evaluation graphs
- Java NIO device driver evaluation
- Comparing to native mpjdev (mpjdev uses MPICH by
interfacing through JNI) - Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Same node of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Importance of OpenMP for shared memory
communications. - Evaluation of eager-send rendezvous protocols
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
- MPJ Pt2Pt evaluation
- Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
37- Java NIO device driver (red line) performs
similar to native mpjdev device driver - Latency (the time taken to transfer one byte) is
260 microseconds
38Sequence of perf evaluation graphs
- Java NIO device driver evaluation
- Comparing to native mpjdev (mpjdev uses MPICH by
interfacing through JNI) - Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Same node of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Importance of OpenMP for shared memory
communications. - Evaluation of eager-send rendezvous protocols
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
- MPJ Pt2Pt evaluation
- Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
39- Throughput for both devices is 89 Mbits/s
Change from eager send to rendezvous protocol
40Sequence of perf evaluation graphs
- Java NIO device driver evaluation
- Comparing to native mpjdev (mpjdev uses MPICH by
interfacing through JNI) - Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Same node of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Importance of OpenMP for shared memory
communications - Evaluation of eager-send rendezvous protocols
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
- MPJ Pt2Pt evaluation
- Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
41- mpjdev (represented by red line) is communicating
through sockets (the time is dictated by memory
bus bandwidth) - native mpjdev is communicating through shared
memory - A problem for SMP clusters!
42Sequence of perf evaluation graphs
- Java NIO device driver evaluation
- Comparing to native mpjdev (mpjdev uses MPICH by
interfacing through JNI) - Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Same node of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Importance of OpenMP for shared memory
communications. - Evaluation of eager-send rendezvous protocols
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
- MPJ Pt2Pt evaluation
- Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
43(No Transcript)
44Sequence of perf evaluation graphs
- Java NIO device driver evaluation
- Comparing to native mpjdev (mpjdev uses MPICH by
interfacing through JNI) - Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Same node of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Importance of OpenMP for shared memory
communications. - Evaluation of eager-send rendezvous protocols
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
- MPJ Pt2Pt evaluation
- Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
45- Eager-send is for small messages lt 128 Kbytes
- Eager-send may incur additional copying
- The time for exchanging control messages in
rendezvous dictates the communication time of
small messages - Rendezvous is suitable for large message gt 128
Kbytes
46Sequence of perf evaluation graphs
- Java NIO device driver evaluation
- Comparing to native mpjdev (mpjdev uses MPICH by
interfacing through JNI) - Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Same node of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Importance of OpenMP for shared memory
communications. - Evaluation of eager-send rendezvous protocols
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds)
- MPJ Pt2Pt evaluation
- Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
47(No Transcript)
48Sequence of perf evaluation graphs
- Java NIO device driver evaluation
- Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Same node of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Importance of OpenMP for shared memory
communications. - Evaluation of eager-send rendezvous protocols
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
- MPJ Pt2Pt evaluation
- Comparing to MPICH, mpiJava (mpiJava uses MPICH
by interfacing through JNI) - Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
49(No Transcript)
50Sequence of perf evaluation graphs
- Java NIO device driver evaluation
- Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Same node of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds),
- Importance of OpenMP for shared memory
communications. - Evaluation of eager-send rendezvous protocols
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
- MPJ Pt2Pt evaluation
- Comparing to MPICH, mpiJava (mpiJava uses MPICH
by interfacing through JNI) - Remote nodes of a cluster
- Transfer time (micro-seconds),
- Throughput achieved (Mbits/seconds).
51- MPICH (C MPI) 89 Mbits/s
- MPJ 88 Mbits/s
- mpiJava 82 Mbits/s
- Over-head of JNI is coming into play for mpiJava
52Parallel matrix multiplication
- Aims
- To test the functionality of MPJ,
- Check-out the speed-up of parallel version
against the sequential version. - The parallel version (Total Processes TP)
- Suppose matrix A and matrix B with rows M and
columns N, - Send (M/TP) rows of matrix A to each process
along with matrix B to compute the (N/TP) columns
of resultant matrix C, - Receive (N/TP) columns of resultant matrix C from
each process. - A trivial parallel application
- Parallel version used eight processors on
StarBug, the DSG Cluster.
53(No Transcript)
54The default heap 64M goes out of memory at this
point
55Java on Gigabit Ethernet
- The max throughput achieved by C Netpipe driver
is 900 Mbits/s - The max throughput achieved by Java Netpipe
driver is 680 Mbits/s - The Java driver should be right up with the C
driver!
561. Aggressive Heap 2. -client JVM 3.
Concurrent GC 4. Concurrent I/O 5. Fixed no.
of GC Threads 6. Incremental GC 7. No Class
GC 8. Par New GC 9. GC Time Ratio 10. Simple
(Nothing)
Things only get worse with mpjdev on Gigabit
Ethernet!
57Java on Gigabit Ethernet
- The performance of Java is not satisfactory on
Gigabit Ethernet - Depends how well the application is written
- Well in the sense of consuming as little memory,
as possible. - Suspicions
- Garbage collection starts using a lot of CPU.
- Identified this problem while comparing Netpipe C
and Java drivers on Gigabit Ethernet. - Understanding this behaviour/problem is a work in
progress.
58Presentation Outline
- Introduction
- Literature review
- MPJ design implementation
- Performance evaluation
- Conclusions
59Summary
- A lot of interest in the community to implement a
reference Java messaging system. - The overall aim of this project is to develop a
framework for parallel programming in Java over
SMP clusters. - The past year has been spent in developing a
reference messaging system - Java NIO based device driver,
- MPJ Pt2Pt and collective communications are in
progress. - A secure runtime infrastructure to execute the
MPJ application over a cluster or workstations
connected by fast network.
60Future work
- Implementation of the MPJ standard
- Virtual topologies, communicators, derived
datatypes etc. - Support for multi-dimensional arrays
- Multi-dimensional arrays in Java are really
arrays of arrays, - Inefficient and confusing for application
developers. - Support for shared memory communications
- JOMP,
- Develop a specialized device driver which has
threads as processes over an SMP node, - Use MappedByteBuffer to implement shared memory
communications between the two JVMs on an SMP
node, - It is not clear what is the efficient options out
of above three. - Integration of JOMP and MPJ
- It is not clear what is the best way to integrate
JOMP and MPJ. - Monitoring, debugging, and profiling tools.
61Conclusions
- The performance of MPJ suggest that Java is an
viable option for a messaging system. - Java NIO adds useful functionality to Java
- Non-blocking I/O,
- Direct buffers.
- MPJ can be used as
- A teaching tool
- Easy to use
- Concentrate on message passing concepts,
- No memory leaks.
- A simulation tool (enables Rapid Application
Development) - To test fault tolerant algorithms
- Load balancing and process migration.
- An MPJ runtime infrastructure allows execution on
heterogeneous systems.
62Questions/Suggestions