Title: Platforms for HPJava: Runtime Support for Scalable Programming in Java
1Platforms for HPJava Runtime Support for
Scalable Programming in Java
- Sang Boem Lim
- Florida State University
- slim_at_csit.fsu.edu
2Contents
- Overview of HPJava
- Library support for HPJava
- High-level APIs
- Low-level API
- Applications and performance
- Contributions
- Conclusions
3Goals
- Our research is concerned with enabling parallel,
high-performance computation--in particular
development of scientific software in the
network-aware programming language, java. - Issues concerned with the implementation of the
run-time environment underlying HPJava. - High-level APIs (e.g. Adlib)
- Low-level API for underlying communications (e.g.
mpjdev) - Adlib is the first application-level library for
HPspmd mode. - The mpjdev API is a underlying communication
library to perform actual communications. - Implementations of mpjdev
- mpiJava-based implementation
- Multithreaded implementation
- LAPI implementation
4SPMD Parallel Computing
- SIMD A single control unit dispatches
instructions to each processing unit. - e.g.) Illiac IV, Connection Machine-2, ect.
- introduced a new concept, distributed arrays
- MIMD Each processor is capable of executing a
different program independent of the other
processors. - asynchronous, flexible, but hard to program
- e.g.) Cosmic Cube, Cray T3D, IBM SP3, etc.
- SPMD Each processor executes the same program
asynchronously. Synchronization takes place only
when processors need to exchange data. - loosely synchronous model (SIMDMIMD)
- HPF - an extension of Fortran 90 to support the
data parallel programming model on distributed
memory parallel computers.
5Motivation
- SPMD (Single Program, Multiple Data) programming
has been very successful for parallel computing. - Many higher-level programming environments and
libraries assume the SPMD style as their basic
modelScaLAPACK, DAGH, Kelp, Global Array
Toolkit. - But the library-based SPMD approach to
data-parallel programming lacks the uniformity
and elegance of HPF. - Compared with HPF, creating distributed arrays
and accessing their local and remote elements is
clumsy and error-prone. - Because the arrays are managed entirely in
libraries, the compiler offers little support and
no safety net of compile-time or
compiler-generated run-time checking. - These observations motivate our introduction of
the HPspmd modeldirect SPMD programming
supported by additional syntax for HPF-like
distributed arrays.
6HPspmd
- Proposed by Fox, Carpenter, Xiaoming Li around
1998. - Independent processes executing same program,
sharing elements of distributed arrays described
by special syntax. - Processes operate directly on locally owned
elements. Explicit communication needed in
program to permit access to elements owned by
other processes. - Envisaged bindings for base languages like
Fortran, C, Java, etc.
7HPJavaOverview
- Environment for parallel programming.
- Extends Java by adding some predefined classes
and some extra syntax for dealing with
distributed arrays. - So far the only implementation of HPspmd model.
- HPJava program translated to standard Java
program which calls communication libraries and
parallel runtime system.
8HPJava Example
- Procs p new Procs2(2, 2)
- on(p)
- Range x new ExtBlockRange(M, p.dim(0), 1),
y new ExtBlockRange(N, p.dim(1), 1) - float -,- a new float x, y
- . . . Initialize edge values in a (boundary
conditions) - float -,- b new float x,y, r new
float x,y // r residuals - do
- Adlib.writeHalo(a)
- overall (i x for 1 N 2)
- overall (j y for 1 N 2)
- float newA 0.25 (ai - 1, j ai
1, j ai, j - 1 ai, j 1 ) - r i, j Math.abs(newA a i, j)
- b i, j newA
-
- HPspmd.copy(a, b) // Jacobi relaxation.
- while(Adlib.maxval(r) gt EPS)
9Processes and Process Grids
- An HPJava program is started concurrently in some
set of processes. - Processes named through grid objects
- Procs p new Procs2 (2, 3)
- Assumes program currently executing on 6 or more
processes. - Specify execution in a particular process grid by
on construct - on(p)
- . . .
-
10Distributed Arrays in HPJava
- Many differences between distributed arrays and
ordinary arrays of Java. New kind of container
type with special syntax. - Type signatures, constructors use double brackets
to emphasize distinction - Procs2 p new Procs2(2, 3)
- on(p)
- Range x new BlockRange(M, p.dim(0))
- Range y new BlockRange(N, p.dim(1))
- float -,- a new floatx, y
- . . .
-
112-dimensional array block-distributed over p
p.dim(1)
M N 8
0
1
2
a0,0 a0,1 a0,2 a1,0 a1,1 a1,2 a2,0
a2,1 a2,2 a3,0 a3,1 a3,2
a0,3 a0,4 a0,5 a1,3 a1,4 a1,5 a2,3
a2,4 a2,5 a3,3 a3,4 a3,5
a0,6 a0,7 a1,6 a1,7 a2,6 a2,7 a3,6
a3,7
0
p.dim(0)
a4,0 a4,1 a4,2 a5,0 a5,1 a5,2 a6,0
a6,1 a6,2 a7,0 a7,1 a7,2
a4,3 a4,4 a4,5 a5,3 a5,4 a5,5 a6,3
a6,4 a6,5 a7,3 a7,4 a7,5
a4,6 a4,7 a5,6 a5,7 a6,6 a6,7 a7,6
a7,7
1
12The Range hierarchy of HPJava
BlockRange
CyclicRange
ExtBlockRange
Range
IrregRange
CollapsedRange
Dimension
13The overall construct
- overalla distributed parallel loop
- General form parameterized by index triplet
- overall (i x for l u s) . . .
- i distributed index,
- l lower bound, u upper bound, s step.
- In general a subscript used in a distributed
array element must be a distributed index in the
array range.
14Irregular distributed data structures
- Can be described as distributed array of Java
arrays.
float - a new float x overall (i
x ) a i new float f(x)
0
1
0
1
2
3
Size 4
Size 2
Size 5
Size 3
15Library Support for HPJava
16Historical Adlib I
- Adlib library was completed in the Parallel
Compiler Runtime Consortium (PCRC). - This version used C as an implementation
language. - Initial emphasis was on High Performance Fortran
(HPF). - Initially Adlib was not meant be user-level
library. It was called by HPF compiler-generate
code when HPF translated user application. - It was developed on top of portable MPI.
- Used by two experimental HPF translators (SHPF,
and PCRC HPF).
17Historical Adlib II
- Initially HPJava used a JNI wrapper interface to
the C kernel of the PCRC library. - This version of implementation had limitations
and disadvantages. - Most importantly this version was hard and
inefficient to support Java object types. - It had performance disadvantages because all
calls to C Adlib should go though JNI calls. - It did not provide a set of gather/scatter buffer
operation to better support HPC applications.
18Collective Communication Library
- Java version of Adlib is the first library of its
kind developed from scratch for application-level
use in HPspmd model. - Borrows many ideas from the PCRC library, but for
this project we rewrote high-level library for
Java. - It is extended to support Java Object types, to
target Java based communication platforms and to
use Java exception handlingmaking it safe for
Java. - Support collective operations on distributed
arrays described by HPJava syntax. - The Java version of the Adlib library is
developed on top of mpjdev. The mpjdev API can be
implemented portably on network platforms and
efficiently on parallel hardware.
19Java version of Adlib
- This API intended for an application level
communication library which is suitable for
HPJava programming. - There are three main families of collective
operation in Adlib - regular collective communications
- reduction operations
- irregular communications
- Complete APIs of Java Adlib have been presented
in Appendix A of my dissertation.
20Regular Collective Communications I
- remap
- To copy the values of the elements in the source
array to the corresponding elements in the
destination array. - void remap (T - dst, T - src)
- T stands as a shorthand for any primitive type or
Object type of Java. - Destination and source must have the same size
and shape but they can have any, unrelated,
distribution formats. - Can implement a multicast if destination has
replicated distribution formats. - shift
- void shift (T - dst, T - src, int amount,
int dimension) - implements simpler pattern of communication than
general remap.
21Regular Collective Communications II
- writeHalo
- void writeHalo (T - a)
- applied to distributed arrays that have ghost
regions. It updates those regions. - A more general form of writeHalo allows to
specify that only a subset of the available ghost
area is to be updated. - void writeHalo(T - a, int wlo, int whi, int
mode) - wlo, whi specify the widths at upper and lower
ends of the bands to be update. -
22Solution of Laplace equation using ghost regions
1
0
- Range x new ExtBlockRange(M, p.dim(0), 1)
- Range y new ExtBlockRange(N, p.dim(1), 1)
- float -,- a new float x, y
- . . . Initialize values in a
- float -,- b new float x,y, r new
float x,y - do
- Adlib.writeHalo(a)
- overall (i x for 1 N 2)
- overall (j y for 1 N 2)
- float newA 0.25 (ai - 1, j ai 1,
j - ai, j - 1
ai, j 1 ) - r i, j Math.abs(newA a i, j)
- b i, j newA
-
- HPspmd.copy(a, b)
- while(Adlib.maxval(r) gt EPS)
a0,0 a0,1 a0,2 a1,0 a1,1 a1,2
a2,0 a2,1 a2,2
a0,1 a0,2 a0,3 a1,1 a1,2
a1,3 a2,1 a2,2 a2,3
0
a3,0 a3,1 a3,2
a3,1 a3,2 a3,3
a2,0 a2,1 a2,2
a2,1 a2,2 a2,3
1
a3,0 a3,1 a3,2 a4,0 a4,1 a4,1
a5,0 a5,1 a5,2
a3,1 a3,2 a3,3 a4,1 a4,2 a4,3
a5,1 a5,2 a5,3
23Illustration of the effect of executing the
writeHalo function
Physical Segment Of array
Declared ghost Region of array segment
Ghost area written By writeHalo
24Other features of Adlib
- Provide reduction operations (e.g. maxval() and
sum()) and irregular communications (e.g.
gather() and scatter()). - Complete API and implementation issues are
described in depth in my dissertation.
25Other High-level APIs
- Java Grande Message-Passing Working Group
- formed as a subset of the existing Concurrency
and Applications working group of Java Grande
Forum. - Discussion of a common API for MPI-like Java
libraries. - To avoid confusion with standards published by
the original MPI Forum the API was called MPJ. - java-mpi mailing list has about 195 subscribers.
26mpiJava
- Implements a Java API for MPI suggested in late
97. - mpiJava is currently implemented as Java
interface to an underlying MPI implementationsuch
as MPICH or some other native MPI
implementation. - The interface between mpiJava and the underlying
MPI implementation is via the Java Native
Interface (JNI). - This software is available from
- http//www.hpjava.org/mpiJava.html
- Around 1465 people downloaded this software.
27Low-level API
- One area of research is how to transfer data
between the Java program and the network while
reducing overheads of the Java Native Interface. - Should do this Portably on network platforms and
efficiently on parallel hardware. - We developed a low-level Java API for HPC message
passing, called mpjdev. - The mpjdev API is a device level communication
library. This library is developed with HPJava in
mind, but it is a standalone library and could be
used by other systems.
28mpjdev I
- Meant for library developer.
- Application level communication libraries like
Java version of Adlib (or potentially MPJ) can be
implemented on top of mpjdev. - API for mpjdev is small compared to MPI (only
includes point-to-point communications) - Blocking mode (like MPI_SEND, MPI_RECV)
- Non-blocking mode (like MPI_ISEND, MPI_IRECV)
- The sophisticated data types of MPI are omitted.
- provide a flexible suit of operations for copying
data to and from the buffer. (like gather- and
scatter-style operations.) - Buffer handling has similarity to JDK 1.4 new I/O.
29mpjdev II
- mpjdev could be implemented on top of Java
sockets in a portable network implementation,
oron HPC platformsthrough a JNI interface to a
subset of MPI. - Currently there are three different
implementations. - The initial version was targeted to HPC
platforms, through a JNI interface to a subset of
MPI. - For SMPs, and for debugging on a single
processor, we implemented a pure-Java,
multithreaded version. - We also developed a more system-specific mpjdev
built on the IBM SP system using LAPI. - A Java sockets version which will provide a more
portable network implementation and will be added
in the future.
30HPJava communication layers
Other application- level API
MPJ and
Java version of Adlib
mpjdev
Pure Java
Native MPI
SMPs or Networks of PCs
Parallel Hardware (e.g. IBM SP3, Sun HPC)
31mpiJava-based Implementation
- Assumes C binding of native method calls to MPI
from mpiJava as basic communication protocol. - Can be divided into two parts.
- Java APIs (Buffer and Comm classes) which are
used to call native methods via JNI. - C native methods that construct the message
vector and perform communication. - For elements of Object type, the serialized data
are stored into a Java byte array. - copying into the existing message vector if it
has space to hold serialized data array. - or using separate send if the original message
vector is not large enough.
32Multithreaded Implementation
- The processes of an HPJava program are mapped to
the Java threads of a single JVM. - This allows to debug and demonstrate HPJava
programs without facing the ordeal of installing
MPI or running on a network. - As a by-product, it also means we can run HPJava
programs on shared memory parallel computers. - e.g. high-end UNIX servers
- Java threads of modern JVMs are usually executed
in parallel on this kinds machines.
33LAPI Implementation
- The Low-level Application Programming Interface
(LAPI) is a low level communication interface for
the IBM Scalable Powerparallel (SP) supercomputer
Switch. - This switch provides scalable high performance
communication between SP nodes. - LAPI functions can be divided into three
different characteristic groups. - Active message infrastructure allows programmers
to write and install their own set of handlers. - Two Remote Memory Copy (RMC) interfaces.
- PUT operation copies data from the address space
of the origin process into the address space of
the target process. - GET operation opposite of the PUT operation.
- We produced two different implementations of
mpjdev using LAPI. - Active message function (LAPI_Amsend) GET
operation (LAPI_Get) - Active message function (LAPI_Amsend)
34LAPI Implementation Active message and GET
operation
35LAPI Implementation Active message
36Applications and Performance
37Environments
- System IBM SP3 supercomputing system with AIX
4.3.3 operating system and 42 nodes. - CPU A node has four processors (Power3 375 MHZ)
and 2 gigabytes of shared memory. - Network MPI Setting Shared css0 adapter with
User Space (US) communication mode. - Java VM IBMs JIT
- Java Compiler IBM J2RE 1.3.1 with -O option.
- HPF Compiler IBM xlhpf95 with -qhot and -O3
options. - Fortran 95 Compiler IBM xlf95 with -O5 option.
38- HPJava can out-perform sequential Java by up to
17 times. - On 36 processors HPJava can get about 79 of the
performance of HPF.
39(No Transcript)
40Multigrid
- The multigrids method is a fast algorithm for
solution of linear and nonlinear problems. It
uses hierarchy grids with restrict and
interpolate operations between current grids
(fine grid) and restricted grids (coarse grid). - General stratagem is
- make the error smooth by performing a relaxation
method. - restricting a smoothed version of the error term
to a coarse grid, computing a correction term on
the coarse grid, then interpolating this
correction back to the original fine grid. - Perform some step of the relaxation method again
to improve the original approximation to the
solution.
41- Speedup is relatively modest. This seems to be
due to the complex pattern of communication in
this algorithm.
42Speedup of HPJava Benchmarks
Multigrid Solver Multigrid Solver Multigrid Solver Multigrid Solver Multigrid Solver Multigrid Solver
Processors Processors Processors 2 3 4 6 9 9 9 9
5122 5122 5122 1.90 2.29 2.39 2.96 3.03 3.03 3.03 3.03
2D Laplace Equation 2D Laplace Equation 2D Laplace Equation 2D Laplace Equation 2D Laplace Equation 2D Laplace Equation
Processors Processors Processors 4 9 16 25 36 36 36 36
2562 2562 2562 2.67 3.73 4.67 6.22 6.22 6.22 6.22 6.22
5122 5122 5122 4.03 7.70 10.58 12.09 16.93 16.93 16.93 16.93
10242 10242 10242 4.41 8.82 13.40 19.71 25.77 25.77 25.77 25.77
3D Diffusion Equation 3D Diffusion Equation 3D Diffusion Equation 3D Diffusion Equation 3D Diffusion Equation 3D Diffusion Equation
Processors Processors Processors 2 4 8 16 32 32 32 32
323 323 323 1.58 2.72 3.45 4.75 5.43 5.43 5.43 5.43
643 643 643 1.67 3.00 4.85 7.47 8.92 8.92 8.92 8.92
1283 1283 1283 1.80 3.31 5.76 9.98 13.88 13.88 13.88 13.88
43HPJava with GUI
- Illustrate how our HPJava can be used with a Java
graphical user interface. - The Java multithreaded implementation of mpjdev
makes it possible for HPJava to cooperate with
Java AWT. - For test and demonstration of multithreaded
version of mpjdev, We implemented computational
fluid dynamics (CFD) code using HPJava. - Illustrates usage of Java object in our
communication library. - You can view this demonstration and source code
at http//www.hpjava.org/demo.html
44(No Transcript)
45- Removed the graphical part of the CFD code and
did performance tests on the computational part
only. - Changed a 2 dimensional Java object distributed
array into a 3 dimensional double distributed
array to eliminate object serialization overhead. - Using HPC implementation of underlying
communication to run the code on an SP.
46LAPI mpjdev Performance
- We found that current version of Java thread
synchronization is not implemented with high
performance. - The Java thread consumes more then five times a
long as POSIX thread, to perform wait and awake
thread function calls. - 57.49 microseconds (Java thread) vs. 10.68
microseconds (POSIX) - This result suggests we should look for a new
architectural design for mpjdev using LAPI. - Consider using POSIX threads by calling JNI to
the C instead of Java threads.
47(No Transcript)
48Contributions
49Contributions I
- HPJava
- My main contributions to HPJava was to develop
runtime communication libraries. - Java version of Adlib has been developed as
application-level communication library suitable
for data parallel programming in Java. - The mpjdev API has been developed as device level
communication library. - Other contributions The Type-Analyzer for
analyzing byte classes hierarchy, some part of
the type-checker, and Pre-translator of HPJava
was developed. - Some Applications of HPJava and full test codes
of Adlib and mpjdev were developed.
50Contributions II
- mpiJava
- Main contribution to mpiJava project was to add
support for direct communication of Java objects
via serialization. - Complete set of test cases of mpiJava for Java
object types were developed. - Maintaining mpiJava.
51Conclusions I
- We have discussed in detail the design and
development of high-level and low-level runtime
communication libraries for HPJava. - The Adlib API is presented as high-level
communication library. This API is intended as an
example of an application communication library
suitable for data parallel programming in Java. - fully supports Java object types, as part of the
basic data types. - The API and usage of collective communications.
- based on low-level communication library called
mpjdev. - The mpjdev API is a device level communication
library. This library is developed with HPJava in
mind, but it is a standalone library and could be
used by other systems. Three different
implementations are - mpiJava-based implementation
- Multithreaded implementation
- LAPI implementation
52Conclusion II
- Some benchmark results were presented.
- We got good performance on simple applications
without any serious optimization. For example - HPJava can out-perform sequential Java by up to
17 times. - On large number of processors HPJava can get
about 79 of the performance of HPF. - Laplace equation with size of 10242 got about 25
times speed up on 36 processors .