Title: JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support
1JESSICA2 A Distributed Java Virtual Machine
with Transparent Thread Migration Support
- Wenzhang Zhu, Cho-Li Wang, Francis Lau
- The Systems Research Group
- Department of Computer Science and Information
Systems - The University of Hong Kong
2HKU JESSICA Project
- JESSICA Java-Enabled Single-System-Image
Computing Architecture - Project started in 1996. First version (JESSICA1)
in 1999. - A middleware that runs on top of the standard
UNIX/Linux operating system to support parallel
execution of multi-threaded Java applications in
a cluster of computers. - JESSICA hides the physical boundaries between
machines and makes the cluster appear as a single
computer to applications -- a single-system image
(SSI). - Special feature preemptive thread migration
which allows a thread to freely move between
machines. - Part of the RGCs Area of Excellence project in
1999-2002.
3JESSICA Team Members
- Supervisors
- Dr. Francis C.M. Lau
- Dr. Cho-Li Wang
- Research Students
- Ph.D Wenzhang Zhu (Thread Migration)
- Ph.D WeiJian Fang (Global Heap)
- M.Phil Zoe Ching Han Yu (Distributed Garbage
Collection) - Ph.D Benny W. L. Cheung (Software Distributed
Shared Memory) - Graduated Matchy Ma (JESSICA1)
JESSICA Team Members
The Systems Research Group
4Outline
- Introduction of Cluster Computing
- Motivations
- Related works
- JESSICA2 features
- Performance Analysis
- Conclusion Future works
5Whats a cluster ?
- A cluster is a type of parallel or distributed
processing system, which consists of a collection
of interconnected stand-alone/complete computers
cooperatively working together as a single,
integrated computing resource IEEE TFCC. - My definition a HPC system that integrates
mainstream commodity components to process
large-scale problems ? low cost, self-made, yet
powerful.
6Cluster Computer Architecture
Programming Environment (Java, C, MPI, HPF, DSM)
Management Monitoring Job Scheduling
Cluster Applications (Web, Storage,
Computing, Rendering, Financing..)
Single System Image Infrastructure
Availability Infrastructure
OS
OS
OS
OS
Node
Node
Node
Node
High-Speed LAN (Fast/Gigabit Ethernet, SCI,
Myrinet)
7Single System Image (SSI) ?
- JESSICA Project Java-Enabled Single-System-Image
Computing Architecture - A single system image is the illusion, created by
software or hardware, that presents a collection
of resources as one, more powerful resource. - Ultimate Goals of SSI makes the cluster appear
like a single machine to the user, to
applications, and to the network.
Single Entry Point, Single File System, Single
Virtual Networking, Single I/O and Memory Space,
Single Process Space, Single Management /
Programming View
8Top 500 computers by classification (June
2002) (Source http//www.top500.org/ )
MPP Massively Parallel ProcessorConstellation E.g
., cluster of HPCsCluster Cluster of
PCsSMP Symmetric Multiprocessor
- About the TOP500 List
- the 500 most powerful computer systems installed
in the world. - Compiled twice a year since June 1993
- Ranked by their performance on the LINPACK
Benchmark -
91 Supercomputer NECs Earth Simulator
- Linpack 35.86 Tflop/s
- (Tera FLOPS 1012 floating point operations per
second 450 x Pentium 4 PCs) - Interconnect Single stage crossbar (1800 miles
of cable) 83,000 copper cables, 16 GB/s cross
section bandwidth - Area of computer 4 tennis courts, 3 floors
- Built by NEC, 640 processor nodes, each
consisting of 8 vector processors, total of 5120
processors, 40 TFlop/s peak, and 10 TB memory.
(Source NEC)
10Other Supercomputers in TOP500
- 2 3 Supercomputer ASCI Q
- 7.7 TF/s Linpack performance.
- Los Alamos National Laboratory, U.S.
- HP Alphserver SC (375 x 32-way multiprocessors,
total 11,968 processors), 12 terabytes of memory
and 600 terabytes of disk storage - 4 IBM ASCI White (U.S.)
- 8,192 copper microprocessors (IBM SP POWER3), and
contains 160 trillion bytes (TB) of memory with
more than 160 TB of IBM disk storage capacity
Linpack 7.22 Tflops. Located at Lawrence
Livermore National Laboratory. - 512-node, 16-way symmetric multiprocessor. Covers
an area the size of two basketball courts, weighs
106 tons. 2,000 miles of copper wiring. Cost
US110 million.
11TOP500 Nov 2002 List
- 2 new PC clusters made the TOP 10
- 5 is a Linux NetworX/Quadrics cluster at
Lawrence Livermore National Laboratory. - 8 is a HPTi/Myrinet cluster at the Forecast
Systems Laboratory at NOAA. - A total of 55 Intel based and 8 AMD based PC
clusters are in the TOP500. - The number of clusters in the TOP500 grew again
to a total of 93 systems.
12Poor Mans Cluster
- HKU Ostrich Cluster
- 32 x 733 MHz Pentium III PCs, 384MB Memory
- Hierarchical Ethernet-based network four
24-port Fast Ethernet switches one 8-port
Gigabit Ethernet backbone switch)
13Rich Mans Cluster
- Computational Plant (C-Plant cluster)
- 1536 Compaq DS10L 1U servers (466 MHz Alpha 21264
(EV6) microprocessor, 256 MB ECC SDRAM) - Each node contains a 64-bit, 33 MHz Myrinet
network interface card (1.28 Gbps/s) connected to
a 64-port Mesh64 switch.
- 48 cabinets, each of which contains 32 nodes
- (48x321536)
14The HKU Gideon 300 Cluster(Operating in mid Oct.
2002)
Linpack performance 355 Gflops 175 in TOP500
(Nov. 2002 List)
300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB
disk, Linux OS) connected by a 312-port Foundry
FastIron 1500 (Fast Ethernet) switch
15Building Gideon 300
16JESSICA2 Introduction
- Research Goal
- High Performance Java Computing using Clusters
- Why Java?
- The dominant language for server-side
programming. - More than 2 million Java developers CNETAsia
06/2002 - Platform independent Compile once, run
anywhere - Code mobility (i.e., dynamic class loading) and
data mobility (i.e., object serialization). - Built-in multithreading support at language level
(parallel programming using MPI, PVM, RMI, RPC,
HPF, DSM is difficult) - Why cluster?
- Large scale server-side applications need
high-performance multithreaded programming
supports - A cluster provides a scalable hardware platform
for true parallel execution.
17Java Virtual Machine
Application Class File
Java API Class File
- Class Loader
- Loads class files
- Interpreter
- Executes bytecode
- Runtime Compiler
- Converts bytecode to native code
Class loader
Bytecode
Interpreter
0a0b0c0d0c6262431 c1d688662a0b0c0d0 c133451472652
2723
Runtime compiler
01010101000101110 10101011000111010 1011001101011
1011
Native code
18Threads in JVM
A Multithreaded Java Program
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
Thread 3
Thread 2
Java Method Area (Code)
Thread 1
PC
Class loader
Stack Frame
Execution Engine
Stack Frame
Class files
Heap (Data)
object
object
19Java Memory Model
- Define memory consistency semantics in
multi-threaded Java programs - when values must be transferred between the main
memory and per-thread working memory - There is a lock associated with each object
- Protect critical sections
- Maintain memory consistency between threads
- Basic Rules
- releasing a lock forces a flush of all writes
from working memory employed by the thread, and
acquiring a lock forces a (re)load of the values
of accessible fields
20Threads in a JVM
Java Memory Model (How to maintain memory
consistency between threads)
T1
T2
Variable is modified in T1s working memory.
Per-Thread working memory
Main memory
Garbage Bin
Object
master copy
Heap Area
Variable
Threads T1, T2
21Distributed Java Virtual Machine (DJVM)
JESSICA2 A distributed Java Virtual Machine
(DJVM) spanning multiple cluster nodes can
provide a true parallel execution environment for
multithreaded Java applications with a Single
System Image illusion to Java threads.
Java Threads created in a program
Global Object Space
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
22Problems in Existing DJVMs
- Mostly based on interpreters
- Simple but slow
- Layered design using distributed shared memory
system (DSM) ? cant be tightly coupled with JVM - JVM runtime information cant be channeled to DSM
- False sharing if page-based DSM is employed
- Page faults block the whole JVM
- Programmer specifies thread distribution ? lacks
of transparency - Need to rewrite multithreaded Java applications
- No dynamic thread distribution (preemptive thread
migration) for load balancing.
23Related Work
- Method shipping IBM cJVM
- Like remote method invocation (RMI) when
accessing object fields, the proxy redirects the
flow of execution to the node where the object's
master copy is located. - Executed in Interpreter mode.
- Load balancing problem affected by the object
distribution. - Page shipping Rice U. Java/DSM, HKU JESSICA
- Simple. GOS was supported by some page-based
Distributed Shared Memory (e.g., TreadMarks,
JUMP, JiaJia) - JVM runtime information cant be channeled to
DSM. - Executed in Interpreter mode.
- Object shipping Hyperion, Jackal
- Leverage some object-based DSM
- Executed in native mode Hyperion translate Java
bytecode to C. Jackal compile Java source code
directly to native code
24Related Work(Summary)
cJVM
Java/DSM
JESSICA
Manual Migration
Method Shipping
Transparent Migration
Intr
Page-based DSM
Intr
Proxy
Intr
Page-based DSM
IntrInterpreter Mode
25JESSICA2 Main Features
JESSICA2
- Transparent Java thread migration
- Runtime capturing and restoring of thread
execution context. - No source code modification. No bytecode
instrumenting (preprocessing) No new API
introduced. - Enable dynamic load balancing on clusters
- Operated in Just-In-Time (JIT) compilation Mode
- Global Object Space
- A shared global heap spanning all cluster nodes
- Adaptive object home migration protocol
- I/O redirection
Transparent migration
JIT
GOS
26JESSICA2 Architecture
Java Bytecode or Source Code
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
27Transparent Thread Migration in JIT Mode
- Simple for interpreters (e.g. JESSICA)
- Interpreter sits in the bytecode decoding loop
which can be stopped upon a migration flag
checking - The full state of a thread is available in the
data structure of interpreter - No register allocation
- JIT mode execution makes things complex
(JESSICA2) - Native code has no clear bytecode boundary
- How to deal with machine registers?
- How to organize the stack frames (all are in
native form now) ? - How to make extracted thread states portable and
recognizable by the remote JVM ? - How to restore the extracted states (rebuild the
stack frames) and restart the execution in native
form ?
Need to modify JIT compiler to instrument native
codes
28Approaches
- Using JVMDI (e.g., HKU M-JavaMPI) ?
- Only recent JDK1.4.1 (Aug., 2002) provides full
speed debugging to support the capturing of
thread status - Portable but too heavy
- need large data structures to keep debug
information - Only using JVMDI cant support full function of
DJVM - How to access remote object?
- Put a DSM under it? But you cant control Sun
JVMs memory allocation unless you get the latest
JDK source codes - Our lightweight approach
- Provide the minimum functions required to capture
and restore Java threads to support Java thread
migration
29An overview of JESSICA2 Java thread migration
- Frame parsing
- Restore execution
Thread
GOS (heap)
(3)
Frames
Frames
Frames
Migration Manager
(4a) Object Access
GOS (heap)
Method Area
Frame
PC
- Stack analysis
- Stack capturing
(2)
Method Area
Thread Scheduler
JVM
PC
(4b) Load method from NFS
Source node
(1) Alert
Destination node
Load Monitor
30What are those functions?
- Migration points selection
- Delayed to the head of loop basic block or
method - Register context handler
- Spill dirty registers at migration point without
invalidation so that native codes can continue
the use of registers - Use register recovering stub at restoring phase
- Variable type deduction
- Spill type in stacks using compression
- Java frames linking
- Discover consecutive Java frames
31Dynamic Thread State Capturing and Restoring in
JESSICA2
migration point
Bytecode verifier
migration point Selection
(Restore)
cmp mflag,0 jz ...
invoke
register allocation
bytecode translation
cmp objoffset,0 jz ...
1. Add migration checking 2. Add object
checking 3. Add type register spilling
Intermediate Code
mov 0x110182, slot ...
Register recovering
code generation
reg
slots
Native Code
Global Object Access
(Capturing)
Linking Constant Resolution
Native stack scanning
Java frame
mov slot1-gtreg1 mov slot2-gtreg2 ...
C frame
Frame
Native thread stack
32How to Maintain Memory Consistency in a
Distributed Environment ?
T2
T4
T6
T8
T1
T3
T5
T7
Heap
Heap
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
33Embedded Global Object Space (GOS)
- Main Features
- Take advantage of JVM runtime information for
optimization (e.g. object types, accessing
threads, etc.) - Use threaded I/O interface inside JVM for
communication to hide the latency ? Non-blocking
GOS access - OO based to reduce false sharing
- Home-based, compliant with JVM Memory Model
(Lazy Release Consistency) - Master Heap (home objects) and Cache Heap (local
and cached objects) reduce object access latency
34Object Cache
35Adaptive object home migration
- Definition
- home of an object the JVM that holds the
master copy of an object - Problems
- cache objects need to be flushed and re-fetched
from the home whenever synchronization happens - Adaptive object home migration
- if of accesses from a thread dominates the
total of accesses to an object, the object home
will be migrated to the node where the thread is
running
36I/O redirection
- Timer
- Use the time in Master node as the standard time
- Calibrate the time in worker node when they
register to master node - File I/O
- Use half word of fd as node number
- Open file
- For read, check local first, then master node
- For write, go to master node
- Read/Write
- Go to the node specified by the node number in fd
- Network I/O
- Connectionless send do locally
- Others, go to master
37Experimental Setting
- Modified Kaffe Open JVM version 1.0.6
- Linux PC Clusters
- Pentium II PCs at 540MHz (Linux 2.2.1 kernel)
Connected by Fast Ethernet - HKU Gideon 300 Cluster (RayTracing)
38Parallel Ray Tracing on JESSICA2(Running at
64-node Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes 108
seconds 1 node 3430 seconds ( 1 hour) Speedup
4402/10840.75
39Micro Benchmarks
(PI Calculation)
40Java Grande Benchmark
41SPECjvm98 Benchmark
M- disabling migration mechanism, M
enabling migration mechanism. I enabling
pseudo-inlining. I- disabling
pseudo-inlining.
42JESSICA2 vs JESSICA (CPI)
43Application benchmark
44Effect of Adaptive object home migration (SOR)
45Conclusions
- Transparent Java thread migration in JIT compiler
enable the high-performance execution of
multithreaded Java application on clusters while
keeping the merits of Java - JVM approach gt dynamic class loading
- Just-in-Time compilation for speed
- An embedded GOS layer can take advantage of the
JVM runtime information to reduce communication
overhead
46Thanks
- HKU SRG
- http//www.srg.csis.hku.hk/
- JESSICA2 Webpage
- http//www.csis.hku.hk/clwang/projects/JESSICA2.h
tml