JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support - PowerPoint PPT Presentation

About This Presentation

Title:

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support

Description:

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu, Cho-Li Wang, Francis Lau The Systems Research Group – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 38

Provided by: wzz8

Category:

more less

Transcript and Presenter's Notes

Title: JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support

1
JESSICA2 A Distributed Java Virtual Machine
with Transparent Thread Migration Support

Wenzhang Zhu, Cho-Li Wang, Francis Lau
The Systems Research Group
Department of Computer Science and Information
Systems
The University of Hong Kong

2
HKU JESSICA Project

JESSICA Java-Enabled Single-System-Image
Computing Architecture
Project started in 1996. First version (JESSICA1)
in 1999.
A middleware that runs on top of the standard
UNIX/Linux operating system to support parallel
execution of multi-threaded Java applications in
a cluster of computers.
JESSICA hides the physical boundaries between
machines and makes the cluster appear as a single
computer to applications -- a single-system image
(SSI).
Special feature preemptive thread migration
which allows a thread to freely move between
machines.
Part of the RGCs Area of Excellence project in
1999-2002.

3
JESSICA Team Members

Supervisors
Dr. Francis C.M. Lau
Dr. Cho-Li Wang
Research Students
Ph.D Wenzhang Zhu (Thread Migration)
Ph.D WeiJian Fang (Global Heap)
M.Phil Zoe Ching Han Yu (Distributed Garbage
Collection)
Ph.D Benny W. L. Cheung (Software Distributed
Shared Memory)
Graduated Matchy Ma (JESSICA1)

JESSICA Team Members
The Systems Research Group
4
Outline

Introduction of Cluster Computing
Motivations
Related works
JESSICA2 features
Performance Analysis
Conclusion Future works

5
Whats a cluster ?

A cluster is a type of parallel or distributed
processing system, which consists of a collection
of interconnected stand-alone/complete computers
cooperatively working together as a single,
integrated computing resource IEEE TFCC.
My definition a HPC system that integrates
mainstream commodity components to process
large-scale problems ? low cost, self-made, yet
powerful.

6
Cluster Computer Architecture
Programming Environment (Java, C, MPI, HPF, DSM)
Management Monitoring Job Scheduling
Cluster Applications (Web, Storage,
Computing, Rendering, Financing..)
Single System Image Infrastructure
Availability Infrastructure
OS
OS
OS
OS
Node
Node
Node
Node
High-Speed LAN (Fast/Gigabit Ethernet, SCI,
Myrinet)
7
Single System Image (SSI) ?

JESSICA Project Java-Enabled Single-System-Image
Computing Architecture
A single system image is the illusion, created by
software or hardware, that presents a collection
of resources as one, more powerful resource.
Ultimate Goals of SSI makes the cluster appear
like a single machine to the user, to
applications, and to the network.

Single Entry Point, Single File System, Single
Virtual Networking, Single I/O and Memory Space,
Single Process Space, Single Management /
Programming View
8
Top 500 computers by classification (June
2002) (Source http//www.top500.org/ )
MPP Massively Parallel ProcessorConstellation E.g
., cluster of HPCsCluster Cluster of
PCsSMP Symmetric Multiprocessor

About the TOP500 List
the 500 most powerful computer systems installed
in the world.
Compiled twice a year since June 1993
Ranked by their performance on the LINPACK
Benchmark

9
1 Supercomputer NECs Earth Simulator

Linpack 35.86 Tflop/s
(Tera FLOPS 1012 floating point operations per
second 450 x Pentium 4 PCs)
Interconnect Single stage crossbar (1800 miles
of cable) 83,000 copper cables, 16 GB/s cross
section bandwidth
Area of computer 4 tennis courts, 3 floors

Built by NEC, 640 processor nodes, each
consisting of 8 vector processors, total of 5120
processors, 40 TFlop/s peak, and 10 TB memory.

(Source NEC)
10
Other Supercomputers in TOP500

2 3 Supercomputer ASCI Q
7.7 TF/s Linpack performance.
Los Alamos National Laboratory, U.S.
HP Alphserver SC (375 x 32-way multiprocessors,
total 11,968 processors), 12 terabytes of memory
and 600 terabytes of disk storage
4 IBM ASCI White (U.S.)
8,192 copper microprocessors (IBM SP POWER3), and
contains 160 trillion bytes (TB) of memory with
more than 160 TB of IBM disk storage capacity
Linpack 7.22 Tflops. Located at Lawrence
Livermore National Laboratory.
512-node, 16-way symmetric multiprocessor. Covers
an area the size of two basketball courts, weighs
106 tons. 2,000 miles of copper wiring. Cost
US110 million.

11
TOP500 Nov 2002 List

2 new PC clusters made the TOP 10
5 is a Linux NetworX/Quadrics cluster at
Lawrence Livermore National Laboratory.
8 is a HPTi/Myrinet cluster at the Forecast
Systems Laboratory at NOAA.
A total of 55 Intel based and 8 AMD based PC
clusters are in the TOP500.
The number of clusters in the TOP500 grew again
to a total of 93 systems.

12
Poor Mans Cluster

HKU Ostrich Cluster
32 x 733 MHz Pentium III PCs, 384MB Memory
Hierarchical Ethernet-based network four
24-port Fast Ethernet switches one 8-port
Gigabit Ethernet backbone switch)

13
Rich Mans Cluster

Computational Plant (C-Plant cluster)
1536 Compaq DS10L 1U servers (466 MHz Alpha 21264
(EV6) microprocessor, 256 MB ECC SDRAM)
Each node contains a 64-bit, 33 MHz Myrinet
network interface card (1.28 Gbps/s) connected to
a 64-port Mesh64 switch.

48 cabinets, each of which contains 32 nodes
(48x321536)

14
The HKU Gideon 300 Cluster(Operating in mid Oct.
2002)
Linpack performance 355 Gflops 175 in TOP500
(Nov. 2002 List)
300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB
disk, Linux OS) connected by a 312-port Foundry
FastIron 1500 (Fast Ethernet) switch
15
Building Gideon 300
16
JESSICA2 Introduction

Research Goal
High Performance Java Computing using Clusters
Why Java?
The dominant language for server-side
programming.
More than 2 million Java developers CNETAsia
06/2002
Platform independent Compile once, run
anywhere
Code mobility (i.e., dynamic class loading) and
data mobility (i.e., object serialization).
Built-in multithreading support at language level
(parallel programming using MPI, PVM, RMI, RPC,
HPF, DSM is difficult)
Why cluster?
Large scale server-side applications need
high-performance multithreaded programming
supports
A cluster provides a scalable hardware platform
for true parallel execution.

17
Java Virtual Machine
Application Class File
Java API Class File

Class Loader
Loads class files
Interpreter
Executes bytecode
Runtime Compiler
Converts bytecode to native code

Class loader
Bytecode
Interpreter
0a0b0c0d0c6262431 c1d688662a0b0c0d0 c133451472652
2723
Runtime compiler
01010101000101110 10101011000111010 1011001101011
1011
Native code
18
Threads in JVM
A Multithreaded Java Program
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
Thread 3
Thread 2
Java Method Area (Code)
Thread 1
PC
Class loader
Stack Frame
Execution Engine
Stack Frame
Class files
Heap (Data)
object
object
19
Java Memory Model

Define memory consistency semantics in
multi-threaded Java programs
when values must be transferred between the main
memory and per-thread working memory
There is a lock associated with each object
Protect critical sections
Maintain memory consistency between threads
Basic Rules
releasing a lock forces a flush of all writes
from working memory employed by the thread, and
acquiring a lock forces a (re)load of the values
of accessible fields

20
Threads in a JVM
Java Memory Model (How to maintain memory
consistency between threads)
T1
T2
Variable is modified in T1s working memory.
Per-Thread working memory
Main memory
Garbage Bin
Object
master copy
Heap Area
Variable
Threads T1, T2
21
Distributed Java Virtual Machine (DJVM)
JESSICA2 A distributed Java Virtual Machine
(DJVM) spanning multiple cluster nodes can
provide a true parallel execution environment for
multithreaded Java applications with a Single
System Image illusion to Java threads.
Java Threads created in a program
Global Object Space
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
22
Problems in Existing DJVMs

Mostly based on interpreters
Simple but slow
Layered design using distributed shared memory
system (DSM) ? cant be tightly coupled with JVM
JVM runtime information cant be channeled to DSM
False sharing if page-based DSM is employed
Page faults block the whole JVM
Programmer specifies thread distribution ? lacks
of transparency
Need to rewrite multithreaded Java applications
No dynamic thread distribution (preemptive thread
migration) for load balancing.

23
Related Work

Method shipping IBM cJVM
Like remote method invocation (RMI) when
accessing object fields, the proxy redirects the
flow of execution to the node where the object's
master copy is located.
Executed in Interpreter mode.
Load balancing problem affected by the object
distribution.
Page shipping Rice U. Java/DSM, HKU JESSICA
Simple. GOS was supported by some page-based
Distributed Shared Memory (e.g., TreadMarks,
JUMP, JiaJia)
JVM runtime information cant be channeled to
DSM.
Executed in Interpreter mode.
Object shipping Hyperion, Jackal
Leverage some object-based DSM
Executed in native mode Hyperion translate Java
bytecode to C. Jackal compile Java source code
directly to native code

24
Related Work(Summary)
cJVM
Java/DSM
JESSICA
Manual Migration
Method Shipping
Transparent Migration
Intr
Page-based DSM
Intr
Proxy
Intr
Page-based DSM
IntrInterpreter Mode
25
JESSICA2 Main Features
JESSICA2

Transparent Java thread migration
Runtime capturing and restoring of thread
execution context.
No source code modification. No bytecode
instrumenting (preprocessing) No new API
introduced.
Enable dynamic load balancing on clusters
Operated in Just-In-Time (JIT) compilation Mode
Global Object Space
A shared global heap spanning all cluster nodes
Adaptive object home migration protocol
I/O redirection

Transparent migration
JIT
GOS
26
JESSICA2 Architecture
Java Bytecode or Source Code
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
27
Transparent Thread Migration in JIT Mode

Simple for interpreters (e.g. JESSICA)
Interpreter sits in the bytecode decoding loop
which can be stopped upon a migration flag
checking
The full state of a thread is available in the
data structure of interpreter
No register allocation
JIT mode execution makes things complex
(JESSICA2)
Native code has no clear bytecode boundary
How to deal with machine registers?
How to organize the stack frames (all are in
native form now) ?
How to make extracted thread states portable and
recognizable by the remote JVM ?
How to restore the extracted states (rebuild the
stack frames) and restart the execution in native
form ?

Need to modify JIT compiler to instrument native
codes
28
Approaches

Using JVMDI (e.g., HKU M-JavaMPI) ?
Only recent JDK1.4.1 (Aug., 2002) provides full
speed debugging to support the capturing of
thread status
Portable but too heavy
need large data structures to keep debug
information
Only using JVMDI cant support full function of
DJVM
How to access remote object?
Put a DSM under it? But you cant control Sun
JVMs memory allocation unless you get the latest
JDK source codes
Our lightweight approach
Provide the minimum functions required to capture
and restore Java threads to support Java thread
migration

29
An overview of JESSICA2 Java thread migration

Frame parsing
Restore execution

Thread
GOS (heap)
(3)
Frames
Frames
Frames
Migration Manager
(4a) Object Access
GOS (heap)
Method Area
Frame
PC

Stack analysis
Stack capturing

(2)
Method Area
Thread Scheduler
JVM
PC
(4b) Load method from NFS
Source node
(1) Alert
Destination node
Load Monitor
30
What are those functions?

Migration points selection
Delayed to the head of loop basic block or
method
Register context handler
Spill dirty registers at migration point without
invalidation so that native codes can continue
the use of registers
Use register recovering stub at restoring phase
Variable type deduction
Spill type in stacks using compression
Java frames linking
Discover consecutive Java frames

31
Dynamic Thread State Capturing and Restoring in
JESSICA2
migration point
Bytecode verifier
migration point Selection
(Restore)
cmp mflag,0 jz ...
invoke
register allocation
bytecode translation
cmp objoffset,0 jz ...
1. Add migration checking 2. Add object
checking 3. Add type register spilling
Intermediate Code
mov 0x110182, slot ...
Register recovering
code generation
reg
slots
Native Code
Global Object Access
(Capturing)
Linking Constant Resolution
Native stack scanning
Java frame
mov slot1-gtreg1 mov slot2-gtreg2 ...
C frame
Frame
Native thread stack
32
How to Maintain Memory Consistency in a
Distributed Environment ?
T2
T4
T6
T8
T1
T3
T5
T7
Heap
Heap
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
33
Embedded Global Object Space (GOS)

Main Features
Take advantage of JVM runtime information for
optimization (e.g. object types, accessing
threads, etc.)
Use threaded I/O interface inside JVM for
communication to hide the latency ? Non-blocking
GOS access
OO based to reduce false sharing
Home-based, compliant with JVM Memory Model
(Lazy Release Consistency)
Master Heap (home objects) and Cache Heap (local
and cached objects) reduce object access latency

34
Object Cache
35
Adaptive object home migration

Definition
home of an object the JVM that holds the
master copy of an object
Problems
cache objects need to be flushed and re-fetched
from the home whenever synchronization happens
Adaptive object home migration
if of accesses from a thread dominates the
total of accesses to an object, the object home
will be migrated to the node where the thread is
running

36
I/O redirection

Timer
Use the time in Master node as the standard time
Calibrate the time in worker node when they
register to master node
File I/O
Use half word of fd as node number
Open file
For read, check local first, then master node
For write, go to master node
Read/Write
Go to the node specified by the node number in fd
Network I/O
Connectionless send do locally
Others, go to master

37
Experimental Setting

Modified Kaffe Open JVM version 1.0.6
Linux PC Clusters
Pentium II PCs at 540MHz (Linux 2.2.1 kernel)
Connected by Fast Ethernet
HKU Gideon 300 Cluster (RayTracing)

38
Parallel Ray Tracing on JESSICA2(Running at
64-node Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes 108
seconds 1 node 3430 seconds ( 1 hour) Speedup
4402/10840.75
39
Micro Benchmarks
(PI Calculation)
40
Java Grande Benchmark
41
SPECjvm98 Benchmark
M- disabling migration mechanism, M
enabling migration mechanism. I enabling
pseudo-inlining. I- disabling
pseudo-inlining.
42
JESSICA2 vs JESSICA (CPI)
43
Application benchmark
44
Effect of Adaptive object home migration (SOR)
45
Conclusions

Transparent Java thread migration in JIT compiler
enable the high-performance execution of
multithreaded Java application on clusters while
keeping the merits of Java
JVM approach gt dynamic class loading
Just-in-Time compilation for speed
An embedded GOS layer can take advantage of the
JVM runtime information to reduce communication
overhead

46
Thanks