JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support - PowerPoint PPT Presentation

About This Presentation
Title:

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support

Description:

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu, Cho-Li Wang, Francis Lau The Systems Research Group – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 38
Provided by: wzz8
Category:

less

Transcript and Presenter's Notes

Title: JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support


1
JESSICA2 A Distributed Java Virtual Machine
with Transparent Thread Migration Support
  • Wenzhang Zhu, Cho-Li Wang, Francis Lau
  • The Systems Research Group
  • Department of Computer Science and Information
    Systems
  • The University of Hong Kong

2
HKU JESSICA Project
  • JESSICA Java-Enabled Single-System-Image
    Computing Architecture
  • Project started in 1996. First version (JESSICA1)
    in 1999.
  • A middleware that runs on top of the standard
    UNIX/Linux operating system to support parallel
    execution of multi-threaded Java applications in
    a cluster of computers.
  • JESSICA hides the physical boundaries between
    machines and makes the cluster appear as a single
    computer to applications -- a single-system image
    (SSI).
  • Special feature preemptive thread migration
    which allows a thread to freely move between
    machines.
  • Part of the RGCs Area of Excellence project in
    1999-2002.

3
JESSICA Team Members
  • Supervisors
  • Dr. Francis C.M. Lau 
  • Dr. Cho-Li Wang
  • Research Students
  • Ph.D Wenzhang Zhu (Thread Migration)
  • Ph.D WeiJian Fang (Global Heap)
  • M.Phil Zoe Ching Han Yu (Distributed Garbage
    Collection) 
  • Ph.D Benny W. L. Cheung (Software Distributed
    Shared Memory)
  • Graduated Matchy Ma (JESSICA1)

JESSICA Team Members
The Systems Research Group
4
Outline
  • Introduction of Cluster Computing
  • Motivations
  • Related works
  • JESSICA2 features
  • Performance Analysis
  • Conclusion Future works

5
Whats a cluster ?
  • A cluster is a type of parallel or distributed
    processing system, which consists of a collection
    of interconnected stand-alone/complete computers
    cooperatively working together as a single,
    integrated computing resource IEEE TFCC.
  • My definition a HPC system that integrates
    mainstream commodity components to process
    large-scale problems ? low cost, self-made, yet
    powerful.

6
Cluster Computer Architecture
Programming Environment (Java, C, MPI, HPF, DSM)
Management Monitoring Job Scheduling
Cluster Applications (Web, Storage,
Computing, Rendering, Financing..)
Single System Image Infrastructure
Availability Infrastructure
OS
OS
OS
OS
Node
Node
Node
Node
High-Speed LAN (Fast/Gigabit Ethernet, SCI,
Myrinet)
7
Single System Image (SSI) ?
  • JESSICA Project Java-Enabled Single-System-Image
    Computing Architecture
  • A single system image is the illusion, created by
    software or hardware, that presents a collection
    of resources as one, more powerful resource.
  • Ultimate Goals of SSI makes the cluster appear
    like a single machine to the user, to
    applications, and to the network.

Single Entry Point, Single File System, Single
Virtual Networking, Single I/O and Memory Space,
Single Process Space, Single Management /
Programming View
8
Top 500 computers by classification (June
2002) (Source http//www.top500.org/ )
MPP Massively Parallel ProcessorConstellation E.g
., cluster of HPCsCluster Cluster of
PCsSMP Symmetric Multiprocessor
  • About the TOP500 List
  • the 500 most powerful computer systems installed
    in the world.
  • Compiled twice a year since June 1993
  • Ranked by their performance on the LINPACK
    Benchmark

9
1 Supercomputer NECs Earth Simulator
  • Linpack 35.86 Tflop/s
  • (Tera FLOPS 1012 floating point operations per
    second 450 x Pentium 4 PCs)
  • Interconnect Single stage crossbar (1800 miles
    of cable) 83,000 copper cables, 16 GB/s cross
    section bandwidth
  • Area of computer 4 tennis courts, 3 floors
  • Built by NEC, 640 processor nodes, each
    consisting of 8 vector processors, total of 5120
    processors, 40 TFlop/s peak, and 10 TB memory.

(Source NEC)
10
Other Supercomputers in TOP500
  • 2 3 Supercomputer ASCI Q
  • 7.7 TF/s Linpack performance.
  • Los Alamos National Laboratory, U.S.
  • HP Alphserver SC (375 x 32-way multiprocessors,
    total 11,968 processors), 12 terabytes of memory
    and 600 terabytes of disk storage
  • 4 IBM ASCI White (U.S.)
  • 8,192 copper microprocessors (IBM SP POWER3), and
    contains 160 trillion bytes (TB) of memory with
    more than 160 TB of IBM disk storage capacity
    Linpack 7.22 Tflops. Located at Lawrence
    Livermore National Laboratory.
  • 512-node, 16-way symmetric multiprocessor. Covers
    an area the size of two basketball courts, weighs
    106 tons. 2,000 miles of copper wiring. Cost
    US110 million.

11
TOP500 Nov 2002 List
  • 2 new PC clusters made the TOP 10
  • 5 is a Linux NetworX/Quadrics cluster at
    Lawrence Livermore National Laboratory.
  • 8 is a HPTi/Myrinet cluster at the Forecast
    Systems Laboratory at NOAA.
  • A total of 55 Intel based and 8 AMD based PC
    clusters are in the TOP500.
  • The number of clusters in the TOP500 grew again
    to a total of 93 systems.

12
Poor Mans Cluster
  • HKU Ostrich Cluster
  • 32 x 733 MHz Pentium III PCs, 384MB Memory
  • Hierarchical Ethernet-based network four
    24-port Fast Ethernet switches one 8-port
    Gigabit Ethernet backbone switch)

13
Rich Mans Cluster
  • Computational Plant (C-Plant cluster)
  • 1536 Compaq DS10L 1U servers (466 MHz Alpha 21264
    (EV6) microprocessor, 256 MB ECC SDRAM)
  • Each node contains a 64-bit, 33 MHz Myrinet
    network interface card (1.28 Gbps/s) connected to
    a 64-port Mesh64 switch.
  • 48 cabinets, each of which contains 32 nodes
  • (48x321536)

14
The HKU Gideon 300 Cluster(Operating in mid Oct.
2002)
Linpack performance 355 Gflops 175 in TOP500
(Nov. 2002 List)
300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB
disk, Linux OS) connected by a 312-port Foundry
FastIron 1500 (Fast Ethernet) switch
15
Building Gideon 300
16
JESSICA2 Introduction
  • Research Goal
  • High Performance Java Computing using Clusters
  • Why Java?
  • The dominant language for server-side
    programming.
  • More than 2 million Java developers CNETAsia
    06/2002
  • Platform independent Compile once, run
    anywhere
  • Code mobility (i.e., dynamic class loading) and
    data mobility (i.e., object serialization).
  • Built-in multithreading support at language level
    (parallel programming using MPI, PVM, RMI, RPC,
    HPF, DSM is difficult)
  • Why cluster?
  • Large scale server-side applications need
    high-performance multithreaded programming
    supports
  • A cluster provides a scalable hardware platform
    for true parallel execution.

17
Java Virtual Machine
Application Class File
Java API Class File
  • Class Loader
  • Loads class files
  • Interpreter
  • Executes bytecode
  • Runtime Compiler
  • Converts bytecode to native code

Class loader
Bytecode
Interpreter
0a0b0c0d0c6262431 c1d688662a0b0c0d0 c133451472652
2723
Runtime compiler
01010101000101110 10101011000111010 1011001101011
1011
Native code
18
Threads in JVM
A Multithreaded Java Program
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
Thread 3
Thread 2
Java Method Area (Code)
Thread 1
PC
Class loader
Stack Frame
Execution Engine
Stack Frame
Class files
Heap (Data)
object
object
19
Java Memory Model
  • Define memory consistency semantics in
    multi-threaded Java programs
  • when values must be transferred between the main
    memory and per-thread working memory
  • There is a lock associated with each object
  • Protect critical sections
  • Maintain memory consistency between threads
  • Basic Rules
  • releasing a lock forces a flush of all writes
    from working memory employed by the thread, and
    acquiring a lock forces a (re)load of the values
    of accessible fields

20
Threads in a JVM
Java Memory Model (How to maintain memory
consistency between threads)
T1
T2
Variable is modified in T1s working memory.
Per-Thread working memory
Main memory
Garbage Bin
Object
master copy
Heap Area
Variable
Threads T1, T2
21
Distributed Java Virtual Machine (DJVM)
JESSICA2 A distributed Java Virtual Machine
(DJVM) spanning multiple cluster nodes can
provide a true parallel execution environment for
multithreaded Java applications with a Single
System Image illusion to Java threads.
Java Threads created in a program
Global Object Space
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
22
Problems in Existing DJVMs
  • Mostly based on interpreters
  • Simple but slow
  • Layered design using distributed shared memory
    system (DSM) ? cant be tightly coupled with JVM
  • JVM runtime information cant be channeled to DSM
  • False sharing if page-based DSM is employed
  • Page faults block the whole JVM
  • Programmer specifies thread distribution ? lacks
    of transparency
  • Need to rewrite multithreaded Java applications
  • No dynamic thread distribution (preemptive thread
    migration) for load balancing.

23
Related Work
  • Method shipping IBM cJVM
  • Like remote method invocation (RMI) when
    accessing object fields, the proxy redirects the
    flow of execution to the node where the object's
    master copy is located.
  • Executed in Interpreter mode.
  • Load balancing problem affected by the object
    distribution.
  • Page shipping Rice U. Java/DSM, HKU JESSICA
  • Simple. GOS was supported by some page-based
    Distributed Shared Memory (e.g., TreadMarks,
    JUMP, JiaJia)
  • JVM runtime information cant be channeled to
    DSM.
  • Executed in Interpreter mode.
  • Object shipping Hyperion, Jackal
  • Leverage some object-based DSM
  • Executed in native mode Hyperion translate Java
    bytecode to C. Jackal compile Java source code
    directly to native code

24
Related Work(Summary)
cJVM
Java/DSM
JESSICA
Manual Migration
Method Shipping
Transparent Migration
Intr
Page-based DSM
Intr
Proxy
Intr
Page-based DSM
IntrInterpreter Mode
25
JESSICA2 Main Features
JESSICA2
  • Transparent Java thread migration
  • Runtime capturing and restoring of thread
    execution context.
  • No source code modification. No bytecode
    instrumenting (preprocessing) No new API
    introduced.
  • Enable dynamic load balancing on clusters
  • Operated in Just-In-Time (JIT) compilation Mode
  • Global Object Space
  • A shared global heap spanning all cluster nodes
  • Adaptive object home migration protocol
  • I/O redirection

Transparent migration
JIT
GOS
26
JESSICA2 Architecture
Java Bytecode or Source Code
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
27
Transparent Thread Migration in JIT Mode
  • Simple for interpreters (e.g. JESSICA)
  • Interpreter sits in the bytecode decoding loop
    which can be stopped upon a migration flag
    checking
  • The full state of a thread is available in the
    data structure of interpreter
  • No register allocation
  • JIT mode execution makes things complex
    (JESSICA2)
  • Native code has no clear bytecode boundary
  • How to deal with machine registers?
  • How to organize the stack frames (all are in
    native form now) ?
  • How to make extracted thread states portable and
    recognizable by the remote JVM ?
  • How to restore the extracted states (rebuild the
    stack frames) and restart the execution in native
    form ?

Need to modify JIT compiler to instrument native
codes
28
Approaches
  • Using JVMDI (e.g., HKU M-JavaMPI) ?
  • Only recent JDK1.4.1 (Aug., 2002) provides full
    speed debugging to support the capturing of
    thread status
  • Portable but too heavy
  • need large data structures to keep debug
    information
  • Only using JVMDI cant support full function of
    DJVM
  • How to access remote object?
  • Put a DSM under it? But you cant control Sun
    JVMs memory allocation unless you get the latest
    JDK source codes
  • Our lightweight approach
  • Provide the minimum functions required to capture
    and restore Java threads to support Java thread
    migration

29
An overview of JESSICA2 Java thread migration
  • Frame parsing
  • Restore execution

Thread
GOS (heap)
(3)
Frames
Frames
Frames
Migration Manager
(4a) Object Access
GOS (heap)
Method Area
Frame
PC
  • Stack analysis
  • Stack capturing

(2)
Method Area
Thread Scheduler
JVM
PC
(4b) Load method from NFS
Source node
(1) Alert
Destination node
Load Monitor
30
What are those functions?
  • Migration points selection
  • Delayed to the head of loop basic block or
    method
  • Register context handler
  • Spill dirty registers at migration point without
    invalidation so that native codes can continue
    the use of registers
  • Use register recovering stub at restoring phase
  • Variable type deduction
  • Spill type in stacks using compression
  • Java frames linking
  • Discover consecutive Java frames

31
Dynamic Thread State Capturing and Restoring in
JESSICA2
migration point
Bytecode verifier
migration point Selection
(Restore)
cmp mflag,0 jz ...
invoke
register allocation
bytecode translation
cmp objoffset,0 jz ...
1. Add migration checking 2. Add object
checking 3. Add type register spilling
Intermediate Code
mov 0x110182, slot ...
Register recovering
code generation
reg
slots
Native Code
Global Object Access
(Capturing)
Linking Constant Resolution
Native stack scanning
Java frame
mov slot1-gtreg1 mov slot2-gtreg2 ...
C frame
Frame
Native thread stack
32
How to Maintain Memory Consistency in a
Distributed Environment ?
T2
T4
T6
T8
T1
T3
T5
T7
Heap
Heap
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
33
Embedded Global Object Space (GOS)
  • Main Features
  • Take advantage of JVM runtime information for
    optimization (e.g. object types, accessing
    threads, etc.)
  • Use threaded I/O interface inside JVM for
    communication to hide the latency ? Non-blocking
    GOS access
  • OO based to reduce false sharing
  • Home-based, compliant with JVM Memory Model
    (Lazy Release Consistency)
  • Master Heap (home objects) and Cache Heap (local
    and cached objects) reduce object access latency

34
Object Cache
35
Adaptive object home migration
  • Definition
  • home of an object the JVM that holds the
    master copy of an object
  • Problems
  • cache objects need to be flushed and re-fetched
    from the home whenever synchronization happens
  • Adaptive object home migration
  • if of accesses from a thread dominates the
    total of accesses to an object, the object home
    will be migrated to the node where the thread is
    running

36
I/O redirection
  • Timer
  • Use the time in Master node as the standard time
  • Calibrate the time in worker node when they
    register to master node
  • File I/O
  • Use half word of fd as node number
  • Open file
  • For read, check local first, then master node
  • For write, go to master node
  • Read/Write
  • Go to the node specified by the node number in fd
  • Network I/O
  • Connectionless send do locally
  • Others, go to master

37
Experimental Setting
  • Modified Kaffe Open JVM version 1.0.6
  • Linux PC Clusters
  • Pentium II PCs at 540MHz (Linux 2.2.1 kernel)
    Connected by Fast Ethernet
  • HKU Gideon 300 Cluster (RayTracing)

38
Parallel Ray Tracing on JESSICA2(Running at
64-node Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes 108
seconds 1 node 3430 seconds ( 1 hour) Speedup
4402/10840.75
39
Micro Benchmarks
(PI Calculation)
40
Java Grande Benchmark
41
SPECjvm98 Benchmark
M- disabling migration mechanism, M
enabling migration mechanism. I enabling
pseudo-inlining. I- disabling
pseudo-inlining.
42
JESSICA2 vs JESSICA (CPI)
43
Application benchmark
44
Effect of Adaptive object home migration (SOR)
45
Conclusions
  • Transparent Java thread migration in JIT compiler
    enable the high-performance execution of
    multithreaded Java application on clusters while
    keeping the merits of Java
  • JVM approach gt dynamic class loading
  • Just-in-Time compilation for speed
  • An embedded GOS layer can take advantage of the
    JVM runtime information to reduce communication
    overhead

46
Thanks
  • HKU SRG
  • http//www.srg.csis.hku.hk/
  • JESSICA2 Webpage
  • http//www.csis.hku.hk/clwang/projects/JESSICA2.h
    tml
Write a Comment
User Comments (0)
About PowerShow.com