Safe and Efficient Cluster Communication in Java using Explicit Memory Management

About This Presentation

Title:

Safe and Efficient Cluster Communication in Java using Explicit Memory Management

Description:

Safe and Efficient Cluster Communication in Java using Explicit Memory Management Chi-Chao Chang Dept. of Computer Science Cornell University – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 47

Provided by: ChiCha4

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Safe and Efficient Cluster Communication in Java using Explicit Memory Management

1
Safe and Efficient Cluster Communication in Java
using Explicit Memory Management

Chi-Chao Chang
Dept. of Computer Science
Cornell University

2
Goal

High-performance cluster computing with safe
languages
parallel and distributed applications
Use off-the-shelf technologies
Java
safe better C
write once run everywhere
growing interest for high-performance
applications (Java Grande)
User-level network interfaces (UNIs)
direct, protected access to network devices
prototypes U-Net (Cornell), Shrimp (Princeton),
FM (UIUC)
industry standard Virtual Interface Architecture
(VIA)
cost-effective clusters new 256-processor
cluster _at_ Cornell TC

2
3
Java Networking

Traditional front-end approach
pick favorite abstraction (sockets, RMI, MPI) and
Java VM
write a Java front-end to custom or existing
native libraries
good performance, re-use proven code
magic in native code, no common solution
Interface Java with Network Devices
bottom-up approach
minimizes amount of unverified code
focus on fundamental data transfer inefficiencies
due to
1. Storage safety
2. Type safety

Java
C
3
4
Outline

Thesis Overview
GC/Native heap separation, object serialization
Experimental Setup VI Architecture and Marmot
Part I Array Transfers
(1) Javia-I Java Interface to VI Architecture
respects heap separation
(2) Jbufs Safe and Explicit Management of
Buffers
Javia-II, matrix multiplication, Active Messages
Part II Object Transfers
(3) A Case For Specialization
micro-benchmarks, RMI using Javia-I/II, impact on
application suite
(4) Jstreams in-place de-serialization
micro-benchmarks, RMI using Javia-III, impact on
application suite
Conclusions

4
5
(1) Storage Safety

Java programs are garbage-collected
no explicit de-allocation GC tracks and frees
garbage objects
programs are oblivious to the GC scheme used
non-copying (e.g. conservative) or copying
no control over location of objects
Modern Network and I/O Devices
direct DMA from/into user buffers
native code is necessary to interface with
hardware devices

5
6
(1) Storage Safety
Result Hard Separation between GC and native
heaps
Application Memory
Application Memory
GC heap
Native heap
GC heap
Native heap
pin
copy
RAM
RAM
DMA
DMA
ON
OFF
OFF
OFF
pin
(a) Hard Separation Copy-on-demand
(b) Optimization Pin-on-demand

Pin-on-demand only works for send/write
operations
For receive/read operations, GC must be disabled
indefinitely...

6
7
(1) Storage Safety Effect

Best case scenario 10-40 hit in throughput
pick your favorite JVM, your fastest network
interface, and a pair of 450Mhz P-II with
commodity OS
pinning on demand is expensive...

7
8
(2) Type Safety

Cannot forge a reference to a Java object
b is an array of bytes
in C
double data (double )b
in Java
double data new double1024/8
for (int i0,off0ilt1024/8i,off8)
int upper (((boff0xff)ltlt24)
((boff10xff)ltlt16)
((boff20xff)ltlt8)
(boff30xff))
int lower (((boff40xff)ltlt24)
((boff50xff)ltlt16)
((boff60xff)ltlt8)
(boff70xff))
datai Double.toLongBits(((long)upper)ltlt3
2)
(lower0xffffffffL))

8
9
(2) Type Safety

Objects have meta-data
runtime safety checks (array-bounds, array-store,
casts)

In C struct Buffer int len char
data1 Buffer b malloc(sizeof(Buffer)1024)
b.len 1024
In Java class Buffer int len byte data
Buffer(int n) data new byten len n
Buffer b new Buffer(1024)
b
9
10
(2) Type Safety

Result Java objects need to be serialized and
de-serialized across the network

Application Memory
GC heap
Native heap
serial
pin
copy
RAM
DMA
ON
OFF
10
11
(2) Type Safety Effect

Performance hit of one order of magnitude
pick your favorite high-level communication
abstraction (e.g. Remote Method Invocation)
pick your favorite JVM, your fastest network
interface, and a pair of 450Mhz P-II

11
12
Thesis

Use explicit memory management to improve Java
communication performance
Jbufs safe and explicit management of Java
buffers
softens the GC/Native heap separation
preserves type and storage safety
zero-copy array transfers
Jstreams extends Jbufs for optimizing
serialization in clusters
zero-copy de-serialization of arbitrary objects

Application Memory
GC heap
Native heap
pin
RAM
DMA
ON
OFF
user-controlled
12
13
Outline

Thesis Overview
GC/Native heap separation, object serialization
Experimental Setup Giganet cluster and Marmot
Part I Array Transfers
(1) Javia-I Java Interface to VI Architecture
respects heap separation
(2) Jbufs Safe and Explicit Management of
Buffers
Javia-II, matrix multiplication, Active Messages
Part II Object Transfers
(3) A Case For Specialization
micro-benchmarks, RMI using Javia-I/II, impact on
application suite
(4) Jstreams in-place de-serialization
micro-benchmarks, RMI using Javia-III, impact on
application suite
Conclusions

13
14
Giganet Cluster

Configuration
8 P-II 450MHz, 128MB RAM
8 1.25 Gbps Giganet GNN-1000 adapter
one Giganet switch
GNN1000 Adapter User-Level Network Interface
Virtual Interface Architecture implemented as a
library (Win32 dll)
Base-line pt-2-pt Performance
14?s r/t latency, 16?s with switch
over 100MBytes/s peak, 85MBytes/s with switch

14
15
Marmot

Java System from Microsoft Research
not a VM
static compiler bytecode (.class) to x86 (.asm)
linker asm files runtime libraries -gt
executable (.exe)
no dynamic loading of classes
most Dragon book opts, some OO and Java-specific
opts
Advantages
source code
good performance
two types of non-concurrent GC (copying,
conservative)
native interface close enough to JNI

15
16
Outline

Thesis Overview
GC/Native heap separation, object serialization
Experimental Setup Giganet cluster and Marmot
Part I Array Transfers
(1) Javia-I Java Interface to VI Architecture
respects heap separation
(2) Jbufs Safe and Explicit Management of
Buffers
Javia-II, matrix multiplication, Active Messages
Part II Object Transfers
(3) A Case For Specialization
micro-benchmarks, RMI using Javia-I/II, impact on
application suite
(4) Jstreams in-place de-serialization
micro-benchmarks, RMI using Javia-III, impact on
application suite
Conclusions

16
17
Javia-I

Basic Architecture
respects heap separation
buffer mgmt in native code
Marmot as an off-the-shelf system
copying GC disabled in native code
primitive array transfers only
Send/Recv API
non-blocking
blocking
bypass ring accesses
pin-on-demand
alloc-recv allocates new array on-demand
cannot eliminate copying during recv

17
18
Javia-I Performance
Basic Costs (PII-450, Windows2000b3) pin unpin
(10 10)us, or 5000 machine cycles Marmot
native call 0.28us, locks 0.25us, array alloc
0.75us Latency N transfer size in
bytes 16.5us (25ns) N raw 38.0us (38ns)
N pin(s) 21.5us (42ns) N copy(s) 18.0us
(55ns) N copy(s)alloc(r) BW 75 to 85 of
raw for 16Kbytes
18
19
jbufs

Goal
provide buffer management capabilities to Java
without violating its safety properties
re-use is important amortizes high pinning costs
jbuf exposes communication buffers to Java
programmers
1. lifetime control explicit allocation and
de-allocation
2. efficient access direct access as
primitive-typed arrays
3. location control safe de-allocation and
re-use by controlling whether or not a jbuf is
part of the GC heap
heap separation becomes soft and user-controlled

19
20
jbufs Lifetime Control
public class jbuf public static jbuf alloc(int
bytes)/ allocates jbuf outside of GC heap /
public void free() throws CannotFreeException /
frees jbuf if it can /
handle
jbuf
GC heap

1. jbuf allocation does not result in a Java
reference to it
cannot access the jbuf from the wrapper object
2. jbuf is not automatically freed if there are
no Java references to it
free has to be explicitly called

20
21
jbufs Efficient Access
public class jbuf / alloc and free omitted
/ public byte toByteArray() throws
TypedException/hands out byte ref/ public
int toIntArray() throws TypedException /hands
out int ref/ . . .
jbuf
Java byte ref
GC heap

3. (Storage Safety) jbuf remains allocated as
long as there are array references to it
when can we ever free it?
4. (Type Safety) jbuf cannot have two differently
typed references to it at any given time
when can we ever re-use it (e.g. change its
reference type)?

21
22
jbufs Location Control
public class jbuf / alloc, free, toArrays
omitted / public void unRef(CallBack cb) /
app intends to free/re-use jbuf /

Idea Use GC to track references
unRef application claims it has no references
into the jbuf
jbuf is added to the GC heap
GC verifies the claim and notifies application
through callback
application can now free or re-use the jbuf
Required GC support change scope of GC heap
dynamically

22
23
jbufs Runtime Checks
toltpgtArray, GC
alloc
toltpgtArray
Unref
refltpgt
free
unRef
GC
to-be unrefltpgt
toltpgtArray, unRef

Type safety ref and to-be-unref states
parameterized by primitive type
GC transition depends on the type of garbage
collector
non-copying transition only if all refs to array
are dropped before GC
copying transition occurs after every GC

23
24
Javia-II

Exploiting jbufs
explicit pinning/unpinning of jbufs
only non-blocking send/recvs

24
25
Javia-II Performance
Basic Jbuf Costs allocation 1.2us, toArray
0.8us, unRefs 2.3 us, GC degradation1.2us/jbuf
Latency (n xfer size) 16.5us (0.025us)
n raw 20.5us (0.025us) n jbufs 38.0us
(0.038us) n pin(s) 21.5us (0.042us)
n copy(s) BW within 1 of raw
25
26
MM Communication

pMM over Javia-II/jbufs spends at least 25 less
in communication for 256x256 matrices on 8
processors

26
27
MM Overall

Cache effects better communication performance
does not always translate to better overall
performance

27
28
Active Messages
class First extends AMHandler private int
first void handler(AMJbuf buf, ) int
tmp buf.toIntArray() first tmp0
class Enqueue extends AMHandler private
Queue q void handler(AMJbuf buf, )
int tmp buf.toIntArray() q.enq(tmp)

Exercising Jbufs
user supplies a list of jbufs
upon message arrival
jbuf passed to handler
unRef is invoked after handler invocation
if pool is empty, reclaim existing ones
copying deferred to GC-time only if needed

28
29
AM Performance

Latency about 15?s higher than Javia
synch access to buffer pool, endpoint header,
flow control checks, handler id lookup
BW within 10 of peak for 16KByte messages

29
30
Jbufs Experience

Efficient access through arrays is useful
no indirect access via method invocation
promotes code re-use of large numerical kernels
leverages compiler infrastructure for eliminating
safety checks
Limitations
still not as flexible as C buffers
stale references may confuse programmers
Discussed in thesis
the necessity of explicit de-allocation
implementation of Jbufs in Marmots copying
collector
impact on conservative and generational collector
extension to JNI to allow portable
implementations of Jbufs

30
31
Outline

Thesis Overview
GC/Native heap separation, object serialization
Experimental Setup VI Architecture and Marmot
Part I Array Transfers
(1) Javia-I Java Interface to VI Architecture
respects heap separation
(2) Jbufs Safe and Explicit Management of
Buffers
Javia-II, matrix multiplication, Active Messages
Part II Object Transfers
(3) A Case For Specialization on Homogeneous
Clusters
micro-benchmarks, RMI using Javia-I/II, impact on
application suite
(4) Jstreams in-place de-serialization
micro-benchmarks, RMI using Javia-III, impact on
application suite
Conclusions

31
32
Object Serialization and RMI

Standard JOS Protocol
heavy-weight class descriptors are serialized
along with objects
type-checking classes need not be equal, just
compatible.
protocol allows for user extensions
Remote Method Invocation
object-oriented version of Remote Procedure Call
relies on JOS for argument passing
actual parameter object can be a sub-class of the
formal parameter class.

GC heap
readObject
NETWORK
32
33
JOS Costs

1. overheads in tens or hundreds of ?s
send/recv overheads 3 ?s, memcpy of 500 bytes
0.8 ?s
2. double 50 more expensive than byte of
similar size
3. overheads grow as object sizes grow

33
34
Impact of Marmot

Impact of Marmots optimizations
Method inlining up to 66 improvement (already
deployed)
No synchronization whatsoever up to 21
improvement
No safety checks whatsoever up to 15 combined
Better compilation technology unlikely to reduce
overheads substantially

34
35
Impact on RMI

Order of magnitude worse than Javia-I/II
round-trip latency drops to about 30us in a null
RMI no JOS!
peak bandwidth of 22MBytes/s, about 25 of raw

35
36
Impact on Applications

A Case for Specializing Serialization for Cluster
applications
overheads a order of magnitude higher than
send/recv and memcpy
RMI performance degraded by one order of
magnitude
5-15 estimated impact on applications
old adage specialize for the common case

36
37
Optimizing De-serialization

in-place object de-serialization
specialization for homogeneous cluster and JVMs
Goal
eliminate copying and allocation of objects
Challenges
preserve the integrity of the receiving JVM
permit de-serialization of arbitrary Java objects
with unrestricted usage and without special
annotations
independent of a particular GC scheme

GC heap
GC heap
writeObject
NETWORK
37
38
Jstreams write
public class Jstream extends Jbuf public void
writeObject(Object o) / serializes o onto the
stream / throws TypedException,
ReferencedException public void writeClear()
/ clears the stream for writing/ throws
TypedException, ReferencedException

writeObject
deep-copy of objects maintains in-memory layout
deals with cyclic data structures
swizzle pointers offsets to a base address
replace object meta-data with 64-bit class
descriptor
optimization primitive-typed arrays in jbufs are
not copied

38
39
Jstreams read
public class Jstream extends Jbuf public
Object readObject() throws TypedException /
de-serialization / public boolean
isJstream(Object o) / checks if o resides in
the stream /

readObject
replace class descriptors with meta-data
unswizzle pointers, array-bounds checking
after first readObject, add jstream to GC heap
tracks references coming out of read objects
unRef user is willing to free or re-use

39
40
jstreams Runtime Checks

Modification to Javia-II prevent DMA from
clobbering de-serialized objects
receive posts not allowed if jstream is in read
mode
no changes to Javia-II architecture

41
jstream Performance

De-serialization costs constant w.r.t. object
size
2.6us for arrays, 3.3us per list element.

41
42
jstream Impact on RMI

4-byte round-trip latency of 45us (25us higher
than Javia-II)
52MBytes/s for 16KBytes arguments

42
43
jstream Impact on Applications

3-10 improvement in SOR, EM3D, FFT
10 hit in pMM performance
over 22,000 incoming RMIs, 1000 jstreams in
receive pool, 26 garbage collections 15 of
total execution time in GC
generational collection will alleviate GC costs
substantially
receive pool size is hard to tune tradeoffs
between GC and locality

43
44
Jstreams Experience

Implementation of readObject and writeObject
integrated into JVM
protocol is JVM-specific
native implementation is faster
Limitations
not as flexible as Java streams cannot read and
write at the same time
no extensible wire protocols
Discussed in thesis
implementation of Jstreams in Marmots copying
collector
support for polymorphic RMI minor changes to the
stub compiler
JNI extensions to allow portable
implementations of Jstreams

44
45
Related Work

Microsoft J-Direct
pinned arrays defined using source-level
annotations
JIT produces code to redirect array access
expensive
Berkeleys Jaguar efficient code generation with
JIT extensions
security concern JIT hacks may break Java or
byte-code
Custom JVMs
many tricks are possible (e.g. pinned array
factories, pinned and non-pinned heaps, etc)
depend on a particular GC scheme
Jbufs isolates minimal support needed from GC
Memory Management
Safe Regions (Gay and Aiken) reference counting,
no GC
Fast Serialization and RMI
KaRMI (Karlsruhe) fixed JOS, ground-up RMI
implementation
Manta (Vrije U) fast RMI but a Java dialect

45
46
Summary

Use of explicit memory management to improve Java
communication performance in clusters
softens the GC/Native heap separation
preserves type and storage safety
independent of GC scheme
jbufs zero-copy array transfers
jstreams zero-copy de-serialization of arbitrary
objects
Framework for building communication software and
applications in Java
Javia-I/II
parallel matrix multiplication
Jam active messages
Java RMI
cluster applications TSP, IDA, SOR, EM3D, FFT,
and MM

Write a Comment

User Comments (0)

About PowerShow.com

Safe and Efficient Cluster Communication in Java using Explicit Memory Management - PowerPoint PPT Presentation

Safe and Efficient Cluster Communication in Java using Explicit Memory Management

Safe and Efficient Cluster Communication in Java using Explicit Memory Management Chi-Chao Chang Dept. of Computer Science Cornell University – PowerPoint PPT presentation