Safe and Efficient Cluster Communication in Java using Explicit Memory Management - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Safe and Efficient Cluster Communication in Java using Explicit Memory Management

Description:

Safe and Efficient Cluster Communication in Java using Explicit Memory Management Chi-Chao Chang Dept. of Computer Science Cornell University – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 47
Provided by: ChiCha4
Category:

less

Transcript and Presenter's Notes

Title: Safe and Efficient Cluster Communication in Java using Explicit Memory Management


1
Safe and Efficient Cluster Communication in Java
using Explicit Memory Management
  • Chi-Chao Chang
  • Dept. of Computer Science
  • Cornell University

2
Goal
  • High-performance cluster computing with safe
    languages
  • parallel and distributed applications
  • Use off-the-shelf technologies
  • Java
  • safe better C
  • write once run everywhere
  • growing interest for high-performance
    applications (Java Grande)
  • User-level network interfaces (UNIs)
  • direct, protected access to network devices
  • prototypes U-Net (Cornell), Shrimp (Princeton),
    FM (UIUC)
  • industry standard Virtual Interface Architecture
    (VIA)
  • cost-effective clusters new 256-processor
    cluster _at_ Cornell TC

2
3
Java Networking
  • Traditional front-end approach
  • pick favorite abstraction (sockets, RMI, MPI) and
    Java VM
  • write a Java front-end to custom or existing
    native libraries
  • good performance, re-use proven code
  • magic in native code, no common solution
  • Interface Java with Network Devices
  • bottom-up approach
  • minimizes amount of unverified code
  • focus on fundamental data transfer inefficiencies
    due to
  • 1. Storage safety
  • 2. Type safety

Java
C
3
4
Outline
  • Thesis Overview
  • GC/Native heap separation, object serialization
  • Experimental Setup VI Architecture and Marmot
  • Part I Array Transfers
  • (1) Javia-I Java Interface to VI Architecture
  • respects heap separation
  • (2) Jbufs Safe and Explicit Management of
    Buffers
  • Javia-II, matrix multiplication, Active Messages
  • Part II Object Transfers
  • (3) A Case For Specialization
  • micro-benchmarks, RMI using Javia-I/II, impact on
    application suite
  • (4) Jstreams in-place de-serialization
  • micro-benchmarks, RMI using Javia-III, impact on
    application suite
  • Conclusions

4
5
(1) Storage Safety
  • Java programs are garbage-collected
  • no explicit de-allocation GC tracks and frees
    garbage objects
  • programs are oblivious to the GC scheme used
    non-copying (e.g. conservative) or copying
  • no control over location of objects
  • Modern Network and I/O Devices
  • direct DMA from/into user buffers
  • native code is necessary to interface with
    hardware devices

5
6
(1) Storage Safety
Result Hard Separation between GC and native
heaps
Application Memory
Application Memory
GC heap
Native heap
GC heap
Native heap
pin
copy
RAM
RAM
DMA
DMA
ON
OFF
OFF
OFF
pin
(a) Hard Separation Copy-on-demand
(b) Optimization Pin-on-demand
  • Pin-on-demand only works for send/write
    operations
  • For receive/read operations, GC must be disabled
    indefinitely...

6
7
(1) Storage Safety Effect
  • Best case scenario 10-40 hit in throughput
  • pick your favorite JVM, your fastest network
    interface, and a pair of 450Mhz P-II with
    commodity OS
  • pinning on demand is expensive...

7
8
(2) Type Safety
  • Cannot forge a reference to a Java object
  • b is an array of bytes
  • in C
  • double data (double )b
  • in Java
  • double data new double1024/8
  • for (int i0,off0ilt1024/8i,off8)
  • int upper (((boff0xff)ltlt24)
  • ((boff10xff)ltlt16)
  • ((boff20xff)ltlt8)
  • (boff30xff))
  • int lower (((boff40xff)ltlt24)
    ((boff50xff)ltlt16)
  • ((boff60xff)ltlt8)
  • (boff70xff))
  • datai Double.toLongBits(((long)upper)ltlt3
    2)
  • (lower0xffffffffL))

8
9
(2) Type Safety
  • Objects have meta-data
  • runtime safety checks (array-bounds, array-store,
    casts)

In C struct Buffer int len char
data1 Buffer b malloc(sizeof(Buffer)1024)
b.len 1024
In Java class Buffer int len byte data
Buffer(int n) data new byten len n
Buffer b new Buffer(1024)
b
9
10
(2) Type Safety
  • Result Java objects need to be serialized and
    de-serialized across the network

Application Memory
GC heap
Native heap
serial
pin
copy
RAM
DMA
ON
OFF
10
11
(2) Type Safety Effect
  • Performance hit of one order of magnitude
  • pick your favorite high-level communication
    abstraction (e.g. Remote Method Invocation)
  • pick your favorite JVM, your fastest network
    interface, and a pair of 450Mhz P-II

11
12
Thesis
  • Use explicit memory management to improve Java
    communication performance
  • Jbufs safe and explicit management of Java
    buffers
  • softens the GC/Native heap separation
  • preserves type and storage safety
  • zero-copy array transfers
  • Jstreams extends Jbufs for optimizing
    serialization in clusters
  • zero-copy de-serialization of arbitrary objects

Application Memory
GC heap
Native heap
pin
RAM
DMA
ON
OFF
user-controlled
12
13
Outline
  • Thesis Overview
  • GC/Native heap separation, object serialization
  • Experimental Setup Giganet cluster and Marmot
  • Part I Array Transfers
  • (1) Javia-I Java Interface to VI Architecture
  • respects heap separation
  • (2) Jbufs Safe and Explicit Management of
    Buffers
  • Javia-II, matrix multiplication, Active Messages
  • Part II Object Transfers
  • (3) A Case For Specialization
  • micro-benchmarks, RMI using Javia-I/II, impact on
    application suite
  • (4) Jstreams in-place de-serialization
  • micro-benchmarks, RMI using Javia-III, impact on
    application suite
  • Conclusions

13
14
Giganet Cluster
  • Configuration
  • 8 P-II 450MHz, 128MB RAM
  • 8 1.25 Gbps Giganet GNN-1000 adapter
  • one Giganet switch
  • GNN1000 Adapter User-Level Network Interface
  • Virtual Interface Architecture implemented as a
    library (Win32 dll)
  • Base-line pt-2-pt Performance
  • 14?s r/t latency, 16?s with switch
  • over 100MBytes/s peak, 85MBytes/s with switch

14
15
Marmot
  • Java System from Microsoft Research
  • not a VM
  • static compiler bytecode (.class) to x86 (.asm)
  • linker asm files runtime libraries -gt
    executable (.exe)
  • no dynamic loading of classes
  • most Dragon book opts, some OO and Java-specific
    opts
  • Advantages
  • source code
  • good performance
  • two types of non-concurrent GC (copying,
    conservative)
  • native interface close enough to JNI

15
16
Outline
  • Thesis Overview
  • GC/Native heap separation, object serialization
  • Experimental Setup Giganet cluster and Marmot
  • Part I Array Transfers
  • (1) Javia-I Java Interface to VI Architecture
  • respects heap separation
  • (2) Jbufs Safe and Explicit Management of
    Buffers
  • Javia-II, matrix multiplication, Active Messages
  • Part II Object Transfers
  • (3) A Case For Specialization
  • micro-benchmarks, RMI using Javia-I/II, impact on
    application suite
  • (4) Jstreams in-place de-serialization
  • micro-benchmarks, RMI using Javia-III, impact on
    application suite
  • Conclusions

16
17
Javia-I
  • Basic Architecture
  • respects heap separation
  • buffer mgmt in native code
  • Marmot as an off-the-shelf system
  • copying GC disabled in native code
  • primitive array transfers only
  • Send/Recv API
  • non-blocking
  • blocking
  • bypass ring accesses
  • pin-on-demand
  • alloc-recv allocates new array on-demand
  • cannot eliminate copying during recv

17
18
Javia-I Performance
Basic Costs (PII-450, Windows2000b3) pin unpin
(10 10)us, or 5000 machine cycles Marmot
native call 0.28us, locks 0.25us, array alloc
0.75us Latency N transfer size in
bytes 16.5us (25ns) N raw 38.0us (38ns)
N pin(s) 21.5us (42ns) N copy(s) 18.0us
(55ns) N copy(s)alloc(r) BW 75 to 85 of
raw for 16Kbytes
18
19
jbufs
  • Goal
  • provide buffer management capabilities to Java
    without violating its safety properties
  • re-use is important amortizes high pinning costs
  • jbuf exposes communication buffers to Java
    programmers
  • 1. lifetime control explicit allocation and
    de-allocation
  • 2. efficient access direct access as
    primitive-typed arrays
  • 3. location control safe de-allocation and
    re-use by controlling whether or not a jbuf is
    part of the GC heap
  • heap separation becomes soft and user-controlled

19
20
jbufs Lifetime Control
public class jbuf public static jbuf alloc(int
bytes)/ allocates jbuf outside of GC heap /
public void free() throws CannotFreeException /
frees jbuf if it can /
handle
jbuf
GC heap
  • 1. jbuf allocation does not result in a Java
    reference to it
  • cannot access the jbuf from the wrapper object
  • 2. jbuf is not automatically freed if there are
    no Java references to it
  • free has to be explicitly called

20
21
jbufs Efficient Access
public class jbuf / alloc and free omitted
/ public byte toByteArray() throws
TypedException/hands out byte ref/ public
int toIntArray() throws TypedException /hands
out int ref/ . . .
jbuf
Java byte ref
GC heap
  • 3. (Storage Safety) jbuf remains allocated as
    long as there are array references to it
  • when can we ever free it?
  • 4. (Type Safety) jbuf cannot have two differently
    typed references to it at any given time
  • when can we ever re-use it (e.g. change its
    reference type)?

21
22
jbufs Location Control
public class jbuf / alloc, free, toArrays
omitted / public void unRef(CallBack cb) /
app intends to free/re-use jbuf /
  • Idea Use GC to track references
  • unRef application claims it has no references
    into the jbuf
  • jbuf is added to the GC heap
  • GC verifies the claim and notifies application
    through callback
  • application can now free or re-use the jbuf
  • Required GC support change scope of GC heap
    dynamically

22
23
jbufs Runtime Checks
toltpgtArray, GC
alloc
toltpgtArray
Unref
refltpgt
free
unRef
GC
to-be unrefltpgt
toltpgtArray, unRef
  • Type safety ref and to-be-unref states
    parameterized by primitive type
  • GC transition depends on the type of garbage
    collector
  • non-copying transition only if all refs to array
    are dropped before GC
  • copying transition occurs after every GC

23
24
Javia-II
  • Exploiting jbufs
  • explicit pinning/unpinning of jbufs
  • only non-blocking send/recvs

24
25
Javia-II Performance
Basic Jbuf Costs allocation 1.2us, toArray
0.8us, unRefs 2.3 us, GC degradation1.2us/jbuf
Latency (n xfer size) 16.5us (0.025us)
n raw 20.5us (0.025us) n jbufs 38.0us
(0.038us) n pin(s) 21.5us (0.042us)
n copy(s) BW within 1 of raw
25
26
MM Communication
  • pMM over Javia-II/jbufs spends at least 25 less
    in communication for 256x256 matrices on 8
    processors

26
27
MM Overall
  • Cache effects better communication performance
    does not always translate to better overall
    performance

27
28
Active Messages
class First extends AMHandler private int
first void handler(AMJbuf buf, ) int
tmp buf.toIntArray() first tmp0
class Enqueue extends AMHandler private
Queue q void handler(AMJbuf buf, )
int tmp buf.toIntArray() q.enq(tmp)
  • Exercising Jbufs
  • user supplies a list of jbufs
  • upon message arrival
  • jbuf passed to handler
  • unRef is invoked after handler invocation
  • if pool is empty, reclaim existing ones
  • copying deferred to GC-time only if needed

28
29
AM Performance
  • Latency about 15?s higher than Javia
  • synch access to buffer pool, endpoint header,
    flow control checks, handler id lookup
  • BW within 10 of peak for 16KByte messages

29
30
Jbufs Experience
  • Efficient access through arrays is useful
  • no indirect access via method invocation
  • promotes code re-use of large numerical kernels
  • leverages compiler infrastructure for eliminating
    safety checks
  • Limitations
  • still not as flexible as C buffers
  • stale references may confuse programmers
  • Discussed in thesis
  • the necessity of explicit de-allocation
  • implementation of Jbufs in Marmots copying
    collector
  • impact on conservative and generational collector
  • extension to JNI to allow portable
    implementations of Jbufs

30
31
Outline
  • Thesis Overview
  • GC/Native heap separation, object serialization
  • Experimental Setup VI Architecture and Marmot
  • Part I Array Transfers
  • (1) Javia-I Java Interface to VI Architecture
  • respects heap separation
  • (2) Jbufs Safe and Explicit Management of
    Buffers
  • Javia-II, matrix multiplication, Active Messages
  • Part II Object Transfers
  • (3) A Case For Specialization on Homogeneous
    Clusters
  • micro-benchmarks, RMI using Javia-I/II, impact on
    application suite
  • (4) Jstreams in-place de-serialization
  • micro-benchmarks, RMI using Javia-III, impact on
    application suite
  • Conclusions

31
32
Object Serialization and RMI
  • Standard JOS Protocol
  • heavy-weight class descriptors are serialized
    along with objects
  • type-checking classes need not be equal, just
    compatible.
  • protocol allows for user extensions
  • Remote Method Invocation
  • object-oriented version of Remote Procedure Call
  • relies on JOS for argument passing
  • actual parameter object can be a sub-class of the
    formal parameter class.

GC heap
readObject
NETWORK
32
33
JOS Costs
  • 1. overheads in tens or hundreds of ?s
  • send/recv overheads 3 ?s, memcpy of 500 bytes
    0.8 ?s
  • 2. double 50 more expensive than byte of
    similar size
  • 3. overheads grow as object sizes grow

33
34
Impact of Marmot
  • Impact of Marmots optimizations
  • Method inlining up to 66 improvement (already
    deployed)
  • No synchronization whatsoever up to 21
    improvement
  • No safety checks whatsoever up to 15 combined
  • Better compilation technology unlikely to reduce
    overheads substantially

34
35
Impact on RMI
  • Order of magnitude worse than Javia-I/II
  • round-trip latency drops to about 30us in a null
    RMI no JOS!
  • peak bandwidth of 22MBytes/s, about 25 of raw

35
36
Impact on Applications
  • A Case for Specializing Serialization for Cluster
    applications
  • overheads a order of magnitude higher than
    send/recv and memcpy
  • RMI performance degraded by one order of
    magnitude
  • 5-15 estimated impact on applications
  • old adage specialize for the common case

36
37
Optimizing De-serialization
  • in-place object de-serialization
  • specialization for homogeneous cluster and JVMs
  • Goal
  • eliminate copying and allocation of objects
  • Challenges
  • preserve the integrity of the receiving JVM
  • permit de-serialization of arbitrary Java objects
    with unrestricted usage and without special
    annotations
  • independent of a particular GC scheme

GC heap
GC heap
writeObject
NETWORK
37
38
Jstreams write
public class Jstream extends Jbuf public void
writeObject(Object o) / serializes o onto the
stream / throws TypedException,
ReferencedException public void writeClear()
/ clears the stream for writing/ throws
TypedException, ReferencedException
  • writeObject
  • deep-copy of objects maintains in-memory layout
  • deals with cyclic data structures
  • swizzle pointers offsets to a base address
  • replace object meta-data with 64-bit class
    descriptor
  • optimization primitive-typed arrays in jbufs are
    not copied

38
39
Jstreams read
public class Jstream extends Jbuf public
Object readObject() throws TypedException /
de-serialization / public boolean
isJstream(Object o) / checks if o resides in
the stream /
  • readObject
  • replace class descriptors with meta-data
  • unswizzle pointers, array-bounds checking
  • after first readObject, add jstream to GC heap
  • tracks references coming out of read objects
  • unRef user is willing to free or re-use

39
40
jstreams Runtime Checks
  • Modification to Javia-II prevent DMA from
    clobbering de-serialized objects
  • receive posts not allowed if jstream is in read
    mode
  • no changes to Javia-II architecture

41
jstream Performance
  • De-serialization costs constant w.r.t. object
    size
  • 2.6us for arrays, 3.3us per list element.

41
42
jstream Impact on RMI
  • 4-byte round-trip latency of 45us (25us higher
    than Javia-II)
  • 52MBytes/s for 16KBytes arguments

42
43
jstream Impact on Applications
  • 3-10 improvement in SOR, EM3D, FFT
  • 10 hit in pMM performance
  • over 22,000 incoming RMIs, 1000 jstreams in
    receive pool, 26 garbage collections 15 of
    total execution time in GC
  • generational collection will alleviate GC costs
    substantially
  • receive pool size is hard to tune tradeoffs
    between GC and locality

43
44
Jstreams Experience
  • Implementation of readObject and writeObject
    integrated into JVM
  • protocol is JVM-specific
  • native implementation is faster
  • Limitations
  • not as flexible as Java streams cannot read and
    write at the same time
  • no extensible wire protocols
  • Discussed in thesis
  • implementation of Jstreams in Marmots copying
    collector
  • support for polymorphic RMI minor changes to the
    stub compiler
  • JNI extensions to allow portable
    implementations of Jstreams

44
45
Related Work
  • Microsoft J-Direct
  • pinned arrays defined using source-level
    annotations
  • JIT produces code to redirect array access
    expensive
  • Berkeleys Jaguar efficient code generation with
    JIT extensions
  • security concern JIT hacks may break Java or
    byte-code
  • Custom JVMs
  • many tricks are possible (e.g. pinned array
    factories, pinned and non-pinned heaps, etc)
    depend on a particular GC scheme
  • Jbufs isolates minimal support needed from GC
  • Memory Management
  • Safe Regions (Gay and Aiken) reference counting,
    no GC
  • Fast Serialization and RMI
  • KaRMI (Karlsruhe) fixed JOS, ground-up RMI
    implementation
  • Manta (Vrije U) fast RMI but a Java dialect

45
46
Summary
  • Use of explicit memory management to improve Java
    communication performance in clusters
  • softens the GC/Native heap separation
  • preserves type and storage safety
  • independent of GC scheme
  • jbufs zero-copy array transfers
  • jstreams zero-copy de-serialization of arbitrary
    objects
  • Framework for building communication software and
    applications in Java
  • Javia-I/II
  • parallel matrix multiplication
  • Jam active messages
  • Java RMI
  • cluster applications TSP, IDA, SOR, EM3D, FFT,
    and MM

46
Write a Comment
User Comments (0)
About PowerShow.com