On the Conformance of a Cluster Implementation of the Java Memory Model - PowerPoint PPT Presentation

Loading...

PPT – On the Conformance of a Cluster Implementation of the Java Memory Model PowerPoint presentation | free to download - id: 6cc92-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

On the Conformance of a Cluster Implementation of the Java Memory Model

Description:

collection of machines connected by network. commercial distributed-memory parallel ... Jackal (Vrije Univ. Amsterdam) high-performance computing. Hyperion ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 59
Provided by: christin86
Learn more at: http://www.cs.unh.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: On the Conformance of a Cluster Implementation of the Java Memory Model


1
On the Conformance of a Cluster Implementation of
the Java Memory Model
  • Philip J. Hatcher
  • University of New Hampshire
  • Philip.Hatcher_at_unh.edu

2
Collaborators
  • ENS-Lyon
  • Gabriel Antoniu, Luc Bougé, Raymond Namyst

3
Cluster Environment
  • collection of machines connected by network
  • commercial distributed-memory parallel computer
  • build-your-own parallel computer

4
Cluster Implementation of Java
  • Single JVM running on cluster.
  • Nodes of cluster are transparent.
  • Multithreaded applications exploit multiple
    processors of cluster.

5
Examples
  • Java/DSM (Rice Univ. - Houston)
  • transparent heterogeneous computing
  • cJVM (IBM Research - Haifa)
  • scalable Java server
  • Jackal (Vrije Univ. Amsterdam)
  • high-performance computing

6
Hyperion
  • Cluster implementation of Java developed at the
    Univ. of New Hampshire.
  • Currently built on top of the PM2 distributed,
    multithreaded runtime environment from ENS-Lyon.

7
Motivation
  • Use Java as is for high-performance computing
  • support computationally intensive applications
  • utilize parallel computing hardware

8
Why Java?
  • Explicitly parallel!
  • includes a threaded programming model
  • Relaxed memory model
  • consistency model aids an implementation on
    distributed-memory parallel computers

9
Java Threads
  • Threads are objects.
  • The class java/lang/Thread contains all of the
    methods for initializing, running, suspending,
    querying and destroying threads.

10
java/lang/Thread methods
  • Thread() - constructor for thread object.
  • start() - start the thread executing.
  • run() - method invoked by start.
  • stop(), suspend(), resume(), join(), yield().
  • setPriority().

11
Java Synchronization
  • Java uses monitors, which protect a region of
    code by allowing only one thread at a time to
    execute it.
  • Monitors utilize locks.
  • There is a lock associated with each object.

12
synchronized keyword
  • synchronized ( Exp ) Block
  • public class Q synchronized void put()

13
java/lang/Object methods
  • wait() - the calling thread, which must hold the
    lock for the object, is placed in a wait set
    associated with the object. The lock is then
    released.
  • notify() - an arbitrary thread in the wait set of
    this object is awakened and then competes again
    to get lock for object.
  • notifyall() - all waiting threads awakened.

14
Shared-Memory Model
  • Java threads execute in a virtual shared memory.
  • All threads are able to access all objects.
  • But threads may not access each others stacks.

15
Java Memory Consistency
  • A variant of release consistency.
  • Threads can keep locally cached copies of
    objects.
  • Consistency is provided by requiring that
  • a thread's object cache be flushed upon entry to
    a monitor.
  • local modifications made to cached objects be
    transmitted to the central memory when a thread
    exits a monitor.

16
General Hyperion Overview
Runtime libraries
17
The Hyperion Run-Time System
  • Collection of modules to allow plug-and-play
    implementations
  • inter-node communication
  • threads
  • memory and synchronization
  • etc

18
Thread and Object Allocation
  • Currently, threads are allocated to processors in
    round-robin fashion.
  • Currently, an object is allocated to the
    processor that holds the thread that is creating
    the object.
  • Currently, DSM-PM2 is used to implement the Java
    memory model.

19
Hyperion Internal Structure
20
PM2 A Distributed, Multithreaded Runtime
Environment
  • Thread library Marcel
  • User-level
  • Supports SMP
  • POSIX-like
  • Preemptive thread migration
  • Communication library Madeleine
  • Portable BIP, SISCI/SCI, MPI, TCP, PVM
  • Efficient

21
DSM-PM2 Architecture
  • DSM comm
  • send page request
  • send page
  • send invalidate request
  • DSM page manager
  • set/get page owner
  • set/get page access
  • add/remove to/from copyset
  • ...

DSM-PM2
PM2
22
Hyperions DSM API
  • loadIntoCache
  • invalidateCache
  • updateMainMemory
  • get
  • put

23
DSM Implementation
  • Node-level caches.
  • Page-based and home-based protocol.
  • Use page faults to detect remote objects.
  • Log modifications made to remote objects.
  • Each node allocates objects from a different
    range of the virtual address space.

24
Using Page Faults details
  • An object reference is the address of the base of
    the object.
  • loadIntoCache does nothing.
  • DSM-PM2 is used to handle page faults generated
    by the get/put primitives.

25
More details
  • When an object is allocated, its address is
    appended to a list attached to the page that
    contains its header.
  • When a page is loaded on a remote node, the list
    is used to turn on the header bit for all object
    headers on the page.
  • The put primitive checks the header bit to see if
    a modification should be logged.
  • updateMainMemory sends the logged changes to the
    home node.

26
Benchmarking
  • Two Linux 2.2 clusters
  • twelve 200 MHz Pentium Pro processors connected
    by Myrinet switch and using BIP.
  • six 450 MHz Pentium II processors connected by a
    SCI network and using SISCI.
  • gcc 2.7.2.3 with -O6

27
Pi (50M intervals)
28
Jacobi (1024x1024)
29
Traveling Salesperson (17 cities)
30
All-pairs Shortest Path (2K nodes)
31
Barnes-Hut (16K bodies)
32
Correctness
  • Does the Hyperion approach fully support the Java
    Memory Model?
  • Hyperion follows the operational specification of
    the JMM with two exceptions
  • page granularity
  • node-level caches

33
Java Memory Model the actors
Threads lock unlock load store use assign
Main Memory lock unlock read write
34
Entry to Monitor
lock load use
lock read
lock load use
read lock
35
Exit from Monitor
assign store unlock
write unlock
assign store unlock
unlock write
36
Serializability
  • Main memory actions are serializable for a single
    variable or lock.

37
Page Granularity
  • Hyperion fetches remote memory locations with
    page granularity.
  • always OK to perform load/read actions early.

38
Node-level Caches
  • All threads on a node share a cache.
  • Cache contains values being accessed whose homes
    are on other nodes.
  • Values whose homes are on this node are directly
    accessed.
  • as if every use is immediately preceded by a
    load/read and every assign is immediately
    followed by a store/write.

39
Node-level Caches
  • If one thread invalidates the cache, the cache is
    invalidated for all threads.
  • always OK to do extra load/read actions.
  • If one thread updates main memory, all threads
    do.
  • always OK to do early store/write actions.

40
Node-level Caches
  • If one thread performs a load/read, then all
    threads see the result.
  • as if all threads perform the load/read.
  • If one thread performs an assign, then other
    threads see the result before the subsequent
    store/write.
  • hmmm...

41
Implementation Traces
  • load/read across cluster implemented by
    request-send message pair.
  • store/write across cluster implemented by
    transmit-receive message pair.

42
An Example
Home(x) send x (value 7) to Node 0 send x (7) to
Node 1 receive x (17) from Node 0 receive x
(19) from Node 1
Node 0 request x T0 use x (7) T1 assign x
17 T2 use x (17) transmit x (17)
Node 1 request x T3 use x (7) T4 assign x
19 T5 use x (19) transmit x (19)
43
Serialization
Main Memory read x (value 7) for T0 read x (7)
for T3 write x (17) for T1 read x (17) for
T2 write x (19) for T4 read x (19) for T5
44
Algorithm
  • The model-satisfying serialization can always be
    constructed from the node-level traces of the
    Hyperion actions.
  • Merge node traces, controlled by the pairing of
    request-send and transmit-receive actions.
  • Correct, if memory actions at the home node are
    serializable.

45
Therefore
  • Node-level caches conform to the JMM.
  • Simplify implementation.
  • Reduce memory consumption.
  • Facilitate pre-fetching.
  • However, invalidate one, invalidate all.

46
Implementation Difficulties
  • Node-level concurrency makes the implementation
    tricky. (duh!)
  • concurrent page fetches
  • cache invalidate while page fetch pending?
  • thread reading page as it is being installed

47
Java API Additions?
  • These would be desirable
  • barrier synchronizations
  • data reductions
  • query cluster configuration
  • i.e. number of nodes
  • Is this cheating?
  • no longer as is Java?

48
API implementation
  • Careful implementation of API extensions can
    lessen potential cost of invalidate one,
    invalidate all.
  • Implementation of barrier should only invalidate
    local cache once all threads have reached barrier.

49
Level of Transparency
  • Consider the current Hyperion thread/object
    allocation strategies
  • not mandated by Java Language Spec
  • might be superceded by smarter run-time lib
  • but, still good guidelines for programmer?
  • i.e. if I didnt create the object, it might be
    expensive to access.
  • not unreasonable to expect user to be aware that
    there might be an extended memory hierarchy.

50
Final Thoughts
  • Java is extremely attractive vehicle for
    programming parallel computers.
  • Can acceptable performance be obtained?
  • Scalar execution
  • DSM implementation
  • Application locality
  • Scalability

51
More Final Thoughts
  • Extended API for threads required.
  • Transparent implementations are necessary.
  • JMM specification under review but it will remain
    based upon release consistency, which is
    important.

52
Support for Reflection
  • Object headers point to virtual method table,
    which also contains a pointer to a class database
    structure.
  • The classDB contains all information pertinent to
    class introspection.
  • The classDBs are initialized at program start-up
    using the tree of static class dependences rooted
    at the class that contains the main method to be
    called.

53
Static Initialization
  • Static initializers are executed in a postorder
    traversal of the static class dependence tree.
  • violates the JLS as static initializers should
    not execute until the class is first accessed.

54
Class Loading
  • Not currently supported.
  • But conceptually possible.
  • dynamically compile and then dynamically link in
    the new class.
  • many details to be worked out, however.

55
java2c Details
  • Maps JVM stack locations to C locals.
  • relies on gcc to map these to registers
  • Interface method invocation is implemented via a
    table of pairs
  • (interface VMT, implementation VMT)
  • Exceptions are implemented via setjmp and
    longjmp.
  • Null references are detected by segfault handler.

56
Future Work
  • Continue to investigate DSM protocols
  • eliminate check in put primitive
  • compare object-based and page-based protocols
  • Implement message aggregation
  • send mods for multiple pages in one message
  • prefetch pages in groups

57
Future Work
  • Utilize multiprocessor nodes
  • true node-level concurrency puts more demands on
    the implementation
  • java2IC
  • java2c has been modified to generate a
    (three-address) intermediate code
  • want to connect to SUIF, or some other compiler
    toolset

58
Future Work
  • Further benchmarking
  • converting SPLASH-2 applications to Java
  • Java API
  • minimal support now (and still JDK 1.1)
  • is there a clean API implementation?
  • evaluate extensions barriers, etc.
  • distributed garbage collection
About PowerShow.com