Cluster Computing with Java Threads - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster Computing with Java Threads

Description:

Problem size: 29 eastern-most states of USA with 4 colors of differing costs. ... Compared serial Java to serial C for map-coloring application. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 50
Provided by: christin86
Learn more at: https://www.cs.unh.edu
Category:
Tags: cluster | computing | eastern | java | map | of | threads | usa

less

Transcript and Presenter's Notes

Title: Cluster Computing with Java Threads


1
Cluster Computing with Java Threads
  • Philip J. Hatcher
  • University of New Hampshire
  • Philip.Hatcher_at_unh.edu

2
Collaborators
  • UNH/Hyperion
  • Mark MacBeth and Keith McGuigan
  • ENS-Lyon/DSM-PM2
  • Gabriel Antoniu, Luc Bougé and Raymond Namyst

3
Focus
  • Use Java as is for high-performance computing
  • support computationally intensive applications
  • utilize parallel computing hardware

4
Outline
  • Our Vision
  • Java Threads
  • The PM2 Run-time Environment
  • Hyperion Java Threads on Clusters
  • Evaluation
  • Related Work
  • Conclusions

5
Why Java?
  • Soon to be ubiquitous!
  • use of Java is growing very rapidly
  • Designed for portability
  • develop programs on your desktop
  • run programs on a distant cluster

6
Why Java?
  • Explicitly parallel!
  • includes a threaded programming model
  • Relaxed memory model
  • consistency model aids an implementation on
    distributed-memory parallel computers

7
Unique Opportunity
  • Use Java to bring parallelism to the masses
  • Lets not miss it!
  • But, programmers will not accept syntax or model
    changes

8
Open Question
  • Parallelism via Java access to distributed-computi
    ng techniques?
  • e.g. RMI (remote method invocation)
  • Or, parallelism via Java threads?

9
That is, ...
  • Does a user prefer to view a cluster as a
    collection of distinct machines?
  • Or, does a user prefer to view a cluster as a
    black box that will simply run Java code faster?

10
Are you in a box?
11
Or, are you thinking outside of the box?
12
Climb out of the box!
  • Use Java threads as is to program clusters of
    computers.
  • Program for the threaded Java virtual machine.
  • Allow the implementation to handle the details of
    executing in a cluster.

13
Java Threads
  • Threads are objects.
  • The class java/lang/Thread contains all of the
    methods for initializing, running, suspending,
    querying and destroying threads.

14
java/lang/Thread methods
  • Thread() - constructor for thread object.
  • start() - start the thread executing.
  • run() - method invoked by start.
  • stop(), suspend(), resume(), join(), yield().
  • setPriority().

15
Java Synchronization
  • Java uses monitors, which protect a region of
    code by allowing only one thread at a time to
    execute it.
  • Monitors utilize locks.
  • There is a lock associated with each object.

16
synchronized keyword
  • synchronized ( Exp ) Block
  • public class Q synchronized void put()

17
java/lang/Object methods
  • wait() - the calling thread, which must hold the
    lock for the object, is placed in a wait set
    associated with the object. The lock is then
    released.
  • notify() - an arbitrary thread in the wait set of
    this object is awakened and then competes again
    to get lock for object.
  • notifyall() - all waiting threads awakened.

18
Shared-Memory Model
  • Java threads execute in a virtual shared memory.
  • All threads are able to access all objects.
  • But threads may not access each others stacks.

19
Java Memory Consistency
  • A variant of release consistency.
  • Threads can keep locally cached copies of
    objects.
  • Consistency is provided by requiring that
  • a thread's object cache be flushed upon entry to
    a monitor.
  • local modifications made to cached objects be
    transmitted to the central memory when a thread
    exits a monitor.

20
PM2 A Distributed, Multithreaded Runtime
Environment
  • Thread library Marcel
  • User-level
  • Supports SMP
  • POSIX-like
  • Preemptive thread migration
  • Communication library Madeleine
  • Portable BIP, SISCI/SCI, MPI, TCP, PVM
  • Efficient

21
DSM-PM2 Architecture
  • DSM comm
  • send page request
  • send page
  • send invalidate request
  • DSM page manager
  • set/get page owner
  • set/get page access
  • add/remove to/from copyset
  • ...

DSM-PM2
PM2
22
DSM-PM2 Performance
  • SCI cluster has 450 MHz Pentium II nodes
  • Myrinet cluster has 200 MHz Pentium Pro nodes

23
Hyperion
  • Executes threaded Java programs on clusters.
  • Built on top of PM2 and DSM-PM2.
  • Provides both portability and efficiency

24
Reversing the Bytecode Stream
  • Conventionally, users pull bytecode to their
    machines for local execution.
  • Our vision
  • users develop their high-performance Java
    programs using the Java toolset on their desktop.
  • they then push the resulting bytecode to a
    Hyperion server for high-performance cycles.

25
Supporting High Performance
  • Utilizes a bytecode-to-C translator.
  • Parallel execution via spreading of Java threads
    across nodes of the cluster.
  • Java threads implemented as lightweight threads
    using PM2 library.

26
Compiling Java
  • Hyperion designed for computationally intensive
    applications, so small overhead of translating
    bytecode is not important.
  • Translating to C allows us to leverage the native
    C compiler and optimizer.

27
General Hyperion Overview
Runtime libraries
28
The Hyperion Run-Time System
  • Collection of modules to allow plug-and-play
    implementations
  • inter-node communication
  • threads
  • memory and synchronization
  • etc

29
Hyperion Internal Structure
30
Thread and Object Allocation
  • Currently, threads are allocated to processors in
    round-robin fashion.
  • Currently, an object is allocated to the
    processor that holds the thread that is creating
    the object.
  • Currently, DSM-PM2 is used to implement the Java
    memory model.

31
Hyperions DSM API
  • loadIntoCache
  • invalidateCache
  • updateMainMemory
  • get
  • put

32
DSM Implementation
  • Node-level caches.
  • Page-based and home-based protocol.
  • Log mods made to remote objects.
  • Use explicit in-line checks in get/put.
  • Each node allocates objects from a different
    range of the virtual address space.

33
Details
  • Objects are aligned on 64-byte boundaries.
  • An object reference is the address of the base of
    the object.
  • The bottom 6 bits of the ref can be used to store
    the node number of the objects home.

34
More details
  • loadIntoCache checks the 6 bits to see if an
    object is remote.
  • If so, and if not already locally cached, DSM-PM2
    is used to load the page(s) containing the
    object.
  • When a remote object is cached, a bit is turned
    on in its header.

35
Yet more details
  • The put primitive checks the header bit to see if
    a modification should be logged.
  • updateMainMemory sends the logged changes to the
    home node.

36
Evaluation
  • Minimal-cost map-coloring application.
  • Branch-and-bound algorithm.
  • 64 threads, each with its own priority queue.
  • Current best solution is shared.
  • Problem size 29 eastern-most states of USA with
    4 colors of differing costs.

37
Experimental Setting
  • Two Linux 2.2 clusters
  • eight 200 MHz Pentium Pro processors connected by
    Myrinet switch and using MPI over BIP.
  • four 450 MHz Pentium II processors connected by a
    SCI network and using SISCI.
  • gcc 2.7.2.3 with -O6

38
Performance Results
39
Parallelizability
40
Baseline Performance
  • Compared serial Java to serial C for map-coloring
    application.
  • Each program has single queue, single thread.

41
Serial Java versus Serial C
  • Java v2 DSM checks disabled
  • Java v3 DSM and array-bound checks disabled
  • Executing on a single 450 MHz Pentium II

42
Inline checks are expensive!
  • Genericity of DSM-PM2 allows an alternative
    implementation.
  • Use page-fault detection rather than inline check
    to detect non-local object.

43
Using Page Faults details
  • An object reference is the address of the base of
    the object.
  • loadIntoCache does nothing.
  • DSM-PM2 is used to handle page faults generated
    by the get/put primitives.

44
More details
  • When an object is allocated, its address is
    appended to a list attached to the page that
    contains its header.
  • When a page is loaded on a remote node, the list
    is used to turn on the header bit for all object
    headers on the page.
  • The put primitive uses the header bit in the same
    manner as inline-check version.

45
Inline Check versus Page Fault
  • IC has higher overhead for accessing objects
    (either local or locally cached).
  • PF has higher overhead (signal handling and
    memory protection) for loading a page into the
    cache.

46
IC versus PF serial map-coloring
  • Java XX v2 DSM checks disabled
  • Java XX v3 DSM and array-bound checks disabled
  • Executing on a single 450 MHz Pentium II

47
IC versus PF parallel map-coloring
  • Executing on 450MHz/SCI cluster.

48
Related Work
  • Java/MPI cluster nodes are explicit
  • Java/RMI ditto
  • Remote objects via RMI nearly transparent
  • e.g. JavaParty, Do!
  • Distributed interpreters
  • e.g. Java/DSM, MultiJav, cJVM

49
Conclusions
  • Approach is clean Java as is
  • Approach is promising
  • good parallelizability for map-coloring
  • need better scalar compilation
  • e.g. array bound-check removal
  • need further parallel application studies
  • are thread/object placement heuristics sufficient
    for programmers to write efficient programs?
Write a Comment
User Comments (0)
About PowerShow.com