X10 Tutorial PSC Software Productivity Study May 23 - PowerPoint PPT Presentation

Loading...

PPT – X10 Tutorial PSC Software Productivity Study May 23 PowerPoint presentation | free to download - id: 1b9223-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

X10 Tutorial PSC Software Productivity Study May 23

Description:

Defense Advanced Research Projects Agency (DARPA) under contract No. ... Advanced ... Console. interconnect. interconnect. Front-end. Nodes. Pset 1023. Pset 0. File ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 64
Provided by: vive60
Learn more at: http://www.csm.ornl.gov
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: X10 Tutorial PSC Software Productivity Study May 23


1
X10 Tutorial PSC Software Productivity
Study May 23 27, 2005
  • Vivek Sarkar
  • IBM T.J. Watson Research Center
  • vsarkar_at_us.ibm.com
  • This work has been supported in part by the
    Defense Advanced Research Projects Agency
    (DARPA) under contract No. NBCH30390004.

2
Acknowledgments
  • X10 core team
  • Philippe Charles
  • Chris Donawa
  • Kemal Ebcioglu
  • Christian Grothoff
  • Allan Kielstra
  • Christoph von Praun
  • Vivek Sarkar
  • Vijay Saraswat
  • X10 productivity team
  • Catalina Danis
  • Christine Halverson
  • Additional contributors to PSC Productivity Study
  • David Bader
  • Bill Clark
  • Nick Nystrom
  • John Urbanic

3
Outline
  • Clocks
  • creation, registration, next, resume, drop,
    ClockUseException
  • Basic serial constructs that differ from Java
  • const, nullable, extern
  • Advanced topics
  • Value types, conditional atomic sections (when),
    general regions distributions
  • Refer to language spec for details
  • What is X10?
  • background, status
  • Basic X10 (single place)
  • async, finish, atomic
  • future, force
  • Basic X10 (arrays loops)
  • points, rectangular regions, arrays
  • for, foreach
  • Scalable X10 (multiple places)
  • places, distributions, distributed arrays,
    ateach, BadPlaceException

4
What is X10?
  • X10 is a new experimental language developed in
    the IBM PERCS project as part of the DARPA
    program on High Productivity Computing Systems
    (HPCS)
  • X10s goal is to provide a new parallel
    programming model and its embodiment in a high
    level language that
  • is more productive than current models,
  • can support higher levels of abstraction better
    than current models, and
  • can exploit the multiple levels of parallelism
    and nonuniform data access that are critical for
    obtaining scalable performance in current and
    future HPC systems,

5
X10 status and schedule
  • 6/2003 PERCS programming model concept (end of
    PERCS Phase 1)
  • 7/2004 Start of PERCS Phase 2
  • 2/2004 Kickoff of X10 as concrete embodiment of
    PERCS programming model as a new language
  • 7/2004 First draft of X10 language specification
  • 2/2005 First X10 implementation -- unoptimized
    single-VM prototype
  • Emulates distributed parallelism in a single
    process
  • This is what you will use to run X10 programs
    this week
  • 5/2005 X10 productivity study at Pittsburgh
    Supercomputing Center
  • 7/2005 Results from X10 application
    productivity studies
  • 2H2005 Revise language based on application
    productivity feedback
  • 2H2005 Start participation in High Productivity
    Language consortium?
  • 1/2006 Second X10 implementation optimized
    multi-VM prototype
  • 6/2006 Open source release of X10 reference
    implementation
  • 6/2006 Design completed for production X10
    implementation in Phase 3 (end of Phase 2)

6
Current X10 Environment Unoptimized Single-VM
Implementation
X10 source program --- must contain a class named
Foo with a public static void main(String
args) method
Foo.x10
x10c Foo.x10
x10c
X10 compiler --- translates Foo.x10 to Foo.java,
uses javac to generate Foo.class from Foo.java
X10 program translated into Java --- // line
pseudocomment in Foo.java specifies source line
mapping in Foo.x10
Foo.class
Foo.java
x10 Foo.x10
X10 extern interface
X10 Virtual Machine (JVM J2SE libraries X10
libraries X10 Multithreaded Runtime)
External DLLs
X10 Abstract Performance Metrics (event counts,
distribution efficiency)
X10 Program Output
Caveat this is a prototype implementation with
many limitations. Please be patient!
7
Examples of X10 Compiler Error Messages
Case 1 Error message identifies source file and
line number
  • 1) x10c TutError1.x10
  • TutError1.x108 Could not find field or local
    variable "evenSum".
  • for (int i 2 i lt n i 2 )
    evenSum i

  • ----
  • 2) x10c TutError2.x10
  • x10c TutError2.x10427427 unexpected
    token(s) ignored
  • 3) x10c TutError3.x10
  • x10c C\vivek\eclipse\workspace\x10\examples\Tuto
    rial\TutError3.java49
  • local variable n is accessed from within
    inner class needs to be declared
  • final

Case 1 Carats indicate column range
Case 2 Error message identifies source file,
line number, and column range
Case 3 Error message reported by Java compiler
look for line comment in .java file to identify
X10 source location
8
Future X10 Environment
Implicit parallelism, Implicit data distributions
Very High Level Languages (VHLLs), Domain
Specific Languages (DSLs)
X10 places --- abstraction of explicit control
data distribution
X10 Libraries
X10 High Level Language
Mapping of places to nodes in HPC Parallel
Environment
X10 Deployment
HPC Runtime Environment (Parallel Environment,
MPI, LAPI, )
Primitive constructs for parallelism,
communication, and synchronization
HPC Parallel System
Target system for parallel application
9
Future X10 Environment Targeting Scalable HPC
Parallel Systems
Front-end Nodes
File Servers
interconnect
. . .
Pset 0
I/O Node 0
Thick X10 VM
Full X10 VM
Functional Gigabit Ethernet
Console
. . .
interconnect
I/O Node 1023
Thick X10 VM
. . .
Pset 1023
10
Future X10 Environment Targeting Scalable HPC
Parallel Systems
Front-end Nodes
File Servers
Clusters (scale-out)
SMP
Multiple cores on a chip
Coprocessors (SPUs)
SMTs
SIMD
ILP
interconnect
. . .
Pset 0
I/O Node 0
Thick X10 VM
Full X10 VM
Functional Gigabit Ethernet
Console
. . .
interconnect
I/O Node 1023
Thick X10 VM
. . .
Pset 1023
11
X10 vs. Java
  • X10 is an extended subset of Java
  • Base language Java 1.4
  • Java 5 features (generics, metadata, etc.) are
    currently not supported in X10
  • Notable features removed from Java
  • Concurrency --- threads, synchronized, etc.
  • Java arrays replaced by X10 arrays
  • Notable features added to Java
  • Concurrency async, finish, atomic, future,
    force, foreach, ateach, clocks
  • Distribution --- points, distributions
  • X10 arrays --- multidimensional distributed
    arrays, array reductions, array initializers,
  • Serial constructs --- nullable, const, extern,
    value types
  • X10 supports both OO and non-OO programming
    paradigms

12
x10.lang standard library
  • Java package with built in classes that provide
    support for selected X10 constructs
  • Standard types
  • boolean, byte, char, double, float, int, long,
    short, String
  • x10.lang.Object -- root class for all instances
    of X10 objects
  • x10.lang.clock --- clock instances clock
    operations
  • x10.lang.dist --- distribution instances
    distribution operations
  • x10.lang.place --- place instances place
    operations
  • x10.lang.point --- point instances point
    operations
  • x10.lang.region --- region instances region
    operations
  • All X10 programs implicitly import the x10.lang.
    package, so the x10.lang prefix can be omitted
    when referring to members of x10.lang. classes
  • e.g., place.MAX_PLACES, dist.factory.block(0100,
    0100),
  • Similarly, all X10 programs also implicitly
    import the java.lang. package
  • e.g., X10 programs can use Math.min() and
    Math.max() from java.lang

13
Calling foreign functions from X10 programs
  • Java methods
  • Can be called directly from X10 programs
  • Java class will be loaded automatically as part
    of X10 program execution
  • Basic rule dont call any method that can
    perform wait/notify or related thread operations
  • Calling synchronized methods is okay
  • C functions
  • Need to use extern declaration in X10, and
    perform a System.loadLibrary() call

14
Resources available in current X10 installation
  • Readme.txt --- basic information on X10
    installation and usage
  • Limitations.txt --- list of known limitations in
    the current X10 implementation
  • etc/standard.cfg --- default configuration
    information
  • examples/ -- root directory for a number of
    working X10 example programs
  • examples/Constructs shows usage of different X10
    constructs
  • examples/Tutorial contains examples used in this
    tutorial

15
Outline
  • Clocks
  • creation, registration, next, resume, drop,
    ClockUseException
  • Basic serial constructs that differ from Java
  • const, nullable, extern
  • Advanced topics
  • Value types, conditional atomic sections (when),
    general regions distributions
  • Refer to language spec for details
  • What is X10?
  • background, status
  • Basic X10 (single place)
  • async, finish, atomic
  • future, force
  • Basic X10 (arrays loops)
  • points, rectangular regions, arrays
  • for, foreach
  • Scalable X10 (multiple places)
  • places, distributions, distributed arrays,
    ateach, BadPlaceException

16
X10 Programming Model (Single Place)
Immutable Data (I) -- final variables, value
type instances
  • Activity lightweight thread
  • Main program starts as single activity in Place 0
  • Permitted object references (pointers)
  • I? H, H ? I, I?I, H?H, S?H, S-gtI,
  • Prohibited references
  • H ? S, I ? S, S ? S
  • No data sharing permitted between parent
    activitys stack and child activitys stack
  • Single Place Memory model
  • No coherence constraints needed for I and S
    storage classes
  • Guaranteed coherence for H storage class --- all
    writes to same shared location are observed in
    same order by all activities
  • Largest deployment granularity for a single place
    is a single SMP

Shared Heap (H)
Place 0
Locally Synchronous (coherent access to
intra-place shared heap)
Activity Stacks (S)
  • Storage classes
  • Immutable Data (I)
  • Shared Heap (H)
  • Activity Stacks (S)

17
Basic X10 (Single Place)
  • Core constructs used for intra-place (shared
    memory) parallel programming
  • Async construct used to execute a statement in
    parallel as a new activity
  • Finish construct used to check for global
    termination of statement and all the activities
    that it has created
  • Atomic construct used to coordinate accesses to
    shared heap by multiple activities
  • Future construct used to evaluate an expression
    in parallel as a new activity
  • Force construct used to check for termination
    of future

18
async statement
  • async ltstmtgt
  • Parent activity creates a new child activity to
    execute ltstmtgt in the same place as the parent
    activity
  • An async statement returns immediately parent
    execution proceeds immediately to next statement
  • Any access to parents local data must be through
    final variables
  • Similar to data access rules for inner classes in
    Java
  • Example
  • public class TutAsync
  • const boxedInt oddSumnew boxedInt()
  • const boxedInt evenSumnew boxedInt()
  • public static void main(String args)
  • final int n 100
  • async for (int i1 iltn i2 ) oddSum.val
    i
  • for (int j2 jltn j2 ) evenSum.val j

Variable n must be declared as final --- its
value is passed from parent to child activity
19
finish statement
  • finish ltstmtgt
  • Execute ltstmtgt as usual, but wait until all
    activities spawned (transitively) by ltstmtgt have
    terminated before completing the execution of
    finish S
  • finish traps all exceptions thrown by activities
    spawned by S, and throws a wrapping exception
    after S has terminated.
  • Example (see TutAsync.x10)
  • . . .
  • finish
  • async for (int i1 iltn i2 )
    oddSum.val i
  • for (int j2 jltn j2 ) evenSum.val
    j
  • // Both oddSum and evenSum have been computed
    now
  • System.out.println("oddSum " oddSum.val
  • " evenSum "
    evenSum.val)
  • // main()
  • // TutAsync

Console output oddSum 2500 evenSum 2550
20
Atomic statements methods
  • atomic ltstmtgt, atomic ltmethod-declgt
  • An atomic statement/method is conceptually
    executed in a single step, while other activities
    are suspended
  • Note programmer does not manage any locks
    explicitly
  • An atomic section may not include
  • Blocking operations
  • Creation of activities
  • Example (see TutAtomic1.x10)
  • finish
  • async for (int i1 iltn i2 )
  • double r 1.0d / i atomic rSum
    r
  • for (int j2 jltn j2 )
  • double r 1.0d / j atomic rSum
    r
  • System.out.println("rSum " rSum)

Console output rSum 5.187377517639618
21
Another Example (TutAtomic2.x10)
  • public class TutAtomic2
  • const int a new boxedInt(100)
  • const int b new boxedInt(100)
  • public static atomic void incr_a() a.val
    b.val--
  • public static atomic void decr_a() a.val--
    b.val
  • public static void main(String args)
  • int sum
  • finish
  • async for (int i1 ilt10 i )
    incr_a()
  • for (int i1 ilt10 i )
    decr_a()
  • atomic sum a.val b.val
  • System.out.println("ab " sum)
  • // main()
  • // TutAtomic2

Console output ab 200
22
Future Force
  • futurelttypegt F future ltexprgt
  • Parent activity creates a new asynchronous child
    activity at ltplacegt to evaluate ltexprgt
  • lttypegt value F.force()
  • Caller blocks until return value is obtained from
    future (and all activities spawned transitively
    by ltexprgt have terminated )
  • Example (see TutFuture2.x10)
  • // Note that futureltintgt and int are different
    types
  • futureltintgt Fi future fib(10)
  • int i Fi.force()
  • // Nested future types can also be created (if
    need be)
  • futureltfutureltintgtgt FFj future
    futurefib(100)
  • futureltintgt Fj FFj.force()
  • int j Fj.force()

23
Example (TutFuture1.x10)
Example of recursive divide-and-conquer
parallelism --- calls to fib(n-1) and fib(n-2)
execute in parallel
  • public class TutFuture1
  • static int fib(final int n)
  • if ( n lt 0 ) return 0
  • else if ( n 1 ) return 1
  • else
  • futureltintgt fn_1 future fib(n-1)
  • futureltintgt fn_2 future fib(n-2)
  • return fn_1.force() fn_2.force()
  • // fib()
  • public static void main(String args)
  • System.out.println("fib(10) "
    fib(10))
  • // main()
  • // TutFuture1

24
Parallel Programming Pitfalls Deadlock
  • Deadlock occurs when parallel threads/activities
    acquire locks or perform other blocking
    operations in a sequence that creates a
    dependence cycle
  • Java example
  • Thread 0
  • synchronized (Foo.a) synchronized(Foo.b)
  • Thread 1
  • synchronized (Foo.b) synchronized(Foo.a)
  • MPI example
  • Process 0
  • MPI_Recv(recvbuf, count, MPI_REAL, 1, tag, )
  • Process 1
  • MPI_Recv(recvbuf, count, MPI_REAL, 0, tag, )

25
Parallel Programming Pitfalls Deadlock (contd.)
  • X10 guarantee
  • Any program written with async, finish, atomic,
    foreach, ateach, and clock parallel constructs
    will never deadlock
  • Unrestricted use of future and force may lead to
    deadlock (see examples/Constructs/Future/FutureDea
    dlock_MustFailTimeout.x10)
  • f1 future a1()
  • f2 future a2()
  • int a1() f2.force()
  • Int a2() f1.force()
  • Restricted use of future and force in X10 can
    preserve guaranteed freedom from deadlocks
  • Sufficient condition 1 ensure that activity
    that creates the future also performs the force()
    operation
  • Sufficient condition 2 . . .

26
Parallel Programming Pitfalls Data Races
  • A data race occurs when two (or more)
    threads/activities can access the same shared
    location in parallel such that one of the
    accesses is a write operation
  • Java example
  • Thread 0 a b--
  • Thread 1 a b--
  • Data race can violate invariant that (ab) is
    constant
  • Data race may also prevent multiple increments
    from being combined correctly
  • X10 guidelines for avoiding data races
  • Use atomic methods and blocks without worrying
    about deadlock
  • Declare data to be read-only (i.e., final or
    value type instance) whenever possible

27
Outline
  • Clocks
  • creation, registration, next, resume, drop,
    ClockUseException
  • Basic serial constructs that differ from Java
  • const, nullable, extern
  • Advanced topics
  • Value types, conditional atomic sections (when),
    general regions distributions
  • Refer to language spec for details
  • What is X10?
  • background, status
  • Basic X10 (single place)
  • async, finish, atomic
  • future, force
  • Basic X10 (arrays loops)
  • points, rectangular regions, arrays
  • for, foreach
  • Scalable X10 (multiple places)
  • places, distributions, distributed arrays,
    ateach, BadPlaceException

28
Points
  • A point is an element of an n-dimensional
    Cartesian space (ngt1) with integer-valued
    coordinates e.g., 5, 1, 2,
  • Dimensions are numbered from 0 to n-1
  • n is also referred to as the rank of the point
  • A point variable can hold values of different
    ranks e.g.,
  • point p p 1 p 2,3
  • The following operations are defined on a
    point-valued expression p1
  • p1.rank --- returns rank of point p1
  • p1.get(i) --- returns element i of point p1
  • Returns element (i mod p1.rank) if i lt 0 or i gt
    p1.rank
  • p1.lt(p2), p1.le(p2), p1.gt(p2), p1.ge(p2)
  • Returns true iff p1 is lexicographically lt, lt,
    gt, or gt p2
  • Only defined when p1.rank and p1.rank are equal

29
Example (see TutPoint.x10)
  • public class TutPoint
  • public static void main(String args)
  • point p1 1,2,3,4,5
  • point p2 1,2
  • point p3 2,1
  • System.out.println("p1 " p1 "
    p1.rank " p1.rank " p1.get(2) "
    p1.get(2))
  • System.out.println("p2 " p2 " p3
    " p3 " p2.lt(p3) " p2.lt(p3))
  • // main()
  • // TutPoint

Console output p1 1,2,3,4,5 p1.rank 5
p1.get(2) 3 p2 1,2 p3 2,1 p2.lt(p3)
true
30
Rectangular Regions
  • A rectangular region is the set of points
    contained in a rectangular subspace
  • A region variable can hold values of different
    ranks e.g.,
  • region R R 010 R -100100, -100100
    R 0-1
  • The following operations are defined on a
    region-valued expression R
  • R.rank dimensions in region R.size()
    points in region
  • R.contains(P) true if region R contains point P
  • R.contains(S) true if region R contains region
    S
  • R.equal(S) true if region R equals region S
  • R.rank(i) projection of region R on dimension i
    (a one-dimensional region)
  • R.rank(i).low() lower bound of ith dimension of
    region R
  • R.rank(i).high() upper bound of ith dimension
    of region R
  • R.ordinal(P) ordinal value of point P in region
    R
  • R.coord(N) point in region R with ordinal value
    N
  • R1 R2 region intersection (will be
    rectangular if R1 and R2 are rectangular)
  • R1 R2 union of regions R1 and R2 (may not be
    rectangular)
  • R1 R2 region difference (may not be
    rectangular)

31
Example (see TutRegion.x10)
  • public class TutRegion
  • public static void main(String args)
  • region R1 110, -100100
  • System.out.println("R1 " R1 "
    R1.rank " R1.rank " R1.size() "
    R1.size() " R1.ordinal(10,100) "
    R1.ordinal(10,100))
  • region R2 110,90100
  • System.out.println("R2 " R2 "
    R1.contains(R2) " R1.contains(R2) "
    R2.rank(1).low() " R2.rank(1).low() "
    R2.coord(0) " R2.coord(0))
  • // main()
  • // TutRegion

Console output R1 110,-100100 R1.rank
2 R1.size() 2010 R1.ordinal(10,100)
2009 R2 110,90100 R1.contains(R2) true
R2.rank(1).low() 90 R2.coord(0) 1,90
32
X10 Arrays
  • Java arrays are one-dimensional and local
  • e.g., array args in main(String args)
  • Multi-dimensional arrays are represented as
    arrays of arrays in Java
  • X10 has true multi-dimensional arrays (as in C,
    Fortran) that can be distributed (as in UPC,
    Co-Array Fortran, ZPL, Chapel, etc.)
  • Array declaration
  • T . A declares an X10 array with element type
    T
  • An array variable can hold values of different
    rank)
  • The . syntax is used to avoid confusion with
    Java arrays
  • Array creation
  • new T R creates a local rectangular X10
    array with rectangular region R as the index
    domain and T as the element (range) type
  • e.g., int. A new int 0N1, 0N1
  • Array initializers can also be specified in
    conjunction with creation (see TutArray1.x10)
  • E.g., int. A new int 110,110
    (pointi,j) return ij

33
X10 Array Operations
  • The following operations are defined on
    array-valued expression s
  • A.rank dimensions in array
  • A.region index region (domain) of array
  • AP element at point P, where P belongs to
    A.region
  • A R restriction of array onto region R
  • Useful for extracting subarrays
  • A.sum(), A.max() sum/max of elements in array
  • A1 op A2 returns result of applying a pointwise
    op on array elements, when A1.region A2. region
  • Op can include , -, , and /
  • A1 A2 disjoint union of arrays A1 and A2
    (A1.region and A2.region must be disjoint)
  • A1.overlay(A2)
  • Returns an array with region, A1.region
    A2.region, with element value A2P for all
    points P in A2.region and A1P otherwise.
  • A.distribution distribution of array A
  • Will be discussed later when we introduce X10
    places

34
Example (see TutArray1.x10)
  • public class TutArray1
  • public static void main(String args)
  • int. A new int 110,110
  • (point i,j) return ij
  • System.out.println("A.rank " A.rank
  • " A.region "
    A.region)
  • int. B A 15,15
  • System.out.println("B.max() "
    B.max())
  • // main()
  • // TutArray1

Console output A.rank 2 A.region
110,110 B.max() 10
35
Pointwise for loop
  • X10 extends Javas for loop to support sequential
    iteration over points in region R in canonical
    lexicographic order
  • for ( point p R ) . . .
  • Standard point operations can be used to extract
    individual index values from point p
  • for ( point p R ) int i p.get(0) int j
    p.get(1) . . .
  • Or an exploded syntax can be used instead of
    explicitly declaring a point variable
  • for ( point i,j R ) . . .
  • The exploded syntax declares the constituent
    variables (i, j, ) as local int variables in the
    scope of the for loop body

36
Example (see TutFor.x10)
  • public class TutFor
  • public static void main(String args)
  • region R 01,02
  • System.out.print("Points in region " R
    " ")
  • for ( point p R ) System.out.print(" "
    p)
  • System.out.println()
  • // Use exploded syntax instead
  • System.out.print("(i,j) pairs in region "
    R " ")
  • for ( pointi,j R )
  • System.out.print("(" i "," j
    ")")
  • System.out.println()
  • // main()
  • // TutFor

Console output Points in region 01,02
0,0 0,1 0,2 1,0 1,1 1,2 (i,j) pairs
in region 01,02 (0,0)(0,1)(0,2)(1,0)(1,1)(1,2
)
37
foreach loop (Parallel iteration)
  • The X10 foreach loop is similar to the pointwise
    for loop, except that each iteration executes in
    parallel as a new asynchronous activity i.e.,
  • foreach ( point p R ) S is equivalent to for
    ( point p R ) async S
  • As before, finish can be used to wait for
    termination of all foreach iterations
  • finish foreach ( pointi,j 0M-1,0N-1 ) . .
    .
  • Special case use foreach to create a
    single-dimensional parallel loop
  • foreach ( pointi 0 N-1 ) S
  • Allowing a single foreach construct to span
    multiple dimensions makes it convenient to write
    parallel matrix code that is independent of the
    underlying rank and region e.g.
  • foreach ( point p A.region ) Ap f(Bp,
    Cp, Dp)
  • Multiple foreach instances may accesses shared
    data in the same place ? use finish, atomic,
    force to avoid data races

38
Example (see TutForeach1.x10)
  • public class TutForeach1
  • public static void main(String args)
  • final int N 5
  • int. A new int1N,1N
    (pointi,j) return ij
  • // For the Ai,j F(Ai,j) case,
  • // both loops can execute in parallel
  • finish foreach ( pointi,j A.region )
  • Ai,j Ai,j 1
  • // For the Ai,j F(Ai,j-1) case,
  • // only the outer loop can execute in
    parallel
  • finish foreach ( pointi
    A.region.rank(0) )
  • for (pointj
  • (A.region.rank(1).low()1)A.region
    .rank(1).high())
  • Ai,j Ai,j-1 1

NOTE A.region.rank(0) is the same as 1N
39
Example contd. (see TutForeach1.x10)
  • // For the Ai,j F(Ai-1,j) case,
  • // only the inner loop can execute in
    parallel
  • for (pointi
  • (A.region.rank(0).low()1)A.regio
    n.rank(0).high() )
  • finish foreach ( pointj
    A.region.rank(1) )
  • Ai,j Ai-1,j 1
  • // For the Ai,j F(Ai-1,j,Ai,j-1)
    case,
  • // use loop skewing to execute the inner
    loop in parallel
  • for ( pointt 42N)
  • finish foreach ( pointj
    Math.max(2,t-N)Math.min(N,t-2))
  • int i t - j
  • System.out.print("(" i ","
    j ")")
  • Ai,j Ai-1,j Ai,j-1 1
  • System.out.println()

Console output (2,2) (3,2)(2,3) (4,2)(3,3)(2,4) (
5,2)(3,4)(4,3)(2,5) (5,3)(4,4)(3,5) (5,4)(4,5) (5,
5)
40
Outline
  • Clocks
  • creation, registration, next, resume, drop,
    ClockUseException
  • Basic serial constructs that differ from Java
  • const, nullable, extern
  • Advanced topics
  • Value types, conditional atomic sections (when),
    general regions distributions
  • Refer to language spec for details
  • What is X10?
  • background, status
  • Basic X10 (single place)
  • async, finish, atomic
  • future, force
  • Basic X10 (arrays loops)
  • points, rectangular regions, arrays
  • for, foreach
  • Scalable X10 (multiple places)
  • places, distributions, distributed arrays,
    ateach, BadPlaceException

41
Limitations of using a Single Place
Immutable Data (I) -- final variables, value
type instances
  • Largest deployment granularity for a single place
    is a single SMP
  • Smallest granularity can be a single CPU or even
    a single hardware thread
  • Single SMP is inadequate for solving problems
    with large memory and compute requirements
  • X10 solution incorporate multiple places as a
    core foundation of the X10 programming model
  • ? Enable deployment on large-scale clustered
    machines, with integrated support for intra-place
    parallelism

Shared Heap (H)
Place 0
Locally Synchronous (coherent access to
intra-place shared heap)
Activity Stacks (S)
  • Storage classes
  • Immutable Data (I)
  • Shared Heap (H)
  • Activity Stacks (S)

42
Scalable X10 using multiple places
Immutable Data (I) -- final variables, value
type instances
Partitioned Global Address Space (PGAS)
Local Heap (LH)
Local Heap (LH)
Outbound activities
Inbound activities
Globally Asynchronous
Locally Synchronous (coherent access to
intra-place shared heap)
Activity Stacks (S)
Activity Stacks (S)
. . .
Inbound activity replies
Outbound activity replies
Place 0
Place (MAX_PLACES -1)
  • Storage classes
  • Immutable Data (I)
  • PGAS
  • Local Heap (LH)
  • Remote Heap (RH)
  • Activity Stacks (S)
  • Place collection of activities objects
  • Activities and data objects do not move after
    being created
  • Scalar object, O -- maps to a single place
    specified by O.location
  • Array object, A may be local to a place or
    distributed across multiple places, as specified
    by A.distribution

43
Locality Rule
  • Any access to a mutable (shared heap) datum must
    be performed by an activity located at the place
    as the datum
  • The prohibited references are similar as before
  • LH/RH ? S, I ? S, S ? S
  • Local-to-remote (LH ? RH) and remote-to-local (RH
    ? LH) heap references are freely permitted
  • However, direct access via a remote heap
    reference is not permitted!
  • Inter-place data accesses can only be performed
    by creating remote activities (with weaker
    ordering guarantees than intra-place data
    accesses)
  • The locality rule is currently not checked by
    default. Instead, the user can perform the check
    explicitly by inserting a place cast operator as
    follows
  • (_at_ P) E checks if expression E can be evaluated
    at place P
  • If so, expression E is evaluated as usual
  • If not, a BadPlaceException is thrown

44
Activity Execution within a Place
Place
Atomic sections do not have blocking semantics
Ready Activities
Executing Activities
Inbound activities
Outbound activities
Place-local activity can only its stack (S),
place-local heap (LH), or immutable data (I)
Blocked Activities
Completed Activities
Clock
. . .
Outbound replies
Inbound replies
Future
45
Places
  • place.MAX_PLACES total number of places
  • Default value is 4
  • Can be changed by using the -NUMBER_OF_LOCAL_PLACE
    S option in x10 command
  • place.places Set of all places in an X10
    program(see java.lang.Set)
  • place.factory.place(i) place corresponding to
    index i
  • here place in which current activity is
    executing
  • ltplace-exprgt.toString() returns a string of the
    form place(id99)
  • ltplace-exprgt.id returns the id of the place

X10 Data Structures
X10 language defines mapping from X10 objects to
X10 places, and abstract performance metrics on
places
X10 Places
Future X10 deployment system will define mapping
from X10 places to system nodes not supported in
current implementation
System Nodes
46
Extension of async and future to places
  • async (P) S
  • Creates new activity to execute statement S at
    place P
  • async S is equivalent to async (here) S
  • future (P) E
  • Create new activity to evaluate expression E at
    place P
  • future E is equivalent to future (here)
    E
  • Note that here in a child activity for an
    async/future computation will refer to the place
    P at which the child activity is executing, not
    the place where the parent activity is executing
  • The goal is to specify the destination place for
    async/future activities so as to obey the
    Locality Rule e.g.,
  • async (O.location) O.x 1
  • futureltintgt F future (A.distributioni) Ai

47
Distribution mapping from region to places
  • Creating distributions (x10.lang.dist)
  • dist D1 R-gt here // local distribution maps
    region R to here
  • dist D2 dist.factory.block(R) // blocked
    distribution
  • dist D3 dist.factory.cyclic(R) // cyclic
    distribution
  • dist D4 dist.factory.unique() // identity map
    on 0MAX_PLACES-1
  • Using distributions
  • DP place to which point P is mapped by
    distribution D (assuming that P is in D.region)
  • Allocate a distributed array e.g., T. A new
    T D
  • Allocates an array with index set D.region,
    such that element AP is located at place DP
    for each point P in D.region
  • NOTE new TR for region R is equivalent to
    new TR-gthere
  • Iterating over a distribution generalization of
    foreach to ateach
  • ateach is discussed in more detail later

48
Operations defined on distributions
  • D.region source region of distribution
  • D.rank rank of D.region
  • D R region restriction for distribution D and
    region R (returns a restricted distribution)
  • D P place restriction for distribution D and
    place P (returns region mapped by D to place P)
  • D1 D2 union of distributions D1 and D2
    (assumes that D1.region and D2.region are
    disjoint)
  • D1.overlay(D2) // Overlay of D2 over D1
    asymmetric union
  • D.contains(p) true iff D.region contains point
    p
  • D R -gt P, constant distribution which maps
    entire region R to place P
  • D1 D2 distribution difference D1
    (D1.region D2.region)
  • D.distributionEfficiency() load balance
    efficiency of distribution D

49
Inter-place communication using async and future
  • Question how to assign Ai Bj, when Ai
    and and Bj may be in different places?
  • Answer 1 --- use nested asyncs!
  • finish async ( B.distributionj )
  • final int bb Bj
  • async ( A.distributioni ) Ai bb
  • Answer 2 --- use future-force and an async!
  • final int b future (B.distributionj) Bj
    .force()
  • finish async ( A.distributioni ) Ai b

50
Load Balance Efficiency
  • Consider a parallel application that is executed
    on P places
  • Let T(i) computation load mapped to place i
  • For distribution D, T(i) (D
    place.factory.place(i)).size()
  • Let Tmax max T(i) 1 lt i lt P
  • Let E SUM T(i) 1 lt i lt P / (Tmax P)
  • E is the load balance efficiency, 1/P lt E lt 1
  • E 1 is the best case ? computation load is
    perfectly balanced
  • E 1/P is the worst case ? computation load is
    placed on a single processor/place
  • Load balance efficiency is one of the key factors
    that limit speedup on a parallel machine
  • there are several other factors e.g., comm.
    synchronization overhead
  • ignoring other factors, we expect speedup to be
    lt E P
  • NOTE also try x10 DUMP_STATS_ON_EXITtrue
    to see activity and atomic counts

51
ateach loop (distributed parallel iteration)
  • The X10 ateach loop is similar to the foreach
    loop, except that each iteration executes in
    parallel at a place specified by a distribution
  • ateach ( point p D ) S is equivalent to for
    ( point p D.region ) async (Dp) S
  • As before, finish can be used to wait for
    termination of all ateach iterations
  • finish ateach( pointi dist.factory.unique()
    ) S creates one activity per place, as in an
    SPMD computation
  • ateach is a convenient construct for writing
    parallel matrix code that is independent of the
    underlying distribution e.g.,
  • ateach ( point p A.distribution ) Ap
    f(Bp, Cp, Dp)

52
Example (see TutAteach1.x10)
  • public class TutAteach1
  • public static void main(String args)
  • finish ateach( pointi
    dist.factory.unique() )
  • System.out.println("Hello from "
    i)
  • // main()
  • // TutAteach1

dist.factory.unique() maps point i in the region,
0 place.MAX_PLACES-1, to place
place.factory.place(i)
Console output Hello from 1 Hello from 0 Hello
from 3 Hello from 2
53
Example converting foreach to ateach (see
TutAteach2.x10)
foreach version // For the Ai,j
F(Ai,j) case, // both loops can
execute in parallel finish foreach (
pointi,j A.region ) Ai,j
Ai,j 1 ateach version 1 finish
ateach ( pointi,j A.distribution)
Ai,j Ai,j 1 ateach version 2 (create
only one activity per place) finish
ateach ( point p dist.factory.unique() )
for ( pointi,j A.distribution here )
Ai,j Ai,j 1
54
Example converting foreach to ateach, contd.
(see TutAteach2.x10)
foreach version // For the Ai,j
F(Ai,j-1) case, // only the outer loop
can execute in parallel finish foreach (
pointi 1N ) for ( pointj
2N ) Ai,j Ai,j-1
1 ateach version // Assume that N is a
multiple of place.MAX_PLACES finish ateach (
pointi dist.factory.block(1N) )
for ( pointj 2N )
Ai,j Ai,j-1 1
55
Outline
  • Clocks
  • creation, registration, next, resume, drop,
    ClockUseException
  • Basic serial constructs that differ from Java
  • const, nullable, extern
  • Advanced topics
  • Value types, conditional atomic sections (when),
    general regions distributions
  • Refer to language spec for details
  • What is X10?
  • background, status
  • Basic X10 (single place)
  • async, finish, atomic
  • future, force
  • Basic X10 (arrays loops)
  • points, rectangular regions, arrays
  • for, foreach
  • Scalable X10 (multiple places)
  • places, distributions, distributed arrays,
    ateach, BadPlaceException

56
X10 clocks Motivation
  • Activity coordination using finish and force() is
    accomplished by checking for activity termination
  • However, there are many cases in which a
    producer-consumer relationship exists among the
    activities, and a barrier-like coordination is
    needed without waiting for activity termination
  • The activities involved may be in the same place
    or in different places

Phase 0
Phase 1
. . .
. . .
Activity 0
Activity 1
Activity 2
57
X10 Clocks
  • clock c clock.factory.clock()
  • Allocate a clock, register current activity with
    it. Phase 0 of c starts.
  • async() clocked (c1,c2,) S
  • ateach() clocked (c1,c2,) S
  • foreach() clocked (c1,c2,) S
  • Create async activities registered on clocks c1,
    c2,
  • c.resume()
  • Nonblocking operation that signals completion of
    work by current activity for this phase of clock
    c
  • next
  • Barrier --- suspend until all clocks that the
    current activity is registered with can advance.
    c.resume() is first performed for each such
    clock, if needed.
  • Next can be viewed like a finish of all
    computations under way in the current phase of
    the clock

58
X10 Clocks (contd.)
  • c.drop()
  • Unregister with c. A terminating activity will
    implicitly drop all clocks that it is registered
    on.
  • c.registered()
  • Return true iff current activity is registered on
    clock c
  • c.dropped() returns the opposite of
    c.registered()
  • ClockUseException
  • Thrown if an activity attempts to transmit or
    operate on a clock that it is not registered on

59
Example (see TutClock1.x10)
  • finish async
  • final clock c clock.factory.clock()
  • foreach (pointi 1N) clocked (c)
  • while ( true )
  • int old_A_i Ai int
    new_A_i Math.min(Ai,Bi)
  • if ( i gt 1 ) new_A_i
    Math.min(new_A_i,Bi-1)
  • if ( i lt N ) new_A_i
    Math.min(new_A_i,Bi1)
  • Ai new_A_i
  • next
  • int old_B_i Bi int
    new_B_i Math.min(Bi,Ai)
  • if ( i gt 1 ) new_B_i
    Math.min(new_B_i,Ai-1)
  • if ( i lt N ) new_B_i
    Math.min(new_B_i,Ai1)
  • Bi new_B_i
  • next
  • if ( old_A_i new_A_i
    old_B_i new_B_i ) break
  • // while
  • // foreach
  • // finish async

Example of transmitting clock from parent to child
NOTE exiting from while loop terminates activity
for iteration i, and automatically deregisters
activity from clock
60
Outline
  • Clocks
  • creation, registration, next, resume, drop,
    ClockUseException
  • Basic serial constructs that differ from Java
  • const, nullable, extern
  • Advanced topics
  • Value types, conditional atomic sections (when),
    general regions distributions
  • Refer to language spec for details
  • What is X10?
  • background, status
  • Basic X10 (single place)
  • async, finish, atomic
  • future, force
  • Basic X10 (arrays loops)
  • points, rectangular regions, arrays
  • for, foreach
  • Scalable X10 (multiple places)
  • places, distributions, distributed arrays,
    ateach, BadPlaceException

61
nullable
  • By default, object references in X10 are not
    allowed to take on the null value
  • However, the nullable type constructor can be
    used to enable certain object references to be
    set to null, or to compare them with null e.g.,
  • T1 a
  • nullable T2 b
  • a null // Not allowed
  • b null // Allowed
  • NOTE const is simply a shorthand for static
    final

62
extern
  • X10 provides a simple mechanism for invoking
    external functions written in C
  • Currently, the C function is restricted to
    arguments with primitive types or references to
    unsafe X10 arrays
  • The X10 program must contain an external
    declaration of the C function as follows
  • static extern char doit(int a, float b)
  • and also a statement to ensure that the native
    DLL, ltdllgt.dll is loaded
  • static System.loadLibrary(ltdllgt")
  • The X10 compiler then generates a file called
    ltclassgt_x10stub.c
  • To generate the DLL, the C programmer must
    compile the C function by including the file
    jni.h in tehir C function, and must link with the
    object file obtained from ltclassgt_x10stub.c

63
Outline
  • Clocks
  • creation, registration, next, resume, drop,
    ClockUseException
  • Basic serial constructs that differ from Java
  • const, nullable, extern
  • Advanced topics
  • Value types, conditional atomic sections (when),
    general regions distributions
  • Refer to language spec for details
  • What is X10?
  • background, status
  • Basic X10 (single place)
  • async, finish, atomic
  • future, force
  • Basic X10 (arrays loops)
  • points, rectangular regions, arrays
  • for, foreach
  • Scalable X10 (multiple places)
  • places, distributions, distributed arrays,
    ateach, BadPlaceException
About PowerShow.com