Titanium: Language and Compiler Support for Gridbased Computation - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Titanium: Language and Compiler Support for Gridbased Computation

Description:

Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley ... of bugs is barriers or other global operations inside branches or ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 69
Provided by: bad61
Category:

less

Transcript and Presenter's Notes

Title: Titanium: Language and Compiler Support for Gridbased Computation


1
Titanium Language and Compiler Support for
Grid-based Computation
Kathy Yelick
  • U.C. Berkeley
  • Computer Science Division

2
Titanium Group
  • Susan Graham
  • Katherine Yelick
  • Paul Hilfinger
  • Phillip Colella (LBNL)
  • Alex Aiken
  • Greg Balls (SDSC)
  • Peter McQuorquodale (LBNL)
  • Andrew Begel
  • Dan Bonachea
  • Tyson Condie
  • David Gay
  • Ben Liblit
  • Chang Sun Lin
  • Geoff Pike
  • Siu Man Yau

3
Target Problems
  • Many modeling problems in astrophysics, biology,
    material science, and other areas require
  • Enormous range of spatial and temporal scales
  • To solve interesting problems, one needs
  • Adaptive methods
  • Large scale parallel machines
  • Titanium is designed for methods with
  • Stuctured grids
  • Locally-structured grids (AMR)

4
Common Requirements
  • Algorithms for numerical PDE computations
    are
  • communication intensive
  • memory intensive
  • AMR makes these harder
  • more small messages
  • more complex data structures
  • most of the programming effort is
    debugging the boundary cases
  • locality and load balance trade-off is hard

5
A Little History
  • Most parallel programs are written using explicit
    parallelism, either
  • Message passing with a SPMD model
  • Usually for scientific applications with
    C/Fortran
  • Scales easily
  • Shared memory with a thread C or Java
  • Usually for non-scientific applications
  • Easier to program
  • Take the best features of both for Titanium

6
Titanium
  • Take the best features of threads and MPI
  • global address space like threads (programming)
  • SPMD parallelism like MPI (performance)
  • local/global distinction, i.e., layout matters
    (performance)
  • Based on Java, a cleaner C
  • classes, memory management
  • Optimizing compiler
  • communication and memory optimizations
  • synchronization analysis
  • cache and other uniprocessor optimizations

7
Why Java for Scientific Computing?
  • Computational scientists use increasingly complex
    models
  • Popularized C features classes, overloading,
    pointer-based data structures
  • But C is very complicated
  • easy to lose performance and readability
  • Java is a better C
  • Safe strongly typed, garbage collected
  • Much simpler to implement (research vehicle)
  • Industrial interest as well IBM HP Java

8
Summary of Features Added to Java
  • Multidimensional arrays with iterators
  • Immutable (value) classes
  • Templates
  • Operator overloading
  • Scalable SPMD parallelism
  • Global address space
  • Checked Synchronization
  • Zone-based memory management
  • Scientific Libraries

9
Lecture Outline
  • Language and compiler support for uniprocessor
    performance
  • Immutable classes
  • Multidimensional Arrays
  • Foreach
  • Language support for ease of programming
  • Language support for parallel computation
  • Applications and application-level libraries
  • Summary and future directions

10
Java A Cleaner C
  • Java is an object-oriented language
  • classes (no standalone functions) with methods
  • inheritance between classes
  • Documentation on web at java.sun.com
  • Syntax similar to C
  • class Hello
  • public static void main (String argv)
  • System.out.println(Hello, world!)
  • Safe strongly typed, auto memory management
  • Titanium is (almost) strict superset

11
Java Objects
  • Primitive scalar types boolean, double, int,
    etc.
  • implementations will store these on the program
    stack
  • access is fast -- comparable to other languages
  • Objects user-defined and standard library
  • passed by pointer value (object sharing) into
    functions
  • has level of indirection (pointer to) implicit
  • simple model, but inefficient for small objects

2.6 3 true
r 7.1 i 4.3
12
Java Object Example
  • class Complex
  • private double real
  • private double imag
  • public Complex(double r, double i)
  • real r imag i
  • public Complex add(Complex c)
  • return new Complex(c.real real,
    c.imag imag)
  • public double getReal return real
  • public double getImag return imag
  • Complex c new Complex(7.1, 4.3)
  • c c.add(c)
  • class VisComplex extends Complex ...

13
Immutable Classes in Titanium
  • For small objects, would sometimes prefer
  • to avoid level of indirection
  • pass by value (copying of entire object)
  • especially when immutable -- fields never
    modified
  • extends the idea of primitive values to
    user-defined values
  • Titanium introduces immutable classes
  • all fields are final (implicitly)
  • cannot inherit from or be inherited by other
    classes
  • needs to have 0-argument constructor
  • Note considering allowing mutation in future

14
Example of Immutable Classes
  • The immutable complex class nearly the same
  • immutable class Complex
  • Complex () real0 imag0
  • ...
  • Use of immutable complex values
  • Complex c1 new Complex(7.1, 4.3)
  • Complex c2 new Complex(2.5, 9.0)
  • c1 c1.add(c2)
  • Similar to structs in C in terms of performance

Zero-argument constructor required
new keyword
Rest unchanged. No assignment to fields outside
of constructors.
15
Arrays in Java
2d array
  • Arrays in Java are objects
  • Only 1D arrays are directly supported
  • Multidimensional arrays are slow
  • Subarrays are important in AMR (e.g., interior of
    a grid)
  • Even C and C dont support these well
  • Hand-coding (array libraries) can confuse
    optimizer

16
Multidimensional Arrays in Titanium
  • New multidimensional array added to Java
  • One array may be a subarray of another
  • e.g., a is interior of b, or a is all even
    elements of b
  • Indexed by Points (tuples of ints)
  • Constructed over a set of Points, called
    Rectangular Domains (RectDomains)
  • Points, Domains and RectDomains are built-in
    immutable classes
  • Support for AMR and other grid computations
  • domain operations intersection, shrink, border

17
Unordered Iteration
  • Memory hierarchy optimizations are essential
  • Compilers can sometimes do these, but hard in
    general
  • Titanium adds unordered iteration on rectangular
    domains
  • foreach (p in r) ...
  • p is a Point
  • r is a RectDomain or Domain
  • Foreach simplifies bounds checking as well
  • Additional operations on domains to subset and
    xform
  • Note foreach is not a parallelism construct

18
Point, RectDomain, Arrays in General
  • Points specified by a tuple of ints
  • RectDomains given by 3 points
  • lower bound, upper bound (and stride)
  • Array declared by dimensions and type
  • Array created by passing RectDomain

19
Simple Array Example
  • Matrix sum in Titanium

Pointlt2gt lb 1,1 Pointlt2gt ub
10,20 RectDomainlt2gt r lb,ub double 2d
a new double r double 2d b new double
110,120 double 2d c new double
lbub1,1 for (int i 1 i lt 10 i)
for (int j 1 j lt 20 j) ci,j
ai,j bi,j
No array allocation here
Syntactic sugar
Optional stride
20
Naïve MatMul with Titanium Arrays
  • public static void matMul(double 2d a, double
    2d b,
  • double 2d c)
  • int n c.domain().max()1 // assumes square
  • for (int i 0 i lt n i)
  • for (int j 0 j lt n j)
  • for (int k 0 k lt n k)
  • ci,j ai,k bk,j

21
Better MatMul with Titanium Arrays
  • public static void matMul(double 2d a, double
    2d b,
  • double 2d c)
  • foreach (ij within c.domain())
  • double 1d aRowi a.slice(1, ij1)
  • double 1d bColj b.slice(2, ij2)
  • foreach (k within aRowi.domain())
  • cij aRowik bColjk
  • Current performance comparable to 3 nested loops
    in C
  • Future automatic blocking for memory hierarchy
    (Geoff Pikes PhD thesis)

22
Array Performance Issues
  • Array representation is fast, but access methods
    can be slow, e.g., bounds checking, strides
  • Compiler optimizes these
  • common subexpression elimination
  • eliminate (or hoist) bounds checking
  • strength reduce e.g., naïve code has 1 divide
    per dimension for each array access
  • Currently /- 20 of C/Fortran for large loops
  • Future small loop and cache optimizations

23
Sequential Performance
Performance results from 98 new IR and
optimization framework almost complete.
24
Lecture Outline
  • Language and compiler support for uniprocessor
    performance
  • Language support for ease of programming
  • Templates
  • Operator overloading
  • Language support for parallel computation
  • Applications and application-level libraries
  • Summary and future directions

Example later
25
Lecture Outline
  • Language and compiler support for uniprocessor
    performance
  • Language support for parallel computation
  • SPMD execution
  • Barriers and single
  • Explicit Communication
  • Implicit Communication (global and local
    references)
  • More on Single
  • Synchronized methods and blocks (as in Java)
  • Applications and application-level libraries
  • Summary and future directions

26
SPMD Execution Model
  • Java programs can be run as Titanium, but the
    result will be that all processors do all the
    work
  • E.g., parallel hello world
  • class HelloWorld
  • public static void main (String
    argv)
  • System.out.println(Hello from proc
  • Ti.thisProc())
  • Any non-trivial program will have communication
    and synchronization

27
SPMD Model
  • All processors start together and execute same
    code, but not in lock-step
  • Basic control done using
  • Ti.numProcs() total number of processors
  • Ti.thisProc() number of executing processor
  • Bulk-synchronous style
  • read all particles and compute forces on
    mine
  • Ti.barrier()
  • write to my particles using new forces
  • Ti.barrier()
  • This is neither message passing nor data-parallel

28
Barriers and Single
  • Common source of bugs is barriers or other global
    operations inside branches or loops
  • barrier, broadcast, reduction, exchange
  • A single method is one called by all procs
  • public single static void allStep(...)
  • A single variable has same value on all procs
  • int single timestep 0
  • Single annotation on methods (also called
    sglobal) is optional, but useful to
    understanding compiler messages.

29
Explicit Communication Broadcast
  • Broadcast is a one-to-all communication
  • broadcast ltvaluegt from ltprocessorgt
  • For example
  • int count 0
  • int allCount 0
  • if (Ti.thisProc() 0) count
    computeCount()
  • allCount broadcast count from 0
  • The processor number in the broadcast must be
    single all constants are single.
  • The allCount variable could be declared single.

30
Example of Data Input
  • Same example, but reading from keyboard
  • Shows use of Java exceptions
  • int single count 0
  • int allCount 0
  • if (Ti.thisProc() 0)
  • try
  • DataInputStream kb new
    DataInputStream(System.in)
  • myCount Integer.valueOf(kb.readLine()).i
    ntValue()
  • catch (Exception e)
  • System.err.println(Illegal Input)
  • allCount myCount from 0

31
Global Address Space
  • References (pointers) may be remote
  • useful in building adaptive meshes
  • easy to port shared-memory programs
  • uniform programming model across machines
  • Global pointers are more expensive than local
  • True even when data is on the same processor
  • space (processor number memory address)
  • dereference time (check to see if local)
  • Use local declarations in critical sections

32
Example A Distributed Data Structure
  • Data can be accessed across processor boundaries

local_grids
all_grids
33
Example Setting Boundary Conditions
  • foreach (l in local_grids.domain())
  • foreach (a in all_grids.domain())
  • local_gridsl.copy(all_gridsa)

34
Explicit Communication Exchange
  • To create shared data structures
  • each processor builds its own piece
  • pieces are exchanged (for object, just exchange
    pointers)
  • Exchange primitive in Titanium
  • int 1d single allData
  • allData new int 0Ti.numProcs()-1
  • allData.exchange(Ti.thisProc()2)
  • E.g., on 4 procs, each will have copy of allData

35
Building Distributed Structures
  • Distributed structures are built with exchange
  • class Boxed
  • public Boxed (int j) val j
  • public int val
  • Object 1d single allData
  • allData new Object 0Ti.numProcs()-1
  • allData.exchange(new Boxed(Ti.thisProc())

36
Distributed Data Structures
  • Building distributed arrays
  • RectDomain lt1gt single allProcs
    0Ti.numProcs-1
  • RectDomain lt1gt myParticleDomain
    0myPartCount-1
  • Particle 1d single 1d allParticle
  • new Particle
    allProcs1d
  • Particle 1d myParticle
  • new Particle
    myParticleDomain
  • allParticle.exchange(myParticle)
  • Now each processor has array of pointers, one to
    each processors chunk of particles

37
More on Single
  • Global synchronization needs to be controlled
  • if (this processor owns some data)
  • compute on it
  • barrier
  • Hence the use of single variables in Titanium
  • If a conditional or loop block contains a
    barrier, all processors must execute it
  • conditions in such loops, if statements, etc.
    must contain only single variables

38
Single Variable Example
  • Barriers and single in N-body Simulation
  • class ParticleSim
  • public static void main (String argv)
  • int single allTimestep 0
  • int single allEndTime 100
  • for ( allTimestep lt allEndTime
    allTimestep)
  • read all particles and compute forces
    on mine
  • Ti.barrier()
  • write to my particles using new forces
  • Ti.barrier()
  • Single methods inferred see David Gays work

39
Use of Global / Local
  • As seen, references (pointers) may be remote
  • easy to port shared-memory programs
  • Global pointers are more expensive than local
  • True even when data is on the same processor
  • Use local declarations in critical sections
  • Costs of global
  • space (processor number memory address)
  • dereference time (check to see if local)
  • May declare references as local
  • Compiler will automatically infer them when
    possible

40
Global Address Space
  • Processes allocate locally
  • References can be passed to other processes

Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class C int val... C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val ... ...
gv.val
41
Local Pointer Analysis
  • Compiler can infer many uses of local
  • See Liblits work on Local Qualification
    Inference
  • Data structures must be well partitioned

42
Lecture Outline
  • Language and compiler support for uniprocessor
    performance
  • Language support for ease of programming
  • Language support for parallel computation
  • Applications and application-level libraries
  • Gene sequencing application
  • Heart simulation
  • AMR elliptic and hyperbolic solvers
  • Scalable Poisson for infinite domains
  • Several smaller benchmarks EM3D, MatMul, LU,
    FFT, Join
  • Summary and future directions

43
Unstructured Mesh Kernel
  • EM3D Relaxation on a 3D unstructured mesh
  • Speedup on Ultrasparc SMP
  • Simple kernel mesh not partitioned.

44
AMR Poisson
  • Poisson Solver Semenzato, Pike, Colella
  • 3D AMR
  • finite domain
  • variable

    coefficients
  • multigrid

    across levels
  • Performance of Titanium implementation
  • Sequential multigrid performance /- 20 of
    Fortran
  • On fixed, well-balanced problem of 8 patches,
    each 723
  • parallel speedups of 5.5 on 8 processors

45
Scalable Poisson Solver
  • MLC for Finite-Differences by Balls and Colella
  • Poisson equation with infinite boundaries
  • arise in astrophysics, some biological systems,
    etc.
  • Method is scalable
  • Low communication
  • Performance on
  • SP2 (shown) and t3e
  • scaled speedups
  • nearly ideal (flat)
  • Currently 2D and non-adaptive

46
AMR Gas Dynamics
  • Developed by McCorquodale and Colella
  • Merge with Poisson underway for self-gravity
  • 2D Example (3D supported)
  • Mach-10 shock on solid surface
    at oblique
    angle
  • Future Self-gravitating gas dynamics package

47
Distributed Array Libraries
  • There are some standard distributed array
    libraries associated with Titanium
  • Hides the details of exchange, indirection within
    the data structure, etc.
  • Libraries benefit from support for templates

48
Distributed Array Library Fragment
  • template ltclass T, int single aritygt public class
    DistArray
  • RectDomain ltaritygt single rd
  • T arity darity d subMatrices
  • RectDomain ltaritygt arity d single subDomains
  • ...
  • / Sets the element at p to value /
  • public void set (Point ltaritygt p, T value)
  • getHomingSubMatrix (p) p value
  • template DistArray ltdouble, 2gt single A new
    template DistArray ltdouble, 2gt ( 0, 0
    aHeight, aWidth)

49
Immersed Boundary Method (future)
  • Immersed boundary method Peskin,MacQueen
  • Used in heart model, platelets, and others
  • Currently uses FFT for Navier-Stokes solver
  • Begun effort to move solver and full method into
    Titanium

50
Implementation
  • Strategy
  • Titanium into C
  • Solaris or Posix threads for SMPs
  • Lightweight communication for MPPs/Clusters
  • Status Titanium runs on
  • Solaris or Linux SMPs and uniprocessors
  • Berkeley NOW
  • SDSC Tera, SP2, T3E (also NERSC)
  • SP3 port underway

51
Using Titanium on NPACI Machines
  • Send mail to us if you are interested
  • titanium-group_at_cs.berkeley.edu
  • Has been installed in individual accounts
  • t3e and BH upgrade needed
  • On uniprocessors and SMPs
  • available from the Titanium home page
  • http//www.cs.berkeley.edu/projects/titanium
  • other documentation available as well

52
Calling Other Languages
  • We have built interfaces to
  • PETSc scientific library for finite element
    applications
  • Metis graph partitioning library
  • KeLP starting work on this
  • Two issues with cross-language calls
  • accessing Titanium data structures (arrays) from
    C
  • possible because Titanium arrays have same format
    on inside
  • having a common message layer
  • Titanium is built on lightweight communication

53
Future Plans
  • Improved compiler optimizations for scalar code
  • large loops are currently /- 20 of Fortran
  • working on small loop performance
  • Packaged solvers written in Titanium
  • Elliptic and hyperbolic solvers, both regular and
    adaptive
  • New application collaboration
  • Peskin and McQueen (NYU) with Colella (LBNL)
  • Immersed boundary method, currently use for heart
    simulation, platelet coagulation, and others

54
Backup Slides
55
Example Domain
r
  • Domains in general are not rectangular
  • Built using set operations
  • union,
  • intersection,
  • difference, -
  • Example is red-black algorithm

(6, 4)
(0, 0)
r 1, 1
(7, 5)
Pointlt2gt lb 0, 0 Pointlt2gt ub 6,
4 RectDomainlt2gt r lb ub 2,
2 ... Domainlt2gt red r (r 1,
1) foreach (p in red) ...
(1, 1)
red
(7, 5)
(0, 0)
56
Example using Domains and foreach
  • Gauss-Seidel red-black computation in multigrid

void gsrb() boundary (phi) for (domainlt2gt
d res d ! null d
(d red ? black null)) foreach (q in
d) resq ((phin(q) phis(q)
phie(q) phiw(q))4
(phine(q) phinw(q) phise(q)
phisw(q)) - 20.0phiq -
krhsq) 0.05 foreach (q in d) phiq
resq
unordered iteration
57
Recent Progress in Titanium
  • Distributed data structures built with global
    refs
  • communication may be implicit, e.g. aj
    ai.dx
  • use extensively in AMR algorithms
  • Runtime layer optimizes
  • bulk communication
  • bulk I/O
  • Runs on
  • t3e, SP2, and Tera
  • Compiler analysis optimizes
  • global references converted to local ones when
    possible

58
Consistency Model
  • Titanium adopts the Java memory consistency model
  • Roughly Access to shared variables that are not
    synchronized have undefined behavior.
  • Use synchronization to control access to shared
    variables.
  • barriers
  • synchronized methods and blocks

59
Compiler Techniques Outline
  • Analysis and Optimization of parallel code
  • Tolerate network latency Split-C experience
  • Hardware trends and reordering
  • Semantics sequential consistency
  • Cycle detection parallel dependence analysis
  • Synchronization analysis parallel flow analysis
  • Summary and future directions

60
Parallel Optimizations
  • Titanium compiler performs parallel optimizations
  • communication overlap and aggregation
  • Two new analyses
  • synchronization analysis the parallel analog to
    control flow analysis for serial code Gay
    Aiken
  • shared variable analysis the parallel analog to
    dependence analysis Krishnamurthy Yelick
  • local qualification inference automatically
    inserts local qualifiers Liblit Aiken

61
Split-C Experience Latency Overlap
  • Titanium borrowed ideas from Split-C
  • global address space
  • SPMD parallelism
  • But, Split-C had non-blocking accesses built in
    to tolerate network latency on remote read/write
  • Also one-way communication
  • Conclusion useful, but complicated

int global p x p / get / p
3 / put / sync / wait for my
puts/gets /
p - x / store / all_store_sync /
wait globally /
62
Sources of Memory/Comm. Overlap
  • Would like compiler to introduce put/get/store.
  • Hardware also reorders
  • out-of-order execution
  • write buffered with read by-pass
  • non-FIFO write buffers
  • weak memory models in general
  • Software already reorders too
  • register allocation
  • any code motion
  • System provides enforcement primitives
  • e.g., memory fence, volatile, etc.
  • tend to be heavy wait and with unpredictable
    performance
  • Can the compiler hide all this?

63
Semantics Sequential Consistency
  • When compiling sequential programs
  • Valid if y not in expr1 and x not in expr2
    (roughly)
  • When compiling parallel code, not sufficient test.

y expr2 x expr1
x expr1 y expr2
Initially flag data 0 Proc A Proc
B data 1 while (flag!1) flag 1
... ...data...
64
Cycle Detection Dependence Analog
  • Processors define a program order on accesses
    from the same thread
  • P is the union of these total orders
  • Memory system define an access order on
    accesses to the same variable
  • A is access order (read/write
    write/write pairs)
  • A violation of sequential consistency is cycle in
    P U A.
  • Intuition time cannot flow backwards.

65
Cycle Detection
  • Generalizes to arbitrary numbers of variables and
    processors
  • Cycles may be arbitrarily long, but it is
    sufficient to consider only cycles with 1 or 2
    consecutive stops per processor

write x write y read y
read y write
x
66
Static Analysis for Cycle Detection
  • Approximate P by the control flow graph
  • Approximate A by undirected dependence edges
  • Let the delay set D be all edges from P that
    are part of a minimal cycle
  • The execution order of D edge must be preserved
    other P edges may be reordered (modulo usual
    rules about serial code)
  • Synchronization analysis also critical

write z read x
write y read x
read y write z
67
Communication Optimizations
  • Implemented in subset of C with limited pointers
    Krishnamurthy, Yelick
  • Experiments on the NOW 3 synchronization styles
  • Future pointer analysis and optimizations for
    AMR Jeh, Yelick

Time (normalized)
68
End of Compiling Parallel Code
Write a Comment
User Comments (0)
About PowerShow.com