Titanium: Language and Compiler Support for Gridbased Computation

About This Presentation

Title:

Titanium: Language and Compiler Support for Gridbased Computation

Description:

Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley ... of bugs is barriers or other global operations inside branches or ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 69

Provided by: bad61

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Titanium: Language and Compiler Support for Gridbased Computation

1
Titanium Language and Compiler Support for
Grid-based Computation
Kathy Yelick

U.C. Berkeley
Computer Science Division

2
Titanium Group

Susan Graham
Katherine Yelick
Paul Hilfinger
Phillip Colella (LBNL)
Alex Aiken
Greg Balls (SDSC)
Peter McQuorquodale (LBNL)

Andrew Begel
Dan Bonachea
Tyson Condie
David Gay
Ben Liblit
Chang Sun Lin
Geoff Pike
Siu Man Yau

3
Target Problems

Many modeling problems in astrophysics, biology,
material science, and other areas require
Enormous range of spatial and temporal scales
To solve interesting problems, one needs
Adaptive methods
Large scale parallel machines
Titanium is designed for methods with
Stuctured grids
Locally-structured grids (AMR)

4
Common Requirements

Algorithms for numerical PDE computations
are
communication intensive
memory intensive
AMR makes these harder
more small messages
more complex data structures
most of the programming effort is
debugging the boundary cases
locality and load balance trade-off is hard

5
A Little History

Most parallel programs are written using explicit
parallelism, either
Message passing with a SPMD model
Usually for scientific applications with
C/Fortran
Scales easily
Shared memory with a thread C or Java
Usually for non-scientific applications
Easier to program
Take the best features of both for Titanium

6
Titanium

Take the best features of threads and MPI
global address space like threads (programming)
SPMD parallelism like MPI (performance)
local/global distinction, i.e., layout matters
(performance)
Based on Java, a cleaner C
classes, memory management
Optimizing compiler
communication and memory optimizations
synchronization analysis
cache and other uniprocessor optimizations

7
Why Java for Scientific Computing?

Computational scientists use increasingly complex
models
Popularized C features classes, overloading,
pointer-based data structures
But C is very complicated
easy to lose performance and readability
Java is a better C
Safe strongly typed, garbage collected
Much simpler to implement (research vehicle)
Industrial interest as well IBM HP Java

8
Summary of Features Added to Java

Multidimensional arrays with iterators
Immutable (value) classes
Templates
Operator overloading
Scalable SPMD parallelism
Global address space
Checked Synchronization
Zone-based memory management
Scientific Libraries

9
Lecture Outline

Language and compiler support for uniprocessor
performance
Immutable classes
Multidimensional Arrays
Foreach
Language support for ease of programming
Language support for parallel computation
Applications and application-level libraries
Summary and future directions

10
Java A Cleaner C

Java is an object-oriented language
classes (no standalone functions) with methods
inheritance between classes
Documentation on web at java.sun.com
Syntax similar to C
class Hello
public static void main (String argv)
System.out.println(Hello, world!)
Safe strongly typed, auto memory management
Titanium is (almost) strict superset

11
Java Objects

Primitive scalar types boolean, double, int,
etc.
implementations will store these on the program
stack
access is fast -- comparable to other languages
Objects user-defined and standard library
passed by pointer value (object sharing) into
functions
has level of indirection (pointer to) implicit
simple model, but inefficient for small objects

2.6 3 true
r 7.1 i 4.3
12
Java Object Example

class Complex
private double real
private double imag
public Complex(double r, double i)
real r imag i
public Complex add(Complex c)
return new Complex(c.real real,
c.imag imag)
public double getReal return real
public double getImag return imag
Complex c new Complex(7.1, 4.3)
c c.add(c)
class VisComplex extends Complex ...

13
Immutable Classes in Titanium

For small objects, would sometimes prefer
to avoid level of indirection
pass by value (copying of entire object)
especially when immutable -- fields never
modified
extends the idea of primitive values to
user-defined values
Titanium introduces immutable classes
all fields are final (implicitly)
cannot inherit from or be inherited by other
classes
needs to have 0-argument constructor
Note considering allowing mutation in future

14
Example of Immutable Classes

The immutable complex class nearly the same
immutable class Complex
Complex () real0 imag0
...
Use of immutable complex values
Complex c1 new Complex(7.1, 4.3)
Complex c2 new Complex(2.5, 9.0)
c1 c1.add(c2)
Similar to structs in C in terms of performance

Zero-argument constructor required
new keyword
Rest unchanged. No assignment to fields outside
of constructors.
15
Arrays in Java
2d array

Arrays in Java are objects
Only 1D arrays are directly supported
Multidimensional arrays are slow

Subarrays are important in AMR (e.g., interior of
a grid)
Even C and C dont support these well
Hand-coding (array libraries) can confuse
optimizer

16
Multidimensional Arrays in Titanium

New multidimensional array added to Java
One array may be a subarray of another
e.g., a is interior of b, or a is all even
elements of b
Indexed by Points (tuples of ints)
Constructed over a set of Points, called
Rectangular Domains (RectDomains)
Points, Domains and RectDomains are built-in
immutable classes
Support for AMR and other grid computations
domain operations intersection, shrink, border

17
Unordered Iteration

Memory hierarchy optimizations are essential
Compilers can sometimes do these, but hard in
general
Titanium adds unordered iteration on rectangular
domains
foreach (p in r) ...
p is a Point
r is a RectDomain or Domain
Foreach simplifies bounds checking as well
Additional operations on domains to subset and
xform
Note foreach is not a parallelism construct

18
Point, RectDomain, Arrays in General

Points specified by a tuple of ints
RectDomains given by 3 points
lower bound, upper bound (and stride)
Array declared by dimensions and type
Array created by passing RectDomain

19
Simple Array Example

Matrix sum in Titanium

Pointlt2gt lb 1,1 Pointlt2gt ub
10,20 RectDomainlt2gt r lb,ub double 2d
a new double r double 2d b new double
110,120 double 2d c new double
lbub1,1 for (int i 1 i lt 10 i)
for (int j 1 j lt 20 j) ci,j
ai,j bi,j
No array allocation here
Syntactic sugar
Optional stride
20
Naïve MatMul with Titanium Arrays

public static void matMul(double 2d a, double
2d b,
double 2d c)
int n c.domain().max()1 // assumes square
for (int i 0 i lt n i)
for (int j 0 j lt n j)
for (int k 0 k lt n k)
ci,j ai,k bk,j

21
Better MatMul with Titanium Arrays

public static void matMul(double 2d a, double
2d b,
double 2d c)
foreach (ij within c.domain())
double 1d aRowi a.slice(1, ij1)
double 1d bColj b.slice(2, ij2)
foreach (k within aRowi.domain())
cij aRowik bColjk
Current performance comparable to 3 nested loops
in C
Future automatic blocking for memory hierarchy
(Geoff Pikes PhD thesis)

22
Array Performance Issues

Array representation is fast, but access methods
can be slow, e.g., bounds checking, strides
Compiler optimizes these
common subexpression elimination
eliminate (or hoist) bounds checking
strength reduce e.g., naïve code has 1 divide
per dimension for each array access
Currently /- 20 of C/Fortran for large loops
Future small loop and cache optimizations

23
Sequential Performance
Performance results from 98 new IR and
optimization framework almost complete.
24
Lecture Outline

Language and compiler support for uniprocessor
performance
Language support for ease of programming
Templates
Operator overloading
Language support for parallel computation
Applications and application-level libraries
Summary and future directions

Example later
25
Lecture Outline

Language and compiler support for uniprocessor
performance
Language support for parallel computation
SPMD execution
Barriers and single
Explicit Communication
Implicit Communication (global and local
references)
More on Single
Synchronized methods and blocks (as in Java)
Applications and application-level libraries
Summary and future directions

26
SPMD Execution Model

Java programs can be run as Titanium, but the
result will be that all processors do all the
work
E.g., parallel hello world
class HelloWorld
public static void main (String
argv)
System.out.println(Hello from proc
Ti.thisProc())
Any non-trivial program will have communication
and synchronization

27
SPMD Model

All processors start together and execute same
code, but not in lock-step
Basic control done using
Ti.numProcs() total number of processors
Ti.thisProc() number of executing processor
Bulk-synchronous style
read all particles and compute forces on
mine
Ti.barrier()
write to my particles using new forces
Ti.barrier()
This is neither message passing nor data-parallel

28
Barriers and Single

Common source of bugs is barriers or other global
operations inside branches or loops
barrier, broadcast, reduction, exchange
A single method is one called by all procs
public single static void allStep(...)
A single variable has same value on all procs
int single timestep 0
Single annotation on methods (also called
sglobal) is optional, but useful to
understanding compiler messages.

29
Explicit Communication Broadcast

Broadcast is a one-to-all communication
broadcast ltvaluegt from ltprocessorgt
For example
int count 0
int allCount 0
if (Ti.thisProc() 0) count
computeCount()
allCount broadcast count from 0
The processor number in the broadcast must be
single all constants are single.
The allCount variable could be declared single.

30
Example of Data Input

Same example, but reading from keyboard
Shows use of Java exceptions
int single count 0
int allCount 0
if (Ti.thisProc() 0)
try
DataInputStream kb new
DataInputStream(System.in)
myCount Integer.valueOf(kb.readLine()).i
ntValue()
catch (Exception e)
System.err.println(Illegal Input)
allCount myCount from 0

31
Global Address Space

References (pointers) may be remote
useful in building adaptive meshes
easy to port shared-memory programs
uniform programming model across machines
Global pointers are more expensive than local
True even when data is on the same processor
space (processor number memory address)
dereference time (check to see if local)
Use local declarations in critical sections

32
Example A Distributed Data Structure

Data can be accessed across processor boundaries

local_grids
all_grids
33
Example Setting Boundary Conditions

foreach (l in local_grids.domain())
foreach (a in all_grids.domain())
local_gridsl.copy(all_gridsa)

34
Explicit Communication Exchange

To create shared data structures
each processor builds its own piece
pieces are exchanged (for object, just exchange
pointers)
Exchange primitive in Titanium
int 1d single allData
allData new int 0Ti.numProcs()-1
allData.exchange(Ti.thisProc()2)
E.g., on 4 procs, each will have copy of allData

35
Building Distributed Structures

Distributed structures are built with exchange
class Boxed
public Boxed (int j) val j
public int val
Object 1d single allData
allData new Object 0Ti.numProcs()-1
allData.exchange(new Boxed(Ti.thisProc())

36
Distributed Data Structures

Building distributed arrays
RectDomain lt1gt single allProcs
0Ti.numProcs-1
RectDomain lt1gt myParticleDomain
0myPartCount-1
Particle 1d single 1d allParticle
new Particle
allProcs1d
Particle 1d myParticle
new Particle
myParticleDomain
allParticle.exchange(myParticle)
Now each processor has array of pointers, one to
each processors chunk of particles

37
More on Single

Global synchronization needs to be controlled
if (this processor owns some data)
compute on it
barrier
Hence the use of single variables in Titanium
If a conditional or loop block contains a
barrier, all processors must execute it
conditions in such loops, if statements, etc.
must contain only single variables

38
Single Variable Example

Barriers and single in N-body Simulation
class ParticleSim
public static void main (String argv)
int single allTimestep 0
int single allEndTime 100
for ( allTimestep lt allEndTime
allTimestep)
read all particles and compute forces
on mine
Ti.barrier()
write to my particles using new forces
Ti.barrier()
Single methods inferred see David Gays work

39
Use of Global / Local

As seen, references (pointers) may be remote
easy to port shared-memory programs
Global pointers are more expensive than local
True even when data is on the same processor
Use local declarations in critical sections
Costs of global
space (processor number memory address)
dereference time (check to see if local)
May declare references as local
Compiler will automatically infer them when
possible

40
Global Address Space

Processes allocate locally
References can be passed to other processes

Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class C int val... C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val ... ...
gv.val
41
Local Pointer Analysis

Compiler can infer many uses of local
See Liblits work on Local Qualification
Inference
Data structures must be well partitioned

42
Lecture Outline

Language and compiler support for uniprocessor
performance
Language support for ease of programming
Language support for parallel computation
Applications and application-level libraries
Gene sequencing application
Heart simulation
AMR elliptic and hyperbolic solvers
Scalable Poisson for infinite domains
Several smaller benchmarks EM3D, MatMul, LU,
FFT, Join
Summary and future directions

43
Unstructured Mesh Kernel

EM3D Relaxation on a 3D unstructured mesh
Speedup on Ultrasparc SMP
Simple kernel mesh not partitioned.

44
AMR Poisson

Poisson Solver Semenzato, Pike, Colella
3D AMR
finite domain
variable

coefficients
multigrid

across levels
Performance of Titanium implementation
Sequential multigrid performance /- 20 of
Fortran
On fixed, well-balanced problem of 8 patches,
each 723
parallel speedups of 5.5 on 8 processors

45
Scalable Poisson Solver

MLC for Finite-Differences by Balls and Colella
Poisson equation with infinite boundaries
arise in astrophysics, some biological systems,
etc.
Method is scalable
Low communication
Performance on
SP2 (shown) and t3e
scaled speedups
nearly ideal (flat)
Currently 2D and non-adaptive

46
AMR Gas Dynamics

Developed by McCorquodale and Colella
Merge with Poisson underway for self-gravity
2D Example (3D supported)
Mach-10 shock on solid surface
at oblique
angle
Future Self-gravitating gas dynamics package

47
Distributed Array Libraries

There are some standard distributed array
libraries associated with Titanium
Hides the details of exchange, indirection within
the data structure, etc.
Libraries benefit from support for templates

48
Distributed Array Library Fragment

template ltclass T, int single aritygt public class
DistArray
RectDomain ltaritygt single rd
T arity darity d subMatrices
RectDomain ltaritygt arity d single subDomains
...
/ Sets the element at p to value /
public void set (Point ltaritygt p, T value)
getHomingSubMatrix (p) p value
template DistArray ltdouble, 2gt single A new
template DistArray ltdouble, 2gt ( 0, 0
aHeight, aWidth)

49
Immersed Boundary Method (future)

Immersed boundary method Peskin,MacQueen
Used in heart model, platelets, and others
Currently uses FFT for Navier-Stokes solver
Begun effort to move solver and full method into
Titanium

50
Implementation

Strategy
Titanium into C
Solaris or Posix threads for SMPs
Lightweight communication for MPPs/Clusters
Status Titanium runs on
Solaris or Linux SMPs and uniprocessors
Berkeley NOW
SDSC Tera, SP2, T3E (also NERSC)
SP3 port underway

51
Using Titanium on NPACI Machines

Send mail to us if you are interested
titanium-group_at_cs.berkeley.edu
Has been installed in individual accounts
t3e and BH upgrade needed
On uniprocessors and SMPs
available from the Titanium home page
http//www.cs.berkeley.edu/projects/titanium
other documentation available as well

52
Calling Other Languages

We have built interfaces to
PETSc scientific library for finite element
applications
Metis graph partitioning library
KeLP starting work on this
Two issues with cross-language calls
accessing Titanium data structures (arrays) from
C
possible because Titanium arrays have same format
on inside
having a common message layer
Titanium is built on lightweight communication

53
Future Plans

Improved compiler optimizations for scalar code
large loops are currently /- 20 of Fortran
working on small loop performance
Packaged solvers written in Titanium
Elliptic and hyperbolic solvers, both regular and
adaptive
New application collaboration
Peskin and McQueen (NYU) with Colella (LBNL)
Immersed boundary method, currently use for heart
simulation, platelet coagulation, and others

54
Backup Slides
55
Example Domain
r

Domains in general are not rectangular
Built using set operations
union,
intersection,
difference, -
Example is red-black algorithm

(6, 4)
(0, 0)
r 1, 1
(7, 5)
Pointlt2gt lb 0, 0 Pointlt2gt ub 6,
4 RectDomainlt2gt r lb ub 2,
2 ... Domainlt2gt red r (r 1,
1) foreach (p in red) ...
(1, 1)
red
(7, 5)
(0, 0)
56
Example using Domains and foreach

Gauss-Seidel red-black computation in multigrid

void gsrb() boundary (phi) for (domainlt2gt
d res d ! null d
(d red ? black null)) foreach (q in
d) resq ((phin(q) phis(q)
phie(q) phiw(q))4
(phine(q) phinw(q) phise(q)
phisw(q)) - 20.0phiq -
krhsq) 0.05 foreach (q in d) phiq
resq
unordered iteration
57
Recent Progress in Titanium

Distributed data structures built with global
refs
communication may be implicit, e.g. aj
ai.dx
use extensively in AMR algorithms
Runtime layer optimizes
bulk communication
bulk I/O
Runs on
t3e, SP2, and Tera
Compiler analysis optimizes
global references converted to local ones when
possible

58
Consistency Model

Titanium adopts the Java memory consistency model
Roughly Access to shared variables that are not
synchronized have undefined behavior.
Use synchronization to control access to shared
variables.
barriers
synchronized methods and blocks

59
Compiler Techniques Outline

Analysis and Optimization of parallel code
Tolerate network latency Split-C experience
Hardware trends and reordering
Semantics sequential consistency
Cycle detection parallel dependence analysis
Synchronization analysis parallel flow analysis
Summary and future directions

60
Parallel Optimizations

Titanium compiler performs parallel optimizations
communication overlap and aggregation
Two new analyses
synchronization analysis the parallel analog to
control flow analysis for serial code Gay
Aiken
shared variable analysis the parallel analog to
dependence analysis Krishnamurthy Yelick
local qualification inference automatically
inserts local qualifiers Liblit Aiken

61
Split-C Experience Latency Overlap

Titanium borrowed ideas from Split-C
global address space
SPMD parallelism
But, Split-C had non-blocking accesses built in
to tolerate network latency on remote read/write
Also one-way communication
Conclusion useful, but complicated

int global p x p / get / p
3 / put / sync / wait for my
puts/gets /
p - x / store / all_store_sync /
wait globally /
62
Sources of Memory/Comm. Overlap

Would like compiler to introduce put/get/store.
Hardware also reorders
out-of-order execution
write buffered with read by-pass
non-FIFO write buffers
weak memory models in general
Software already reorders too
register allocation
any code motion
System provides enforcement primitives
e.g., memory fence, volatile, etc.
tend to be heavy wait and with unpredictable
performance
Can the compiler hide all this?

63
Semantics Sequential Consistency

When compiling sequential programs
Valid if y not in expr1 and x not in expr2
(roughly)
When compiling parallel code, not sufficient test.

y expr2 x expr1
x expr1 y expr2
Initially flag data 0 Proc A Proc
B data 1 while (flag!1) flag 1
... ...data...
64
Cycle Detection Dependence Analog

Processors define a program order on accesses
from the same thread
P is the union of these total orders
Memory system define an access order on
accesses to the same variable
A is access order (read/write
write/write pairs)
A violation of sequential consistency is cycle in
P U A.
Intuition time cannot flow backwards.

65
Cycle Detection

Generalizes to arbitrary numbers of variables and
processors
Cycles may be arbitrarily long, but it is
sufficient to consider only cycles with 1 or 2
consecutive stops per processor

write x write y read y
read y write
x
66
Static Analysis for Cycle Detection

Approximate P by the control flow graph
Approximate A by undirected dependence edges
Let the delay set D be all edges from P that
are part of a minimal cycle
The execution order of D edge must be preserved
other P edges may be reordered (modulo usual
rules about serial code)
Synchronization analysis also critical

write z read x
write y read x
read y write z
67
Communication Optimizations