Compiling for Parallel Machines - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Compiling for Parallel Machines

Description:

RectDomain 2 d = [0:n,0:n]; Point 2 p = [1, 2]; double [2d] a = new double [d] ... Let the 'delay set' D be all edges from P that are part of a minimal cycle ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 51
Provided by: susanl2
Category:

less

Transcript and Presenter's Notes

Title: Compiling for Parallel Machines


1
Compiling for Parallel Machines
Kathy Yelick
  • CS264

2
Two General Research Goals
  • Correctness help programmers eliminate bugs
  • Analysis to detect bugs statically (and
    conservatively)
  • Tools such as debuggers to help detect bugs
    dynamically
  • Performance help make programs run faster
  • Static compiler optimizations
  • May use analyses similar to above to ensure
    compiler is correctly transforming code
  • In many areas, the open problem is determining
    which transformations should be applied when
  • Link or load-time optimizations, including object
    code translation
  • Feedback-directed optimization
  • Runtime optimization
  • For parallel machines, if you cant get good
    performance, whats the point?

3
A Little History
  • Most research on compiling for parallel machines
    is
  • automatic parallelization of serial code
  • loop-level parallelization (usually Fortran)
  • Most parallel programs are written using explicit
    parallelism, either
  • Message passing with a single processor multiple
    data (SPMD) model
  • ) usually MPI with either Fortran or mixed C
    and Fortran for scientific applications
  • ) shared memory with a thread and synchronization
    library in C or Java for non-scientific
    applications
  • Option B is easier to program, but requires
    hardware support that is still unproven for more
    than 200 processors

4
Titanium Overview
  • Give programmers a global address space
  • Useful for building large complex data structures
    that are spread over the machine
  • But, dont pretend it will have uniform access
    time (I.e., not quite shared memory)
  • Use an explicit parallelism model
  • SPMD for simplicity
  • Extend a standard language with data structures
    for specific problem domain, grid-based
    scientific applications
  • Small amount of syntax added for ease of
    programming
  • General idea build domain-specific features into
    the language and optimization framework

5
Titanium Goals
  • Performance
  • close to C/FORTRAN MPI or better
  • Portability
  • develop on uniprocessor, then SMP, then
    MPP/Cluster
  • Safety
  • as safe as Java, extended to parallel framework
  • Expressiveness
  • close to usability of threads
  • add minimal set of features
  • Compatibility, interoperability, etc.
  • no gratuitous departures from Java standard

6
Titanium Goals
  • Performance
  • close to C/FORTRAN MPI or better
  • Safety
  • as safe as Java, extended to parallel framework
  • Expressiveness
  • close to usability of threads
  • add minimal set of features
  • Compatibility, interoperability, etc.
  • no gratuitous departures from Java standard

7
Titanium
  • Take the best features of threads and MPI
  • global address space like threads (ease
    programming)
  • SPMD parallelism like MPI (for performance)
  • local/global distinction, i.e., layout matters
    (for performance)
  • Based on Java, a cleaner C
  • classes, memory management
  • Language is extensible through classes
  • domain-specific language extensions
  • current support for grid-based computations,
    including AMR
  • Optimizing compiler
  • communication and memory optimizations
  • synchronization analysis
  • cache and other uniprocessor optimizations

8
New Language Features
  • Scalable parallelism
  • SPMD model of execution with global address space
  • Multidimensional arrays
  • points and index sets as first-class values to
    simplify programs
  • iterators for performance
  • Checked Synchronization
  • single-valued variables and globally executed
    methods
  • Global Communication Library
  • Immutable classes
  • user-definable non-reference types for
    performance
  • Operator overloading
  • by demand from our user community
  • Semi-automated zone-based memory management
  • as safe as a garbage-collected language
  • better parallel performance and scalability

9
Lecture Outline
  • Language and compiler support for uniprocessor
    performance
  • Immutable classes
  • Multidimensional Arrays
  • foreach
  • Language support for parallel computation
  • Analysis of parallel code
  • Summary and future directions

10
Java A Cleaner C
  • Java is an object-oriented language
  • classes (no standalone functions) with methods
  • inheritance between classes multiple interface
    inheritance only
  • Documentation on web at java.sun.com
  • Syntax similar to C
  • class Hello
  • public static void main (String argv)
  • System.out.println(Hello, world!)
  • Safe
  • Strongly typed checked at compile time, no
    unsafe casts
  • Automatic memory management
  • Titanium is (almost) strict superset

11
Java Objects
  • Primitive scalar types boolean, double, int,
    etc.
  • implementations will store these on the program
    stack
  • access is fast -- comparable to other languages
  • Objects user-defined and from the standard
    library
  • passed by pointer value (object sharing) into
    functions
  • has level of indirection (pointer to) implicit
  • simple model, but inefficient for small objects

2.6 3 true
r 7.1 i 4.3
12
Java Object Example
  • class Complex
  • private double real
  • private double imag
  • public Complex(double r, double i)
  • real r imag i
  • public Complex add(Complex c)
  • return new Complex(c.real real,
    c.imag imag)
  • public double getReal return real
  • public double getImag return imag
  • Complex c new Complex(7.1, 4.3)
  • c c.add(c)
  • class VisComplex extends Complex ...

13
Immutable Classes in Titanium
  • For small objects, would sometimes prefer
  • to avoid level of indirection
  • pass by value (copying of entire object)
  • especially when objects are immutable -- fields
    are unchangeable
  • extends the idea of primitive values (1, 4.2,
    etc.) to user-defined values
  • Titanium introduces immutable classes
  • all fields are final (implicitly)
  • cannot inherit from (extend) or be inherited by
    other classes
  • needs to have 0-argument constructor, e.g.,
    Complex ()
  • immutable class Complex ...
  • Complex c new Complex(7.1, 4.3)

14
Arrays in Java
  • Arrays in Java are objects
  • Only 1D arrays are directly supported
  • Array bounds are checked
  • Multidimensional arrays as arrays-of-arrays are
    slow

15
Multidimensional Arrays in Titanium
  • New kind of multidimensional array added
  • Two arrays may overlap (unlike Java arrays)
  • Indexed by Points (tuple of ints)
  • Constructed over a set of Points, called Domains
  • RectDomains are special case of domains
  • Points, Domains and RectDomains are built-in
    immutable classes
  • Support for adaptive meshes and other mesh/grid
    operations

RectDomainlt2gt d 0n,0n Pointlt2gt p 1,
2 double 2d a new double d a0,0
a9,9
16
Naïve MatMul with Titanium Arrays
  • public static void matMul(double 2d a, double
    2d b,
  • double 2d c)
  • int n c.domain().max()1 // assumes square
  • for (int i 0 i lt n i)
  • for (int j 0 j lt n j)
  • for (int k 0 k lt n k)
  • ci,j ai,k bk,j

17
Two Performance Issues
  • In any language, uniprocessor performance is
    often dominated by memory hierarchy costs
  • algorithms that are blocked for the memory
    hierarchy (caches and registers) can be much
    faster
  • In Titanium, the representation of arrays is
    fast, but the access methods are expensive
  • need optimizations on Titanium arrays
  • common subexpression elimination
  • eliminate (or hoist) bounds checking
  • strength reduce e.g., naïve code has 1 divide
    per dimension for each array access
  • See Geoff Pikes work
  • goal competitive with C/Fortran performance or
    better

18
Matrix Multiply (blocked, or tiled)
  • Consider A,B,C to be N by N matrices of b by b
    subblocks where bn/N is called the blocksize
  • for i 1 to N
  • for j 1 to N
  • read block C(i,j) into fast memory
  • for k 1 to N
  • read block A(i,k) into fast
    memory
  • read block B(k,j) into fast
    memory
  • C(i,j) C(i,j) A(i,k)
    B(k,j) do a matrix multiply on blocks
  • write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)



B(k,j)
19
Memory Hierarchy Optimizations MatMul
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
20
Unordered iteration
  • Often useful to reorder iterations for caches
  • Compilers can do this for simple operations,
    e.g., matrix multiply, but hard in general
  • Titanium adds unordered iteration on rectangular
    domains
  • foreach (p within r) ..
  • p is a Point new point, scoped only within the
    foreach body
  • r is a previously-declared RectDomain
  • Foreach simplifies bounds checking as well
  • Additional operations on domains and arrays to
    subset and transform

21
Better MatMul with Titanium Arrays
  • public static void matMul(double 2d a, double
    2d b,
  • double 2d c)
  • foreach (ij within c.domain())
  • double 1d aRowi a.slice(1, ij1)
  • double 1d bColj b.slice(2, ij2)
  • foreach (k within aRowi.domain())
  • cij aRowik bColjk
  • Current compiler eliminates array overhead,
    making it comparable to C performance for 3
    nested loops
  • Automatic tiling still TBD

22
Sequential Performance
Performance results from 98 new IR and
optimization framework almost complete.
23
Lecture Outline
  • Language and compiler support for uniprocessor
    performance
  • Language support for parallel computation
  • SPMD execution
  • Global and local references
  • Communication
  • Barriers and single
  • Synchronized methods and blocks (as in Java)
  • Analysis of parallel code
  • Summary and future directions

24
SPMD Execution Model
  • Java programs can be run as Titanium, but the
    result will be that all processors do all the
    work
  • E.g., parallel hello world
  • class HelloWorld
  • public static void main (String argv)
  • System.out.println(Hello from proc
  • Ti.thisProc())
  • Any non-trivial program will have communication
    and synchronization between processors

25
SPMD Execution Model
  • A common style is compute/communicate
  • E.g., in each timestep within fish simulation
    with gravitation attraction
  • read all fish and compute forces on mine
  • Ti.barrier()
  • write to my fish using new forces
  • Ti.barrier()

26
SPMD Model
  • All processor start together and execute same
    code, but not in lock-step
  • Sometimes they take different branches
  • if (Ti.thisProc() 0) do setup
  • for(all data I own) compute on data
  • Common source of bugs is barriers or other global
    operations inside branches or loops
  • barrier, broadcast, reduction, exchange
  • A single method is one called by all procs
  • public single static void allStep()
  • A single variable has the same value on all
    procs
  • int single timestep 0

27
SPMD Execution Model
  • Barriers and single in FishSimulation (n-body)
  • class FishSim
  • public static void main (String argv)
  • int allTimestep 0
  • int allEndTime 100
  • for ( allTimestep lt allEndTime
    allTimestep)
  • read all fish and compute forces on mine
  • Ti.barrier()
  • write to my fish using new forces
  • Ti.barrier()
  • Single methods inferred see David Gays work

single
single
single
28
Global Address Space
  • Processes allocate locally
  • References can be passed to other processes

Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class C int val.. C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val // full
gv.val // functionality
29
Use of Global / Local
  • Default is global
  • easier to port shared-memory programs
  • performance bugs common global pointers are more
    expensive
  • harder to use sequential kernels
  • Use local declarations in critical sections
  • Compiler can infer many instances of local
  • See Liblits work on LQI (Local Qualification
    Inference)

30
Local Pointer Analysis Liblit, Aiken
  • Global references simplify programming, but incur
    overhead, even when data is local
  • Split-C therefore requires global pointers be
    declared explicitly
  • Titanium pointers global by default easier,
    better portability
  • Automatic local qualification inference

31
Parallel performance
  • Speedup on Ultrasparc SMP
  • AMR largely limited by
  • current algorithm
  • problem size
  • 2 levels, with top one serial
  • Not yet optimized with local for distributed
    memory

32
Lecture Outline
  • Language and compiler support for uniprocessor
    performance
  • Language support for parallel computation
  • Analysis and Optimization of parallel code
  • Tolerate network latency Split-C experience
  • Hardware trends and reordering
  • Semantics sequential consistency
  • Cycle detection parallel dependence analysis
  • Synchronization analysis parallel flow analysis
  • Summary and future directions

33
Split-C Experience Latency Overlap
  • Titanium borrowed ideas from Split-C
  • global address space
  • SPMD parallelism
  • But, Split-C had non-blocking accesses built in
    to tolerate network latency on remote read/write
  • Also one-way communication
  • Conclusion useful, but complicated

int global p x p / get / p
3 / put / sync / wait for my
puts/gets /
p - x / store / all_store_sync /
wait globally /
34
Other sources of Overlap
  • Would like compiler to introduce put/get/store.
  • Hardware also reorders
  • out-of-order execution
  • write buffered with read by-pass
  • non-FIFO write buffers
  • weak memory models in general
  • Software already reorders too
  • register allocation
  • any code motion
  • System provides enforcement primitives
  • e.g., memory fence, volatile, etc.
  • tend to be heavy wait and with unpredictable
    performance
  • Can the compiler hide all this?

35
Semantics Sequential Consistency
  • When compiling sequential programs
  • Valid if y not in expr1 and x not in expr2
    (roughly)
  • When compiling parallel code, not sufficient test.

y expr2 x expr1
Initially flag data 0 Proc A Proc
B data 1 while (flag1) flag 1
.. ..data..
36
Cycle Detection Dependence Analog
  • Processors define a program order on accesses
    from the same thread
  • P is the union of these total orders
  • Memory system define an access order on
    accesses to the same variable
  • A is access order (read/write
    write/write pairs)
  • A violation of sequential consistency is cycle in
    P U A.
  • Intuition time cannot flow backwards.

37
Cycle Detection
  • Generalizes to arbitrary numbers of variables and
    processors
  • Cycles may be arbitrarily long, but it is
    sufficient to consider only cycles with 1 or 2
    consecutive stops per processor Sasha Snir

write x write y read y
read y write
x
38
Static Analysis for Cycle Detection
  • Approximate P by the control flow graph
  • Approximate A by undirected dependence edges
  • Let the delay set D be all edges from P that
    are part of a minimal cycle
  • The execution order of D edge must be preserved
    other P edges may be reordered (modulo usual
    rules about serial code)
  • Synchronization analsysis also critical
    Krishnamurthy

write z read x
write y read x
read y write z
39
Automatic Communication Optimization
  • Implemented in subset of C with limited pointers
    Krishnamurthy, Yelick
  • Experiments on the NOW 3 synchronization styles
  • Future pointer analysis and optimizations for
    AMR Jeh, Yelick

40
Other Language Extensions
  • Java extensions for expressiveness performance
  • Operator overloading
  • Zone-based memory management
  • Foreign function interface
  • The following is not yet implemented in the
    compiler
  • Parameterized types (aka templates)

41
Implementation
  • Strategy
  • compile Titanium into C
  • Solaris or Posix threads for SMPs
  • Active Messages (Split-C library) for
    communication
  • Status
  • runs on SUN Enterprise 8-way SMP
  • runs on Berkeley NOW
  • runs on the Tera (not fully tested)
  • T3E port partially working
  • SP2 port under way

42
Titanium Status
  • Titanium language definition complete.
  • Titanium compiler running.
  • Compiles for uniprocessors, NOW, Tera, t3e, SMPs,
    SP2 (under way).
  • Application developments ongoing.
  • Lots of research opportunities.

43
Future Directions
  • Super optimizers for targeted kernels
  • e.g., Phipack, Sparsity, FFTW, and Atlas
  • include feedback and some runtime information
  • New application domains
  • unstructured grids (aka graphs and sparse
    matrices)
  • I/O-intensive applications such as information
    retrieval
  • Optimizing I/O as well as communication
  • uniform treatment of memory hierarchy
    optimizations
  • Performance heterogeneity from the hardware
  • related to dynamic load balancing in software
  • Reasoning about parallel code
  • correctness analysis race condition and
    synchronization analysis
  • better analysis aliases and threads
  • Java memory model and hiding the hardware model

44
Backup Slides
45
Point, RectDomain, Arrays in General
  • Points specified by a tuple of ints
  • RectDomains given by
  • lower bound point
  • upper bound point
  • stride point
  • Array given by RectDomain and element type

Pointlt2gt lb 1, 1 Pointlt2gt ub 10,
20 RectDomainlt2gt R lb ub 2, 2 double
2d A new doubler ... foreach (p in
A.domain()) Ap B2 p 1, 1
46
AMR Poisson
  • Poisson Solver Semenzato, Pike, Colella
  • 3D AMR
  • finite domain
  • variable

    coefficients
  • multigrid

    across levels
  • Performance of Titanium implementation
  • Sequential multigrid performance /- 20 of
    Fortran
  • On fixed, well-balanced problem of 8 patches, 723
    parallel speedups of 5.5 on 8 processors

47
Distributed Data Structures
  • Build distributed data structures
  • broadcast or exchange
  • RectDomain lt1gt single allProcs
    0Ti.numProcs-1
  • RectDomain lt1gt myFishDomain 0myFishCount-1
  • Fish 1d single 1d allFish
  • new Fish allProcs1d
  • Fish 1d myFish new Fish myFishDomain
  • allFish.exchage(myFish)
  • Now each processor has an array of global
    pointers, one to each processors chunk of fish

48
Consistency Model
  • Titanium adopts the Java memory consistency model
  • Roughly Access to shared variables that are not
    synchronized have undefined behavior.
  • Use synchronization to control access to shared
    variables.
  • barriers
  • synchronized methods and blocks

49
Example Domain
r
  • Domains in general are not rectangular
  • Built using set operations
  • union,
  • intersection,
  • difference, -
  • Example is red-black algorithm

(6, 4)
(0, 0)
r 1, 1
(7, 5)
Pointlt2gt lb 0, 0 Pointlt2gt ub 6,
4 RectDomainlt2gt r lb ub 2,
2 Domainlt2gt red r (r 1, 1) foreach
(p in red) ...
(1, 1)
red
(7, 5)
(0, 0)
50
Example using Domains and foreach
  • Gauss-Seidel red-black computation in multigrid

void gsrb() boundary (phi) for (domainlt2gt
d res d ! null d
(d red ? black null)) foreach (q in
d) resq ((phin(q) phis(q)
phie(q) phiw(q))4
(phine(q) phinw(q) phise(q)
phisw(q)) - 20.0phiq -
krhsq) 0.05 foreach (q in d) phiq
resq
unordered iteration
51
Applications
  • Three-D AMR Poisson Solver (AMR3D)
  • block-structured grids
  • 2000 line program
  • algorithm not yet fully implemented in other
    languages
  • tests performance and effectiveness of language
    features
  • Other 2D Poisson Solvers (under development)
  • infinite domains
  • based on method of local corrections
  • Three-D Electromagnetic Waves (EM3D)
  • unstructured grids
  • Several smaller benchmarks
Write a Comment
User Comments (0)
About PowerShow.com