Title: Language and Compiler Support for Adaptive Mesh Refinement
1Language and Compiler Support for Adaptive Mesh
Refinement
Katherine Yelick Alex Aiken, Greg Balls, Dan
Bonachea, Phillip Colella, David Gay,
Susan Graham, Paul Hilfinger, Arvind
Krishnamurthy, Ben Liblit, Chang Sun Lin, Peter
McCorquodale, Carleton Miyamoto, Geoff Pike, Kar
Ming Tang, Siu Man Yau
2Multiscale Modeling Problems
- Many modeling problems in astrophysics, biology,
material science, and other areas require - Enormous range of spatial and temporal scales
- To solve interesting problems, one needs
- Adaptive methods
- Large scale parallel machines
- For Computational Fluid Dynamics, a popular
technique is - Adaptive Mesh Refinement (AMR)
3Block-Structured AMR
- Algorithms for many rectangular, grid-based
computations are - communication intensive
- memory intensive
- AMR makes these harder
- more small messages
- more complex data
structures - most of the programming effort is debugging the
boundary cases - locality and load balance trade-off is hard
4A Layered Attack on These Problems
- Algorithms
- Languages Titanium
- Java dialect compiled to C
- Extensions for serial programming
- Extensions for parallel programming
- Compilers
- Uniprocessor optimizations
- Communication optimizations
- Libraries
- Automatic tuning of kernels
- Architecture
5Algorithms for AMR
- Existing algorithms in Titanium
- 3D AMR Poisson solver
- 3D AMR Gas dynamics
- Domain-decomposition MLC Poisson
- Under development
- Self-gravitating gas dynamics (3D AMR)
- For stellar collapse, etc.
- Immersed boundary method (3D, non-adaptive)
- Peskin and MacQueens method for heart model,
etc. - Embedded boundaries
- Simulation of bio-MEMs devices and cellular level
modeling - All joint with Colellas group at LBNL
63D AMR Gas Dynamics
- Hyperbolic Solver McCorquodale and Colella
- Implementation of Berger-Colella algorithm
- Mesh generation algorithm included
- 2D Example (3D supported)
- Mach-10 shock on solid surface
at oblique
angle - Future Self-gravitating gas dynamics package
73D AMR Poisson
- Poisson Solver Semenzato, Pike, Colella
- finite domain
- variable
coefficients - multigrid
across levels - Currently synthetic grids, no grid generation
- Under construction
- reengineered to interface with hyperbolic solver
- including mesh generation
8MLC for Finite-Differences
- Poisson solver with infinite domains Colella,
Balls - Uses a Method of Local Corrections (MLC)
- Currently non-adaptive and 2D
- Supports only constant coefficients
- Uses 2-level, domain decomposition approach
- Fine-grid solutions are computed in parallel
- Information transferred to a coarse-grid and
solved serially - Fine-grid solutions is computed using boundary
conditions from the coarse grid - Future work includes 3D Adaptive version
9MLC for Finite-Differences
- Features of the method
- Solution is still second-order accurate
- Accuracy depends only weakly on the coarse-grid
spacing - Scalability
- No communication during fine-grid solves
- Single communication step (global all-to-all)
- coarse grid work is serial (replicated), but
relatively small - Future work extend to 3D and adaptive meshes
10Accuracy of the MLC for FD
- Solution is second-order accurate
- decreasing h by 2x, decreases error by 4x
11Error on High-Wavenumber Problem
- Charge is
- 1 charge of concentric waves
- 2 star-shaped charges.
- Largest error is where the charge is changing
rapidly. Note - discretization error
- faint decomposition error
- Run on 16 procs
12A Layered Attack on These Problems
- Algorithms
- Languages Titanium
- Java dialect compiled to C
- Extensions serial programming
- Extensions for parallel programming
- Compilers
- Uniprocessor optimizations
- Communication optimizations
- Libraries
- Automatic tuning of kernels
- Architecture
13Java for Scientific Computing
- Computational scientists work on increasingly
complex models - Popularized C features classes, overloading,
pointer-based data structures - But C is very complicated
- easy to lose performance and readability
- Java is a better C
- Safe strongly typed, garbage collected
- Much simpler to implement (research vehicle)
- Industrial interest as well IBM HP Java
14Java Objects
- Primitive scalar types boolean, double, int,
etc. - implementations store these in place
- access is fast -- comparable to other languages
- Objects user-defined and library
- passed by pointer value
- has level of indirection (pointer to) implicit
- simple model, but inefficient for small objects
2.6 3 true
Complex object
r 7.1 i 4.3
15Java Object Example
- class Complex
- private double real
- private double imag
- public Complex(double r, double i)
- real r imag i
- public Complex operator(Complex c)
- return new Complex(c.real real,
- c.imag imag)
- public double getReal return real
- public double getImag return imag
-
- Complex c new Complex(7.1, 4.3)
- c c c
- class VisComplex extends Complex ...
16Immutable Classes in Titanium
- For small objects, would sometimes prefer to
- avoid indirection, pass by copying
- especially when fields are immutable
- extends the idea of primitive values (1, 4.2,
etc.) to user-defined values - performs like structs in C
- Titanium introduces immutable classes
- all fields are implicitly final (constant)
- cannot inherit from (extend) or be inherited by
other classes - immutable class Complex ..unchanged..
- Complex c new Complex(7.1, 4.3)
17Arrays in Java
- Arrays in Java are objects
- Only 1D arrays are directly supported
- Multidimensional arrays are slow
2d array
- Subarrays are important in AMR
- e.g., interior of a grid
- Even C and C dont support these well
- Fortran has contiguous subarrays (LDA)
- Hand-coding can confuse optimizer
18Multidimensional Arrays in Titanium
- New multidimensional array added to Java
- One array may be a subarray of another
- e.g., a is interior of b, or a is all even
elements of b - Indexed by Points (tuple of ints)
- Constructed over a set of Points, called
Rectangular Domains (RectDomains) - Points, Domains and RectDomains are built-in
immutable classes - Support for AMR and other grid computations
- domain operations intersection, shrink, border
19Unordered iteration
- Memory hierarchy optimizations are essential
- Compilers can sometimes do these, but hard in
general - Titanium adds unordered iteration on rectangular
domains - foreach (p within r) ...
- p is a point
- r is a previously-declared RectDomain
- Foreach simplifies bounds checking as well
- Additional operations on domains and arrays to
subset and transform
20Laplacian Example
- Simple example of using arrays and foreach
-
- Domain interior A.domain().shrink(1)
- Point dx 1,0
- Point dy 0,1
- foreach (p in interior)
- Lp 4ap - apdx - ap-dx
- - apdy - ap-dy
-
21Sequential Performance
Ultrasparc
C/C/ FORTRAN
Java Arrays
Titanium Arrays
Overhead
DAXPY
1.4s
7
1.5s
6.8s
3D multigrid
12s
83
22s
2D multigrid
5.4s
15
6.2s
MatMul
1.8s
2.2s
22
Compares to naïve C code neither compiler does
cache blocking (yet).
22A Layered Attack on These Problems
- Algorithms
- Languages Titanium
- Java dialect compiled to C
- Extensions serial programming
- Extensions for parallel programming
- Compilers
- Uniprocessor optimizations
- Communication optimizations
- Libraries
- Automatic tuning of kernels
- Architecture
23SPMD Model
- All processor start together and execute same
code, but not in lock-step - Basic control done using
- Ti.numProcs() total number of processors
- Ti.thisProc() number of executing processor
- Bulk-synchronous style
- read all particles and compute forces on
mine - Ti.barrier()
- write to my particles using new forces
- Ti.barrier()
- This is neither message passing nor data-parallel
-
24Building Distributed Data Structures
- Broadcast is a one-to-all communication
- broadcast from
- To create shared data structures
- each processor builds its own piece
- pieces are exchanged (all-to-all)
- allData.exchange(new Boxed(Ti.thisProc())
25Global Address Space
- As seen, references (pointers) may be remote
- useful in building adaptive meshes
- easy to port shared-memory programs
- uniform programming model across machines
- Global pointers are more expensive than local
- True even when data is on the same processor
- space (processor number memory address)
- dereference time (check to see if local)
- Use local declarations in critical sections
- May declare references as local
26Global Address Space
- Processes allocate locally
- References can be passed to other processes
Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class C int val... C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val ... ...
gv.val
27A Layered Attack on These Problems
- Algorithms
- Languages Titanium
- Java dialect compiled to C
- Extensions serial programming
- Extensions for parallel programming
- Compilers
- Uniprocessor optimizations, e.g., local
- Parallel optimizations, e.g., communication
- Libraries
- Automatic tuning of kernels
- Architecture
28Local Pointer Analysis
- Compiler can infer many uses of local
LiblitAiken - Local Qualification Inference (LQI)
- Data structures must be well partitioned
29Parallel Optimizations
- Titanium compiler performs parallel optimizations
- communication overlap and aggregation
- Two new analyses
- synchronization analysis the parallel analog to
control flow analysis for serial code - identifies code segments that may execute in
parallel - work by Gay Aiken
- shared variable analysis the parallel analog to
dependence analysis - recognize when reordering can be observed by
another processor - work by Krishnamurthy Yelick
30Semantics Sequential Consistency
- When compiling sequential programs
- Valid if y not in expr1 and x not in expr2
(roughly) - When compiling parallel code, not sufficient test.
y expr2 x expr1
Initially flag data 0 Proc A Proc
B data 1 while (flag1) flag 1
... ...data...
31Cycle Detection Dependence Analog
- Processors define a program order on accesses
from the same thread - P is the union of these total orders
- Memory system define an access order on
accesses to the same variable - A is access order (at least 1 write)
- A violation is cycle in P U A.
- Intuition time cannot flow backwards.
32Cycle Detection
- Generalizes to arbitrary numbers of variables and
processors - Cycles may be arbitrarily long, but it is
sufficient to consider only cycles with 1 or 2
consecutive stops per processor
write x write y read y
read y write
x
33Static Analysis for Cycle Detection
- Approximate P by the control flow graph
- Approximate A by undirected dependence edges
- Let the delay set D be all edges from P that
are part of a minimal cycle - The execution order of D edge must be preserved
other P edges may be reordered (modulo usual
rules about serial code) - Synchronization analysis also critical
write z read x
write y read x
read y write z
34Automatic Communication Optimization
- Prototype for C subset KrishnamurthyYelick
- Experiments on the NOW 3 synchronization styles
- Integration into production Titanium underway
35Parallel performance
- Speedup on Ultrasparc SMP
- EM3D is small kernel
- relaxation on unstructured mesh
- shows high parallel efficiency of Titanium system
- AMR speedup limited by
- small fixed mesh
- 2-levels, 9 patches
36Scalable Poisson Solver (MLC)
- Communication performance is low (
- Scaled speedup experiments are nearly ideal
(flat) - IBM SP2 at SDSC Cray t3e at NERSC
37A Layered Attack on These Problems
- Algorithms
- Languages Titanium
- Java dialect compiled to C
- Extensions serial programming
- Extensions for parallel programming
- Compilers
- Uniprocessor optimizations
- Communication optimizations
- Libraries
- Automatic tuning of kernels
- Architecture
38Sparsity
- Sparsity is a automatic kernel generator for
sparse matrix-vector multiply - Optimization problem in general
- register and cache blocking (store contiguously)
- reordering
- Choice depends on
- sparsity structure
- machine parameters without analytical models
- Can exploit context multiple right-hand-sides
39Sparsity Performance
40Sparsity Performance
41A Layered Attack on These Problems
- Algorithms
- Languages Titanium
- Java dialect compiled to C
- Extensions serial programming
- Extensions for parallel programming
- Compilers
- Uniprocessor optimizations
- Communication optimizations
- Libraries
- Automatic tuning of kernels
- Architecture
42Architecture
- IRAM Intelligent DRAM Patterson Yelick
- integrated processor in DRAM
- an end-run around the memory hierarchy?
- IRAM uses vectors for high on-chip bandwidth
- Targets multimedia applications and low power
- Media extension (MMX, VIS, ) are short vectors
- Sony Playstation uses vector co-processor
- Economics drive both DRAM and processors
- Supercomputers are built with whatever exists
- Can multimedia save supercomputers?
43Summary
- Algorithms
- Realistic AMR applications in Titanium
- Language support
- Arrays, Immutable, Overloading,
- Compiler optimizations
- Uniprocessor optimizations
- Parallel analysis
- Libraries
- Automatically generated libraries
- Architecture
- high memory bandwidth