Language and Compiler Support for Adaptive Mesh Refinement - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Language and Compiler Support for Adaptive Mesh Refinement

Description:

AMR & Titanium. Language and Compiler Support for Adaptive Mesh Refinement. Katherine Yelick ... AMR & Titanium. Multiscale Modeling Problems ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 44

Provided by: susanl2

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Language and Compiler Support for Adaptive Mesh Refinement

1
Language and Compiler Support for Adaptive Mesh
Refinement
Katherine Yelick Alex Aiken, Greg Balls, Dan
Bonachea, Phillip Colella, David Gay,
Susan Graham, Paul Hilfinger, Arvind
Krishnamurthy, Ben Liblit, Chang Sun Lin, Peter
McCorquodale, Carleton Miyamoto, Geoff Pike, Kar
Ming Tang, Siu Man Yau
2
Multiscale Modeling Problems

Many modeling problems in astrophysics, biology,
material science, and other areas require
Enormous range of spatial and temporal scales
To solve interesting problems, one needs
Adaptive methods
Large scale parallel machines
For Computational Fluid Dynamics, a popular
technique is
Adaptive Mesh Refinement (AMR)

3
Block-Structured AMR

Algorithms for many rectangular, grid-based
computations are
communication intensive
memory intensive
AMR makes these harder
more small messages
more complex data
structures
most of the programming effort is debugging the
boundary cases
locality and load balance trade-off is hard

4
A Layered Attack on These Problems

Algorithms
Languages Titanium
Java dialect compiled to C
Extensions for serial programming
Extensions for parallel programming
Compilers
Uniprocessor optimizations
Communication optimizations
Libraries
Automatic tuning of kernels
Architecture

5
Algorithms for AMR

Existing algorithms in Titanium
3D AMR Poisson solver
3D AMR Gas dynamics
Domain-decomposition MLC Poisson
Under development
Self-gravitating gas dynamics (3D AMR)
For stellar collapse, etc.
Immersed boundary method (3D, non-adaptive)
Peskin and MacQueens method for heart model,
etc.
Embedded boundaries
Simulation of bio-MEMs devices and cellular level
modeling
All joint with Colellas group at LBNL

6
3D AMR Gas Dynamics

Hyperbolic Solver McCorquodale and Colella
Implementation of Berger-Colella algorithm
Mesh generation algorithm included
2D Example (3D supported)
Mach-10 shock on solid surface
at oblique
angle
Future Self-gravitating gas dynamics package

7
3D AMR Poisson

Poisson Solver Semenzato, Pike, Colella
finite domain
variable

coefficients
multigrid

across levels
Currently synthetic grids, no grid generation
Under construction
reengineered to interface with hyperbolic solver
including mesh generation

8
MLC for Finite-Differences

Poisson solver with infinite domains Colella,
Balls
Uses a Method of Local Corrections (MLC)
Currently non-adaptive and 2D
Supports only constant coefficients
Uses 2-level, domain decomposition approach
Fine-grid solutions are computed in parallel
Information transferred to a coarse-grid and
solved serially
Fine-grid solutions is computed using boundary
conditions from the coarse grid
Future work includes 3D Adaptive version

9
MLC for Finite-Differences

Features of the method
Solution is still second-order accurate
Accuracy depends only weakly on the coarse-grid
spacing
Scalability
No communication during fine-grid solves
Single communication step (global all-to-all)
coarse grid work is serial (replicated), but
relatively small
Future work extend to 3D and adaptive meshes

10
Accuracy of the MLC for FD

Solution is second-order accurate
decreasing h by 2x, decreases error by 4x

11
Error on High-Wavenumber Problem

Charge is
1 charge of concentric waves
2 star-shaped charges.
Largest error is where the charge is changing
rapidly. Note
discretization error
faint decomposition error
Run on 16 procs

12
A Layered Attack on These Problems

Algorithms
Languages Titanium
Java dialect compiled to C
Extensions serial programming
Extensions for parallel programming
Compilers
Uniprocessor optimizations
Communication optimizations
Libraries
Automatic tuning of kernels
Architecture

13
Java for Scientific Computing

Computational scientists work on increasingly
complex models
Popularized C features classes, overloading,
pointer-based data structures
But C is very complicated
easy to lose performance and readability
Java is a better C
Safe strongly typed, garbage collected
Much simpler to implement (research vehicle)
Industrial interest as well IBM HP Java

14
Java Objects

Primitive scalar types boolean, double, int,
etc.
implementations store these in place
access is fast -- comparable to other languages
Objects user-defined and library
passed by pointer value
has level of indirection (pointer to) implicit
simple model, but inefficient for small objects

2.6 3 true
Complex object
r 7.1 i 4.3
15
Java Object Example

class Complex
private double real
private double imag
public Complex(double r, double i)
real r imag i
public Complex operator(Complex c)
return new Complex(c.real real,
c.imag imag)
public double getReal return real
public double getImag return imag
Complex c new Complex(7.1, 4.3)
c c c
class VisComplex extends Complex ...

16
Immutable Classes in Titanium

For small objects, would sometimes prefer to
avoid indirection, pass by copying
especially when fields are immutable
extends the idea of primitive values (1, 4.2,
etc.) to user-defined values
performs like structs in C
Titanium introduces immutable classes
all fields are implicitly final (constant)
cannot inherit from (extend) or be inherited by
other classes
immutable class Complex ..unchanged..
Complex c new Complex(7.1, 4.3)

17
Arrays in Java

Arrays in Java are objects
Only 1D arrays are directly supported
Multidimensional arrays are slow

2d array

Subarrays are important in AMR
e.g., interior of a grid
Even C and C dont support these well
Fortran has contiguous subarrays (LDA)
Hand-coding can confuse optimizer

18
Multidimensional Arrays in Titanium

New multidimensional array added to Java
One array may be a subarray of another
e.g., a is interior of b, or a is all even
elements of b
Indexed by Points (tuple of ints)
Constructed over a set of Points, called
Rectangular Domains (RectDomains)
Points, Domains and RectDomains are built-in
immutable classes
Support for AMR and other grid computations
domain operations intersection, shrink, border

19
Unordered iteration

Memory hierarchy optimizations are essential
Compilers can sometimes do these, but hard in
general
Titanium adds unordered iteration on rectangular
domains
foreach (p within r) ...
p is a point
r is a previously-declared RectDomain
Foreach simplifies bounds checking as well
Additional operations on domains and arrays to
subset and transform

20
Laplacian Example

Simple example of using arrays and foreach
Domain interior A.domain().shrink(1)
Point dx 1,0
Point dy 0,1
foreach (p in interior)
Lp 4ap - apdx - ap-dx
- apdy - ap-dy

21
Sequential Performance
Ultrasparc
C/C/ FORTRAN
Java Arrays
Titanium Arrays
Overhead
DAXPY
1.4s
7
1.5s
6.8s
3D multigrid
12s
83
22s
2D multigrid
5.4s
15
6.2s
MatMul
1.8s
2.2s
22
Compares to naïve C code neither compiler does
cache blocking (yet).
22
A Layered Attack on These Problems

Algorithms
Languages Titanium
Java dialect compiled to C
Extensions serial programming
Extensions for parallel programming
Compilers
Uniprocessor optimizations
Communication optimizations
Libraries
Automatic tuning of kernels
Architecture

23
SPMD Model

All processor start together and execute same
code, but not in lock-step
Basic control done using
Ti.numProcs() total number of processors
Ti.thisProc() number of executing processor
Bulk-synchronous style
read all particles and compute forces on
mine
Ti.barrier()
write to my particles using new forces
Ti.barrier()
This is neither message passing nor data-parallel

24
Building Distributed Data Structures

Broadcast is a one-to-all communication
broadcast from
To create shared data structures
each processor builds its own piece
pieces are exchanged (all-to-all)
allData.exchange(new Boxed(Ti.thisProc())

25
Global Address Space

As seen, references (pointers) may be remote
useful in building adaptive meshes
easy to port shared-memory programs
uniform programming model across machines
Global pointers are more expensive than local
True even when data is on the same processor
space (processor number memory address)
dereference time (check to see if local)
Use local declarations in critical sections
May declare references as local

26
Global Address Space

Processes allocate locally
References can be passed to other processes

Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class C int val... C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val ... ...
gv.val
27
A Layered Attack on These Problems

Algorithms
Languages Titanium
Java dialect compiled to C
Extensions serial programming
Extensions for parallel programming
Compilers
Uniprocessor optimizations, e.g., local
Parallel optimizations, e.g., communication
Libraries
Automatic tuning of kernels
Architecture

28
Local Pointer Analysis

Compiler can infer many uses of local
LiblitAiken
Local Qualification Inference (LQI)
Data structures must be well partitioned

29
Parallel Optimizations

Titanium compiler performs parallel optimizations
communication overlap and aggregation
Two new analyses
synchronization analysis the parallel analog to
control flow analysis for serial code
identifies code segments that may execute in
parallel
work by Gay Aiken
shared variable analysis the parallel analog to
dependence analysis
recognize when reordering can be observed by
another processor
work by Krishnamurthy Yelick

30
Semantics Sequential Consistency

When compiling sequential programs
Valid if y not in expr1 and x not in expr2
(roughly)
When compiling parallel code, not sufficient test.

y expr2 x expr1
Initially flag data 0 Proc A Proc
B data 1 while (flag1) flag 1
... ...data...
31
Cycle Detection Dependence Analog

Processors define a program order on accesses
from the same thread
P is the union of these total orders
Memory system define an access order on
accesses to the same variable
A is access order (at least 1 write)
A violation is cycle in P U A.
Intuition time cannot flow backwards.

32
Cycle Detection

Generalizes to arbitrary numbers of variables and
processors
Cycles may be arbitrarily long, but it is
sufficient to consider only cycles with 1 or 2
consecutive stops per processor

write x write y read y
read y write
x
33
Static Analysis for Cycle Detection

Approximate P by the control flow graph
Approximate A by undirected dependence edges
Let the delay set D be all edges from P that
are part of a minimal cycle
The execution order of D edge must be preserved
other P edges may be reordered (modulo usual
rules about serial code)
Synchronization analysis also critical

write z read x
write y read x
read y write z
34
Automatic Communication Optimization