Language and Compiler Support for Adaptive Mesh Refinement - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Language and Compiler Support for Adaptive Mesh Refinement

Description:

AMR & Titanium. Language and Compiler Support for Adaptive Mesh Refinement. Katherine Yelick ... AMR & Titanium. Multiscale Modeling Problems ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 44
Provided by: susanl2
Category:

less

Transcript and Presenter's Notes

Title: Language and Compiler Support for Adaptive Mesh Refinement


1
Language and Compiler Support for Adaptive Mesh
Refinement
Katherine Yelick Alex Aiken, Greg Balls, Dan
Bonachea, Phillip Colella, David Gay,
Susan Graham, Paul Hilfinger, Arvind
Krishnamurthy, Ben Liblit, Chang Sun Lin, Peter
McCorquodale, Carleton Miyamoto, Geoff Pike, Kar
Ming Tang, Siu Man Yau
2
Multiscale Modeling Problems
  • Many modeling problems in astrophysics, biology,
    material science, and other areas require
  • Enormous range of spatial and temporal scales
  • To solve interesting problems, one needs
  • Adaptive methods
  • Large scale parallel machines
  • For Computational Fluid Dynamics, a popular
    technique is
  • Adaptive Mesh Refinement (AMR)

3
Block-Structured AMR
  • Algorithms for many rectangular, grid-based
    computations are
  • communication intensive
  • memory intensive
  • AMR makes these harder
  • more small messages
  • more complex data
    structures
  • most of the programming effort is debugging the
    boundary cases
  • locality and load balance trade-off is hard

4
A Layered Attack on These Problems
  • Algorithms
  • Languages Titanium
  • Java dialect compiled to C
  • Extensions for serial programming
  • Extensions for parallel programming
  • Compilers
  • Uniprocessor optimizations
  • Communication optimizations
  • Libraries
  • Automatic tuning of kernels
  • Architecture

5
Algorithms for AMR
  • Existing algorithms in Titanium
  • 3D AMR Poisson solver
  • 3D AMR Gas dynamics
  • Domain-decomposition MLC Poisson
  • Under development
  • Self-gravitating gas dynamics (3D AMR)
  • For stellar collapse, etc.
  • Immersed boundary method (3D, non-adaptive)
  • Peskin and MacQueens method for heart model,
    etc.
  • Embedded boundaries
  • Simulation of bio-MEMs devices and cellular level
    modeling
  • All joint with Colellas group at LBNL

6
3D AMR Gas Dynamics
  • Hyperbolic Solver McCorquodale and Colella
  • Implementation of Berger-Colella algorithm
  • Mesh generation algorithm included
  • 2D Example (3D supported)
  • Mach-10 shock on solid surface
    at oblique
    angle
  • Future Self-gravitating gas dynamics package

7
3D AMR Poisson
  • Poisson Solver Semenzato, Pike, Colella
  • finite domain
  • variable

    coefficients
  • multigrid

    across levels
  • Currently synthetic grids, no grid generation
  • Under construction
  • reengineered to interface with hyperbolic solver
  • including mesh generation

8
MLC for Finite-Differences
  • Poisson solver with infinite domains Colella,
    Balls
  • Uses a Method of Local Corrections (MLC)
  • Currently non-adaptive and 2D
  • Supports only constant coefficients
  • Uses 2-level, domain decomposition approach
  • Fine-grid solutions are computed in parallel
  • Information transferred to a coarse-grid and
    solved serially
  • Fine-grid solutions is computed using boundary
    conditions from the coarse grid
  • Future work includes 3D Adaptive version

9
MLC for Finite-Differences
  • Features of the method
  • Solution is still second-order accurate
  • Accuracy depends only weakly on the coarse-grid
    spacing
  • Scalability
  • No communication during fine-grid solves
  • Single communication step (global all-to-all)
  • coarse grid work is serial (replicated), but
    relatively small
  • Future work extend to 3D and adaptive meshes

10
Accuracy of the MLC for FD
  • Solution is second-order accurate
  • decreasing h by 2x, decreases error by 4x

11
Error on High-Wavenumber Problem
  • Charge is
  • 1 charge of concentric waves
  • 2 star-shaped charges.
  • Largest error is where the charge is changing
    rapidly. Note
  • discretization error
  • faint decomposition error
  • Run on 16 procs

12
A Layered Attack on These Problems
  • Algorithms
  • Languages Titanium
  • Java dialect compiled to C
  • Extensions serial programming
  • Extensions for parallel programming
  • Compilers
  • Uniprocessor optimizations
  • Communication optimizations
  • Libraries
  • Automatic tuning of kernels
  • Architecture

13
Java for Scientific Computing
  • Computational scientists work on increasingly
    complex models
  • Popularized C features classes, overloading,
    pointer-based data structures
  • But C is very complicated
  • easy to lose performance and readability
  • Java is a better C
  • Safe strongly typed, garbage collected
  • Much simpler to implement (research vehicle)
  • Industrial interest as well IBM HP Java

14
Java Objects
  • Primitive scalar types boolean, double, int,
    etc.
  • implementations store these in place
  • access is fast -- comparable to other languages
  • Objects user-defined and library
  • passed by pointer value
  • has level of indirection (pointer to) implicit
  • simple model, but inefficient for small objects

2.6 3 true
Complex object
r 7.1 i 4.3
15
Java Object Example
  • class Complex
  • private double real
  • private double imag
  • public Complex(double r, double i)
  • real r imag i
  • public Complex operator(Complex c)
  • return new Complex(c.real real,
  • c.imag imag)
  • public double getReal return real
  • public double getImag return imag
  • Complex c new Complex(7.1, 4.3)
  • c c c
  • class VisComplex extends Complex ...

16
Immutable Classes in Titanium
  • For small objects, would sometimes prefer to
  • avoid indirection, pass by copying
  • especially when fields are immutable
  • extends the idea of primitive values (1, 4.2,
    etc.) to user-defined values
  • performs like structs in C
  • Titanium introduces immutable classes
  • all fields are implicitly final (constant)
  • cannot inherit from (extend) or be inherited by
    other classes
  • immutable class Complex ..unchanged..
  • Complex c new Complex(7.1, 4.3)

17
Arrays in Java
  • Arrays in Java are objects
  • Only 1D arrays are directly supported
  • Multidimensional arrays are slow

2d array
  • Subarrays are important in AMR
  • e.g., interior of a grid
  • Even C and C dont support these well
  • Fortran has contiguous subarrays (LDA)
  • Hand-coding can confuse optimizer

18
Multidimensional Arrays in Titanium
  • New multidimensional array added to Java
  • One array may be a subarray of another
  • e.g., a is interior of b, or a is all even
    elements of b
  • Indexed by Points (tuple of ints)
  • Constructed over a set of Points, called
    Rectangular Domains (RectDomains)
  • Points, Domains and RectDomains are built-in
    immutable classes
  • Support for AMR and other grid computations
  • domain operations intersection, shrink, border

19
Unordered iteration
  • Memory hierarchy optimizations are essential
  • Compilers can sometimes do these, but hard in
    general
  • Titanium adds unordered iteration on rectangular
    domains
  • foreach (p within r) ...
  • p is a point
  • r is a previously-declared RectDomain
  • Foreach simplifies bounds checking as well
  • Additional operations on domains and arrays to
    subset and transform

20
Laplacian Example
  • Simple example of using arrays and foreach
  • Domain interior A.domain().shrink(1)
  • Point dx 1,0
  • Point dy 0,1
  • foreach (p in interior)
  • Lp 4ap - apdx - ap-dx
  • - apdy - ap-dy

21
Sequential Performance
Ultrasparc
C/C/ FORTRAN
Java Arrays
Titanium Arrays
Overhead
DAXPY
1.4s
7
1.5s
6.8s
3D multigrid
12s
83
22s
2D multigrid
5.4s
15
6.2s
MatMul
1.8s
2.2s
22
Compares to naïve C code neither compiler does
cache blocking (yet).
22
A Layered Attack on These Problems
  • Algorithms
  • Languages Titanium
  • Java dialect compiled to C
  • Extensions serial programming
  • Extensions for parallel programming
  • Compilers
  • Uniprocessor optimizations
  • Communication optimizations
  • Libraries
  • Automatic tuning of kernels
  • Architecture

23
SPMD Model
  • All processor start together and execute same
    code, but not in lock-step
  • Basic control done using
  • Ti.numProcs() total number of processors
  • Ti.thisProc() number of executing processor
  • Bulk-synchronous style
  • read all particles and compute forces on
    mine
  • Ti.barrier()
  • write to my particles using new forces
  • Ti.barrier()
  • This is neither message passing nor data-parallel

24
Building Distributed Data Structures
  • Broadcast is a one-to-all communication
  • broadcast from
  • To create shared data structures
  • each processor builds its own piece
  • pieces are exchanged (all-to-all)
  • allData.exchange(new Boxed(Ti.thisProc())

25
Global Address Space
  • As seen, references (pointers) may be remote
  • useful in building adaptive meshes
  • easy to port shared-memory programs
  • uniform programming model across machines
  • Global pointers are more expensive than local
  • True even when data is on the same processor
  • space (processor number memory address)
  • dereference time (check to see if local)
  • Use local declarations in critical sections
  • May declare references as local

26
Global Address Space
  • Processes allocate locally
  • References can be passed to other processes

Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class C int val... C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val ... ...
gv.val
27
A Layered Attack on These Problems
  • Algorithms
  • Languages Titanium
  • Java dialect compiled to C
  • Extensions serial programming
  • Extensions for parallel programming
  • Compilers
  • Uniprocessor optimizations, e.g., local
  • Parallel optimizations, e.g., communication
  • Libraries
  • Automatic tuning of kernels
  • Architecture

28
Local Pointer Analysis
  • Compiler can infer many uses of local
    LiblitAiken
  • Local Qualification Inference (LQI)
  • Data structures must be well partitioned

29
Parallel Optimizations
  • Titanium compiler performs parallel optimizations
  • communication overlap and aggregation
  • Two new analyses
  • synchronization analysis the parallel analog to
    control flow analysis for serial code
  • identifies code segments that may execute in
    parallel
  • work by Gay Aiken
  • shared variable analysis the parallel analog to
    dependence analysis
  • recognize when reordering can be observed by
    another processor
  • work by Krishnamurthy Yelick

30
Semantics Sequential Consistency
  • When compiling sequential programs
  • Valid if y not in expr1 and x not in expr2
    (roughly)
  • When compiling parallel code, not sufficient test.

y expr2 x expr1
Initially flag data 0 Proc A Proc
B data 1 while (flag1) flag 1
... ...data...
31
Cycle Detection Dependence Analog
  • Processors define a program order on accesses
    from the same thread
  • P is the union of these total orders
  • Memory system define an access order on
    accesses to the same variable
  • A is access order (at least 1 write)
  • A violation is cycle in P U A.
  • Intuition time cannot flow backwards.

32
Cycle Detection
  • Generalizes to arbitrary numbers of variables and
    processors
  • Cycles may be arbitrarily long, but it is
    sufficient to consider only cycles with 1 or 2
    consecutive stops per processor

write x write y read y
read y write
x
33
Static Analysis for Cycle Detection
  • Approximate P by the control flow graph
  • Approximate A by undirected dependence edges
  • Let the delay set D be all edges from P that
    are part of a minimal cycle
  • The execution order of D edge must be preserved
    other P edges may be reordered (modulo usual
    rules about serial code)
  • Synchronization analysis also critical

write z read x
write y read x
read y write z
34
Automatic Communication Optimization
  • Prototype for C subset KrishnamurthyYelick
  • Experiments on the NOW 3 synchronization styles
  • Integration into production Titanium underway

35
Parallel performance
  • Speedup on Ultrasparc SMP
  • EM3D is small kernel
  • relaxation on unstructured mesh
  • shows high parallel efficiency of Titanium system
  • AMR speedup limited by
  • small fixed mesh
  • 2-levels, 9 patches

36
Scalable Poisson Solver (MLC)
  • Communication performance is low (
  • Scaled speedup experiments are nearly ideal
    (flat)
  • IBM SP2 at SDSC Cray t3e at NERSC

37
A Layered Attack on These Problems
  • Algorithms
  • Languages Titanium
  • Java dialect compiled to C
  • Extensions serial programming
  • Extensions for parallel programming
  • Compilers
  • Uniprocessor optimizations
  • Communication optimizations
  • Libraries
  • Automatic tuning of kernels
  • Architecture

38
Sparsity
  • Sparsity is a automatic kernel generator for
    sparse matrix-vector multiply
  • Optimization problem in general
  • register and cache blocking (store contiguously)
  • reordering
  • Choice depends on
  • sparsity structure
  • machine parameters without analytical models
  • Can exploit context multiple right-hand-sides

39
Sparsity Performance
40
Sparsity Performance
41
A Layered Attack on These Problems
  • Algorithms
  • Languages Titanium
  • Java dialect compiled to C
  • Extensions serial programming
  • Extensions for parallel programming
  • Compilers
  • Uniprocessor optimizations
  • Communication optimizations
  • Libraries
  • Automatic tuning of kernels
  • Architecture

42
Architecture
  • IRAM Intelligent DRAM Patterson Yelick
  • integrated processor in DRAM
  • an end-run around the memory hierarchy?
  • IRAM uses vectors for high on-chip bandwidth
  • Targets multimedia applications and low power
  • Media extension (MMX, VIS, ) are short vectors
  • Sony Playstation uses vector co-processor
  • Economics drive both DRAM and processors
  • Supercomputers are built with whatever exists
  • Can multimedia save supercomputers?

43
Summary
  • Algorithms
  • Realistic AMR applications in Titanium
  • Language support
  • Arrays, Immutable, Overloading,
  • Compiler optimizations
  • Uniprocessor optimizations
  • Parallel analysis
  • Libraries
  • Automatically generated libraries
  • Architecture
  • high memory bandwidth
Write a Comment
User Comments (0)
About PowerShow.com