High Performance Parallel Programming - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

High Performance Parallel Programming

Description:

Each sub-domain should have as few neighbours as possible ... Successively bite adjacent nodes from the mesh until sub-domain has required number of nodes ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 40
Provided by: dirk75
Category:

less

Transcript and Presenter's Notes

Title: High Performance Parallel Programming


1
High Performance Parallel Programming
  • Dirk van der Knijff
  • Advanced Research Computing
  • Information Division

2
High Performance Parallel Programming
  • Lecture 19 Unstructured Mesh Decomposition

3
Last lecture
  • We were discussing methods of triangulation /
    tetrahedralization to generate unstructured
    meshes.
  • We had mentioned grid based methods and looked at
    the advancing front method
  • Today we will finish off mesh generation and look
    at mesh decomposition

4
Delaunay triangulation
  • Given a set of points in a plane, a Voronoi
    polygon about point P is the set of points closer
    to, or as close to, P as any other point.

Delaunay triangulation
Voronoi tesselation
circumcircle
5
Delaunay triangulation (cont.)
  • The Delaunay triangulation is the dual of the
    Voronoi tesselation, and has the following
    properties
  • No point is contained in the circumcircle of any
    triangle
  • Maximises the minimum angle for all triangular
    elements(note we would like to minimise the
    maximum...)
  • The delaunay triangulation is unique except for
    degenerate distributions of points
  • 2D 4 points on a circle

6
Delaunay triangulation (cont.)
  • 3D 5 points on a sphere
  • 3D 6 points (octahedron) can give an invalid mesh

7
Boyer-Watson algorithm
  • Adds points sequentially into an existing
    triangulation
  • Add a point
  • Find all existing triangles whose circumcircle is
    intersected by the new point
  • Delete these triangles to leave a convex (always)
    cavity
  • Join the new point to all the vertices of the
    cavity

8
Point generation
  • The Delaunay triangulation assumes that the
    points to be triangulated are already known
  • points can be pre-generated from the vertices of
    overlapping structured grids - may need filtering
    and smoothing
  • points can be generated simultaneously with the
    triangles to improve quality of triangles
  • Grids are less smooth than advancing front but
    generation is much quicker
  • There can be robustness problems, especially in
    the initial phase when triangles may be highly
    distorted

9
Boundaries
  • The Delaunay construction triangulates a set of
    points, and does not necessarily conform to
    imposed boundaries

10
Constraining / conforming
  • In 2D, the constrained Delaunay triangulation is
    well defined
  • In 3D, no constrained DT is defined, but it is
    possible to recover the boundary edges and faces
    by using face-edge swapping
  • Alternatively, it is possible to make the DT
    conform to a boundary curve / surface by adding
    enough points on the boundary to ensure that the
    DT will include the edges on the surface.

11
Face-edge swapping
12
Other methods
  • Paving advancing layers of quadrilaterals /
    hexahedra
  • need collision rules when fronts merge
  • good elements at surfaces
  • Whisker weaving and the spatial twist continuum
  • Structured near surfaces, unstructured elsewhere
  • hybrid methods
  • advancing layers (for advancing front)
  • advancing normals (for Dalaunay point insertion)
  • Recursive decomposition
  • etc.

13
Unstructured mesh decomposition
  • Review how are unstructured meshes different
    from regular grids
  • regular grids
  • topologically, cartesian grids
  • may be represented as arrays
  • unstructured mesh
  • has no regular structure
  • a node may be connected to an arbitrary number of
    neighbours
  • cannot be represented by an array a more complex
    data structure must be used (implementation may
    be via arrays...)

14
Static and Adaptive methods
  • Static mesh
  • operate with a fixed mesh for the entirety of the
    run
  • Adaptive mesh
  • locally refine or coarsen the mesh
  • usually do this according to some measure of the
    local error in the solution (e.g. the local
    residual)
  • Adaptive methods are applicable to
  • iterative solution problems
  • time-dependent problems

15
Parallelization
  • Decompose by dividing mesh amongst Pes
  • The quality of the mesh decomposition has a
    highly significant effect on performance
  • Arriving at a good decomposition is a complex
    task
  • Good may be problem / architecture dependent
  • A wide variety of well-established methods exist
  • Several packages / libraries provide
    implementations of many of the methods
  • Major practical difficulty is differences in file
    formats..

16
What makes a good decomposition
  • Load Balance
  • Elements should be distributed across the
    processors so that each has an equal share of the
    work
  • Communication costs should be minimized
  • There should be as few as possible elements on
    the boundary of each sub-domain
  • Each sub-domain should have as few neighbours as
    possible
  • Distribution should reflect machine architecture
  • comms/calc and bandwidth/latency ratios need to
    be considered

17
Dynamic decomposition
  • Introduces extra complexities in parallel
  • Static mesh - decomposition only done once
    (pre-process)
  • Dynamic mesh - decomposition becomes part of
    calculation
  • Dynamic decomposition problems
  • Time taken may outweigh the benefit gained so we
    need to check if it is worthwhile
  • The decomposition must run in parallel (and be
    fast)
  • To minimize cost the decomposition must take into
    account the previous decomposition and make
    minimal changes...
  • We will only look at static decomposition

18
Dual graphs
  • To do a decomposition we need
  • a representation of the basic entities being
    distributed
  • an idea of how communication takes place between
    them
  • A dual graph, based on the mesh, can fill this
    role
  • vertices in the graph represent the entities
  • edges in the graph represent communication
  • Graph depends on how data is transfered
  • for finite elements it could be via nodes, edges
    or faces, so.....a single mesh can have many
    dual graphs
  • Edges (comms) or vertices(calc) may be weighted

19
Example in 2D
20
Decomposing the dual graph
  • Mesh decomposition graph partitioning
  • this problem occurs in several areas
  • much work is still being done
  • Partitioning problem is as follows
  • for an undirected graph G with vertices V and
    edges Efind k disjoint, equal (approximately)
    subsets of V such that the number of cut edges
    between them is minimised
  • graph may be weighted
  • weights can represent work/vertex, comms/edge...
  • for weighted graphs, seek equal vertex weights
    for each subset or a minimization of the sum of
    the weights of the cut edges

21
Problem complexity
  • Graph partitioning is N-P complete
  • this means that no exact solution may be found in
    any reasonable time for non-trivial examples
  • complete enumeration is unfeasible
  • for n nodes and p sub-domains the search space is
    of size pn
  • p may be in the hundreds and n in the millions
  • we must therefore resort to heuristics
  • we want an acceptable approximate solution
  • in a reasonable time

22
Partitioning and Mapping
  • Mesh decomposition has two aspects
  • partitioning a graph
  • mapping the sub-domains to processors
  • Algorithms may address both issues or ignore the
    latter
  • whether this is an issue depends on target
    machine
  • Can trade load balance against communication cost
  • may be profitable to accept a small load
    imbalance if it results in a large decrease in
    communication
  • network topology may also be relevant

23
Decomposition algorithms
  • Simple direct k-way algorithms
  • Random, Scattered and Linear Bandwidth Reduction
  • The Greedy Algorithm
  • Optimisation algorithms
  • Simulated Annealing
  • Other algorithms
  • Chained Local Optimization
  • Genetic algorithms
  • Recursive Partitioning

24
Direct k-way algorithms
  • Random partitioning
  • assign each vertex to a processor at random
  • load balance is, on average, attained
  • no account is taken of mesh connectivity
  • communication costs will be enormous
  • Scattered
  • vertices are handed out in order
  • each subsequent vertex goes to smallest
    sub-domain
  • load balance will be good
  • adjacent vertices will not be together
  • again communication suffers

25
Linear partioning
  • Regular domain decomposition of vertices
  • for an unweighted graph of p vertices on n
    processors
  • give the first n/p vertices to the first
    sub-domain
  • give the second n/p to the next and so on
  • this can give suprisingly good results
  • exploits data locality often implicit in vertex
    numbering
  • usually an artifact of automatic mesh generation
  • without refinement this is superior to random or
    scattered
  • quality depends on bandwidth of system of
    equations
  • this is open to improvement by standard
    techniques
  • elements are re-numbered to improve adjacency
    matrix

26
Bandwidth reduction
  • Linear decomposition onto two processors

27
Greedy Algorithm
  • Start at some point on the boundary
  • do until finished
  • Successively bite adjacent nodes from the mesh
    until sub-domain has required number of nodes
  • Choose from the interior boundary of most
    recently completed sub-domain the node with the
    least elements connected to it

28
Greedy Algorithm
29
Characteristics of Greedy
  • Advantages
  • it is very fast
  • directly yields the required number of partitions
  • load balance is even
  • sub-domains generally have good aspect ratios
  • Disadvantages
  • often generates disconnected sub-domains
  • this may lead to high communication costs

30
Optimization Algorithms
  • Take a general view of decomposition procedure
  • Specify some objective function
  • assign a scalar value H to each decomposition x
  • choose the form of H so that it is small for
    acceptable solution
  • A simple form is
  • m represents the relative cost of computation
    and communication - a physical analogy is energy
  • Problem is to minimise H w.r.t. the decomposition
    x
  • Hcomm may attempt to model architecture

31
Partitioning and Mapping
  • Direct k-way algorithms ignored mapping stage
  • In principle, optimization algorithms allow
    partitioning and mapping to be addresses
    simultaneously
  • In practice, an accurate assesment of the
    communication metric would depend on network
    topology, message passing system, contention for
    links etc
  • For this reason the mapping problem is often
    ignored
  • Algorithms must use heuristics due to NP
    completeness
  • Must ensure the solution is not trapped in local
    minimum

32
Initial Decomposition
  • Optimisation algorithms require a starting point
  • Random or other trivial decomposition
  • quick and easy but rather poor in quality (i.e.
    large H)
  • optimization algorithm does all the work
  • Exploration of the landscape is done step by step
  • time to arrive at solution may depend strongly on
    where we start
  • optimization algorithms may be uncompetetive if
    we choose a bad starting point
  • Use optimization in combination with other
    techniques

33
Simulated Annealing
  • Simulates the slow cooling of a physical system
  • key components of algorithm are
  • an energy function or Hamiltonian which we take
    to be H(x)
  • the temperature of the system T
  • some operator D(x) which randomly proposes small
    changes
  • Simulated annealing works by
  • iteratively proposing changes
  • accepting or rejecting changes based on
    Metropolis Criterion

34
The Metropolis Criterion
  • Developed for Statistical Mechanics simulations
  • A change in x results in a corresponding change
    in H
  • We either accept or reject xi1 as follows
  • Key points
  • improving or downhill changes always accepted
  • degrading or uphill changes sometimes accepted

35
Temperature Parameter
  • How does the choice of T affect the algorithm
  • Probability of accepting an uphill change is
  • T small, only accept decreases in H - gradient
    descent
  • T large, any change accepted - explore whole
    state space

36
Annealing Schedule
  • T need not remain constant during the algorithm
  • Neither extreme of temperature is satisfactory
  • Hign T - do random exploration but never improve
    solution
  • Low T - only find local minima
  • Start with high T and slowly lower it
  • the precise way in which this is done is called
    the annealing schedule
  • simple choice might be
  • If the annealing schedule is slow enough then the
    probability of finding a global optimum
    approaches unity

37
Change of State
  • What form should the operator D(x) take?
  • D(x) may take many forms
  • move a single vertex to any other domain
  • move a single vertex to a neighbouring domain
  • move groups of adjacent vertices in a similar
    manner
  • Ideally we want to make large moves with small dH

38
Other Algorithms
  • Chained Local Optimization
  • equivalent to Simulated Annealing with
    intelligent choice of D(x)
  • Genetic Algorithms
  • based on ideas of evolution by natural selection
  • explores many regions simultaneously
  • Recursive Partitioning
  • take a Divide and Conquer approach
  • can choose bisection plane based on cost function
  • And many more...

39
High Performance Parallel Programming
  • Next Week
  • Thursday Prof. Kotagiri languages
  • Friday Dr. Applebe Automatic Parallelization
Write a Comment
User Comments (0)
About PowerShow.com