High Performance Parallel Programming - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

High Performance Parallel Programming

Description:

Each sub-domain should have as few neighbours as possible ... Successively bite adjacent nodes from the mesh until sub-domain has required number of nodes ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 40

Provided by: dirk75

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Parallel Programming

1
High Performance Parallel Programming

Dirk van der Knijff
Advanced Research Computing
Information Division

2
High Performance Parallel Programming

Lecture 19 Unstructured Mesh Decomposition

3
Last lecture

We were discussing methods of triangulation /
tetrahedralization to generate unstructured
meshes.
We had mentioned grid based methods and looked at
the advancing front method
Today we will finish off mesh generation and look
at mesh decomposition

4
Delaunay triangulation

Given a set of points in a plane, a Voronoi
polygon about point P is the set of points closer
to, or as close to, P as any other point.

Delaunay triangulation
Voronoi tesselation
circumcircle
5
Delaunay triangulation (cont.)

The Delaunay triangulation is the dual of the
Voronoi tesselation, and has the following
properties
No point is contained in the circumcircle of any
triangle
Maximises the minimum angle for all triangular
elements(note we would like to minimise the
maximum...)
The delaunay triangulation is unique except for
degenerate distributions of points
2D 4 points on a circle

6
Delaunay triangulation (cont.)

3D 5 points on a sphere
3D 6 points (octahedron) can give an invalid mesh

7
Boyer-Watson algorithm

Adds points sequentially into an existing
triangulation
Add a point
Find all existing triangles whose circumcircle is
intersected by the new point
Delete these triangles to leave a convex (always)
cavity
Join the new point to all the vertices of the
cavity

8
Point generation

The Delaunay triangulation assumes that the
points to be triangulated are already known
points can be pre-generated from the vertices of
overlapping structured grids - may need filtering
and smoothing
points can be generated simultaneously with the
triangles to improve quality of triangles
Grids are less smooth than advancing front but
generation is much quicker
There can be robustness problems, especially in
the initial phase when triangles may be highly
distorted

9
Boundaries

The Delaunay construction triangulates a set of
points, and does not necessarily conform to
imposed boundaries

10
Constraining / conforming

In 2D, the constrained Delaunay triangulation is
well defined
In 3D, no constrained DT is defined, but it is
possible to recover the boundary edges and faces
by using face-edge swapping
Alternatively, it is possible to make the DT
conform to a boundary curve / surface by adding
enough points on the boundary to ensure that the
DT will include the edges on the surface.

11
Face-edge swapping
12
Other methods

Paving advancing layers of quadrilaterals /
hexahedra
need collision rules when fronts merge
good elements at surfaces
Whisker weaving and the spatial twist continuum
Structured near surfaces, unstructured elsewhere
hybrid methods
advancing layers (for advancing front)
advancing normals (for Dalaunay point insertion)
Recursive decomposition
etc.

13
Unstructured mesh decomposition

Review how are unstructured meshes different
from regular grids
regular grids
topologically, cartesian grids
may be represented as arrays
unstructured mesh
has no regular structure
a node may be connected to an arbitrary number of
neighbours
cannot be represented by an array a more complex
data structure must be used (implementation may
be via arrays...)

14
Static and Adaptive methods

Static mesh
operate with a fixed mesh for the entirety of the
run
Adaptive mesh
locally refine or coarsen the mesh
usually do this according to some measure of the
local error in the solution (e.g. the local
residual)
Adaptive methods are applicable to
iterative solution problems
time-dependent problems

15
Parallelization

Decompose by dividing mesh amongst Pes
The quality of the mesh decomposition has a
highly significant effect on performance
Arriving at a good decomposition is a complex
task
Good may be problem / architecture dependent
A wide variety of well-established methods exist
Several packages / libraries provide
implementations of many of the methods
Major practical difficulty is differences in file
formats..

16
What makes a good decomposition

Load Balance
Elements should be distributed across the
processors so that each has an equal share of the
work
Communication costs should be minimized
There should be as few as possible elements on
the boundary of each sub-domain
Each sub-domain should have as few neighbours as
possible
Distribution should reflect machine architecture
comms/calc and bandwidth/latency ratios need to
be considered

17
Dynamic decomposition

Introduces extra complexities in parallel
Static mesh - decomposition only done once
(pre-process)
Dynamic mesh - decomposition becomes part of
calculation
Dynamic decomposition problems
Time taken may outweigh the benefit gained so we
need to check if it is worthwhile
The decomposition must run in parallel (and be
fast)
To minimize cost the decomposition must take into
account the previous decomposition and make
minimal changes...
We will only look at static decomposition

18
Dual graphs

To do a decomposition we need
a representation of the basic entities being
distributed
an idea of how communication takes place between
them
A dual graph, based on the mesh, can fill this
role
vertices in the graph represent the entities
edges in the graph represent communication
Graph depends on how data is transfered
for finite elements it could be via nodes, edges
or faces, so.....a single mesh can have many
dual graphs
Edges (comms) or vertices(calc) may be weighted

19
Example in 2D
20
Decomposing the dual graph

Mesh decomposition graph partitioning
this problem occurs in several areas
much work is still being done
Partitioning problem is as follows
for an undirected graph G with vertices V and
edges Efind k disjoint, equal (approximately)
subsets of V such that the number of cut edges
between them is minimised
graph may be weighted
weights can represent work/vertex, comms/edge...
for weighted graphs, seek equal vertex weights
for each subset or a minimization of the sum of
the weights of the cut edges

21
Problem complexity

Graph partitioning is N-P complete
this means that no exact solution may be found in
any reasonable time for non-trivial examples
complete enumeration is unfeasible
for n nodes and p sub-domains the search space is
of size pn
p may be in the hundreds and n in the millions
we must therefore resort to heuristics
we want an acceptable approximate solution
in a reasonable time

22
Partitioning and Mapping

Mesh decomposition has two aspects
partitioning a graph
mapping the sub-domains to processors
Algorithms may address both issues or ignore the
latter
whether this is an issue depends on target
machine
Can trade load balance against communication cost
may be profitable to accept a small load
imbalance if it results in a large decrease in
communication
network topology may also be relevant

23
Decomposition algorithms

Simple direct k-way algorithms
Random, Scattered and Linear Bandwidth Reduction
The Greedy Algorithm
Optimisation algorithms
Simulated Annealing
Other algorithms
Chained Local Optimization
Genetic algorithms
Recursive Partitioning

24
Direct k-way algorithms

Random partitioning
assign each vertex to a processor at random
load balance is, on average, attained
no account is taken of mesh connectivity
communication costs will be enormous
Scattered
vertices are handed out in order
each subsequent vertex goes to smallest
sub-domain
load balance will be good
adjacent vertices will not be together
again communication suffers

25
Linear partioning

Regular domain decomposition of vertices
for an unweighted graph of p vertices on n
processors
give the first n/p vertices to the first
sub-domain
give the second n/p to the next and so on
this can give suprisingly good results
exploits data locality often implicit in vertex
numbering
usually an artifact of automatic mesh generation
without refinement this is superior to random or
scattered
quality depends on bandwidth of system of
equations
this is open to improvement by standard
techniques
elements are re-numbered to improve adjacency
matrix

26
Bandwidth reduction

Linear decomposition onto two processors

27
Greedy Algorithm

Start at some point on the boundary
do until finished
Successively bite adjacent nodes from the mesh
until sub-domain has required number of nodes
Choose from the interior boundary of most
recently completed sub-domain the node with the
least elements connected to it

28
Greedy Algorithm
29
Characteristics of Greedy

Advantages
it is very fast
directly yields the required number of partitions
load balance is even
sub-domains generally have good aspect ratios
Disadvantages
often generates disconnected sub-domains
this may lead to high communication costs

30
Optimization Algorithms

Take a general view of decomposition procedure
Specify some objective function
assign a scalar value H to each decomposition x
choose the form of H so that it is small for
acceptable solution
A simple form is
m represents the relative cost of computation
and communication - a physical analogy is energy
Problem is to minimise H w.r.t. the decomposition
x
Hcomm may attempt to model architecture

31
Partitioning and Mapping

Direct k-way algorithms ignored mapping stage
In principle, optimization algorithms allow
partitioning and mapping to be addresses
simultaneously
In practice, an accurate assesment of the
communication metric would depend on network
topology, message passing system, contention for
links etc
For this reason the mapping problem is often
ignored
Algorithms must use heuristics due to NP
completeness
Must ensure the solution is not trapped in local
minimum

32
Initial Decomposition

Optimisation algorithms require a starting point
Random or other trivial decomposition
quick and easy but rather poor in quality (i.e.
large H)
optimization algorithm does all the work
Exploration of the landscape is done step by step
time to arrive at solution may depend strongly on
where we start
optimization algorithms may be uncompetetive if
we choose a bad starting point
Use optimization in combination with other
techniques

33
Simulated Annealing

Simulates the slow cooling of a physical system
key components of algorithm are
an energy function or Hamiltonian which we take
to be H(x)
the temperature of the system T
some operator D(x) which randomly proposes small
changes
Simulated annealing works by
iteratively proposing changes
accepting or rejecting changes based on
Metropolis Criterion

34
The Metropolis Criterion

Developed for Statistical Mechanics simulations
A change in x results in a corresponding change
in H
We either accept or reject xi1 as follows
Key points
improving or downhill changes always accepted
degrading or uphill changes sometimes accepted

35
Temperature Parameter

How does the choice of T affect the algorithm
Probability of accepting an uphill change is
T small, only accept decreases in H - gradient
descent
T large, any change accepted - explore whole
state space

36
Annealing Schedule

T need not remain constant during the algorithm
Neither extreme of temperature is satisfactory
Hign T - do random exploration but never improve
solution
Low T - only find local minima
Start with high T and slowly lower it
the precise way in which this is done is called
the annealing schedule
simple choice might be
If the annealing schedule is slow enough then the
probability of finding a global optimum
approaches unity

37
Change of State