Title: M.S. Shephard, K.E. Jansen, A. Ovcharenko, O. Sahni, Ting Xie and Min Zhou
1Dynamic Load Balancing Needs of Parallel
Adaptive Analysis
- M.S. Shephard, K.E. Jansen, A. Ovcharenko, O.
Sahni, Ting Xie and Min Zhou - Scientific Computation Research Center
- Rensselaer Polytechnic Institute
- Outline
- Introduction to approach
- Parallel adaptive finite element analysis
- Scaling the solver
- Adaptive mesh control
- Parallel mesh structure and parallel mesh
migration - Some applications
- More parallel applications
- Mesh generation
- Adaptive multiscale
This work is supported by the DOE SciDAC program
as part of theInteroperable Technologies for
Advanced Petascale Simulations,and NSF through a
petascale application grant
2Status and Issues in Parallel Adaptive Simulation
- Components needed
- Automatic mesh generators
- General mesh adaptation
- Mesh correction indication
- Parallel adaptive mesh control
- Dynamic load balancing
- Issues going forward
- Dealing with more complex parallel control
functions New demands on dynamic load-balancing - Development of parallel adaptive applications
- Operation in a petascale environment
initial mesh
adapted mesh
3Dealing with New Application Development
- Want to reuse methods and tools - provide
enabling technology. - A partitioned-based parallel computation mode
is assumed. - In this case the key components can be abstracted
as - The System Model
- To account for the characteristics of the
computing system - becoming more complicated at
the petascale - The Partition Model
- A simple model to which application models can be
mapped for purposes of parallel adaptive
computations - The Application Model
- Accounts how computations can be done in parallel
- Must focus on the entities associated with the
computations and their interactions - Structure of computational entities and their
interactions must map to the partition model
4System Model
- Petascale Computers
- A key driver Number of kilowatts to run and cool
the computer - Can not afford to construct them like the
clusters we all love - There will be many cores per node and cores
and even nodes will not all be the same - Substantial differences between machines -
Contrast the IBM BlueGene to Ranger
(Sunmachine at U. Texas)
The machine going into Argonne as ofAugust 2006
presentation by Rick Stevens
5Partition Model
- The partition is a collection of parts
- Parts
- Have a given amount of computational load
- Need to communicate with other parts in
prescribed way during the computation - Parts are collections of objects that are of
meaning to the applications - Needs effective interactions with dynamic load
balancing
6Dynamic Load Balancing
- Zoltan Dynamic Services (http//www.cs.sandia.gov/
Zoltan/) - Supports multiple dynamic partitioners
- General control of the definition of part objects
and weights supported - Under active development with emphasis on going
to petascale machines - Focused on graph-based (or hypergraph-based)
partitioners (intend to even use them when the
interactions are spatial - like potential contact
- simply define graph edges when things might
touch in the next set of steps!)
7Application Model
- Must
- Define the objects that will make up the parts
- Quantification of the object computation
- Determination of object-to-object dependencies
- Defining and controlling the entities
- Have to be defined at the appropriate level and
be related to the data structures and control of
the application - Entities considered in the applications covered
today - Mesh entities in non-manifold FE mesh
- Collection of mesh entities to be kept in the
same part - Integration points for which unit cell
evaluations are performed - Atoms
- Chunks of space containing material
8Petascale Adaptive FE Analysis
- Steps to get there
- 1. Be sure the fixed mesh solver scales to
100,000s of processors - 2. Provide parallel distributed support for mesh
adaptation - 3. Construct adaptive loops in which all
components run on petascale machines - 4. Get scalability on all of it
- Status Summary
- Good progress with an implicit FE flow code
- Tools for supporting parallel mesh adaptation
- Constructing initial adaptive loops
- A way to go to petascale
9Introduction to PHASTA
- Parallel finite element flow solver that solves
both compressible and incompressible flow. - Implicit time integration - requires the solution
of very large systems of linear algebra equations
at each time step using iterative solvers. - PHASTA and its predecessor have been parallel for
over 15 years, 10 of which have been at RPI. - Breaks the total domain into parts with roughly
the same number of elements on each processor. - Work can be characterized as requiring
- Substantial floating point operations to form
system of equations, - Organized, substantial, and regular communication
between partitions that touch each other, - For each iteration (typically O(10) iterations
per solve), there is a required ALL-REDUCE
communication.
10Patient-Specific Abdominal Aortic Aneurysm
- Mesh had gt 50M dof
- Must be solved in 10 min.
- Implicit FE flow solve scales
Proc. t (sec) scale
16384 60.6 1.04
8192 131.7 0.957
4096 241.6 1.04
2048 502.3 1.00
1024 1008.7 1.00
11 Implementation of Adaptive Mesh Control
- Given the mesh size field
- Mesh modification loop
- Look at element edge lengths and shape
- If both satisfactory, continue to next element
- If not, select best modification
- Elements with edges that are too long must have
edges split or swapped - Short edges eliminated
- Continue until size and shape is satisfied or no
more improvement possible - Determination of best mesh modification
- Select mesh modifications based on element shape
properties - Appropriate considerations of neighboring
elements - Choosing the best mesh modification
12Mesh Information
- A piece-wise domain decomposition over which the
simulation is to be run - A mesh data structure provide services to create
and/or use the mesh data - Each application has its own needs of mesh
representation in terms of levels of entities
and adjacencies used - ? flexibility in mesh representations
- 3 approaches for mesh data structure design
- Fixed, specific mesh representation
- Reduced model store only needed entities
- Fixed, general mesh representation
- Full model Store all entities.
- Flexible mesh representation
- Flexible Mesh Data structure
- Switch between various representations for
different needs of applications - Application specifies which entities and
adjacencies it needs. - Achieving a good performance both in memory and
computational cost for wide range of applications.
13Distributed Mesh Data Structure
- Distributed Mesh
- Mesh divided into parts for distribution on
parallel computers - Part Pi consists of a set of mesh entities
assigned to the ith part. - Part Object
- The basic unit to which a part ID is assigned.
- A mesh entity to be partitioned
- Mesh entities to be partitioned in the
- example mesh are M13, M12, M23, M22, M11
- An EntityGroup.
- Residence Part
- Operator P Mid returns a set of part IDs
- where Mid exists. (e.g. P M10 P0, P1,
P2 ) - Residence part of Mid on Pi
- If a partition object, P Mid Pi
- Otherwise, P Mid U P Mjq Mid ? ?(Mjq)
14Entity Group
- EntityGroup
- A group of mesh entities that needs to stay
together in a part during the lifetime of the
EntityGroup as defined by the needs of an
application. - Example - stack of prismatic elements in a
boundary layer to support the adaptation of the
layer - Entity group rules
- Mesh entities in a group stay as a group during
the life time of EntityGroup - A mesh entity can only be in a single
EntityGroup, and is defined once in the
EntityGroup - EntityGroup information maintained before and
after migration - EntityGroup is dynamic as defined by the
application which can create and destroy an
EntityGroup, or add/remove mesh entities in an
EntityGroup
15Mesh Partition through Zoltan
- Perform mesh partition through Zoltan
graph-based partitioning -
- In mesh partitioning, a partition object can be
either a mesh entity to be partitioned, or an
EntityGroup.
Different colors represent different EntityGroups
(3 EntityGroups in the 2D mesh). Construct
graph for mesh partition Graph nodes
objects to be partitioned (partition
object). Graph Edges mesh edge-based
dependencies between two objects. Weights
Set graph node and graph edge weights.
16Distributed Mesh Representation
- Functional Requirements
- Communication links
- Remote part non-self part where an entity is
duplicated - Remote copy the memory location of the entity
duplicated on the remote part - Efficient mechanisms to update mesh partitioning
and keep the links between partitions are
mandatory - Entity ownership
- Used for operation control
- Static ownership
- Owner part of an entity is fixed to the specific
partition regardless of mesh partitioning - Not suitable for adaptive analysis due to severe
load imbalance - Dynamic ownership
- Owner part of an entity is determined dynamically
depending on mesh partitioning
17Distributed Mesh Data Structure
- Mesh Migration with Full Complete Representations
- Given a list of pair ltpartition object,
destination part idgt - STEP 1 Collect entities to be updated and reset
their P and partition classification - STEP 2 Determine P of partition objects and
downward entities - STEP 3 Based on P, update the partition model
and collect entities to remove from each part - STEP 4 Exchange entities and update remote
copies - STEP 5 Remove unnecessary entities collected in
STEP 3 - STEP 6 Update the owner part of partition model
entities
18Flexible Distributed Mesh Data Structure
- Mesh Migration with reduced complete
representation - STEP A collect neighboring part objects ()
- STEP B restore downward interior entities ()
- STEP 1 collect entities to update and clear
partition classification and P of them. - STEP 2 Determine P
- STEP 3 Update partition classification and
collect entities to remove - STEP 4 create only necessary migrate-in entities
in representation and update remote copies - Do not send interior entities that will not be on
the partition boundary () - STEP 5 remove unnecessary migrate-out entities
- STEP 6 update entity ownership
- STEP C remove unnecessary interior entities and
adjacencies () - savings in migration time with flexible mesh
representation in parallel - - losses in migration time with flexible mesh
representation in parallel
P0
Serial 2D mesh with MSR
Partitioned 2D mesh with MSR
19Flexible Distributed Mesh Data Structure
- Examples 2-D mesh migration with the reduced
representation
?
1
1
(a) Mark destination pid
(b) step A Get neighboring POs
(c) Step B Restore internal ents
20Parallel Mesh Adaptation
- Parallelization of refinement perform on each
part and synchronize at inter-part boundaries - Parallelization of coarsening and swapping
migrate cavity (on-the-fly) and perform operation
locally on one part. - Requires update of evolving communication-links
between parts and dynamic mesh partitioning
21Mesh Adaptation - Uniform Refinement
- Tests run on IBM Blue Gene/L
- Slow processors
- Fast communication
initial (265k tets)
- At 8 processors
- Initial mesh - 33K/processor
- Final mesh - 226K/processor
- At 128 processors
- Initial mesh 2.0K/processor
- Final mesh 16.7K/processor
adapted (2,127k tets)
- Scalability for one iteration
- of mesh adaptation
22Mesh Adaptation - Refinement
- Communication time for one iteration of mesh
adaptation
- Total time for one iteration of mesh adaptation
- Communication to total time ratio for one
iteration of mesh adaptation
23Mesh Adaptation - Refinement and Coarsening
- Scalability for five iterations
- of mesh adaptation
initial mesh (1,528k tets)
- At 8 processors
- Initial mesh - 191K/processor
- Final mesh - 241K/processor
- At 128 processors
- Initial mesh 11.9K/processor
- Final mesh 15.0K/processor
adapted mesh (1,926k tets)
24Mesh Adaptation - Refinement and Coarsening
- Communication time for
- five iterations
- Total time for five iterations
- Communication to total time ratio for five
iterations
25Mesh Adaptation for 1 Billion Element Mesh
Mesh size field of air bubbles distributing in a
tube (segment of the model)
Number of regions of adapted mesh among 16k parts
- Initial mesh uniform, 17,179,836 mesh regions
- Adapted mesh 160 air bubbles 1,064,284,042 mesh
regions - Multiple predictive load balance are used to make
the adaptation possible - Larger meshes possible (not out of memory) but
this element count is appropriate for solver
Initial and adapted mesh at one bubble - colored
by magnitude of mesh size field
26Predictive Load Balancing
- Refinement of mesh before load balancing can lead
to memory problems - Employ predictive load balancing to avoid the
problem - Assign weights based on what will be refined
- Apply dynamic load balancing
- Refinement
- May want to do some local migration
with predictive load balancing
without predictive load balancing
27Nodal Balance by Local Modification
- For light loaded mesh (small number of regions
for each process), well distributed mesh (based
on the number of regions) could have bad nodal
balance. - Local modification method is used to balance the
number of nodes on each part. - Region (node) ratio number of region
(node)/average number of region (node) - Average number of regions for the test 2434
- Number of parts1024
Node ratio before and after node balance
Region/node ratio before node balance
Before node balance
Region ratio before and after node balance
28Dynamic Load Balancing Needs
- Basic needs for graph-based partitioner
- Executed many times during - needs to scale and
be efficient - Abstraction of a graph important - definition of
graph nodes and edges is application dependent - Not hard to create edges near contact - need to
determine information for mesh adaptation anyway - Real-valued weights (not integers)
- Some additional functions needed
- Moving small numbers of entities for specific
needs - Consideration of multiple criteria - (e.g., edges
nodes) - Possible multiple levels of interacting graphs
- graphs and interactions defined by the
applications - Important for multiphysics and/or multiscale
- Current area of research at RPI.
29FMDB is Part of ITAPS
- ITAPS tools (http//www.itaps-scidac.org)
- A core functionality of the Interoperable
Technologies for Advanced Petascale Simulations
(ITAPS) meshing tools - The only ITAPS component thus far that supports
- Geometry-based adaptive analysis
- Distributed mesh operations in parallel
- Flexible mesh representation
30Adaptive Loop for Accelerator Design
- Complex CAD geometry
- Physics modeling by the SLAC Omega3P
- High level modeling accuracy needed
- E.g., 0.1 error in frequency predictions
- Parallel adaptive mesh control needed to provide
accuracy needed
Initial mesh (1,595 tets)
Adapted mesh (23,082,517 tets)
31Patient Specific Vascular Surgical Planning
(Stanford, RPI)
- Virtual flow facility for patient specific
surgical planning - High quality patient specific flow simulations
needed quickly - Image patient, create model, apply adaptive flow
simulation
Mesh generation
Path planning
Segmentation solid modeling
Simulation
Load CT image
32Reliable In-Time Cardiovascular Flow Simulations
- Requirements
- Must execute from image data
- Geometric domains and meshes automatically
constructed - Inflow BC from image data
- Realistic representation of outflows and
materials - Simulations must provide reliable flow results
- Meshes with millions of elements typically
required - Adaptive mesh control required to ensure accuracy
- Simulations must execute in time needed (15
minutes) - Highly effective parallel computation required
Meshes by Simmetrix MeshSim
33Example of Entire Process Image, Solid, Mesh
Wilson et al. Lect. Notes Comp. Sci. 2001 2208
449-456
S T A N F O R D U N I V E R S I T YC A R D I O
V A S C U L A R B I O M E C H A N I C S R E S E
A R C H L A B O R A T O R Y
34Example of Entire Process Adaptive Meshes
35Parallel Mesh Generation
- Consider parallel mesh generation
- Computation effort related to of elements, but
boundary elements have variable load - Only structure known at start is the geometric
model - Calculations evolve during mesh generation All
mesh generation steps operate in parallel - Meshes starting from solid model
- Both structures created by the mesh generator are
distributed - Octree - used for mesh control, localizing
searches, interior templates - Mesh - topological hierarchy distributed over
parts - Mesh generation steps
- Surface mesh generation
- Octree refinement
- Template meshing of interior octants
- Meshing boundary octants
36Abstraction of Multiscale Simulation
- Practical multiscale simulations will
- Take advantage of existing single scale tools
- Automatically execute processes
- Communicate information, accounting for
transformations, between scales - Functional components needed
- Specification and interaction of physical,
mathematical and computational models - Definition, construction and transformation of
domains - Specification of physical parameters in the form
of tensors - Specification and execution of scale linking
- Adaptive multiscale require petascale computing
- Just starting to consider doing in parallel
- Scaling multiscale parallel makes scaling
adaptive FEM look easy
37Closing Remarks
- Making progress on moving adaptive simulations to
petascale machines - Solver scaling well on some machines
- Can adapt mesh in parallel on large numbers of
processors - ITAPS is developing tools to support parallel
mesh-based applications - Substantial challenges ahead of us
- Dealing with the new machines - so far our stuff
does not scale as well on Ranger as it did on the
Blue Gene - appear to have to code to each - Dynamic load balancing is critical to our
applications - Need it working well on all machines
- Have additional requirements (some defined, some
we are trying to define)