Title: Developing a computational infrastructure for parallel high performance FE/FVM simulations
1Developing a computational infrastructure for
parallel high performance FE/FVM simulations
- Dr. Stan Tomov
- Brookhaven National Laboratory
- August 11, 2003
2Outline
- Motivation and overview
- Mesh generation
- Mesh partitioning and load balancing
- Code optimization
- Parallel FEM/FVM using pthreads/OpenMP/MPI
- Code organization and data structures
- Applications
- Visualization
- Extensions and future work
- Conclusions
3Motivation
- Technological advances facilitate research
requiring very large scale computations - High computing power is needed in many FE/FVM
simulations (fluid flow transport in porous
media, heat mass transfer, elasticity, etc.) - higher demand for simulation accuracy
? higher
demand for computing power - To meet the demand for high computational power
- the use of sequential machines is often
insufficient (physical limitations of both system
memory and computer processing speed)
? use
parallel machines - develop better algorithms
- accuracy and reliability of the computational
method - efficient use of the available computing
resources
Closely related to - error control and
adaptive mesh refinement - optimization
4Motivation
- Parallel HP FE/FVM simulations (issues)
- Choose the solver direct or iterative
- sparse matrices, storage considerations,
parallelization, preconditioners - How to parallelize
- extract parallelism from sequential algorithm, or
- develop ones with enhanced parallelism
- domain decomposition data distribution
- Mesh generation
- Importance of finding a good mesh
- in parallel, adaptive!
- Data structures to maintain
- preconditioners
5Overview
MPI OpenMP pthreads OpenGL
6Mesh generation
- Importance and requirements
- Sequential generators
- Triangle (2D triangular meshes)
- Netgen (3D tetrahedral meshes)
- ParaGrid
- Based on sequential generators
- Adaptively refines a starting mesh in parallel
- Provides data structures suitable for domain
decomposition and multilevel type preconditioners
7Mesh refinement
8Mesh partitioning
- Mesh partitioners
- Metis (University of Minnesota)
- Chaco (Sandia National Laboratory)
- Jostle (University of Greenwich, London)
- Requirements
- Balance of elements and minimum interface
9Load balancing (in AMR)
- For steady state problems
- Algorithm 1 locally adapt the mesh
(sequentially) split
using Metis
refine uniformly in parallel - Algorithm 2 use error estimates as weights in
splitting the
mesh do parallel AMR - For transient problems
- Algorithm 3 ParMetis is used to check the load
balance, and if
needed there is transfer of elements
between sub-domains
10Code optimization
- Main concepts
- Locality of reference (to improve memory
performance) - Software pipelining (to improve CPU performance)
- Locality of reference (or keep things used
together close together) - Due to memory hierarchies
- - Disc, network ? RAM (?200 CP) ? Cache
levels (L2 6 CP, L1 3 CP)
? Registers (0 CP) / data for SGI
Origin 2000, Mips R10000, 250 MHz - Techniques (for cache friendly algorithms in NA)
- - Loop interchange for i, j, k 0 .. 100
do
Aijk BijkCijk, 10 x
faster than k, j, i 0..100 - - Vertex reordering for example
Cuthill-McKee algorithm (CG example 1.16 x
faster) - - Blocking related to domain
decomposition data distribution - - Fusion merge multiple
loops into 1, for example vector operations in
CG,
GMRES, etc. to improve reuse
11Code optimization
- Software pipelining
- Machine dependence - if CPU functional
units are pipelined - Can be turned on with compiler options -
computing with SWP - Aijk BijkCijk
, i, j, k0..100 increased performance 100 x - Techniques to improve SWP - inlining,
splitting/fusing, loop unrolling
- Performance monitoring benchmarking
- importance (in code optimization)
- on SGI we use ssrun, prof, and perfex
- SGIs pmchart to monitor cluster network traffic
12Parallel FE/FVM with pthreads
- Pthreads are portable and simple
- Used in shared memory parallel systems
- Low level parallel programming
- User has to create more complicated parallel
constructs - not widely used in parallel FE/FVM simulations
- We use it on HP Systems that are both Distributed
Memory Parallel Shared Memory Parallel
extern pthread_mutex_t
mlock extern pthread_cond_t
sync_wait extern int
barrier_counterextern int
number_of_threads void pthread_barrier()
pthread_mutex_lock(mlock) if
(barrier_counter) barrier_counter --
pthread_cond_wait(sync_wait,
mlock)
else barrier_counter munber_of_threads-1
pthread_cond_signal(sync_wait)
pthread_mutex_unlock(mlock)
- We use (1) Peer model parallelism (threads
working concurrently) - (2) main thread deals with
MPI communications
13Parallel FE/FVM with OpenMP
- OpenMP is a portable and simple set of compiler
directives and functions for parallel shared
memory programming - Higher level parallel programming
- Implementation often based on pthreads
- Iterative solvers scale well
- Used as pthreads in mixed distributed and shared
parallel systems - On NUMA architectures we need to have arrays
properly distributed among the processors - pragma distribute, pragma redistribute
- pragma distribute_reshape
- We use
- domain decomposition data distribution
- Programming model similar to MPI
- Model one parallel region
Table 3. Parallel CG on problem of
size 1024x1024
// sequential
initialization pragma omp parallel int
myrank omp_get_thread_num() //
distribution using first touch rule
Smyrank new Subdomain(myrank, )
14Parallel FE/FVM with MPI
- MPI is a system of functions for parallel
distributed memory programming - Parallel processes communicate by sending and
receiving messages - Domain decomposition data distribution approach
- Usually 6 or 7 functions are used
- MPI_Allreduce in computing dot-products
- MPI_Isend and MPI_Recv in computing
Matrix-vector products - MPI_Barrier many uses
- MPI_Bcast to broadcast sequential input
- MPI_Comm_rank, MPI_Comm_size
15Mixed implementations
- MPI pthreads/OpenMP in a cluster environment
- Example Parallel CG on (1) a problem of size
314,163, on (2) commodity-
based cluster (4 nodes, each node with 2 Pentium
III, running at 1GHz,
100Mbit or 1Gbit network)
Table 1. MPI implementation
scalability over the two networks.
Table 2. MPI implementation
scalability vs mixed (pthreads on the dual
processors).
16ParaGrid code organization
17ParaGrid data structures
- Connections between the different subdomains
- in terms of packets
- A vertex packet is all verticesshared by the
same subdomains - The subdomains sharing packethave
- their own packet copy
- pointers to the packet copies in the other
subdomains - only one subdomain is owner of the packet
- Similarly for edges and faces, used in
- refinement
- problems with degrees of freedom in edges or
faces
18Applications
- Generation of large, sparse linear systems of
equations on massively parallel computers - Generated on fly, no need to store large meshes
or linear systems - Distributed among processing nodes
- Used at LLNL to generate test problems for their
HYPRE project (scalable software for solving such
problems) - Various FE/FVM discretizations (used at TAMU and
LLNL) with applications to - Heat and mass transfer
- Linear elasticity
- Flow and transport in porous media
19Applications
- A posteriori error control and AMR (at TAMU and
BNL) - Accuracy and reliability of a computational
method - Efficient use of available computational
resources - Studies in domain decomposition and multigrid
preconditioners (at LLNL, TAMU) - Studies in domain decomposition on non-matching
grids (at LLNL and TAMU) - interior penalty discontinuous approximations
- mortar finite element approximations
- Visualization (at LLNL, TAMU, and BNL)
- Benchmarking hardware (at BNL)
- CPU performance
- network traffic, etc.
20Visualization
- Importance
- Integration of ParaGrid with visualization (not
compiled together) - - save mesh solution in files for later
visualization - - send directly mesh solution through
sockets for visualization - GLVis
- - portable, based on OpenGL (compiled also
with Mesa) - - visualize simple geometric primitives
(vertices, lines, and polygons) - - can be used as a server
- - waits for data to be visualized
- - uses fork after every data set
received - - combines parallel input (from
ParaGrid) into a sequential
visualization - VTK based
- - added to support volume visualization
21Visualization
GLVis code structure and features
Abstract classes
2D scalar data visualization
3D scalar data visualization
3D vector data visualization
2D vector data visualization
22Extensions and future work
- Extend and use the technology developed with
other already existing tools for HPC - Legacy FE/FVM (or just user specific) software
- Interfaces to external solvers (including direct)
and preconditioners, etc. - Extend the use to various applications
- Electromagnetics
- Elasticity, etc.
- Tune the code to particular architectures
- Benchmarking and optimization
- Commodity-based clusters
23Extensions and future work
- Further develop methods and tools for adaptive
error control and mesh refinement - Time dependent and non-linear problems
- Better study of the constants involved in the
estimates - Visualization
- User specific
- GPU as coprocessor?
- Create user-friendly interfaces
24Conclusions
- A step toward developing computational
infrastructure for parallel HPC - Domain decomposition framework
- Fundamental concept/technique for parallel
computing with wide area of applications - Needed for parallel HPC research in numerical
PDEs - Benefit to computational researchers
- Require efficient techniques to solve linear
systems with millions of unknowns - Finding a good mesh essential for developing
efficient computational methodology based on
FE/FVM