Developing a computational infrastructure for parallel high performance FE/FVM simulations - PowerPoint PPT Presentation


PPT – Developing a computational infrastructure for parallel high performance FE/FVM simulations PowerPoint presentation | free to download - id: 9f56d-YWQzM


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Developing a computational infrastructure for parallel high performance FE/FVM simulations


develop better algorithms. accuracy and reliability of. the ... Techniques (for cache friendly algorithms in NA) - Loop interchange : for i, j, k = 0. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 25
Provided by: Ton1183
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Developing a computational infrastructure for parallel high performance FE/FVM simulations

Developing a computational infrastructure for
parallel high performance FE/FVM simulations
  • Dr. Stan Tomov
  • Brookhaven National Laboratory
  • August 11, 2003

  • Motivation and overview
  • Mesh generation
  • Mesh partitioning and load balancing
  • Code optimization
  • Parallel FEM/FVM using pthreads/OpenMP/MPI
  • Code organization and data structures
  • Applications
  • Visualization
  • Extensions and future work
  • Conclusions

  • Technological advances facilitate research
    requiring very large scale computations
  • High computing power is needed in many FE/FVM
    simulations (fluid flow transport in porous
    media, heat mass transfer, elasticity, etc.)
  • higher demand for simulation accuracy
    ? higher
    demand for computing power
  • To meet the demand for high computational power
  • the use of sequential machines is often
    insufficient (physical limitations of both system
    memory and computer processing speed)
    ? use
    parallel machines
  • develop better algorithms
  • accuracy and reliability of the computational
  • efficient use of the available computing

Closely related to - error control and
adaptive mesh refinement - optimization
  • Parallel HP FE/FVM simulations (issues)
  • Choose the solver direct or iterative
  • sparse matrices, storage considerations,
    parallelization, preconditioners
  • How to parallelize
  • extract parallelism from sequential algorithm, or
  • develop ones with enhanced parallelism
  • domain decomposition data distribution
  • Mesh generation
  • Importance of finding a good mesh
  • in parallel, adaptive!
  • Data structures to maintain
  • preconditioners

MPI OpenMP pthreads OpenGL
Mesh generation
  • Importance and requirements
  • Sequential generators
  • Triangle (2D triangular meshes)
  • Netgen (3D tetrahedral meshes)
  • ParaGrid
  • Based on sequential generators
  • Adaptively refines a starting mesh in parallel
  • Provides data structures suitable for domain
    decomposition and multilevel type preconditioners

Mesh refinement
Mesh partitioning
  • Mesh partitioners
  • Metis (University of Minnesota)
  • Chaco (Sandia National Laboratory)
  • Jostle (University of Greenwich, London)
  • Requirements
  • Balance of elements and minimum interface

Load balancing (in AMR)
  • For steady state problems
  • Algorithm 1 locally adapt the mesh
    (sequentially) split
    using Metis
    refine uniformly in parallel
  • Algorithm 2 use error estimates as weights in
    splitting the
    mesh do parallel AMR
  • For transient problems
  • Algorithm 3 ParMetis is used to check the load
    balance, and if
    needed there is transfer of elements
    between sub-domains

Code optimization
  • Main concepts
  • Locality of reference (to improve memory
  • Software pipelining (to improve CPU performance)
  • Locality of reference (or keep things used
    together close together)
  • Due to memory hierarchies
  • - Disc, network ? RAM (?200 CP) ? Cache
    levels (L2 6 CP, L1 3 CP)
    ? Registers (0 CP) / data for SGI
    Origin 2000, Mips R10000, 250 MHz
  • Techniques (for cache friendly algorithms in NA)
  • - Loop interchange for i, j, k 0 .. 100
    Aijk BijkCijk, 10 x
    faster than k, j, i 0..100
  • - Vertex reordering for example
    Cuthill-McKee algorithm (CG example 1.16 x
  • - Blocking related to domain
    decomposition data distribution
  • - Fusion merge multiple
    loops into 1, for example vector operations in
    GMRES, etc. to improve reuse

Code optimization
  • Software pipelining
  • Machine dependence - if CPU functional
    units are pipelined
  • Can be turned on with compiler options -
    computing with SWP
  • Aijk BijkCijk
    , i, j, k0..100 increased performance 100 x
  • Techniques to improve SWP - inlining,
    splitting/fusing, loop unrolling
  • Performance monitoring benchmarking
  • importance (in code optimization)
  • on SGI we use ssrun, prof, and perfex
  • SGIs pmchart to monitor cluster network traffic

Parallel FE/FVM with pthreads
  • Pthreads are portable and simple
  • Used in shared memory parallel systems
  • Low level parallel programming
  • User has to create more complicated parallel
  • not widely used in parallel FE/FVM simulations
  • We use it on HP Systems that are both Distributed
    Memory Parallel Shared Memory Parallel

extern pthread_mutex_t
mlock extern pthread_cond_t
sync_wait extern int
barrier_counter extern int
number_of_threads void pthread_barrier()
pthread_mutex_lock(mlock) if
(barrier_counter) barrier_counter --
else barrier_counter munber_of_threads-1
  • We use (1) Peer model parallelism (threads
    working concurrently)
  • (2) main thread deals with
    MPI communications

Parallel FE/FVM with OpenMP
  • OpenMP is a portable and simple set of compiler
    directives and functions for parallel shared
    memory programming
  • Higher level parallel programming
  • Implementation often based on pthreads
  • Iterative solvers scale well
  • Used as pthreads in mixed distributed and shared
    parallel systems
  • On NUMA architectures we need to have arrays
    properly distributed among the processors
  • pragma distribute, pragma redistribute
  • pragma distribute_reshape
  • We use
  • domain decomposition data distribution
  • Programming model similar to MPI
  • Model one parallel region

Table 3. Parallel CG on problem of
size 1024x1024
// sequential
initialization pragma omp parallel int
myrank omp_get_thread_num() //
distribution using first touch rule
Smyrank new Subdomain(myrank, )
Parallel FE/FVM with MPI
  • MPI is a system of functions for parallel
    distributed memory programming
  • Parallel processes communicate by sending and
    receiving messages
  • Domain decomposition data distribution approach
  • Usually 6 or 7 functions are used
  • MPI_Allreduce in computing dot-products
  • MPI_Isend and MPI_Recv in computing
    Matrix-vector products
  • MPI_Barrier many uses
  • MPI_Bcast to broadcast sequential input
  • MPI_Comm_rank, MPI_Comm_size

Mixed implementations
  • MPI pthreads/OpenMP in a cluster environment

- Example Parallel CG on (1) a problem of size
314,163, on (2) commodity-
based cluster (4 nodes, each node with 2 Pentium
III, running at 1GHz,
100Mbit or 1Gbit network)
Table 1. MPI implementation
scalability over the two networks.
Table 2. MPI implementation
scalability vs mixed (pthreads on the dual
ParaGrid code organization
ParaGrid data structures
  • Connections between the different subdomains
  • in terms of packets
  • A vertex packet is all vertices shared by the
    same subdomains
  • The subdomains sharing packet have
  • their own packet copy
  • pointers to the packet copies in the other
  • only one subdomain is owner of the packet
  • Similarly for edges and faces, used in
  • refinement
  • problems with degrees of freedom in edges or

  • Generation of large, sparse linear systems of
    equations on massively parallel computers
  • Generated on fly, no need to store large meshes
    or linear systems
  • Distributed among processing nodes
  • Used at LLNL to generate test problems for their
    HYPRE project (scalable software for solving such
  • Various FE/FVM discretizations (used at TAMU and
    LLNL) with applications to
  • Heat and mass transfer
  • Linear elasticity
  • Flow and transport in porous media

  • A posteriori error control and AMR (at TAMU and
  • Accuracy and reliability of a computational
  • Efficient use of available computational
  • Studies in domain decomposition and multigrid
    preconditioners (at LLNL, TAMU)
  • Studies in domain decomposition on non-matching
    grids (at LLNL and TAMU)
  • interior penalty discontinuous approximations
  • mortar finite element approximations
  • Visualization (at LLNL, TAMU, and BNL)
  • Benchmarking hardware (at BNL)
  • CPU performance
  • network traffic, etc.

  • Importance
  • Integration of ParaGrid with visualization (not
    compiled together)
  • - save mesh solution in files for later
  • - send directly mesh solution through
    sockets for visualization
  • GLVis
  • - portable, based on OpenGL (compiled also
    with Mesa)
  • - visualize simple geometric primitives
    (vertices, lines, and polygons)
  • - can be used as a server
  • - waits for data to be visualized
  • - uses fork after every data set
  • - combines parallel input (from
    ParaGrid) into a sequential
  • VTK based
  • - added to support volume visualization

GLVis code structure and features
Abstract classes
2D scalar data visualization
3D scalar data visualization
3D vector data visualization
2D vector data visualization
Extensions and future work
  • Extend and use the technology developed with
    other already existing tools for HPC
  • Legacy FE/FVM (or just user specific) software
  • Interfaces to external solvers (including direct)
    and preconditioners, etc.
  • Extend the use to various applications
  • Electromagnetics
  • Elasticity, etc.
  • Tune the code to particular architectures
  • Benchmarking and optimization
  • Commodity-based clusters

Extensions and future work
  • Further develop methods and tools for adaptive
    error control and mesh refinement
  • Time dependent and non-linear problems
  • Better study of the constants involved in the
  • Visualization
  • User specific
  • GPU as coprocessor?
  • Create user-friendly interfaces

  • A step toward developing computational
    infrastructure for parallel HPC
  • Domain decomposition framework
  • Fundamental concept/technique for parallel
    computing with wide area of applications
  • Needed for parallel HPC research in numerical
  • Benefit to computational researchers
  • Require efficient techniques to solve linear
    systems with millions of unknowns
  • Finding a good mesh essential for developing
    efficient computational methodology based on