Developing a computational infrastructure for parallel high performance FE/FVM simulations - PowerPoint PPT Presentation

About This Presentation
Title:

Developing a computational infrastructure for parallel high performance FE/FVM simulations

Description:

develop better algorithms. accuracy and reliability of. the ... Techniques (for cache friendly algorithms in NA) - Loop interchange : for i, j, k = 0. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 25
Provided by: Ton1183
Learn more at: https://icl.utk.edu
Category:

less

Transcript and Presenter's Notes

Title: Developing a computational infrastructure for parallel high performance FE/FVM simulations


1
Developing a computational infrastructure for
parallel high performance FE/FVM simulations
  • Dr. Stan Tomov
  • Brookhaven National Laboratory
  • August 11, 2003

2
Outline
  • Motivation and overview
  • Mesh generation
  • Mesh partitioning and load balancing
  • Code optimization
  • Parallel FEM/FVM using pthreads/OpenMP/MPI
  • Code organization and data structures
  • Applications
  • Visualization
  • Extensions and future work
  • Conclusions

3
Motivation
  • Technological advances facilitate research
    requiring very large scale computations
  • High computing power is needed in many FE/FVM
    simulations (fluid flow transport in porous
    media, heat mass transfer, elasticity, etc.)
  • higher demand for simulation accuracy
    ? higher
    demand for computing power
  • To meet the demand for high computational power
  • the use of sequential machines is often
    insufficient (physical limitations of both system
    memory and computer processing speed)
    ? use
    parallel machines
  • develop better algorithms
  • accuracy and reliability of the computational
    method
  • efficient use of the available computing
    resources

Closely related to - error control and
adaptive mesh refinement - optimization
4
Motivation
  • Parallel HP FE/FVM simulations (issues)
  • Choose the solver direct or iterative
  • sparse matrices, storage considerations,
    parallelization, preconditioners
  • How to parallelize
  • extract parallelism from sequential algorithm, or
  • develop ones with enhanced parallelism
  • domain decomposition data distribution
  • Mesh generation
  • Importance of finding a good mesh
  • in parallel, adaptive!
  • Data structures to maintain
  • preconditioners

5
Overview
MPI OpenMP pthreads OpenGL
6
Mesh generation
  • Importance and requirements
  • Sequential generators
  • Triangle (2D triangular meshes)
  • Netgen (3D tetrahedral meshes)
  • ParaGrid
  • Based on sequential generators
  • Adaptively refines a starting mesh in parallel
  • Provides data structures suitable for domain
    decomposition and multilevel type preconditioners

7
Mesh refinement
8
Mesh partitioning
  • Mesh partitioners
  • Metis (University of Minnesota)
  • Chaco (Sandia National Laboratory)
  • Jostle (University of Greenwich, London)
  • Requirements
  • Balance of elements and minimum interface

9
Load balancing (in AMR)
  • For steady state problems
  • Algorithm 1 locally adapt the mesh
    (sequentially) split
    using Metis
    refine uniformly in parallel
  • Algorithm 2 use error estimates as weights in
    splitting the
    mesh do parallel AMR
  • For transient problems
  • Algorithm 3 ParMetis is used to check the load
    balance, and if
    needed there is transfer of elements
    between sub-domains

10
Code optimization
  • Main concepts
  • Locality of reference (to improve memory
    performance)
  • Software pipelining (to improve CPU performance)
  • Locality of reference (or keep things used
    together close together)
  • Due to memory hierarchies
  • - Disc, network ? RAM (?200 CP) ? Cache
    levels (L2 6 CP, L1 3 CP)
    ? Registers (0 CP) / data for SGI
    Origin 2000, Mips R10000, 250 MHz
  • Techniques (for cache friendly algorithms in NA)
  • - Loop interchange for i, j, k 0 .. 100
    do
    Aijk BijkCijk, 10 x
    faster than k, j, i 0..100
  • - Vertex reordering for example
    Cuthill-McKee algorithm (CG example 1.16 x
    faster)
  • - Blocking related to domain
    decomposition data distribution
  • - Fusion merge multiple
    loops into 1, for example vector operations in
    CG,
    GMRES, etc. to improve reuse

11
Code optimization
  • Software pipelining
  • Machine dependence - if CPU functional
    units are pipelined
  • Can be turned on with compiler options -
    computing with SWP
  • Aijk BijkCijk
    , i, j, k0..100 increased performance 100 x
  • Techniques to improve SWP - inlining,
    splitting/fusing, loop unrolling
  • Performance monitoring benchmarking
  • importance (in code optimization)
  • on SGI we use ssrun, prof, and perfex
  • SGIs pmchart to monitor cluster network traffic

12
Parallel FE/FVM with pthreads
  • Pthreads are portable and simple
  • Used in shared memory parallel systems
  • Low level parallel programming
  • User has to create more complicated parallel
    constructs
  • not widely used in parallel FE/FVM simulations
  • We use it on HP Systems that are both Distributed
    Memory Parallel Shared Memory Parallel

extern pthread_mutex_t
mlock extern pthread_cond_t
sync_wait extern int
barrier_counterextern int
number_of_threads void pthread_barrier()
pthread_mutex_lock(mlock) if
(barrier_counter) barrier_counter --
pthread_cond_wait(sync_wait,
mlock)
else barrier_counter munber_of_threads-1
pthread_cond_signal(sync_wait)
pthread_mutex_unlock(mlock)
  • We use (1) Peer model parallelism (threads
    working concurrently)
  • (2) main thread deals with
    MPI communications

13
Parallel FE/FVM with OpenMP
  • OpenMP is a portable and simple set of compiler
    directives and functions for parallel shared
    memory programming
  • Higher level parallel programming
  • Implementation often based on pthreads
  • Iterative solvers scale well
  • Used as pthreads in mixed distributed and shared
    parallel systems
  • On NUMA architectures we need to have arrays
    properly distributed among the processors
  • pragma distribute, pragma redistribute
  • pragma distribute_reshape
  • We use
  • domain decomposition data distribution
  • Programming model similar to MPI
  • Model one parallel region

Table 3. Parallel CG on problem of
size 1024x1024
// sequential
initialization pragma omp parallel int
myrank omp_get_thread_num() //
distribution using first touch rule
Smyrank new Subdomain(myrank, )
14
Parallel FE/FVM with MPI
  • MPI is a system of functions for parallel
    distributed memory programming
  • Parallel processes communicate by sending and
    receiving messages
  • Domain decomposition data distribution approach
  • Usually 6 or 7 functions are used
  • MPI_Allreduce in computing dot-products
  • MPI_Isend and MPI_Recv in computing
    Matrix-vector products
  • MPI_Barrier many uses
  • MPI_Bcast to broadcast sequential input
  • MPI_Comm_rank, MPI_Comm_size

15
Mixed implementations
  • MPI pthreads/OpenMP in a cluster environment

- Example Parallel CG on (1) a problem of size
314,163, on (2) commodity-
based cluster (4 nodes, each node with 2 Pentium
III, running at 1GHz,
100Mbit or 1Gbit network)
Table 1. MPI implementation
scalability over the two networks.
Table 2. MPI implementation
scalability vs mixed (pthreads on the dual
processors).
16
ParaGrid code organization
17
ParaGrid data structures
  • Connections between the different subdomains
  • in terms of packets
  • A vertex packet is all verticesshared by the
    same subdomains
  • The subdomains sharing packethave
  • their own packet copy
  • pointers to the packet copies in the other
    subdomains
  • only one subdomain is owner of the packet
  • Similarly for edges and faces, used in
  • refinement
  • problems with degrees of freedom in edges or
    faces

18
Applications
  • Generation of large, sparse linear systems of
    equations on massively parallel computers
  • Generated on fly, no need to store large meshes
    or linear systems
  • Distributed among processing nodes
  • Used at LLNL to generate test problems for their
    HYPRE project (scalable software for solving such
    problems)
  • Various FE/FVM discretizations (used at TAMU and
    LLNL) with applications to
  • Heat and mass transfer
  • Linear elasticity
  • Flow and transport in porous media

19
Applications
  • A posteriori error control and AMR (at TAMU and
    BNL)
  • Accuracy and reliability of a computational
    method
  • Efficient use of available computational
    resources
  • Studies in domain decomposition and multigrid
    preconditioners (at LLNL, TAMU)
  • Studies in domain decomposition on non-matching
    grids (at LLNL and TAMU)
  • interior penalty discontinuous approximations
  • mortar finite element approximations
  • Visualization (at LLNL, TAMU, and BNL)
  • Benchmarking hardware (at BNL)
  • CPU performance
  • network traffic, etc.

20
Visualization
  • Importance
  • Integration of ParaGrid with visualization (not
    compiled together)
  • - save mesh solution in files for later
    visualization
  • - send directly mesh solution through
    sockets for visualization
  • GLVis
  • - portable, based on OpenGL (compiled also
    with Mesa)
  • - visualize simple geometric primitives
    (vertices, lines, and polygons)
  • - can be used as a server
  • - waits for data to be visualized
  • - uses fork after every data set
    received
  • - combines parallel input (from
    ParaGrid) into a sequential
    visualization
  • VTK based
  • - added to support volume visualization

21
Visualization
GLVis code structure and features
Abstract classes
2D scalar data visualization
3D scalar data visualization
3D vector data visualization
2D vector data visualization
22
Extensions and future work
  • Extend and use the technology developed with
    other already existing tools for HPC
  • Legacy FE/FVM (or just user specific) software
  • Interfaces to external solvers (including direct)
    and preconditioners, etc.
  • Extend the use to various applications
  • Electromagnetics
  • Elasticity, etc.
  • Tune the code to particular architectures
  • Benchmarking and optimization
  • Commodity-based clusters

23
Extensions and future work
  • Further develop methods and tools for adaptive
    error control and mesh refinement
  • Time dependent and non-linear problems
  • Better study of the constants involved in the
    estimates
  • Visualization
  • User specific
  • GPU as coprocessor?
  • Create user-friendly interfaces

24
Conclusions
  • A step toward developing computational
    infrastructure for parallel HPC
  • Domain decomposition framework
  • Fundamental concept/technique for parallel
    computing with wide area of applications
  • Needed for parallel HPC research in numerical
    PDEs
  • Benefit to computational researchers
  • Require efficient techniques to solve linear
    systems with millions of unknowns
  • Finding a good mesh essential for developing
    efficient computational methodology based on
    FE/FVM
Write a Comment
User Comments (0)
About PowerShow.com