Designing and Building Parallel Programs - PowerPoint PPT Presentation

Loading...

PPT – Designing and Building Parallel Programs PowerPoint presentation | free to view - id: c9cc8-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Designing and Building Parallel Programs

Description:

Designing and Building Parallel Programs (c) 1995, 1996, 1997, 1998. Ian Foster Gina Goff ... Designing and Building Parallel Programs. 8. How do Real Parallel ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 55
Provided by: charles191
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Designing and Building Parallel Programs


1
Designing and Building Parallel Programs
A Tutorial Presented by the Department of
Defense HPC ModernizationProgramming Environment
Training Program
  • (c) 1995, 1996, 1997, 1998
  • Ian Foster Gina Goff
  • Ehtesham Hayder Charles Koelbel

2
Outline
  • Day 1
  • Introduction to Parallel Programming
  • The OpenMP Programming Language
  • Day 2
  • Introduction to MPI
  • The PETSc Library
  • Individual consulting in afternoons

3
Outline
  • Day 1
  • Introduction to Parallel Programming
  • Parallel Computers Algorithms
  • Understanding Performance
  • Parallel Programming Tools
  • The OpenMP Programming Language
  • Day 2
  • Introduction to MPI
  • The PETSc Library

4
Why Parallel Computing?
  • Continuing demands for higher performance
  • Physical limits on single processor performance
  • High costs of internal concurrency
  • Result is rise of multiprocessor architectures
  • And number of processors will continue to
    increase
  • Networking is another contributing factor
  • Future software must be concurrent scalable

5
The Multicomputeran Idealized Parallel Computer
6
Multicomputer Architecture
  • Multicomputer nodes network
  • Node processor(s) local memory
  • Access to local memory is cheap
  • 10s of cycles does not involve network
  • Conventional memory reference
  • Access to remote memory is expensive
  • 100s or 1000s of cycles uses network
  • Use I/O-like mechanisms (e.g., send/receive)

7
Multicomputer Cost Model
  • Cost of remote memory access/communication
    (including synchronization)
  • T ts N tw
  • ts per-message cost (latency)
  • tw per-word cost
  • N message size in words
  • Hence locality is an important property of good
    parallel algorithms

8
How do Real Parallel Computers Fit the Model?
  • Major architecture types
  • Distributed memory MIMD
  • Shared memory MIMD
  • Distributed shared memory (DSM)
  • Workstation clusters and metacomputers
  • Model fits current architectures pretty well

9
Distributed MemoryMIMD Multiprocessor
  • Multiple Instruction/Multiple Data
  • Processors with local memory connected by
    high-speed interconnection network
  • Typically high bandwidth, medium latency
  • Hardware support for remote memory access
  • Model breaks down when topology matters
  • Examples Cray T3E, IBM SP

10
Shared MemoryMIMD Multiprocessor
  • Processors access shared memory via bus
  • Low latency, high bandwidth
  • Bus contention limits scalability
  • Search for scalability introduces locality
  • Cache (a form of local memory)
  • Multistage architectures (some memory closer)
  • Examples Cray T90, SGI PCA, Sun

11
Distributed Shared Memory (DSM)
  • A hybrid of distributed and shared memory
  • Small groups of processors share memory others
    access across a scalable network
  • Low to moderate latency, high bandwidth
  • Model simplifies the multilevel hierarchy
  • Examples SGI Origin, HP Exemplar

12
Workstation Clusters
  • Workstations connected by network
  • Cost effective
  • High latency, low to moderate bandwidth
  • Often lack integrated software environment
  • Model breaks down if connectivity limited
  • Examples Ethernet, ATM crossbar, Myrinet

13
A Simple Parallel Programming Model
  • A parallel computation is a set of tasks
  • Each task has local data, can be connected to
    other tasks by channels
  • A task can
  • Compute using local data
  • Send to/receive from other tasks
  • Create new tasks, or terminate itself
  • A receiving task blocks until data available

14
Properties
  • Concurrency is enhanced by creating multiple
    tasks
  • Scalability More tasks than nodes
  • Locality Access local data when possible
  • A task (with local data and subtasks) is a unit
    for modular design
  • Mapping to nodes affects performance only

15
Parallel Algorithm Design
  • Goal Develop an efficient (parallel) solution
    to a programming problem
  • Identify sensible parallel algorithms
  • Evaluate performance and complexity
  • We present
  • A systematic design methodology
  • Some basic design techniques
  • Illustrative examples

16
A Design Methodology
  • Partition
  • Define tasks
  • Communication
  • Identify requirements
  • Agglomeration
  • Enhance locality
  • Mapping
  • Place tasks

17
Partitioning
  • Goal identify opportunities for concurrent
    execution (define tasks computationdata)
  • Focus on data operated on by algorithm ...
  • Then distribute computation appropriately
  • Domain decomposition
  • ... or on the operations performed
  • Then distribute data appropriately
  • Functional decomposition

18
Communication
  • Identify communication requirements
  • If computation in one task requires data located
    in another, communication is needed
  • Example finite difference computation
  • Must communicate with each neighbor

Xi (Xi-1 2Xi Xi1)/4
Partition creates one task per point
X1
X2
X3
19
Agglomeration
  • Once tasks communication determined,
    agglomerate small tasks into larger tasks
  • Motivations
  • To reduce communication costs
  • If tasks cannot execute concurrently
  • To reduce software engineering costs
  • Caveats
  • May involve replicating computation or data

20
Mapping
  • Place tasks on processors, to
  • Maximize concurrency note potential
  • Minimize communication conflict
  • Techniques
  • Regular problems agglomerate to P tasks
  • Irregular problems use static load balancing
  • If irregular in time dynamic load balancing

21
Example Atmosphere Model
  • Simulate atmospheric processes
  • Conservation of momentum, mass, energy
  • Ideal gas law, hydrostatic approximation
  • Represent atmosphere state by 3-D grid
  • Periodic in two horizontal dimensions
  • Nx.Ny.Nz e.g., Ny50-500, Nx2Ny, Nz15-30
  • Computation includes
  • Atmospheric dynamics finite difference
  • Physics (radiation etc.) in vertical only

22
Atmosphere Model Numerical Methods
  • Discretize the (continuous) domain by a regular
    Nx??Ny ??Nz grid
  • Store p, u, v, T, ? at every grid point
  • Approximate derivatives by finite differences
  • Leads to stencils in vertical and horizontal

23
Atmosphere ModelPartition
  • Use domain decomposition
  • Because model operates on large, regular grid
  • Can decompose in 1, 2, or 3 dimensions
  • 3-D decomposition offers greatest flexibility

24
Atmosphere ModelCommunication
  • Finite difference stencil horizontally
  • Local, regular, structured
  • Radiation calculations vertically
  • Global, regular, structured
  • Diagnostic sums
  • Global, regular, structured

25
Atmosphere ModelAgglomeration
  • In horizontal
  • Clump so that 4 points per task
  • Efficiency communicate with 4 neighbors only
  • In vertical, clump all points in column
  • Performance avoid communication
  • Modularity Reuse physics modules
  • Resulting algorithm reasonably scalable
  • (Nx.Ny)/4 at least 1250 tasks

26
Atmosphere ModelMapping
  • Technique depends on load distribution
  • 1) Agglomerate to one task per processor
  • Appropriate if little load imbalance
  • 2) Extend (1) to incorporate cyclic mapping
  • Works well for diurnal cycle imbalance
  • 3) Use dynamic, local load balancing
  • Works well for unpredictable, local imbalances

27
Modeling Performance
  • Execution time (sums are over P nodes) is
  • T (sumTcomp sumTcomm sumTidle)/P
  • Computation time comprises both
  • Operations required by sequential algorithm
  • Additional work, replicated work
  • Idle time due to
  • Load imbalance, and/or
  • Latency (waiting for remote data)

28
Bandwidth and Latency
  • Recall cost model
  • T ts N tw
  • ts per-message cost (latency)
  • tw per-word cost (1/ tw bandwidth)
  • N message size in words
  • Model works well for many algorithms, and on many
    computers

29
Measured Costs
30
Typical Communication Costs
  • Computer ts tw
  • IBM SP2 40 0.11
  • Intel Paragon 121 0.07
  • Meiko CS-2 87 0.08
  • Sparc/Ethernet 1500 5.0
  • Sparc/FDDI 1150 1.1

Times in microseconds
31
Example Finite Difference
  • Finite difference computation on N2Z grid
  • 9-point stencil
  • Similar to atmosphere model earlier
  • Decompose along one horizontal dimension

32
Time for Finite Difference
  • Identical computations at each grid point
  • Tcomp tcN2Z (tc is compute time/point)
  • 1-D decomposition, so each node sends 2NZ data to
    2 neighbors if ? 2 rows/node
  • Tcomm P(ts2 tw4NZ)
  • No significant idle time if load balanced
  • Tidle 0
  • Therefore, T tcN2Z/P ts2 tw4NZ

33
Using Performance Models
  • During design
  • Use models to study qualitative behavior
  • Calibrate models by measuring tc, ts, tw, etc.
  • Use calibrated models to evaluate design
    alternatives
  • During implementation
  • Compare predictions with observations
  • Relate discrepancies to implementation or model
  • Use models to guide optimization process

34
Design Alternatives Finite Difference
  • Consider 2-D and 3-D decompositions
  • Are they ever a win?
  • If so, when?

35
Design Alternatives (2)
  • 2-D Decomposition - On a ?P????P processor grid,
    messages of size 2N/?P???Z to 4 neighbors, so
  • T tcN2Z/P 4(ts tw2NZ/?P)
  • Good if ts lt twNZ(2-4/?P)
  • 3-D Decomposition - On a Px ? Py ? Pz grid,
  • T tcN2Z/P ts6 tw2N2/(PxPy)
    tw4(NZ)/(PxPz) tw4(NZ)/(PyPz)

36
Finding Model Discrepancies
What we have here is a failure to communicate
37
Impact of Network Topology
  • Multicomputer model assumes comm cost independent
    of location other comms
  • Real networks are not fully connected
  • Multicomputer model can break down

2-D Mesh
Ethernet
38
Competition for Bandwidth
  • In many cases, a bandwidth constrained model can
    give sufficiently accurate results
  • If S processes must communicate over same wire
    at same time, each has 1/S bandwidth
  • Example finite difference on Ethernet
  • All processors share single Ethernet
  • Hence bandwidth term scaled by P
  • T tcN2Z/P ts2 tw4NZP

39
Bandwidth-Constrained Model Versus. Observations
Bandwidth-constrained model gives better fit
40
Tool Survey
  • High Performance Fortran (HPF)
  • Message Passing Interface (MPI)
  • Parallel Computing Forum (PCF) and OpenMP
  • Portable, Extensible Toolkit for Scientific
    Computations (PETSc)

41
High Performance Fortran (HPF)
  • A standard data-parallel language
  • CM Fortran, C, HPC are related
  • Programmer specifies
  • Concurrency (concurrent operations on arrays)
  • Locality (data distribution)
  • Compiler infers
  • Mapping of computation (owner-computes rule)
  • Communication

42
HPF Example
PROGRAM hpf_finite_difference !HPF PROCESSORS
pr(4) REAL x(100,100), new(100,100) !HPF ALIGN
new(,) WITH x(,) !HPF DISTRIBUTE x(BLOCK,
) ONTO pr new(299,299) (x(198,299)x(3100
,299)
x(299,198)x(299,3100)) / 4 diff
MAXVAL(ABS(new-x)) end
43
HPF Analysis
  • Advantages
  • High level preserves sequential semantics
  • Standard
  • Disadvantages
  • Restricted applicability
  • Requires sophisticated compiler technology
  • Good for regular, SPMD problems

44
Message Passing Interface (MPI)
  • A standard message-passing library
  • p4, NX, p4, Express, PARMACS are precursors
  • An MPI program defines a set of processes
  • (usually one process per node)
  • ... that communicate by calling MPI functions
  • (point-to-point and collective)
  • ... and can be constructed in a modular fashion
  • (communicators are the key)

45
MPI Example
main(int argc, char argv) MPI_Comm com
MPI_COMM_WORLD MPI_Init(argc,argv)
MPI_Comm_size(com,np) MPI_Send(local1,1,
MPI_FLOAT,lnbr,10,com) MPI_Recv(local,1,MPI_FL
OAT,rnbr,10,com,status) MPI_Send(locallsize,
1,MPI_FLOAT,rnbr,10,com) MPI_Recv(locallsize
1,1,MPI_FLOAT,lnbr,10,com,status) ldiff
maxerror(local) MPI_Allreduce(ldiff,diff,1,M
PI_FLOAT,MPI_MAX,com) MPI_Finalize()
46
MPI Analysis
  • Advantages
  • Wide availability of efficient implementations
  • Support for modular design, code reuse
  • Disadvantages
  • Low level (parallel assembly code)
  • Less well-suited to shared-memory machines
  • Good for performance-critical codes with natural
    task modularity

47
PCF
  • Standardization (circa 1993) of shared memory
    parallelism in Fortran
  • A PCF program is multithreaded, with explicit
    synchronization between the threads and shared
    variables
  • A PCF program is divided into regions
  • Serial regions Only the master thread executes
  • Parallel regions Work shared by all threads

48
PCF and OpenMP
  • PCF per se was not widely implemented
  • Timing Distributed memory became popular
  • Complexity Many details for special cases
  • Not Invented Here (NIH) syndrome
  • Its ideas resurfaced in OpenMP
  • Primary differences are the spelling and
    low-level controls
  • Also some significant simplification (claimed to
    add scalability)

49
PCF Example (SGI Variant)
PCF standard
  • !DOACROSS, LOCAL(I), SHARE(A,B,C),
  • ! REDUCTION(X),
  • ! IF (N.GT.1000),
  • ! MP_SCHEDTYPEDYNAMIC, CHUNK100
  • DO I 2, N-1
  • A(I) (B(I-1)B(I)B(I1)) / 3
  • X X A(I)/C(I)
  • END DO
  • !DOACROSS, LOCAL(I), SHARE(D,E),
  • ! MP_SCHEDTYPESIMPLE
  • DO I 1, N
  • D(I) SIN(E(I))
  • END DO

X is a summation
Conditional parallelization
Iterations managed first-come, first-served, in
blocks of 100
Iterations blocked evenly among
threads(INTERLEAVE, GSS, RUNTIME scheduling
also available)
50
OpenMP Example
  • !OMP PARALLEL DO, SHARED(A,B,C),
  • !OMP REDUCTION(X),
  • !OMP SCHEDULE(DYNAMIC, 100)
  • DO I 2, N-1
  • A(I) (B(I-1)B(I)B(I1)) / 3
  • X X A(I)/C(I)
  • END DO
  • !OMP END DO
  • !OMP PARALLEL DO, SHARED(D,E),
  • !OMP SCHEDULE(STATIC)
  • DO I 1, N
  • D(I) SIN(E(I))
  • END DO
  • !OMP END DO

X is a summation
Iterations managed first-come, first-served, in
blocks of 100
Iterations blocked evenly among threads (GUIDED
scheduling also available)
51
PCF/OpenMP Analysis
  • Advantages
  • Convenient for shared memory, especially when
    using vendor extensions
  • Disadvantages
  • Tied strongly to shared memory
  • Few standard features for locality control
  • A good choice for shared-memory and DSM machines,
    but portability is still hard

52
Portable, Extensible Toolkit for Scientific
Computations (PETSc)
  • A higher-level approach to solving PDEs
  • Not parallel per se, but easy to use that way
  • User-level library provides
  • Linear and nonlinear solvers
  • Standard and advanced options (e.g. parallel
    preconditioners)
  • Programmer supplies
  • Application-specific set-up, data structures, PDE
    operators

53
PETSc Example
  • SLES snes
  • MAT A
  • Vec x,F
  • integer n,its,ierr
  • call MatCreat(MPI_COMM_WORLD,n,n,J,ierr)
  • call VecCreate(MPI_COMM_WORLD,n,x,ierr)
  • call VecDuplicate(x,F,ierr)
  • call SNESCreate(MPI_COMM_WORLD,SNES_NONLINEAR_EQUA
    TIONS,snes,ierr)
  • call SNESSetFunction(snes,F,EvaluateFunction,PETSC
    _NULL,ierr)
  • call SNESSetJacobian(snes,J,EvaluateJacobian,PETSC
    _NULL,ierr)
  • call SNESSetFromOptions(sles,ierr)
  • call SNESSolve(sles,b,x,its,ierr)
  • call SNESDestroy(snes,ierr)

54
PETSc Analysis
  • A rather different beast from the other tools
  • The P does not stand for Parallel
  • Most concurrency in user code, often MPI
  • Advantages
  • Easy access to advanced numerical methods
  • Disadvantages
  • Limited scope
  • Good for implicit or explicit PDE solutions
About PowerShow.com