Designing and Building Parallel Programs - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Designing and Building Parallel Programs

Description:

Designing and Building Parallel Programs (c) 1995, 1996, 1997, 1998. Ian Foster Gina Goff ... Designing and Building Parallel Programs. 8. How do Real Parallel ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 55

Provided by: charles191

Category:

more less

Transcript and Presenter's Notes

Title: Designing and Building Parallel Programs

1
Designing and Building Parallel Programs
A Tutorial Presented by the Department of
Defense HPC ModernizationProgramming Environment
Training Program

(c) 1995, 1996, 1997, 1998
Ian Foster Gina Goff
Ehtesham Hayder Charles Koelbel

2
Outline

Day 1
Introduction to Parallel Programming
The OpenMP Programming Language
Day 2
Introduction to MPI
The PETSc Library
Individual consulting in afternoons

3
Outline

Day 1
Introduction to Parallel Programming
Parallel Computers Algorithms
Understanding Performance
Parallel Programming Tools
The OpenMP Programming Language
Day 2
Introduction to MPI
The PETSc Library

4
Why Parallel Computing?

Continuing demands for higher performance
Physical limits on single processor performance
High costs of internal concurrency
Result is rise of multiprocessor architectures
And number of processors will continue to
increase
Networking is another contributing factor
Future software must be concurrent scalable

5
The Multicomputeran Idealized Parallel Computer
6
Multicomputer Architecture

Multicomputer nodes network
Node processor(s) local memory
Access to local memory is cheap
10s of cycles does not involve network
Conventional memory reference
Access to remote memory is expensive
100s or 1000s of cycles uses network
Use I/O-like mechanisms (e.g., send/receive)

7
Multicomputer Cost Model

Cost of remote memory access/communication
(including synchronization)
T ts N tw
ts per-message cost (latency)
tw per-word cost
N message size in words
Hence locality is an important property of good
parallel algorithms

8
How do Real Parallel Computers Fit the Model?

Major architecture types
Distributed memory MIMD
Shared memory MIMD
Distributed shared memory (DSM)
Workstation clusters and metacomputers
Model fits current architectures pretty well

9
Distributed MemoryMIMD Multiprocessor

Multiple Instruction/Multiple Data
Processors with local memory connected by
high-speed interconnection network
Typically high bandwidth, medium latency
Hardware support for remote memory access
Model breaks down when topology matters
Examples Cray T3E, IBM SP

10
Shared MemoryMIMD Multiprocessor

Processors access shared memory via bus
Low latency, high bandwidth
Bus contention limits scalability
Search for scalability introduces locality
Cache (a form of local memory)
Multistage architectures (some memory closer)
Examples Cray T90, SGI PCA, Sun

11
Distributed Shared Memory (DSM)

A hybrid of distributed and shared memory
Small groups of processors share memory others
access across a scalable network
Low to moderate latency, high bandwidth
Model simplifies the multilevel hierarchy
Examples SGI Origin, HP Exemplar

12
Workstation Clusters

Workstations connected by network
Cost effective
High latency, low to moderate bandwidth
Often lack integrated software environment
Model breaks down if connectivity limited
Examples Ethernet, ATM crossbar, Myrinet

13
A Simple Parallel Programming Model

A parallel computation is a set of tasks
Each task has local data, can be connected to
other tasks by channels
A task can
Compute using local data
Send to/receive from other tasks
Create new tasks, or terminate itself
A receiving task blocks until data available

14
Properties

Concurrency is enhanced by creating multiple
tasks
Scalability More tasks than nodes
Locality Access local data when possible
A task (with local data and subtasks) is a unit
for modular design
Mapping to nodes affects performance only

15
Parallel Algorithm Design

Goal Develop an efficient (parallel) solution
to a programming problem
Identify sensible parallel algorithms
Evaluate performance and complexity
We present
A systematic design methodology
Some basic design techniques
Illustrative examples

16
A Design Methodology

Partition
Define tasks
Communication
Identify requirements
Agglomeration
Enhance locality
Mapping
Place tasks

17
Partitioning

Goal identify opportunities for concurrent
execution (define tasks computationdata)
Focus on data operated on by algorithm ...
Then distribute computation appropriately
Domain decomposition
... or on the operations performed
Then distribute data appropriately
Functional decomposition

18
Communication

Identify communication requirements
If computation in one task requires data located
in another, communication is needed
Example finite difference computation
Must communicate with each neighbor

Xi (Xi-1 2Xi Xi1)/4
Partition creates one task per point
X1
X2
X3
19
Agglomeration

Once tasks communication determined,
agglomerate small tasks into larger tasks
Motivations
To reduce communication costs
If tasks cannot execute concurrently
To reduce software engineering costs
Caveats
May involve replicating computation or data

20
Mapping

Place tasks on processors, to
Maximize concurrency note potential
Minimize communication conflict
Techniques
Regular problems agglomerate to P tasks
Irregular problems use static load balancing
If irregular in time dynamic load balancing

21
Example Atmosphere Model

Simulate atmospheric processes
Conservation of momentum, mass, energy
Ideal gas law, hydrostatic approximation
Represent atmosphere state by 3-D grid
Periodic in two horizontal dimensions
Nx.Ny.Nz e.g., Ny50-500, Nx2Ny, Nz15-30
Computation includes
Atmospheric dynamics finite difference
Physics (radiation etc.) in vertical only

22
Atmosphere Model Numerical Methods

Discretize the (continuous) domain by a regular
Nx??Ny ??Nz grid
Store p, u, v, T, ? at every grid point
Approximate derivatives by finite differences
Leads to stencils in vertical and horizontal

23
Atmosphere ModelPartition

Use domain decomposition
Because model operates on large, regular grid
Can decompose in 1, 2, or 3 dimensions
3-D decomposition offers greatest flexibility

24
Atmosphere ModelCommunication

Finite difference stencil horizontally
Local, regular, structured
Radiation calculations vertically
Global, regular, structured
Diagnostic sums
Global, regular, structured

25
Atmosphere ModelAgglomeration

In horizontal
Clump so that 4 points per task
Efficiency communicate with 4 neighbors only
In vertical, clump all points in column
Performance avoid communication
Modularity Reuse physics modules
Resulting algorithm reasonably scalable
(Nx.Ny)/4 at least 1250 tasks

26
Atmosphere ModelMapping

Technique depends on load distribution
1) Agglomerate to one task per processor
Appropriate if little load imbalance
2) Extend (1) to incorporate cyclic mapping
Works well for diurnal cycle imbalance
3) Use dynamic, local load balancing
Works well for unpredictable, local imbalances

27
Modeling Performance

Execution time (sums are over P nodes) is
T (sumTcomp sumTcomm sumTidle)/P
Computation time comprises both
Operations required by sequential algorithm
Additional work, replicated work
Idle time due to
Load imbalance, and/or
Latency (waiting for remote data)

28
Bandwidth and Latency

Recall cost model
T ts N tw
ts per-message cost (latency)
tw per-word cost (1/ tw bandwidth)
N message size in words
Model works well for many algorithms, and on many
computers

29
Measured Costs
30
Typical Communication Costs

Computer ts tw
IBM SP2 40 0.11
Intel Paragon 121 0.07
Meiko CS-2 87 0.08
Sparc/Ethernet 1500 5.0
Sparc/FDDI 1150 1.1

Times in microseconds
31
Example Finite Difference

Finite difference computation on N2Z grid
9-point stencil
Similar to atmosphere model earlier
Decompose along one horizontal dimension

32
Time for Finite Difference

Identical computations at each grid point
Tcomp tcN2Z (tc is compute time/point)
1-D decomposition, so each node sends 2NZ data to
2 neighbors if ? 2 rows/node
Tcomm P(ts2 tw4NZ)
No significant idle time if load balanced
Tidle 0
Therefore, T tcN2Z/P ts2 tw4NZ

33
Using Performance Models

During design
Use models to study qualitative behavior
Calibrate models by measuring tc, ts, tw, etc.
Use calibrated models to evaluate design
alternatives
During implementation
Compare predictions with observations
Relate discrepancies to implementation or model
Use models to guide optimization process

34
Design Alternatives Finite Difference

Consider 2-D and 3-D decompositions
Are they ever a win?
If so, when?

35
Design Alternatives (2)

2-D Decomposition - On a ?P????P processor grid,
messages of size 2N/?P???Z to 4 neighbors, so
T tcN2Z/P 4(ts tw2NZ/?P)
Good if ts lt twNZ(2-4/?P)
3-D Decomposition - On a Px ? Py ? Pz grid,
T tcN2Z/P ts6 tw2N2/(PxPy)
tw4(NZ)/(PxPz) tw4(NZ)/(PyPz)

36
Finding Model Discrepancies
What we have here is a failure to communicate
37
Impact of Network Topology

Multicomputer model assumes comm cost independent
of location other comms
Real networks are not fully connected
Multicomputer model can break down

2-D Mesh
Ethernet
38
Competition for Bandwidth

In many cases, a bandwidth constrained model can
give sufficiently accurate results
If S processes must communicate over same wire
at same time, each has 1/S bandwidth
Example finite difference on Ethernet
All processors share single Ethernet
Hence bandwidth term scaled by P
T tcN2Z/P ts2 tw4NZP

39
Bandwidth-Constrained Model Versus. Observations
Bandwidth-constrained model gives better fit
40
Tool Survey

High Performance Fortran (HPF)
Message Passing Interface (MPI)
Parallel Computing Forum (PCF) and OpenMP
Portable, Extensible Toolkit for Scientific
Computations (PETSc)

41
High Performance Fortran (HPF)

A standard data-parallel language
CM Fortran, C, HPC are related
Programmer specifies
Concurrency (concurrent operations on arrays)
Locality (data distribution)
Compiler infers
Mapping of computation (owner-computes rule)
Communication

42
HPF Example
PROGRAM hpf_finite_difference !HPF PROCESSORS
pr(4) REAL x(100,100), new(100,100) !HPF ALIGN
new(,) WITH x(,) !HPF DISTRIBUTE x(BLOCK,
) ONTO pr new(299,299) (x(198,299)x(3100
,299)
x(299,198)x(299,3100)) / 4 diff
MAXVAL(ABS(new-x)) end
43
HPF Analysis

Advantages
High level preserves sequential semantics
Standard
Disadvantages
Restricted applicability
Requires sophisticated compiler technology
Good for regular, SPMD problems

44
Message Passing Interface (MPI)

A standard message-passing library
p4, NX, p4, Express, PARMACS are precursors
An MPI program defines a set of processes
(usually one process per node)
... that communicate by calling MPI functions
(point-to-point and collective)
... and can be constructed in a modular fashion
(communicators are the key)

45
MPI Example
main(int argc, char argv) MPI_Comm com
MPI_COMM_WORLD MPI_Init(argc,argv)
MPI_Comm_size(com,np) MPI_Send(local1,1,
MPI_FLOAT,lnbr,10,com) MPI_Recv(local,1,MPI_FL
OAT,rnbr,10,com,status) MPI_Send(locallsize,
1,MPI_FLOAT,rnbr,10,com) MPI_Recv(locallsize
1,1,MPI_FLOAT,lnbr,10,com,status) ldiff
maxerror(local) MPI_Allreduce(ldiff,diff,1,M
PI_FLOAT,MPI_MAX,com) MPI_Finalize()
46
MPI Analysis

Advantages
Wide availability of efficient implementations
Support for modular design, code reuse
Disadvantages
Low level (parallel assembly code)
Less well-suited to shared-memory machines
Good for performance-critical codes with natural
task modularity

47
PCF

Standardization (circa 1993) of shared memory
parallelism in Fortran
A PCF program is multithreaded, with explicit
synchronization between the threads and shared
variables
A PCF program is divided into regions
Serial regions Only the master thread executes
Parallel regions Work shared by all threads

48
PCF and OpenMP

PCF per se was not widely implemented
Timing Distributed memory became popular
Complexity Many details for special cases
Not Invented Here (NIH) syndrome
Its ideas resurfaced in OpenMP
Primary differences are the spelling and
low-level controls
Also some significant simplification (claimed to
add scalability)

49
PCF Example (SGI Variant)
PCF standard

!DOACROSS, LOCAL(I), SHARE(A,B,C),
! REDUCTION(X),
! IF (N.GT.1000),
! MP_SCHEDTYPEDYNAMIC, CHUNK100
DO I 2, N-1
A(I) (B(I-1)B(I)B(I1)) / 3
X X A(I)/C(I)
END DO
!DOACROSS, LOCAL(I), SHARE(D,E),
! MP_SCHEDTYPESIMPLE
DO I 1, N
D(I) SIN(E(I))
END DO

X is a summation
Conditional parallelization
Iterations managed first-come, first-served, in
blocks of 100
Iterations blocked evenly among
threads(INTERLEAVE, GSS, RUNTIME scheduling
also available)
50
OpenMP Example

!OMP PARALLEL DO, SHARED(A,B,C),
!OMP REDUCTION(X),
!OMP SCHEDULE(DYNAMIC, 100)
DO I 2, N-1
A(I) (B(I-1)B(I)B(I1)) / 3
X X A(I)/C(I)
END DO
!OMP END DO
!OMP PARALLEL DO, SHARED(D,E),
!OMP SCHEDULE(STATIC)
DO I 1, N
D(I) SIN(E(I))
END DO
!OMP END DO

X is a summation
Iterations managed first-come, first-served, in
blocks of 100
Iterations blocked evenly among threads (GUIDED
scheduling also available)
51
PCF/OpenMP Analysis