Comparing the performance of MPI and MPI OpenMP for the NAS benchmark on IBM SP3 - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Comparing the performance of MPI and MPI OpenMP for the NAS benchmark on IBM SP3

Description:

F. Cappello: fci_at_lri.fr. Comparing the performance of. MPI and MPI ... Cashmere-2L (processes) -TreadMarks (threads) OpenMP implementation -OmniOpenMP (RWCP) ... – PowerPoint PPT presentation

Number of Views:283

Avg rating:3.0/5.0

Slides: 29

Provided by: toto97

Category:

more less

Transcript and Presenter's Notes

Title: Comparing the performance of MPI and MPI OpenMP for the NAS benchmark on IBM SP3

1
Comparing the performance of MPI and MPIOpenMP
for the NAS benchmark on IBM SP3

Franck Cappello,
CNRS, LRI,
Université Paris-sud
France
fci_at_lri.fr

2
Outline
Background

Introduction CLUsters of MultiProcessors (CLUMP)
Single Memory Model MPI
Hybrid Memory Model MPI OpenMP
Comparison of both approaches for the NAS 2.3
Benchmarkon NH1 and WH2 SP.
Understanding the performance of both models
Common trends,
Specific parameters.

Contribution
3
Clusters of Multiprocessors
Compaq SC
IBM SP
Cluster of 4 ways, 8 ways, 16 ways and 32 ways SMP
Cluster of 4-ways, 8-ways or 16-ways SMP
ORNL 1Tflops
ASCI computers

ASCI White 512 16 ways Multiprocessors (cluster
of SMPs)
ASCI Q 375 32 ways Multiprocessors (Cluster of
NUMA clusters of SMPs)

How to program them efficiently?
4
CLUMP hardware architecture
CLUMP with hardware hybrid memory model
Node
Node
Memory
Memory
I/O bus
Network Interface
Network
System interconnect
Message passing
Processors
Processors
SMP shared memory
How to program them efficiently?
Programming issue -Single or Hybrid memory
models -Standard or dedicated (research) API
Performance issues -Sharing the Memory System
-Sharing the Network Interface(s)
5
Different programming models
Single Memory Models
Message passing
-MPICH-PM/CLUMP (processes) -BIPSMP (processes) -
and some others (IBM, Compaq, SGI)
Software Distributed Shared Memory
-SMP-Shasta (threads) -Cashmere-2L
(processes) -TreadMarks (threads) OpenMP
implementation -OmniOpenMP (RWCP)
Hybrid Memory Models
Thread library thread runtime library (ex
Posix) Library with multi-threaded functions
ex Multithreaded BLAS library on the ASCI
RED Parallelizing compiler (for shared
memory) automatic or directive based a runtime
(ex OpenMP)

Message passing
6
Single Memory Model MPI
MPI user space communication intra SMP
communication through shared memory
Node
Node
Memory
Memory
Intra-node communication
Inter-node communication
Processors
Processors
MPI programs run unchanged
7
Hybrid approaches fine and coarse grain
Combining MPI and OpenMP two ways
Coarse Grain
Fine Grain
! MPI initialization !OMP PARALLEL !OMP DO
do i1,n end do !OMP END
DO MPI_SEND() end
!MPI initialization !OMP PARALLEL do
i1,n end do MPI_SEND() !OMP
END PARALLEL end

Redesign algorithm to allow an OpenMP
parallelization at the outermost level
Threads are managed nearly individually.
OpenMP parallel sections encompass MPI function
calls

Potentially a lot of parallel sections
High overhead for thread management

8
Fine grain hybrid approach
Message passing
Message passing (MPI)
shared memory (OpenMP)
Sequential section (some parts may be
parallelized with OpenMP)
seq
part
seq
part
Initialisation section
init MPI
Single thread section
init SMP part
init // part
unparallel.
comp.
comp.
comm.
parallel.
Multithread section (OpenMP)
Multinode section (MPI)
comp.
sync
sync
Ending section
end SMPpart
end // part
Single thread section
seq
Sequential section (some parts may be
parallelized with OpenMP)
part
Parallelization effort overheads of shared
memory environments
9
Parallelization effort for fine grain hybrid
approach

Incremental
OpenMP Directives are applied on computational
loop nests
OpenMP synchronisation directives
Some loop nests are transformed to become
parallelizable or to reach reasonable speed-up
(loop exchange, splitting, merging, etc.)
Some parts of loop core are rewritten to avoid
expensive synchronization or memory hierarchy
penalties (in particular to avoid false sharing)
Communication buffers filling and emptying are
NOT parallelized

Profiling
Correct
Check semantic and speedup
MPI Code
Correct hybrid code
Loop nest selection
Candidate hybrid code
Calling graph
False or slow down
Remove or refine parallelization
10
Parallelizing MPI programs an example
subroutine conj_grad ( colidx,) do i 1,
l2npcols call mpi_irecv( rho,)
call mpi_send( sum,) call mpi_wait(
request,) enddo !OMP PARALLEL
PRIVATE(k,sum) !OMP DO do
j1,lastrow-firstrow1 sum 0.d0
do krowstr(j),rowstr(j1)-1
sum sum a(k)p(colidx(k))
enddo w(j) sum enddo !OMP
END DO !OMP END PARALLEL do
i l2npcols, 1, -1 call
mpi_irecv() call mpi_send()
call mpi_wait( request,)
enddo return end
program cg call initialize_mpi call
setup_proc_info( num_procs, ) call
setup_submatrix_info( l2npcols,) do it 1,
niter call conj_grad (
colidx,) do i 1, l2npcols
call mpi_irecv( norm_temp2,) call
mpi_send( norm_temp1,) call
mpi_wait( request,) enddo
endo call mpi_finalize(ierr) end
Parallelization effort Reasonable
11
Platform Hardware NH1 and WH2 nodes
NH 1
WH 2
Memory
Memory
I/O bus
I/O bus
Network
Network
Bus
Xbar
4 Processors
8 Processors
CINES Montpellier 14 nodes Memory bandwidth
14.2 GB /sec Processor frequency 222 MHz Network
bandwidth 150 MB/s 300 MB/s bidirectional Only
four windows for user space communications
Poughkeepsie US up to 32 nodes Memory
bandwidth 1.6 GB /sec Processor frequency 375
MHz Network bandwidth 150 MB/s 300 MB/s
bidirectional
12
Software environment
AIX 4.4.3 XLF 6.1 (OpenMP directives) Problem
with EP sqrt() and log() Fortran intrinsic
where not reentrant with XLF 6.1 (bad scalability
and result unsuccessful) fixed in XLF 7.1 PPE
2.4 (shared memory intra node communications)
Compiler options for MPI version -O3
qarchpwer3 qtunepwr3 qcacheauto Compiler
options for MPI OpenMP version -O3
qarchpwer3 qtunepwr3 qcacheauto
qsmpnoautoschedulestatic Linker options
for the two versions -bmaxdata2024000000 (for
some Class B benchmark) Environment options
for MPI version export XLSMPOPTSparthds1
export OMP_NUM_THREADS1
if there are calls to thread library
functions
13
MPI versus MPIOpenMP
NAS 2.3 Class A, 2 different SP nodes
4-way nodes (NH1 222-MHz)
4-way nodes (WH2 375-MHz)
1
2
4
8
1.5
1,4
1,2
1
1
MPI / MPIOMP
MPI / MPIOMP
0,8
0.5
0,6
0,4
0
0,2
cg
ft
mg
lu
bt
sp
cg
ft
mg
lu
bt
sp
Ratio lt 1 means better MPI performance

1) With NH1, MPI is always better
2) With WH2, results depend on the benchmark
LU, SP MPI outperforms
CG, FT MPIOpenMP has better performance

The best model depends on the node hardware
features
14
MPI versus MPIOpenMP
NAS 2.3 Class A, 2 different data set sizes
4-way WH2 nodes - Class A
4-way WH2 nodes - Class B

1) Classes A and B follow the same general trends
2) MPIOpenMP is slightly better for class B
CG for any number of nodes
LU, BT, SP

Performance gap between the 2 models depends on
the data set size
15
MPI versus MPIOpenMPSDSC experiments
Blue Horizon IBM SP (144 NH1 nodes, mpxlf_r,
version 7.0.1) Proper hybrid parallelization of
the NAS 2.3 (different than ours) - Fine Grain
(CG, FT, MG) - Coarse Grain (MG) High
parallelization effort number of loop nest
parallelized for fine grain SDSC LRI MG
50 7 CG 18 3 FT 8
11 Test with Class A and C data set size
Results presented at SCIComp 2000 - Fine
grain hybrid approach is generally worse than
pure MPI - Coarse grain approach for MG is
comparable with pure MPI or slightly better
(Coarse grain approach is time and effort
consuming)
16
Understanding the differencesbetween the two
models (LRI experiments)
Breakdown of the execution time
init SMP part
Cost of the section that cannot be parallelized
with OpenMP
non-parallel.
comp.
Timer function calls
parallel.
Speed-up on parallel regions
comp.
sync
Timer function calls
end SMPpart
Cost of sharing the communication supports
Next slides present only selected results Paper
presents complete results
17
Cost of the section that cannot be parallelized
with OpenMP
Percentage of the execution time in the parallel
section (1 way nodes)
4-way WH2 nodes
Pn exec. time / total exec. time
1
2
4
8
16
32
1
0,8
0,6
0,4
0,2
0
cg-A
cg-B
ft - A
ft - B
lu - A
lu - B
mg-A
mg-B
Maximum theoritical speedup for MPIOpenMP
1.22
1.65
2.20
2.50
1.56
1.76
1.75
1.71
Non parallelized sections may be the most
limiting parameter for the MPIOpenMP model (LU,
MG, BT,SP)
18
Speed-up within the parallelizable section
4-way WH2 nodes - Class A
Number of nodes
Better cache utilization
Parallel Efficiency
2,5
Memory hierarchy penalty
2
1,5
1
0,5
0
cg
cg
ft
ft
lu
lu
mg
mg
mpi
mpi omp
mpi
mpi omp
mpi
mpi omp
mpi
mpi omp

Parallel efficiency on parallel region is better
for MPI (similar result for class B)
There are two main reasons behind that
thread management and synchronization
OpenMP cannot express multidimensional blocking
(except using explicitly manual loop
fusion)

19
Cost of sharing the communication supports
4-way WH2 nodes - Class A
Comp mpi
Comm mpi
Comp mpiopenmp
Comm mpiopenmp
30
150
20
100
CG_A Comp.
CG_A Comm.
time (sec)
time (sec)
10
50
0
0
1
2
4
8
16
32

MPI advantage on computation time.
MPIOpenMP advantage on communication time
N processes versus 4N processes for N nodes
Bandwidth communications
Similar trends for MG and FT.
Non parallelized sections may be the most
limiting parameter
for the MPI model (CG, FT)

20
Why communication time is higher with MPI
CG communication time
70
16 processes 8 comms. per node
p0
p1
p4
p5
150
MPI
100
p2
p3
p6
p7
MPIOpenMP
50
p
p
p
p
0
1
2
4
8
16
32
p
p
p
p
4 processes 2 comms. per nodes
p
p
p
p
1 processor VS 4 processors sending the same
amount of bytes
832 1
416 7000
1-way
4-way internal
4-way external
Bidirectional Bandwidth
193 MB/s
360 MB/s
49 MB/s
Latency
17 µs
7.8 µs
37 µs
More complex communication patterns lead to more
messages for the MPI version and higher
communication on WH2
21
Summary 1/3
Two models for executing existing MPI programs
on CLUMP MPI and MPIOpenMP Two IBM SP
systems using NH1 and WH2 nodes Two data set
size of the NAS NPB 2.3 Benchmarks
22
Summary 2/3
Selecting one model is not trivial Several
parameters must be considered A) Level of
shared memory parallelization of the MPI
computational loop nests B) Speedup efficiency
on pure parallel section. OpenMP has several
limitations 1) thread management and
synchronization cost and 2) it cannot express
multidimensional blocking (or requires code
rewriting) C) Communication cost. For the same
number of CPUs, this cost depends on how the
communication patterns of the application match
the communication hardware and software of the
CLUMP. Generally MPI has a higher communication
time D) Performance balance of the main
component of nodes (CPUs, Mem, Network)
23
Summary 3/3

Very difficult to know a priori if the hybrid
model will be useful for a particular application
(100 x103 lines)
how much gain the hybrid approach will provide?
how many time it will take to get a significant
improvement?
will a hybrid implementation be efficient for
next generation parallel computers (i.e. with IBM
Power4, Alpha ev7 and ev8)?

24
Perspectives
Comparison of SMM (message passing) and HMM
approaches for Cluster of SMP with high speed
shared memory system IBM SP3 NH2 Compaq SC
Clusters SGI Origin 2000 / SN1 IBM NUMA based
on Power4 (not yet available) Compaq NUMA based
on 21364 (not yet available) For the NAS
benchmark and high performance kernels (Scalapack)
We are open for collaboration with other HPC
groups to help choosing the programming model
and improving the performance of Real life
applications
25
For more details
Performance characteristics of a network of
commodity multiprocessors for the NAS benchmarks
using a hybrid memory model, PACT99, NewPort
Beach, CA, Oct 1999Investigating the
performance of two programming models for
clusters of SMP PCs, HPCA6, Toulouse, France,
Jan 2000Understanding performance of SMP
clusters running MPI programsFuture Generation
Computer Systems, Elsevier, to appear 2kMPI
versus MPIOpenMP on IBM SP for the NAS
Benchmarks, Supercomputing 2000fci_at_lri.fr and w
ww.lri.fr/fci
26
Significance of the NAS benchmark
NAS Benchmark 2.3 CG, MG, EP, FT, LU, SP, BT,
IS Date of availability 1996-97
Important characteristics -they use very
simple communication functions (of MPI)
-they have been designed according to the
memory hierarchy of 1996 -gt 1 or 2 cache levels
(8kB for L1 and 64 kB for L2) -Class A
corresponds to current workstations (memory
size) -Class B requires reasonable memory size
on one node (2 GB)
They do not correspond to current memory
hierarchy characteristics This is the case for
most of existing codes The comparison results
that will be presented do not correspond to
highly tuned codes
27
Parallel Architectures team of LRI
Clusters and Grid group

Parallel environments for Clusters at different
scales (SAN, LAN, WAN)
Real life Architectures and Applications
Performance Evaluation
Isolating and Understanding performance
parameters (use of benchmark)
Modeling and simulating application/architecture
Improving application performance

28
Hardware features
4-way WH2 and NH1 nodes - CG Class B
Time (sec)
Breakdown of the execution time
Number of 4 way nodes
Faster processors reduce the communication time

Write a Comment

User Comments (0)