Parallelization and Performance Expectation - PowerPoint PPT Presentation

Loading...

PPT – Parallelization and Performance Expectation PowerPoint presentation | free to download - id: 44421e-OTdjN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Parallelization and Performance Expectation

Description:

... Instrumentation library for Fortran, C, ... % 57.1% 0.120 57.1% 4 __mpdo_MAIN__1 (pi_omp: pi_omp.f90, 22) [10] 0 ... for a single processor run Tp: ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 148
Provided by: MarkSt150
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallelization and Performance Expectation


1
Parallelization and Performance Expectation
  • Wai Yip Kwok
  • PECM

2
Outline
  • Why parallel computing
  • Parallel computer models
  • Parallelization procedure and paradigm
  • Performance measurements
  • Factors affecting scalable performance
  • Summary

3
Performance improvement
www.gasgas.com
www.pennfieldinc.com/research.inc
  • More computers ? less time needed

4
A necessity
  • One person (computer) has its own limit
  • Some jobs are too big, daunting to take on alone

Pheno-menon Time scale Time step count
Protein thermo. 1-10 ?s 109
Protein kinetics 500 ?s 1011
David Klepacki, SciComp 5, 2002
www.lordoftherings.net
5
  • Why parallel computing
  • Parallel computer models
  • Shared-memory machine
  • Distributed-memory machine
  • Parallelization procedure and paradigm
  • Performance measurements
  • Factors affecting scalable performance
  • Summary

6
Shared Memory System
Processor
Processor
Processor
Network
Memory
Memory
Memory
  • Access by all processors to a single pool of
    memory
  • Simple programming

7
Distributed Memory System
Network
Processor
Processor
Processor
Memory
Memory
Memory
  • Data locality
  • Communication thru. Message passing
  • Requires more sophisticated programming

8
  • Why parallel computing
  • Parallel computer models
  • Parallelization procedure and paradigm
  • Procedure
  • OpenMP
  • Message passing interface (MPI)
  • Performance measurements
  • Factors affecting scalable performance
  • Summary

9
Parallelization Implementation
Serial code
Debugging
Parallel code using 1 processor
Debugging and scalability measurements
Parallel code using p processors
10
OpenMP
Master spawns 3 parallel threads
Master thread executes serially
Parallel threads terminate
Serial execution continues
  • Web site http//www.openmp.org
  • Shared-memory machines only
  • Oct 31 morning

11
Message Passing Interface (MPI)
Data passing among processors
Network
Processor
Processor
Processor
  • Shared- and distributed-memory machines
  • Oct 30 afternoon

12
  • Why parallel computing
  • Parallel computer models
  • Parallelization procedure and paradigm
  • Performance measurements
  • Speedup and efficiency
  • Fixed total problem size vs. fixed problem size
    per processor
  • Factors affecting scalable performance
  • Summary

13
Speedup and Efficiency
  • Definition
  • Sp speedup
  • T1 time for a single processor run
  • Tp time for equivalent calculation using p
    processors
  • Perfect Speedup
  • Efficiency
  • Average speedup per processor

14
Fixed Total-problem-size Test
ZEUS-MP CFD code by Laboratory Computational
Astrophysics, UIUC
1E5
1E4
Time (s)
1E3
SGI O2k
IA32 Linux cluster
IA64 Linux cluster
1E2
1E1
1E0
1E1
1E3
1E2
Number of processors
15
Fixed Problem-size-per-proc Test
GEN2 Multi-physics code by Center for
Simulations of Advanced Rockets, UIUC
16
  • Why parallel computing
  • Parallel computer models
  • Parallelization procedure and paradigm
  • Performance measurements
  • Factors affecting scalable performance
  • Limits in problem size
  • Load imbalance
  • Amdahls Law for partially-parallel code
  • Parallel overhead
  • Summary

17
Limits in Problem Size
Small problem
Large problem
Time (s)
Number of processors
  • More processors can lead to diminishing returns!
  • Large problems benefit from use of more processors

18
Load Imbalance
  • Load balancing equal workload for all processors
  • Avoid Idleness in any processor
  • An example

100 processors
Ny 12
144 grids
Nx 12
19
Amdahls Law
  • Speedup formula for a partially-parallel code
  • f fraction of operations done serially
  • (1-f) fraction of operations done in perfect
    parallelism
  • f0 ?Spp or perfect parallelism

20
Implication of Amdahls Law
  • Small f ? large efficiency drop!

21
Parallel Overhead
  • Synchronization
  • Parallel thread creation
  • Blocking threads
  • Begin/end parallel regions

22
Outline
  • Why parallel computing
  • Parallel computer models
  • Parallelization procedure and paradigm
  • Performance measurements
  • Factors affecting scalable performance
  • Summary

23
Whats Coming
  • MPI presentation/lab (afternoon)
  • OpenMP presentation/lab (tomorrow morning)
  • Performance analysis

24
Parallel Performance Analysis Tools on NCSAs
Origin 2000 and Linux Clusters
  • Rick Kufrin
  • Performance Engineering

25
Goals
  • Not a tuning / optimization session
  • Single-processor ideas/techniques just as
    applicable to parallel, plus
  • Parallel overhead
  • Load-balancing
  • Cache-coherency issues
  • Message-passing and network measurements
  • This portion of the workshop is a how-to
    cookbook at NCSA as of 10/2002

26
Current Status (Oct 2002)
  • Each of the major NCSA systems is in a different
    state of maturity with respect to parallel tools
  • Origin 2000 Stable. New tools/libraries unlikely
    to be installed.
  • IA-32 P III cluster Tools should work, if not
    please contact the consulting office.
  • IA-64 Itanium cluster Out of date kernel, out of
    date system libraries (e.g., glibc, pthreads).
    Major impact on parallel support for tools. Work
    ongoing, expect changes.

27
Tools Changing Rapidly
  • Keep abreast of developments and new tools in
    NCSAs Datalink
  • http//www.ncsa.uiuc.edu/News/datalink
  • This talk is only covering a small set of basic
    skills and tools
  • A number of tools not covered here are available
    and continually developed.
  • Standard place to look on the Linux clusters
    /usr/apps/tools

28
Models of Parallelism
  • Shared-memory
  • OpenMP
  • POSIX threads (pthreads)
  • Message-passing
  • MPI
  • Hybrid (combination of above)
  • Simple examples of each cpi.c (from MPICH dist)
    and mpithreads (IBM/LLNL)

29
What To Measure?
  • Where is the program (within-CPU) spending its
    time? profile
  • Whats it doing? hardware counters
  • Message-passing? tracing/logging
  • Network, devices, and more other tools

30
Where Is The Time Spent?
  • PROFILING PARALLEL APPS
  • On the Origin
  • Use SpeedShop and Perfex
  • ssusage - overall resource usage
  • ssrun - provides access to various types of data
  • perfex - application hardware performance
  • SpeedShop experiments work on parallel apps
    similarly to serial ones.
  • On the Linux clusters
  • GNU gprof, VProf (Sandia), HPM (IBM)

31
Origin Parallel Profiling
  • Tune for a single CPU, then move on to parallel
  • Use the full path of the tool
  • mpirun -np 8 /bin/perfex -e 0 -e 21 -y -mp -o
    perfdata a.out
  • mpirun -np 4 /bin/ssrun -ideal a.out
  • SpeedShop will name resulting data files
    according to how spawned
  • ltexecutablegt.ltexp_typegt.ltidgt

32
Visualizing Message-Passing
  • On the Origin
  • Vampir and the VampirTrace libraries are the most
    flexible
  • On the Linux clusters
  • The MPE (Multi Processing Environment) extensions
    to MPICH from Argonne are good choices

33
Linux Cluster Parallel Profiling
  • VProf (Curtis Janssen, Sandia)
  • The Visual Profiler two tools (one graphical,
    one text-based) and supporting libraries.
  • Three methods of collection program counter or
    performance counters using perfctr (IA-32) or
    PAPI
  • NCSA has experimental version on IA64 - only PAPI
    collection supported

34
VProf Summary Display
  • Summary window provides overall view of
    application. Can load multiple runs (e.g.,
    FLOPS, cache misses, cycles)

35
VProf Source Code Browser
  • Selecting a particular measurement will pop up a
    source code browser where measurements are
    related to the program listing.

36
CProf Text-Based Output
  • cprof -l main3 vmon.out
  • Columns correspond to the following events
  • PAPI_FLOPS - Floating Point instructions per
    second (500 events)
  • Line Summary
  • 60.0 /u/ncsa/rkufrin/apps/ASPCG/old/pc_jac2d_bl
    k3.f17
  • 21.2 /u/ncsa/rkufrin/apps/ASPCG/old/pc_jac2d_bl
    k3.f30
  • 13.8 /u/ncsa/rkufrin/apps/ASPCG/old/pc_jac2d_bl
    k3.f24
  • 2.2 /u/ncsa/rkufrin/apps/ASPCG/old/matxvec2d_b
    lk3.f15
  • 1.4 /u/ncsa/rkufrin/apps/ASPCG/old/cg3_blk.f6
    0
  • 0.8 /u/ncsa/rkufrin/apps/ASPCG/old/pc_jac2d_bl
    k3.f16
  • 0.2 /u/ncsa/rkufrin/apps/ASPCG/old/main3.f94
  • 0.2 /u/ncsa/rkufrin/apps/ASPCG/old/cg3_blk.f6
    3
  • 0.2 /u/ncsa/rkufrin/apps/ASPCG/old/cg3_blk.f1
    09

Can select files, functions, lines (above), or
everything
37
Parallel Profiling With VProf
  • Recompilation with -g necessary to obtain symbol
    information
  • Call vprof routines directly (e.g.
    vmon_done_task())
  • serial codes can be left unchanged and linked
    with vmonauto_gcc.o in the VProf directory
  • Relink with the VProf library
  • -L/usr/apps/tools/lib -lvmon
  • Select measurement through environment variable
    VMON and run
  • Examine results
  • vprof a.out vmon.out ...

38
HPM Toolkit / Linux Clusters
  • Luiz DeRose (IBM ACTC)
  • Instrumentation library for Fortran, C, and C
    applications
  • Uses PAPI interface
  • For each instrumented point in an instrumented
    program, libhpm provides
  • Total count
  • Total duration (wall clock time)
  • Hardware performance counters information
  • Derived metrics

39
LIBHPM
  • Supports MPI, OpenMP, and POSIX threads
  • Multiple instrumentation points
  • Default limit of 100
  • Can be extended with environment variable
  • Nested instrumentation
  • Provides exclusive duration for the outer points
  • Multiple calls to an instrumented point
  • Average and standard deviation is provided when
    one instrumentation point is activated multiple
    times

40
Hardware Events Selection
  • Set of hardware events to be used can be
    selected via the environment variable
    LIBHPM_EVENT_SET
  • User can also specify an event set with the file
    libHPMevents

41
libHPMevents File
  • Each line in the file specifies one hardware
    event
  • Counter number
  • Event number
  • Mnemonic
  • Description
  • libHPMevents example
  • 0 0 PAPI_FLOPS FLOPS
  • 1 5 native native event

42
LIBHPM Functions
  • C C
  • hpmInit(taskID, PName)
  • hpmTerminate(taskID)
  • hpmStart(instID, label)
  • hpmStop(instID)
  • hpmTstart(instID, label)
  • hpmTstop(instID)
  • Fortran
  • f_hpminit(taskID, PName)
  • f_hpmterminate(taskID)
  • f_hpmstart(instID, label)
  • f_hpmstop(instID)
  • f_hpmtstart(instID, label)
  • f_hpmtstop(instID)

43
Using LIBHPM C and C
  • Declaration
  • extern "C" void hpmInit( int my_ID, char prog )
  • extern "C" void hpmTerminate( int my_ID )
  • extern "C" void hpmStart( int inst_ID, char
    label )
  • extern "C" void hpmStop( int inst_ID )
  • Use
  • hpmInit( taskID, prog )
  • hpmStart( 1, work )
  • do_work()
  • hpmStart( 5, more work )
  • do_more_work()
  • hpmStop( 5 )
  • hpmStop( 1 )
  • hpmTerminate( taskID )

44
Using LIBHPM - Fortran
  • Declaration
  • include ltf_hpm.hgt
  • Use
  • call f_hpminit( taskID, prog )
  • call f_hpmstart( 1, work )
  • do
  • call do_work()
  • call f_hpmstart( 22, more work )
  • call do_more_work()
  • call f_hpmstop( 22 )
  • end do
  • call f_hpmstop( 1)
  • call f_hpmterminate( taskID )

45
Compiling / Linking with HPM
  • HPM directory /usr/apps/tools/hpm
  • Fortran files w/HPM calls must be passed through
    the C preprocessor
  • Supply flag -fpp2 for the Intel compilers
  • Architecture-specific libraries/binaries named
    ia32 or ia64 (look in the directories), lib
    subdirectory according to compiler
  • Link with HPM and the PAPI library
  • -lhpm -lpapi
  • Dont forget to set LD_LIBRARY_PATH!

46
Multi-Threaded Support
1
2
3
4
5
6
7
8
1
47
HPMVIZ Displays
48
Origin MPI Tracing w/Vampir
49
Compiling/Linking for Vampir
  • VAMPIR (Origin only) can be used to trace
    programs in various ways. Basic procedure to
    link with Vampir library
  • cc -o myprog myprog.c -L/usr/apps/tools/vampirtr
    ace/lib/lib64 -lVT -lmpi -ldwarf -lelf -lexc -lm
  • mpirun -np 8 myprog
  • Result is a Vampir tracefile with the suffix
    .bpv that you can view from within Vampir.
  • http//www.ncsa.uiuc.edu/UserInfo/Resources/Softwa
    re/Math/Vampir/

50
Jumpshot-3 (MPE)
Overall view of application - from here you can
focus in on areas of interest. Initial screen
shows the entire application.
51
Jumpshot-3 Timeline
Detailed view of messages sent and delivered
between processes. Individual MPI functions or
user-defined information can be selectively
viewed.
52
Clusters Tracing/Logging w/MPE
  • MPE libraries in /usr/apps/tools/mpe
  • Link with MPE library and logging libraries (also
    need VMI-related libs)
  • -llmpe -lmpe -lpmpich -lmpich
  • Fortran programs need additional wrapper library
    inserted at beginning
  • -lfmpich
  • Run the program as usual, producing a .clog file

53
MPE Logfile Formats
  • MPE has defined three different log file formats
  • alog - earliest of the formats, used by Upshot
  • clog - format produced by the MPE libraries
  • slog (scalable log) - format used by Jumpshot
  • Conversion utilities exist
  • clog2alog, clog2slog (/usr/apps/tools/mpe/bin)
  • The logviewer tool invokes the appropriate
    visualizer, based on suffix

54
Keeping On Top Of Things
  • Many parallel tools are being developed by
    internal/external groups and are being evaluated
    and installed on NCSA systems, e.g.
  • TAU (Oregon, LANL, Julich)
  • Paradyn (Wisconsin)
  • PerfSuite (NCSA)
  • Visit NCSAs WWW site to stay up-to-date. Ask
    questions, try things out!

55
Break
  • Time for a break

56
Parallel Tuning
  • Gregory Bauer
  • PECM

57
Performance Issues
  • Parallelism and Memory
  • Topology and Communication
  • Blocking versus non-blocking
  • One way communication
  • Parallel I/O
  • MPI2.0
  • Timing
  • Performance tools

58
Parallelism and Memory
  • Fine grained versus coarse grained approach
  • Fine loop level parallelism
  • Beware of overhead
  • Coarse spatial domain decomposition
  • Data synchronization and movement
  • Shared versus distributed memory
  • Shared lower latency, higher bandwidth
  • Distributed higher latency, lower bandwidth

59
Parallelism and Memory
  • OpenMP is typically associated with loop level
    parallelism and shared memory
  • Coarse grained is possible
  • Spatial domain decomposition achieved by SPMD
    (single program multiple data)
  • MPI not limited to distributed systems
  • Origin MPI implemented as shared memory (SHMEM)

60
Topology
  • Origin hypercube
  • is a dual R10K node
  • is a router
  • Distributed shared memory, DSM
  • Linux clusters ?
  • Myricom myrinet switch
  • Based on Clos network
  • Distributed memory

61
Shared Memory
  • Shared memory bottleneck
  • Competition by processors for limited throughput
    to/from main memory
  • PC multi-processor boards 2, 4, maybe 8
  • SGI DSM alleviates the problem by mixing in
    distributed component to shared memory
  • Overhead for cache coherency etc.
  • Interconnection throughput limited

62
Communication
  • MPI communication on Linux clusters
  • Inter-node Myrinet Platinum Titan
  • Latency gt 10 µs, Bandwidth 125 to 250 MB/s
  • Intra-node SHMEM Platinum Titan
  • Latency lt 10 µs, Bandwidth 200 to 450 MB/s
  • MPI communication on Origin
  • Intra-node SHMEM
  • Latency lt 10 µs, Bandwidth 140 MB/s
  • Less communication is better

63
Origin Environment Variables
  • For optimal communication
  • MPI
  • setenv MPI_DSM_PLACEMENT firsttouch
  • setenv MPI_DSM_MUSTRUN
  • OpenMP Threads
  • setenv _DSM_PLACEMENT FIRST_TOUCH
  • setenv _DSM_MUSTRUN
  • Enables processor/memory affinity

64
Blocking versus non-blocking
  • Blocking communication does not complete until
    communication buffer is filled/emptied, forces
    synchronization.
  • Non-blocking communication allows for computation
    to proceed while communication buffer is being
    filled/emptied.
  • Blocking is conceptually clear but is a less
    efficient implementation

65
MPI Point-to-point
  • MPI_Send/MPI_Recv does not complete until buffer
    is empty/full (available for reuse/use).
  • Use non-blocking operations that return
    (immediately) request handles'' that can be
    waited on and queried
  • MPI_Isend(start, count, datatype, dest, tag,
    comm, request) MPI_Irecv(start, count, datatype,
    dest, tag, comm, request) MPI_Wait(request,
    status)
  • One can also test without waiting
  • MPI_Test( request, flag, status)

66
OpenMP Point-to-Point
  • Non-blocking producer-consumer synchronization
  • One thread computes result, other thread(s) reads
    result ? synchronize only the two or more

! PRODUCER THREAD !OMP SINGLE
READY.FALSE. !OMP END SINGLE
f(subdomain)f(subdomain) update
READY.TRUE.
! CONMSUMER THREAD !OMP SINGLE
READY.FALSE. !OMP END SINGLE do
while(READY) ! do useful work while
waiting enddo f(i)f(i-1) f(i)
f(i1)
67
Collective
  • Typical collective communication synchronizes
    data exchange among processes.
  • The synchronization barrier does not scale well
    for large numbers of processors. If possible,
    reduce synchronization.
  • MPI and OpenMP Mimic non-blocking collective
    communication with point-to-point routines.

68
One-way
  • One process specifies sender and receiver
  • User in charge of ordering
  • MPI2.0
  • MPI_Put and MPI_Get commands
  • Additional synchronization commands
  • Provides non-blocking OpenMP-like memory access

69
I/O
  • MPI 1.0 and OpenMP do not contain parallel I/O
    features
  • MPI can be used in a variety of models
  • Packages with parallel I/O
  • HDF5
  • ROMIO
  • Parallel File systems
  • GPFS from IBM

70
OpenMP
  • Limitation normal file descriptor based I/O will
    fight for access to the file pointer
  • Solution
  • Split file OR use open() and mmap()
  • Operate on descriptors/segments of the file in a
    PARALLEL DO region
  • Alternatively use parallel file system or package

71
MPI I/O
  • Master/Slave
  • Does not rely on shared file system
  • Bottleneck at master
  • file r/w and data send/recv
  • Local Access
  • Uses faster local disk
  • Multiple copies of input/output take time to
    distribute and collect
  • Other variations or combinations

72
MPI I/O
  • MPI 2.0 Parallel I/O routines
  • Allows you to perform parallel I/O similar to the
    way messages are sent from one process to
    another.
  • Not all implementations at present implement the
    full MPI-2 I/O.

73
MPI 2.0
  • MPI 2.0 addresses
  • Dynamic process management
  • One-sided communication
  • I/O
  • Extended collective operations
  • MPI 2.0 does not address
  • Non-blocking collective operations (e.g.
    MPI_Ibcast)
  • but it does not change MPI 1.0

74
Timing
  • Use timing functions provided by language
  • C gettimeofday
  • Fortran cpu_time, system_clock
  • MPI provides a wall clock timer
  • C/Fortran MPI_WTIME
  • Uses gettimeofday
  • Wrap regions of code with timer calls

75
Example
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WO
RLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD
, numprocs, ierr) if( myid .eq. 0) then
print,'Enter number of intervals' read,n
t1 MPI_WTIME() do i1,numprocs-1 call
MPI_SEND(n, 1, MPI_INTEGER, i, 99 ,
MPI_COMM_WORLD, ierr) enddo else call
MPI_RECV(n, 1, MPI_INTEGER, 0, 99 ,
MPI_COMM_WORLD, status, ierr) endif ! Calculate
part of pi ! Omitted for brevity if( myid .ne.
0) then call MPI_SEND(mypi, 1,
MPI_DOUBLE_PRECISION, 0, 99 , MPI_COMM_WORLD,
ierr) else m_pimypi do i1,numprocs-1
call MPI_RECV(sum, 1, MPI_DOUBLE_PRECISION, i,
99 , MPI_COMM_WORLD, status, ierr)
m_pim_pisum enddo t2 MPI_WTIME()
print,'pi ',m_pi,' error ',m_pi-pi,' in
',t2-t1,' seconds' endif call MPI_FINALIZE(ierr)
76
Profiling
  • Use tools to characterize performance
  • Origin
  • ssrun
  • MPI use with mpi and mpirun
  • OpenMP use as normal
  • mpirun
  • Use the stats option for MPI statistics
  • Linux Cluster
  • MPE
  • Gprof, vprof

77
Ssrun and MPI
  • On Origin only
  • Use with serial, OpenMP and MPI codes
  • mpirun np 4 ssrun mpi a.out
  • cvperf .
  • .
  • .
  • --------------------------------------------------
    ------------------
  • Incl. Incl. Incl. Incl.
  • bytes bytes MPI MPI
  • MPI/Sent MPI/Recv Send-Ops Recv-Ops
  • 36 36 6 6
    main
  • 36 0 6 0
    pmpi_send_
  • 36 36 6 6
    pi_mpi_f90b_
  • 36 36 6 6
    __start
  • 0 36 0 6
    pmpi_recv_

78
Ssrun and OpenMP
  • setenv OMP_NUM_THREADS 4
  • ssrun exp usertime a.out
  • prof a.out.usertime.
  • .
  • .
  • Summary of statistical callstack sampling data
    (usertime)--
  • 7 Total Samples
  • 0 Samples with incomplete
    traceback
  • 0.210 Accumulated Time (secs.)
  • 30.0 Sample interval (msecs.)
  • --------------------------------------------------
    -----------------------
  • Function list, in descending order by exclusive
    time
  • --------------------------------------------------
    -----------------------
  • index excl.secs excl. cum. incl.secs
    incl. samples procedure (dso file, line)
  • 1 0.120 57.1 57.1 0.120 57.1 4
    __mpdo_MAIN__1 (pi_omp pi_omp.f90, 22)
  • 10 0.060 28.6 85.7 0.060 28.6 2
    __mp_wait_for_completion (libmp.so mp.c, 883)
  • 2 0.030 14.3 100.0 0.120 57.1 4
    __mp_slave_wait_for_work (libmp.so
    mp_parallel_do.s, 593)
  • 6 0.000 0.0 100.0 0.090 42.9 3
    __start (pi_omp crt1text.s, 103)

79
Summary
  • Keep communication to a minimum
  • Use non-blocking communication
  • Communication with computation
  • MPI or OpenMP
  • Problem dependant
  • Hybrid MPIOpenMP

80
Lunch
  • Time for lunch

81
Parallel Programming with MPI
  • Ian Brooks
  • Consulting
  • Dave Ennis
  • (OSC)

82
Table of Contents
  • Brief History of MPI
  • MPI Program Structure
  • Message Passing
  • Point-to-Point Communications
  • Collective Communication

83
Brief History of MPI
  • What is MPI
  • MPI Forum
  • Goals and Scope of MPI
  • MPI on OSC Parallel Platforms

84
What Is MPI
  • Message Passing Interface
  • What is the message? DATA
  • Allows data to be passed between processes in a
    distributed memory environment

85
MPI Forum
  • First message-passing interface standard
  • Successor to PVM
  • Sixty people from forty different organizations
  • International representation
  • MPI 1.1 Standard developed from 92-94
  • MPI 2.0 Standard developed from 95-97
  • Standards documents
  • http//www.mcs.anl.gov/mpi/index.html
  • http//www.mpi-forum.org/docs/docs.html
    (postscript versions)

86
Goals and Scope of MPI
  • MPIs prime goals are
  • To provide source-code portability
  • To allow efficient implementation
  • It also offers
  • A great deal of functionality
  • Support for heterogeneous parallel architectures
  • Acknowledgements
  • Edinburgh Parallel Computing Centre/University of
    Edinburgh for material on which this course is
    based
  • Dr. David Ennis of the Ohio Supercomputer Center
    who initially developed this course

87
MPI Program Structure
  • Handles
  • MPI Communicator
  • MPI_Comm_world
  • Header files
  • MPI function format
  • Initializing MPI
  • Communicator Size
  • Process Rank
  • Exiting MPI

88
Handles
  • MPI controls its own internal data structures
  • MPI releases handles to allow programmers to
    refer to these
  • C handles are of defined typedefs
  • In Fortran, all handles have type INTEGER

89
MPI Communicator
  • Programmer view group of processes that are
    allowed to communicate with each other
  • All MPI communication calls have a communicator
    argument
  • Most often you will use MPI_COMM_WORLD
  • Defined when you call MPI_Init
  • It is all of your processors...

90
MPI_COMM_WORLD Communicator
MPI_COMM_WORLD
91
Header Files
  • MPI constants and handles are defined here
  • C include ltmpi.hgt
  • Fortran include mpif.h

92
MPI Function Format
  • C error MPI_Xxxxx(parameter,...) MPI_Xxxxx(p
    arameter,...)
  • Fortran CALL MPI_XXXXX(parameter,...,IERROR)

93
Initializing MPI
  • Must be the first routine called (only once)
  • C int MPI_Init(int argc, char argv)
  • Fortran CALL MPI_INIT(IERROR)
  • INTEGER IERROR

94
Communicator Size
  • How many processes are contained within a
    communicator
  • C MPI_Comm_size(MPI_Comm comm, int size)
  • Fortran CALL MPI_COMM_SIZE(COMM, SIZE,
    IERROR)
  • INTEGER COMM, SIZE, IERROR

95
Process Rank
  • Process ID number within the communicator
  • Starts with zero and goes to (n-1) where n is the
    number of processes requested
  • Used to identify the source and destination of
    messages
  • C
  • MPI_Comm_rank(MPI_Comm comm, int rank)
  • Fortran
  • CALL MPI_COMM_RANK(COMM, RANK, IERROR)
  • INTEGER COMM, RANK, IERROR

96
Exiting MPI
  • Must be called last by all processes
  • C MPI_Finalize()
  • Fortran CALL MPI_FINALIZE(IERROR)

97
Bones.c
  • includeltmpi.hgt
  • void main(int argc, char argv)
  • int rank, size
  • MPI_Init(argc, argv)
  • MPI_Comm_rank(MPI_COMM_WORLD, rank)
  • MPI_Comm_size(MPI_COMM_WORLD, size)
  • / your code here /
  • MPI_Finalize ()

98
Bones.f
  • PROGRAM skeleton
  • INCLUDE mpif.h
  • INTEGER ierror, rank, size
  • CALL MPI_INIT(ierror)
  • CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank,
    ierror)
  • CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
  • C your code here
  • CALL MPI_FINALIZE(ierror)
  • END

99
Whats in a Message
  • Messages
  • MPI Basic Datatypes - C
  • MPI Basic Datatypes - Fortran
  • Rules and Rationale

100
Messages
  • A message contains an array of elements of some
    particular MPI datatype
  • MPI Datatypes
  • Basic types
  • Derived types
  • Derived types can be build up from basic types
  • C types are different from Fortran types

101
MPI Basic Datatypes - C
102
MPI Basic Datatypes - Fortran
103
Rules and Rationale
  • Programmer declares variables to have normal
    C/Fortran type, but uses matching MPI datatypes
    as arguments in MPI routines
  • Mechanism to handle type conversion in a
    heterogeneous collection of machines
  • General rule MPI datatype specified in a receive
    must match the MPI datatype specified in the send

104
Point-to-Point Communications
  • Communication Envelope
  • Received Message Count
  • Message Order Preservation
  • Sample Programs
  • Timers
  • Class Exercise Processor Ring
  • Extra Exercise 1 Ping Pong
  • Extra Exercise 2 Broadcast
  • Definitions
  • Communication Modes
  • Routine Names (blocking)
  • Sending a Message
  • Memory Mapping
  • Synchronous Send
  • Buffered Send
  • Standard Send
  • Ready Send
  • Receiving a Message
  • Wildcarding

105
Point-to-Point Communication
  • Communication between two processes
  • Source process sends message to destination
    process
  • Destination process receives the message
  • Communication takes place within a communicator
  • Destination process is identified by its rank in
    the communicator

106
Definitions
  • Completion of the communication means that
    memory locations used in the message transfer can
    be safely accessed
  • Send variable sent can be reused after
    completion
  • Receive variable received can now be used
  • MPI communication modes differ in what conditions
    are needed for completion
  • Communication modes can be blocking or
    non-blocking
  • Blocking return from routine implies completion
  • Non-blocking routine returns immediately, user
    must test for completion

107
Communication Modes
108
Routine Names (blocking)
109
Sending a Message
  • C int MPI_Send(void buf, int count,
    MPI_Datatype datatype,int dest, int tag, MPI_Comm
    comm)
  • Fortran CALL MPI_SEND(BUF, COUNT, DATATYPE,
    DEST, TAG, COMM, IERROR)
  • lttypegt BUF() INTEGER COUNT, DATATYPE, DEST,
    TAG INTEGER COMM, IERROR

110
Arguments
  • buf starting address of data to be sent
  • count number of elements to be sent
  • datatype MPI datatype of each element
  • dest rank of destination process
  • tag message marker (set by user)
  • comm MPI communicator of processors involved
  • MPI_SEND(data,500,MPI_REAL,6,33,MPI_COMM_WORLD,
  • IERROR)

111
Memory Mapping
The Fortran 2-D array
Is stored in memory
112
Synchronous Send
  • Completion criteria Completes when message has
    been Received
  • Use if need to know that message has been
    received
  • Sending receiving processes synchronize
  • regardless of who is faster
  • processor idle time is probable
  • Safest communication method

113
Buffered Send
  • Completion criteria Completes when message
    copied to buffer
  • Advantage Completes immediately
  • Disadvantage User cannot assume there is a
    pre-allocated buffer
  • Control your own buffer space using MPI
    routines MPI_Buffer_attach MPI_Buffer_detach

114
Standard Send
  • Completion criteria Unknown!
  • May or may not imply that message has arrived at
    destination
  • Dont make any assumptions (implementation
    dependent)

115
Ready Send
  • Completion criteria Completes immediately, but
    only successful if matching receive already
    posted
  • Advantage Completes immediately
  • Disadvantage User must synchronize processors so
    that receiver is ready
  • Potential for good performance

116
Receiving a Message
  • C int MPI_Recv(void buf, int count,
    MPI_Datatype datatype, int source, int tag,
    MPI_Comm comm, MPI_Status status)
  • Fortran CALL MPI_RECV(BUF, COUNT, DATATYPE,
    SOURCE, TAG, COMM, STATUS, IERROR)
  • lttypegt BUF() INTEGER COUNT, DATATYPE, DEST,
    TAG INTEGER COMM, STATUS(MPI_STATUS_SIZE), IERROR

117
For a communication to succeed
  • Sender must specify a valid destination rank
  • Receiver must specify a valid source rank
  • The communicator must be the same
  • Tags must match
  • Receivers buffer must be large enough

118
Wildcarding
  • Receiver can wildcard
  • To receive from any source MPI_ANY_SOURCE To
    receive with any tag MPI_ANY_TAG
  • Actual source and tag are returned in the
    receivers status parameter

119
Communication Envelope
Senders Address For the attention
of Data Item 1
Item 2 Item 3
120
Communication Envelope Information
  • Envelope information is returned from MPI_RECV as
    status
  • Information includes
  • Source status.MPI_SOURCE or status(MPI_SOURCE)
  • Tag status.MPI_TAG or status(MPI_TAG)
  • Count MPI_Get_count or MPI_GET_COUNT

121
Received Message Count
  • Message received may not fill receive buffer
  • count is number of elements actually received
  • C int MPI_Get_count (MPI_Status
    status, MPI_Datatype datatype, int count)
  • Fortran CALL MPI_GET_COUNT(STATUS,DATATYPE,COUNT,
    IERROR)
  • INTEGER STATUS(MPI_STATUS_SIZE),
    DATATYPE INTEGER COUNT,IERROR

122
Message Order Preservation
communicator
  • Messages do no overtake each other
  • Example Process 0 sends two messages
    Process 2 posts two receives that match either
    message Order preserved

123
Sample Program 1 - C
  • include ltstdio.hgt
  • include ltstdlib.hgt
  • include ltmpi.hgt
  • / Run with two processes /
  • void main(int argc, char argv)
  • int rank, i, count
  • float data100,value200
  • MPI_Status status
  • MPI_Init(argc,argv)
  • MPI_Comm_rank(MPI_COMM_WORLD,rank)
  • if(rank1)
  • for(i0ilt100i) dataii
  • MPI_Send(data,100,MPI_FLOAT,0,55,MPI_COMM_W
    ORLD)
  • else
  • MPI_Recv(value,200,MPI_FLOAT,MPI_ANY_SOURCE
    ,55,MPI_COMM_WORLD,status)

Program Output P 0 Got data from processor 1 P
0 Got 100 elements P 0 value55.000000
124
Sample Program 1 - Fortran
  • PROGRAM p2p
  • C Run with two processes
  • INCLUDE 'mpif.h'
  • INTEGER err, rank, size
  • real data(100)
  • real value(200)
  • integer status(MPI_STATUS_SIZE)
  • integer count
  • CALL MPI_INIT(err)
  • CALL MPI_COMM_RANK(MPI_COMM_WORLD,rank,err)
  • CALL MPI_COMM_SIZE(MPI_COMM_WORLD,size,err)
  • if (rank.eq.1) then
  • data3.0
  • call MPI_SEND(data,100,MPI_REAL,0,55,MPI_
    COMM_WORLD,err)
  • else
  • call MPI_RECV(value,200,MPI_REAL,MPI_ANY_
    SOURCE,55,
  • MPI_COMM_WORLD,status,err)
  • print , "P",rank," got data from
    processor ",
  • status(MPI_SOURCE)

Program Output P 0 Got data from processor 1 P
0 Got 100 elements P 0 value53.
125
Collective Communication
  • Collective Communication
  • Barrier Synchronization
  • Broadcast
  • Scatter
  • Gather
  • Gather/Scatter Variations
  • Summary Illustration
  • Global Reduction Operations
  • Predefined Reduction Operations

MPI_Reduce Minloc and Maxloc User-defined
Reduction Operators Reduction Operator
Functions Registering a User-defined Reduction
Operator Variants of MPI_Reduce Class Exercise
Last Ring includes sample C and Fortran
programs
126
Collective Communication
  • Communications involving a group of processes
  • Called by all processes in a communicator
  • Examples
  • Broadcast, scatter, gather (Data Distribution)
  • Global sum, global maximum, etc. (Collective
    Operations)
  • Barrier synchronization

127
Characteristics of Collective Communication
  • Collective communication will not interfere with
    point-to-point communication and vice-versa
  • All processes must call the collective routine
  • Synchronization not guaranteed (except for
    barrier)
  • No non-blocking collective communication
  • No tags
  • Receive buffers must be exactly the right size

128
Barrier Synchronization
  • Red light for each processor turns green when
    all processors have arrived
  • Slower than hardware barriers (example SGI/Cray
    T3E)
  • C
  • int MPI_Barrier (MPI_Comm comm)
  • Fortran
  • CALL MPI_BARRIER (COMM,IERROR)
  • INTEGER COMM,IERROR

129
Broadcast
  • One-to-all communication same data sent from
    root process to all the others in the
    communicator
  • C int MPI_Bcast (void buffer, int, count,
    MPI_Datatype datatype,int root, MPI_Comm comm)
  • Fortran MPI_BCAST(BUFFER, COUNT, DATATYPE,
    ROOT, COMM IERROR)
  • lttypegt BUFFER () INTEGER COUNT, DATATYPE,
    ROOT, COMM, IERROR
  • All processes must specify same root rank and
    communicator

130
Sample Program 5 - C
  • includeltmpi.hgt
  • void main (int argc, char argv)
  • int rank
  • double param
  • MPI_Init(argc, argv)
  • MPI_Comm_rank(MPI_COMM_WORLD,rank)
  • if(rank5) param23.0
  • MPI_Bcast(param,1,MPI_DOUBLE,5,MPI_COMM_WORLD)
  • printf("Pd after broadcast parameter is
    f\n",rank,param)
  • MPI_Finalize()

Program Output P0 after broadcast parameter is
23.000000 P6 after broadcast parameter is
23.000000 P5 after broadcast parameter is
23.000000 P2 after broadcast parameter is
23.000000 P3 after broadcast parameter is
23.000000 P7 after broadcast parameter is
23.000000 P1 after broadcast parameter is
23.000000 P4 after broadcast parameter is
23.000000
131
Sample Program 5 - Fortran
  • PROGRAM broadcast
  • INCLUDE 'mpif.h'
  • INTEGER err, rank, size
  • real param
  • CALL MPI_INIT(err)
  • CALL MPI_COMM_RANK(MPI_WORLD_COMM,rank,err)
  • CALL MPI_COMM_SIZE(MPI_WORLD_COMM,size,err)
  • if(rank.eq.5) param23.0
  • call MPI_BCAST(param,1,MPI_REAL,5,MPI_COMM_W
    ORLD,err)
  • print ,"P",rank," after broadcast param
    is ",param
  • CALL MPI_FINALIZE(err)
  • END

Program Output P1 after broadcast parameter is
23. P3 after broadcast parameter is 23. P4
after broadcast parameter is 23 P0 after
broadcast parameter is 23 P5 after broadcast
parameter is 23. P6 after broadcast parameter is
23. P7 after broadcast parameter is 23. P2
after broadcast parameter is 23.
132
Scatter
  • One-to-all communication different data sent to
    each process in the communicator (in rank order)
  • C int MPI_Scatter(void sendbuf, int sendcount,
    MPI_Datatype sendtype, void recvbuf, int
    recvcount, MPI_Datatype recvtype, int root,
    MPI_Comm comm)
  • Fortran CALL MPI_SCATTER(SENDBUF, SENDCOUNT,
    SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT,
    COMM, IERROR) lttypegt SENDBUF(), RECVBUF()
  • sendcount is the number of elements sent to each
    process, not the total number sent
  • send arguments are significant only at the root
    process

133
Scatter Example
134
Sample Program 6 - C
  • include ltmpi.hgt
  • void main (int argc, char argv)
  • int rank,size,i,j
  • double param4,mine
  • int sndcnt,revcnt
  • MPI_Init(argc, argv)
  • MPI_Comm_rank(MPI_COMM_WORLD,rank)
  • MPI_Comm_size(MPI_COMM_WORLD,size)
  • revcnt1
  • if(rank3)
  • for(i0ilt4i) parami23.0i
  • sndcnt1
  • MPI_Scatter(param,sndcnt,MPI_DOUBLE,mine,revc
    nt,MPI_DOUBLE,3,MPI_COMM_WORLD)
  • printf("Pd mine is f\n",rank,mine)
  • MPI_Finalize()

Program Output P0 mine is 23.000000 P1 mine
is 24.000000 P2 mine is 25.000000 P3 mine is
26.000000
135
Sample Program 6 - Fortran
  • PROGRAM scatter
  • INCLUDE 'mpif.h'
  • INTEGER err, rank, size
  • real param(4), mine
  • integer sndcnt,rcvcnt
  • CALL MPI_INIT(err)
  • CALL MPI_COMM_RANK(MPI_WORLD_COMM,rank,err)
  • CALL MPI_COMM_SIZE(MPI_WORLD_COMM,size,err)
  • rcvcnt1
  • if(rank.eq.3) then
  • do i1,4
  • param(i)23.0i
  • end do
  • sndcnt1
  • end if
  • call MPI_SCATTER(param,sndcnt,MPI_REAL,mine,
    rcvcnt,MPI_REAL,
  • 3,MPI_COMM_WORLD,err)
  • print ,"P",rank," mine is ",mine
  • CALL MPI_FINALIZE(err)

Program Output P1 mine is 25. P3 mine is
27. P0 mine is 24. P2 mine is 26.
136
Gather
  • All-to-one communication different data
    collected by root process
  • Collection done in rank order
  • MPI_GATHER MPI_Gather have same arguments as
    matching scatter routines
  • Receive arguments only meaningful at the root
    process

137
Gather Example
138
Gather/Scatter Variations
  • MPI_Allgather
  • MPI_Alltoall
  • No root process specified all processes get
    gathered or scattered data
  • Send and receive arguments significant for all
    processes

139
Summary
140
Global Reduction Operations
  • Used to compute a result involving data
    distributed over a group of processes
  • Examples
  • Global sum or product
  • Global maximum or minimum
  • Global user-defined operation

141
Example of Global Reduction
  • Sum of all the x values is placed in result only
    on processor 0
  • C
  • int MPI_Barrier (MPI_Comm comm)
  • Fortran
  • CALL MPI_BARRIER (COMM,IERROR)
  • INTEGER COMM,IERROR

142
Predefined Reduction Operations
143
General Form
  • count is the number of ops done on consecutive
    elements of sendbuf (it is also size of recvbuf)
  • op is an associative operator that takes two
    operands of type datatype and returns a result of
    the same type
  • C
  • int MPI_Reduce(void sendbuf, void recvbuf, int
    count,
  • MPI_Datatype datatype, MPI_Op op, int
    root,MPI_Comm comm)
  • Fortran
  • CALL MPI_REDUCE(SENDBUF,RECVBUF,COUNT,DATATYPE,OP,
    ROOT,COMM,
  • IERROR) lttypegt SENDBUF(), RECVBUF()

144
MPI_Reduce
145
Minloc and Maxloc
  • Designed to compute a global minimum / maximum
    and index associated with the extreme value
  • Common application index is the processor rank
    (see sample program)
  • If more than one extreme, get the first
  • Designed to work on operands that consist of a
    value and index pair
  • MPI_Datatypes include
  • C
  • MPI_FLOAT_INT, MPI_DOUBLE_INT, MPI_LONG_INT,
    MPI_2INT, MPI_SHORT_INT, MPI_LONG_DOUBLE_INT
  • Fortran
  • MPI_2REAL, MPI_2DOUBLEPRECISION, MPI_2INTEGER

146
Sample Program 7 - C
  • include ltmpi.hgt
  • / Run with 16 processes /
  • void main (int argc, char argv)
  • int rank
  • struct
  • double value
  • int rank
  • in, out
  • int root
  • MPI_Init(argc, argv)
  • MPI_Comm_rank(MPI_COMM_WORLD,rank)
  • in.valuerank1
  • in.rankrank
  • root7
  • MPI_Reduce(in,out,1,MPI_DOUBLE_INT,MPI_MAXLOC
    ,root,MPI_COMM_WORLD)

Program Output P7 max16.000000 at rank 15 P7
max1.000000 at rank 0
147
Sample Program 7 - Fortran
  • PROGRAM MaxMin
  • C
  • C Run with 8 processes
  • C
  • INCLUDE 'mpif.h'
  • INTEGER err, rank, size
  • integer in(2),out(2)
  • CALL MPI_INIT(err)
  • CALL MPI_COMM_RANK(MPI_WORLD_COMM,rank,err)
  • CALL MPI_COMM_SIZE(MPI_WORLD_COMM,size,err)
  • in(1)rank1
  • in(2)rank
  • call MPI_REDUCE(in,out,1,MPI_2INTEGER,MPI_MA
    XLOC,7,MPI_COMM_WORLD,err)
  • if(rank.eq.7) print ,"P",rank,"
    max",out(1)," at rank ",out(2)
  • call MPI_REDUCE(in,out,1,MPI_2INTEGER,MPI_MI
    NLOC,2,MPI_COMM_WORLD,err)
  • if(rank.eq.2) print ,"P",rank,"
    min",out(1)," at rank ",out(2)

Program Output P2 min1 at rank 0 P7 max8 at
rank 7
About PowerShow.com