MPI TIPS AND TRICKS - PowerPoint PPT Presentation

1 / 169
About This Presentation
Title:

MPI TIPS AND TRICKS

Description:

Generally ONLY use if you cannot call earlier AND there is no other work that can be done! ... Seg violation #define BUFFSIZE 1000. char *buff; char b1[500], b2[500] ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 170
Provided by: david358
Category:
Tags: and | mpi | tips | tricks | seg

less

Transcript and Presenter's Notes

Title: MPI TIPS AND TRICKS


1
MPI TIPS AND TRICKS
  • Dr. David Cronk
  • Innovative Computing Lab
  • University of Tennessee

2
Course Outline
  • Day 1
  • Morning - Lecture
  • Point-to-point communication modes
  • Collective operations
  • Derived datatypes
  • Afternoon - Lab
  • Hands on exercises demonstrating collective
    operations and derived datatypes

3
Course Outline (cont)
  • Day 2
  • Morning - Lecture
  • Finish lecture on derived datatypes
  • MPI analysis using VAMPIR
  • Performance analysis and tuning
  • Afternoon - Lab
  • Finish Day 1 exercises
  • VAMPIR demo

4
Course Outline (cont)
  • Day 3
  • Morning - Lecture
  • MPI-I/O
  • Afternoon - Lab
  • MPI-I/O exercises

5
Point-to-Point Communication Modes
  • Standard Mode
  • blocking
  • MPI_SEND (buf, count, datatype, dest, tag, comm)
  • MPI_RECV (buf, count, datatype, source, tag,
    comm, status)
  • Generally ONLY use if you cannot call earlier AND
    there is no other work that can be done!
  • Standard ONLY states that buffers can be used
    once calls return. It is implementation
    dependant on when blocking calls return.
  • Blocking sends MAY block until a matching receive
    is posted. This is not required behavior, but
    the standard does not prohibit this behavior
    either. Further, a blocking send may have to
    wait for system resources such as system managed
    message buffers.
  • Be VERY careful of deadlock when using blocking
    calls!

6
Point-to-Point Communication Modes (cont)
  • Standard Mode
  • Non-blocking (immediate) sends/receives
  • MPI_ISEND (buf, count, datatype, dest, tag, comm,
    request)
  • MPI_IRECV (buf, count, datatype, source, tag,
    comm, request)
  • MPI_WAIT (request, status)
  • MPI_TEST (request, flag, status)
  • Allows communication calls to be posted early,
    which may improve performance.
  • Overlap computation and communication
  • Latency tolerance
  • Less (or no) buffering
  • MUST either complete these calls (with wait or
    test) or call MPI_REQUEST_FREE

7
Point-to-Point Communication Modes (cont)
  • Non-standard mode communication
  • Only used by the sender! (MPI uses the push
    communication model)
  • Buffered mode - A buffer must be provided by the
    application
  • Synchronous mode - Completes only after a
    matching receive has been posted
  • Ready mode - May only be called when a matching
    receive has already been posted

8
Point-to-Point Communication Modes Buffered
  • MPI_BSEND (buf, count, datatype, dest, tag, comm)
  • MPI_IBSEND (buf, count, dtype, dest, tag, comm,
    req)
  • MPI_BUFFER_ATTACH (buff, size)
  • MPI_BUFFER_DETACH (buff, size)
  • Buffered sends do not rely on system buffers
  • The user supplies a buffer that MUST be large
    enough for all messages
  • User need not worry about calls blocking, waiting
    for system buffer space
  • The buffer is managed by MPI
  • The user MUST ensure there is no buffer overflow

9
Buffered Sends
Seg violation
Buffer overflow
Safe
10
Point-to-Point Communication Modes Synchronous
  • MPI_SSEND (buf, count, datatype, dest, tag, comm)
  • MPI_ISSEND (buf, count, dtype, dest, tag, comm,
    req)
  • Can be started (called) at any time.
  • Does not complete until a matching receive has
    been posted and the receive operation has been
    started
  • Does NOT mean the matching receive has
    completed
  • Can be used in place of sending and receiving
    acknowledgements
  • Can be more efficient when used appropriately
  • buffering may be avoided

11
Point-to-Point Communication Modes Ready Mode
  • MPI_RSEND (buf, count, datatype, dest, tag, comm)
  • MPI_IRSEND (buf, count, dtype, dest, tag, comm,
    req)
  • May ONLY be started (called) if a matching
    receive has already been posted.
  • If a matching receive has not been posted, the
    results are undefined
  • May be most efficient when appropriate
  • Removal of handshake operation
  • Should only be used with extreme caution

12
Ready Mode
UNSAFE
SAFE
13
Point-to-Point Communication Modes Performance
Issues
  • Non-blocking calls are almost always the way to
    go
  • Communication can be carried out during blocking
    system calls
  • Computation and communication can be overlapped
    if there is special purpose communication
    hardware
  • Less likely to have errors that lead to deadlock
  • Standard mode is usually sufficient - but
    buffered mode can offer advantages
  • Particularly if there are frequent, large
    messages being sent
  • If the user is unsure the system provides
    sufficient buffer space
  • Synchronous mode can be more efficient if acks
    are needed
  • Also tells the system that buffering is not
    required

14
Collective Communication
  • Amount of data sent must exactly match the amount
    of data received
  • Collective routines are collective across an
    entire communicator and must be called in the
    same order from all processors within the
    communicator
  • Collective routines are all blocking
  • This simply means buffers can be re-used upon
    return
  • Collective routines return as soon as the calling
    process participation is complete
  • Does not say anything about the other processors
  • Collective routines may or may not be
    synchronizing
  • No mixing of collective and point-to-point
    communication

15
Collective Communication
  • Barrier MPI_BARRIER (comm)
  • Only collective routine which provides explicit
    synchronization
  • Returns at any processor only after all processes
    have entered the call

16
Collective Communication
  • Collective Communication Routines
  • Except broadcast, each routine has 2 variants
  • Standard variant All messages are the same size
  • Vector Variant Each item is a vector of possibly
    varying length
  • If there is a single origin or destination, it is
    referred to as the root
  • Each routine (except broadcast) has distinct send
    and receive arguments
  • Send and receive buffers must be disjoint
  • Each can use MPI_IN_PLACE, which allows the user
    to specify that data contributed by the caller is
    already in its final location.

17
Collective Communication Bcast
  • MPI_BCAST (buffer, count, datatype, root, comm)
  • Strictly in place
  • MPI-1 insists on using an intra-communicator
  • MPI-2 allows use of an inter-communicator
  • REMEMBER A broadcast need not be synchronizing.
    Returning from a broadcast tells you nothing
    about the status of the other processes involved
    in a broadcast. Furthermore, though MPI does not
    require MPI_BCAST to be synchronizing, it neither
    prohibits synchronous behavior.

18
BCAST
THATS BETTER
OOPS!
19
Collective Communication Gather
  • MPI_GATHER (sendbuf, sendcount, sendtype,
    recvbuf, recvcount, recvtype, root, comm)
  • Receive arguments are only meaningful at the root
  • Each processor must send the same amount of data
  • Root can use MPI_IN_PLACE for sendbuf
  • data is assumed to be in the correct place in the
    recvbuf

MPI_GATHER
20
MPI_Gather
int tmp20 int res320
WORKS
A OK
21
Collective Communication Gatherv
  • MPI_GATHERV (sendbuf, sendcount, sendtype,
    recvbuf, recvcounts, displs, recvtype, root,
    comm)
  • Vector variant of MPI_GATHER
  • Allows a varying amount of data from each proc
  • allows root to specify where data from each proc
    goes
  • No portion of the receive buffer may be written
    more than once
  • MPI_IN_PLACE may be used by root.

22
Collective Communication Gatherv (cont)
MPI_GATHERV
23
Collective Communication Gatherv (cont)
stride 105 root 0 for (i 0 i lt nprocs
i) displsi istride countsI
100 MPI_Gatherv (sbuff, 100, MPI_INT,
rbuff, counts, displs, MPI_INT, root,
MPI_COMM_WORLD)
24
Collective Communication Scatter
  • MPI_SCATTER (sendbuf, sendcount, sendtype,
    recvbuf, recvcount, recvtype, root, comm)
  • Opposite of MPI_GATHER
  • Send arguments only meaningful at root
  • Root can use MPI_IN_PLACE for recvbuf

MPI_SCATTER
25
MPI_SCATTER
IF (MYPE .EQ. ROOT) THEN OPEN (25,
FILEfilename) READ (25, ) nprocs, nboxes
READ (25, ) mat(i,j) (i1,nboxes)(j1,nprocs)
CLOSE (25) ENDIF CALL MPI_BCAST (nboxes, 1,
MPI_INTEGER, ROOT, MPI_COMM_WORLD, ierr) CALL
MPI_SCATTER (mat, nboxes, MPI_INT, lboxes,
nboxes, MPI_INT, ROOT, MPI_COMM_WORLD, ierr)
26
Collective Communication Scatterv
  • MPI_SCATTERV (sendbuf, scounts, displs, sendtype,
    recvbuf, recvcount, recvtype)
  • Opposite of MPI_GATHERV
  • Send arguments only meaningful at root
  • Root can use MPI_IN_PLACE for recvbuf
  • No location of the sendbuf can be read more than
    once

27
Collective Communication Scatterv (cont)
MPI_SCATTERV
28
MPI_SCATTERV
C mnb max number of boxes IF (MYPE .EQ. ROOT)
THEN OPEN (25, FILEfilename) READ (25, )
nprocs READ (25, ) (nboxes(I), I1,nprocs)
READ (25, ) mat(I,J) (I1,nboxes(I))(J1,nprocs)
CLOSE (25) DO I 1,nprocs displs(I)
(I-1)mnb ENDDO ENDIF CALL MPI_SCATTER (nboxes,
1, MPI_INT, nb, 1, MPI_INT, ROOT,
MPI_COMM_WORLD, ierr) CALL MPI_SCATTERV (mat,
nboxes, displs, MPI_INT, lboxes, nb, MPI_INT,
ROOT, MPI_COMM_WORLD, ierr)
29
Collective Communication Allgather
  • MPI_ALLGATHER (sendbuf, sendcount, sendtype,
    recvbuf, recvcount, recvtype, comm)
  • Same as MPI_GATHER, except all processors get the
    result
  • MPI_IN_PLACE may be used for sendbuf of all
    processors
  • Equivalent to a gather followed by a bcast

30
Collective Communication Allgatherv
  • MPI_ALLGATHERV (sendbuf, sendcount, sendtype,
    recvbuf, recvcounts, displs, recvtype, comm)
  • Same as MPI_GATHERV, except all processors get
    the result
  • MPI_IN_PLACE may be used for sendbuf of all
    processors
  • Equivalent to a gatherv followed by a bcast

31
Collective Communication Alltoall
(scatter/gather)
  • MPI_ALLTOALL (sendbuf, sendcount, sendtype,
    recvbuf, recvcount, recvtype, comm)

32
Collective Communication Alltoallv
  • MPI_ALLTOALLV (sendbuf, sendcounts, sdispls,
    sendtype, recvbuf, recvcounts, rdispls, recvtype,
    comm)
  • Same as MPI_ALLTOALL, but the vector variant
  • Can specify how many blocks to send to each
    processor, location of blocks to send, how many
    blocks to receive from each processor, and where
    to place the received blocks

33
Collective Communication Alltoallw
  • MPI_ALLTOALLW (sendbuf, sendcounts, sdispls,
    sendtypes, recvbuf, recvcounts, rdispls,
    recvtypes, comm)
  • Same as MPI_ALLTOALLV, except different datatypes
    can be specified for data scattered as well as
    data gathered
  • Can specify how many blocks to send to each
    processor, location of blocks to send, how many
    blocks to receive from each processor, and where
    to place the received blocks
  • Displacements are now in terms of bytes rather
    that types

34
Collective Communication Reduction
  • Global reduction across all members of a group
  • Can us predefined operations or user defined
    operations
  • Can be used on single elements or arrays of
    elements
  • Counts and types must be the same on all
    processors
  • Operations are assumed to be associative
  • User defined operations can be different on each
    processor, but not recommended

35
Collective Communication Reduction (reduce)
  • MPI_REDUCE (sendbuf, recvbuf, count, datatype,
    op, root, comm)
  • recvbuf only meaningful on root
  • Combines elements (on an element by element
    basis) in sendbuf according to op
  • Results of the reduction are returned to root in
    recvbuf
  • MPI_IN_PLACE can be used for sendbuf on root

36
MPI_REDUCE
REAL a(n), b(n,m), c(m) REAL sum(m) DO j1,m
sum(j) 0.0 DO i 1,n sum(j) sum(j)
a(i)b(i,j) ENDDO ENDDO CALL MPI_REDUCE(sum,
c, m, MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD,
ierr)
37
Collective Communication Reduction (cont)
  • MPI_ALLREDUCE (sendbuf, recvbuf, count, datatype,
    op, comm)
  • Same as MPI_REDUCE, except all processors get the
    result
  • MPI_REDUCE_SCATTER (sendbuf, recv_buff,
    recvcounts, datatype, op, comm)
  • Acts like it does a reduce followed by a scatterv

38
MPI_REDUCE_SCATTER
DO j1,nprocs counts(j) n DO j1,m sum(j)
0.0 DO i 1,n sum(j) sum(j)
a(i)b(i,j) ENDDO ENDDO CALL
MPI_REDUCE_SCATTER(sum, c, counts, MPI_REAL,
MPI_SUM, MPI_COMM_WORLD, ierr)
39
Collective Communication Prefix Reduction
  • MPI_SCAN (sendbuf, recvbuf, count, datatype, op,
    comm)
  • Performs an inclusive element-wise prefix
    reduction
  • MPI_EXSCAN (sendbuf, recvbuf, count, datatype,
    op, comm)
  • Performs an exclusive prefix reduction
  • Results are undefined at process 0

40
MPI_SCAN
MPI_SCAN (sbuf, rbuf, 1, MPI_INT, MPI_SUM,
MPI_COMM_WORLD)
MPI_EXSCAN (sbuf, rbuf, 1, MPI_INT, MPI_SUM,
MPI_COMM_WORLD)
41
Collective Communication Reduction - user
defined ops
  • MPI_OP_CREATE (function, commute, op)
  • if commute is true, operation is assumed to be
    commutative
  • Function is a user defined function with 4
    arguments
  • invec input vector
  • inoutvec input and output value
  • len number of elements
  • datatype MPI_DATATYPE
  • Returns inveci op inoutveci, i 0..len-1
  • MPI_OP_FREE (op)

42
Collective Communication Performance Issues
  • Collective operations should have much better
    performance than simply sending messages directly
  • Broadcast may make use of a broadcast tree (or
    other mechanism)
  • All collective operations can potentially make
    use of a tree (or other) mechanism to improve
    performance
  • Important to use the simplest collective
    operations which still achieve the needed results
  • Use MPI_IN_PLACE whenever appropriate
  • Reduces unnecessary memory usage and redundant
    data movement

43
Derived Datatypes
  • A derived datatype is a sequence of primitive
    datatypes and displacements
  • Derived datatypes are created by building on
    primitive datatypes
  • A derived datatypes typemap is the sequence of
    (primitive type, disp) pairs that defines the
    derived datatype
  • These displacements need not be positive, unique,
    or increasing.
  • A datatypes type signature is just the sequence
    of primitive datatypes
  • A messages type signature is the type signature
    of the datatype being sent, repeated count times

44
Derived Datatypes (cont)
Typemap (MPI_INT, 0) (MPI_INT, 12) (MPI_INT,
16) (MPI_INT, 20) (MPI_INT, 36) Type Signature
MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT
Type Signature MPI_INT, MPI_INT, MPI_INT,
MPI_INT, MPI_INT
In collective communication, the type signature
of data sent must match the type signature of
data received!
45
Derived Datatypes (cont)
  • Lower Bound The lowest displacement of an entry
    of this datatype
  • Upper Bound Relative address of the last byte
    occupied by entries of this datatype, rounded up
    to satisfy alignment requirements
  • Extent The span from lower to upper bound
  • MPI_GET_EXTENT (datatype, lb, extent)
  • MPI_TYPE_SIZE (datatype, size)
  • MPI_GET_ADDRESS (location, address)

46
Datatype Constructors
  • MPI_TYPE_DUP (oldtype, newtype)
  • Simply duplicates an existing type
  • Not useful to regular users
  • MPI_TYPE_CONTIGUOUS (count, oldtype, newtype)
  • Creates a new type representing count contiguous
    occurrences of oldtype
  • ex MPI_TYPE_CONTIGUOUS (2, MPI_INT, 2INT)
  • creates a new datatype 2INT which represents an
    array of 2 integers

47
CONTIGUOUS DATATYPE
P1 sends 100 integers to P2
P1 int buff100 MPI_Datatype dtype ... ... MPI_
Type_contiguous (100, MPI_INT,
dtype) MPI_Type_commit (dtype) MPI_Send
(buff, 1, dtype, 2, tag, MPI_COMM_WORLD)
P2 int buff100 MPI_Recv (buff, 100, MPI_INT,
1, tag, MPI_COMM_WORLD, status)
48
Datatype Constructors (cont)
  • MPI_TYPE_VECTOR (count, blocklength, stride,
    oldtype, newtype)
  • Creates a datatype representing count regularly
    spaced occurrences of blocklength contiguous
    oldtypes
  • stride is in terms of elements of oldtype
  • ex MPI_TYPE_VECTOR (4, 2, 3, 2INT, AINT)

49
Datatype Constructors (cont)
  • MPI_TYPE_HVECTOR (count, blocklength, stride,
    oldtype, newtype)
  • Identical to MPI_TYPE_VECTOR, except stride is
    given in bytes rather than elements.
  • ex MPI_TYPE_HVECTOR (4, 2, 20, 2INT, BINT)

50
EXAMPLE
  • REAL a(100,100), B(100,100)
  • CALL MPI_COMM_RANK (MPI_COMM_WORLD, myrank, ierr)
  • CALL MPI_TYPE_SIZE (MPI_REAL, sizeofreal, ierr)
  • CALL MPI_TYPE_VECTOR (100, 1, 100,MPI_REAL,
    rowtype, ierr)
  • CALL MPI_TYPE_CREATE_HVECTOR (100, 1, sizeofreal,
    rowtype, xpose, ierr)
  • CALL MPI_TYPE_COMMIT (xpose, ierr)
  • CALL MPI_SENDRECV (a, 1, xpose, myrank, 0, b,
    100100, MPI_REAL, myrank, 0, MPI_COMM_WORLD,
    status, ierr)

51
Datatype Constructors (cont)
  • MPI_TYPE_INDEXED (count, blocklengths, displs,
    oldtype, newtype)
  • Allows specification of non-contiguous data
    layout
  • Good for irregular problems
  • ex MPI_TYPE_INDEXED (3, lengths, displs, 2INT,
    CINT)
  • lengths (2, 4, 3) displs (0,3,8)
  • Most often, block sizes are all the same
    (typically 1)
  • MPI-2 introduced a new constructor

52
Datatype Constructors (cont)
  • MPI_TYPE_CREATE_INDEXED_BLOCK (count,
    blocklength, displs, oldtype, newtype)
  • Same as MPI_TYPE_INDEXED, except all blocks are
    the same length (blocklength)
  • ex MPI_TYPE_INDEXED_BLOCK (7, 1, displs,
    MPI_INT, DINT)
  • displs (1, 3, 4, 6, 9, 13, 14)

53
Datatype Constructors (cont)
  • MPI_TYPE_CREATE_HINDEXED (count, blocklengths,
    displs, oldtype, newtype)
  • Identical to MPI_TYPE_INDEXED except
    displacements are in bytes rather then elements
  • MPI_TYPE_CREATE_STRUCT (count, lengths, displs,
    types, newtype)
  • Used mainly for sending arrays of structures
  • count is number of fields in the structure
  • lengths is number of elements in each field
  • displs should be calculated (portability)

54
MPI_TYPE_CREATE_STRUCT
struct s1 char class double d6 char
b7 struct s1 sarray100
Non-portable
Semi-portable
55
MPI_TYPE_CREATE_STRUCT
int i char c100 float f3 int a MPI_Aint
disp4 int lens4 1, 100, 3,
1 MPI_Datatype types4 MPI_INT, MPI_CHAR,
MPI_FLOAT, MPI_INT MPI_Datatype
stype MPI_Get_address(i, disp0) MPI_Get_add
ress(c, disp1) MPI_Get_address(f,
disp2) MPI_Get_address(a, disp3) MPI_Typ
e_create_struct(4, lens, disp, types,
stype) MPI_Type_commit (stype) MPI_Send
(MPI_BOTTOM, 1, stype, ..)
56
Derived Datatypes (cont)
  • MPI_TYPE_CREATE_RESIZED (oldtype, lb, extent,
    newtype)
  • sets a new lower bound and extent for oldtype
  • Does NOT change amount of data sent in a message
  • only changes data access pattern

57
MPI_TYPE_CREATE_RESIZED
Really Portable
58
Datatype Constructors (cont)
  • MPI_TYPE_CREATE_SUBARRAY (ndims, sizes, subsizes,
    starts, order, oldtype, newtype)
  • Creates a newtype which represents a contiguous
    subsection of an array with ndims dimensions
  • This sub-array is only contiguous conceptually,
    it may not be stored contiguously in memory!
  • Arrays are assumed to be indexed starting a
    zero!!!
  • Order must be MPI_ORDER_C or MPI_ORDER_FORTRAN
  • C programs may specify Fortran ordering, and
    vice-versa

59
Datatype Constructors Subarrays
(1,1)
(10,10)
MPI_TYPE_CREATE_SUBARRAY (2, sizes, subsizes,
starts, MPI_ORDER_FORTRAN, MPI_INT,
sarray) sizes (10, 10) subsizes (6,6) starts
(3, 3)
60
Datatype Constructors Subarrays
(1,1)
(10,10)
MPI_TYPE_CREATE_SUBARRAY (2, sizes, subsizes,
starts, MPI_ORDER_FORTRAN, MPI_INT,
sarray) sizes (10, 10) subsizes (6,6) starts
(2,2)
61
Datatype Constructors Darrays
  • MPI_TYPE_CREATE_DARRAY (size, rank, dims, gsizes,
    distribs, dargs, psizes, order, oldt, newtype)
  • Used with arrays that are distributed in HPF-like
    fashion on Cartesian process grids
  • Generates datatypes corresponding to the
    sub-arrays stored on each processor
  • Returns in newtype a datatype specific to the
    sub-array stored on process rank

62
Datatype Constructors (cont)
  • Derived datatypes must be committed before they
    can be used
  • MPI_TYPE_COMMIT (datatype)
  • Performs a compilation of the datatype
    description into an efficient representation
  • Derived datatypes should be freed when they are
    no longer needed
  • MPI_TYPE_FREE (datatype)
  • Does not effect datatypes derived from the freed
    datatype or current communication

63
Pack and Unpack
  • MPI_PACK (inbuf, incount, datatype, outbuf,
    outsize, position, comm)
  • MPI_UNPACK (inbuf, insize, position, outbuf,
    outcount, datatype, comm)
  • MPI_PACK_SIZE (incount, datatype, comm, size)
  • Packed messages must be sent with the type
    MPI_PACKED
  • Packed messages can be received with any matching
    datatype
  • Unpacked messages can be received with the type
    MPI_PACKED
  • Receives must use type MPI_PACKED if the messages
    are to be unpacked

64
Pack and Unpack
65
Derived Datatypes Performance Issues
  • May allow the user to send fewer or smaller
    messages
  • System dependant on how well this works
  • May be able to significantly reduce memory copies
  • can make I/O much more efficient
  • Data packing may be more efficient if it reduces
    the number of send operations by packing
    meta-data at the front of the message
  • This is often possible (and advantageous) for
    data layouts that are runtime dependant

66
DAY 2
  • Morning - Lecture
  • Performance analysis and tuning
  • Afternoon - Lab
  • VAMPIR demo

67
Performance analysis and tuning
  • It is typically much more difficult to debug and
    tune parallel programs
  • Programmers often have no idea where to begin
    searching for possible bottlenecks
  • A tool that allows the programmer to get a quick
    overview of the programs execution can aid the
    programmer in beginning this search

68
VAMPIR
  • Vampir is a visualization program used to
    visualize trace data generated by Vampitrace
  • Vampirtrace is an instrumented MPI library to
    link with user code for automatic tracefile
    generation on parallel platforms

69
Vampir and Vampirtrace
70
Vampir Features
  • Tool for converting tracefile data for MPI
    programs into a variety of graphical views
  • Highly configurable
  • Timeline display with zooming and scrolling
    capabilities
  • Profiling and communications statistics
  • Source-code clickback
  • OpenMP support under development

71
Vampirtrace
  • Profiling library for MPI applications
  • Produces tracefiles that can be analyzed with the
    Vampir performance analysis tool or the Dimemas
    performance prediction tool.
  • Merely linking your application with Vampirtrace
    enables tracing of all MPI calls. On some
    platforms, calls to user-level subroutines are
    also recorded.
  • API for controlling profiling and for defining
    and tracing user-defined activities.

72
Running and Analyzing Vampirtrace-instrumented
Programs
  • Programs linked with Vampirtrace are started in
    the same way as ordinary MPI programs.
  • Use Vampir to analyze the resulting tracefile.
  • Uses configuration file (VAMPIR2.cnf) in
    HOME/.VAMPIR_defaults
  • Can copy VAMPIR_ROOT/etc/VAMPI2.cnf
  • If no configuration file is available, VAMPIR
    will create one with default values

73
Getting Started
If your path is set up correctly, simply enter
vampir
To open a trace file, select File followed by
Open Tracefile Select trace file to open or
enter a known tracefile
Once the tracefile is loaded, VAMPIR starts with
a global timeline display. This is the analysis
starting point
74
Vampir Global Timeline Display
75
Global Timeline Display
  • Context menu is activated with a right mouse
    click inside any display window
  • Zoom in by selecting start of desired region,
    left click held, drag mouse to end of desired
    region and release
  • Can zoom in to unlimited depth
  • Step out of zooms from context menu

76
Zoomed Global Timeline Display
77
Displays
  • Activity Charts
  • Default is pie chart, but can also use
  • histograms or table mode
  • Can select different activities to be shown
  • Can hide some activities
  • Can change scale in histograms

Timeline Activity Chart Summary Chart Message
Statistics File I/O Statistics Parallelism
78
Global Activity Chart Display (single file)
79
Global Activity Chart Display (modular)
80
Global Activity Chart with All Symbols Displayed
81
Global Activity Chart with MPI Activities
Displayed
82
Global Activity Chart with Application Activity
Displayed
83
Global Activity Chart with Application Activity
Displayed
84
Global Activity Chart with Timeline Portion
Displayed
85
Process Activity Chart Display
86
Process Activity Chart Display (mpi)
87
Process Activity Chart Display (hide max)
88
Process Activity Chart Histogram Display
89
Process Activity Chart Histogram Display (log
display)
90
Process Activity Chart Table Display
91
Summary Charts
  • Shows total time spent on each activity
  • Can be sum of all processors or average for each
    processor
  • Similar context menu options as activity charts
  • Default display is horizontal histogram, but can
    also be vertical histogram, pie chart, or table

92
Global Summaric Chart Displays (all symbols)
93
Global Summaric Chart Displays (per process)
94
Global Summaric Chart Displays (mpi)
95
Global Summaric Chart Displays (timeline)
96
Global Summaric Chart Displays
97
Communication Statistics
  • Shows matrix of comm statistics
  • Can show total bytes, total msgs, avg msg size,
    longest, shortest, and transmission rates
  • Can zoom into sub-matrices
  • Can get length statistics
  • Can filter messages by type (tag) and
    communicator

98
Global Communication Statistics Display (total
bytes)
99
Global Communication Statistics Display (total
bytes zoomed in)
100
Global Communication Statistics Display (Average
size)
101
Global Communication Statistics Display (total
messages)
102
Global Communication Statistics Display using
Timeline Portion
103
Global Communication Statistics Display (length
statistics)
104
Global Communication Statistics Display (Filter
Dialog)
105
Global Parallelism Display
106
Tracefile Size
  • Often, the trace file from a fully instrumented
    code grows to an unmanageable size
  • Can limit the problem size for analysis
  • Can limit the number of iterations
  • Can use the vampirtrace API to limit size
  • vttraceoff () Disables tracing
  • vttraceon() Re-enables tracing

107
Performance Analysis and Tuning
  • First, make sure there is available speedup in
    the MPI routines
  • Use a profiling tool such as VAMPIR
  • If the total time spent in MPI routines is a
    small fraction of total execution time, there is
    probably not much use tuning the message passing
    code
  • BEWARE Profiling tools can miss compute cycles
    used due to non-blocking calls!

108
Performance Analysis and Tuning
  • If MPI routines account for a significant portion
    of your execution time
  • Try to identify communication hot-spots
  • Will changing the order of communication reduce
    the hotspot problem?
  • Will changing the data distribution reduce
    communication without increasing computation?
  • Sending more data is better than sending more
    messages

109
Performance Analysis and Tuning
  • Are you using non-blocking calls?
  • Post sends/receives as soon as possible, but
    dont wait for their completion if there is still
    work you can do!
  • If you are waiting for long periods of time for
    completion of non-blocking sends, this may be an
    indication of small system buffers. Consider
    using buffered mode.

110
Performance Analysis and Tuning
  • Are you sending lots of small messages?
  • Message passing has significant overhead
    (latency). Latency accounts for a large
    proportion of the message transmission time for
    small messages.
  • Consider marshaling values into larger messages
    if this is appropriate
  • If you are using derived datatypes, check if the
    MPI implementation handles these types
    efficiently
  • Consider using MPI_PACK where appropriate
  • dynamic data layouts or sender needs to send the
    receiver meta-data.

111
Performance Analysis and Tuning
  • Use collective operations when appropriate
  • many collective operations use mechanisms such as
    broadcast trees to achieve better performance
  • Is your computation to communication ratio too
    small?
  • You may be running on too many processors for the
    problem size

112
DAY 3
  • Morning - Lecture
  • MPI-I/O
  • Afternoon - Lab
  • MPI-I/O exercises

113
MPI-I/O
  • Introduction
  • What is parallel I/O
  • Why do we need parallel I/O
  • What is MPI-I/O
  • MPI-I/O
  • Terms and definitions
  • File manipulation
  • Derived data types and file views

114
OUTLINE (cont)
  • MPI-I/O (cont)
  • Data access
  • Non-collective access
  • Collective access
  • Split collective access
  • File interoperability
  • Gotchas - Consistency and semantics

115
INTRODUCTION
  • What is parallel I/O?
  • Multiple processes accessing a single file

116
INTRODUCTION
  • What is parallel I/O?
  • Multiple processes accessing a single file
  • Often, both data and file access is
    non-contiguous
  • Ghost cells cause non-contiguous data access
  • Block or cyclic distributions cause
    non-contiguous file access

117
Non-Contiguous Access
File layout
Local Mem
118
INTRODUCTION
  • What is parallel I/O?
  • Multiple processes accessing a single file
  • Often, both data and file access is
    non-contiguous
  • Ghost cells cause non-contiguous data access
  • Block or cyclic distributions cause
    non-contiguous file access
  • Want to access data and files with as few I/O
    calls as possible

119
INTRODUCTION (cont)
  • Why use parallel I/O?
  • Many users do not have time to learn the
    complexities of I/O optimization

120
INTRODUCTION (cont)
Integer dim parameter (dim10000) Integer4
out_array(dim)
OPEN (fh,filename,UNFORMATTED) WRITE(fh)
(out_array(I), I1,dim)
rl 4dim OPEN (fh, filename, DIRECT,
RECLrl) WRITE (fh, REC1) out_array
121
INTRODUCTION (cont)
  • Why use parallel I/O?
  • Many users do not have time to learn the
    complexities of I/O optimization
  • Use of parallel I/O can simplify coding
  • Single read/write operation vs. multiple
    read/write operations

122
INTRODUCTION (cont)
  • Why use parallel I/O?
  • Many users do not have time to learn the
    complexities of I/O optimization
  • Use of parallel I/O can simplify coding
  • Single read/write operation vs. multiple
    read/write operations
  • Parallel I/O potentially offers significant
    performance improvement over traditional
    approaches

123
INTRODUCTION (cont)
  • Traditional approaches
  • Each process writes to a separate file
  • Often requires an additional post-processing step
  • Without post-processing, restarts must use same
    number of processor
  • Result sent to a master processor, which collects
    results and writes out to disk
  • Each processor calculates position in file and
    writes individually

124
INTRODUCTION (cont)
  • What is MPI-I/O?
  • MPI-I/O is a set of extensions to the original
    MPI standard
  • This is an interface specification It does NOT
    give implementation specifics
  • It provides routines for file manipulation and
    data access
  • Calls to MPI-I/O routines are portable across a
    large number of architectures

125
MPI-I/O
  • Terms and Definitions
  • Displacement - Number of bytes from the beginning
    of a file
  • etype - unit of data access within a file
  • filetype - datatype used to express access
    patterns of a file
  • file view - definition of access patterns of a
    file

126
MPI-I/O
  • Terms and Definitions
  • Offset - Position in the file, relative to the
    current view, expressed in terms of number of
    etypes
  • file pointers - offsets into the file maintained
    by MPI
  • Individual file pointer - local to the process
    that opened the file
  • Shared file pointer - shared (and manipulated) by
    the group of processes that opened the file

127
FILE MANIPULATION
  • MPI_FILE_OPEN(Comm, filename, mode, info, fh,
    ierr)
  • Opens the file identified by filename on each
    processor in communicator Comm
  • Collective over this group of processors
  • Each processor must use same value for mode and
    reference the same file
  • info is used to give hints about access patterns

128
FILE MANIPULATION (cont)
  • MPI_FILE_CLOSE (fh)
  • This routine synchronizes the file state and then
    closes the file
  • The user must ensure all I/O routines have
    completed before closing the file
  • This is a collective routine (but not
    synchronizing)

129
DERIVED DATATYPES VIEWS
  • Derived datatypes are not part of MPI-I/O
  • They are used extensively in conjunction with
    MPI-I/O
  • A filetype is really a datatype expressing the
    access pattern of a file
  • Filetypes are used to set file views

130
DERIVED DATATYPES VIEWS
  • Non-contiguous memory access
  • MPI_TYPE_CREATE_SUBARRAY
  • NDIMS - number of dimensions
  • ARRAY_OF_SIZES - number of elements in each
    dimension of full array
  • ARRAY_OF_SUBSIZES - number of elements in each
    dimension of sub-array
  • ARRAY_OF_STARTS - starting position in full array
    of sub-array in each dimension
  • ORDER - MPI_ORDER_(C or FORTRAN)
  • OLDTYPE - datatype stored in full array
  • NEWTYPE - handle to new datatype

131
NONCONTIGUOUS MEMORY ACCESS
0,101
0,0
1,1
1,100
101,1
100,100
101,101
101,0
132
NONCONTIGUOUS MEMORY ACCESS
  • INTEGER sizes(2), subsizes(2), starts(2), dtype,
    ierr
  • sizes(1) 102
  • sizes(2) 102
  • subsizes(1) 100
  • subsizes(2) 100
  • starts(1) 1
  • starts(2) 1
  • CALL MPI_TYPE_CREATE_SUBARRAY(2,sizes,subsizes,sta
    rts, MPI_ORDER_FORTRAN,MPI_REAL8,dtype,ierr)

133
NONCONTIGUOUS FILE ACCESS
  • MPI_FILE_SET_VIEW(
  • FH,
  • DISP,
  • ETYPE,
  • FILETYPE,
  • DATAREP,
  • INFO,
  • IERROR)

134
NONCONTIGUOUS FILE ACCESS
  • The file has holes in it from the processors
    perspective

135
NONCONTIGUOUS FILE ACCESS
  • The file has holes in it from the processors
    perspective
  • MPI_TYPE_CONTIGUOUS(NUM,OLD,NEW,IERR)
  • NUM - Number of contiguous elements
  • OLD - Old data type
  • NEW - New data type
  • MPI_TYPE_CREATE_RESIZED(OLD,LB,EXTENT,NEW, IERR)
  • OLD - Old data type
  • LB - Lower Bound
  • EXTENT - New size
  • NEW - New data type

136
Holes in the file
Memory layout
File layout (2 ints followed by 3 ints)
CALL MPI_TYPE_CONTIGUOUS(2, MPI_INT, CTYPE,
IERR) DISP 4 LB 0 EXTENT54 CALL
MPI_TYPE_CREATE_RESIZED(CTYPE,LB,EXTENT,FTYPE,IERR
) CALL MPI_TYPE_COMMIT(FTYPE, IERR) CALL
MPI_FILE_SET_VIEW(FH,DISP,MPI_INT,FTYPE,native,M
PI_INFO_NULL, IERR)
137
NONCONTIGUOUS FILE ACCESS
  • The file has holes in it from the processors
    perspective
  • A block-cyclic data distribution

138
NONCONTIGUOUS FILE ACCESS
  • The file has holes in it from the processors
    perspective
  • A block-cyclic data distribution
  • MPI_TYPE_VECTOR(
  • COUNT - Number of blocks
  • BLOCKLENGTH - Number of elements per block
  • STRIDE - Elements between start of each block
  • OLDTYPE - Old datatype
  • NEWTYPE - New datatype)

139
Block-cyclic distribution
File layout (blocks of 4 ints)
CALL MPI_TYPE_VECTOR(3, 4, 16, MPI_INT, FILETYPE,
IERR) CALL MPI_TYPE_COMMIT (FILETYPE, IERR) DISP
4 4 MYRANK CALL MPI_FILE_SET_VIEW (FH,
DISP, MPI_INT, FILETYPE, native, MPI_INFO_NULL,
IERR)
140
NONCONTIGUOUS FILE ACCESS
  • The file has holes in it from the processors
    perspective
  • A block-cyclic data distribution
  • multi-dimensional array access

141
NONCONTIGUOUS FILE ACCESS
  • The file has holes in it from the processors
    perspective
  • A block-cyclic data distribution
  • multi-dimensional array access
  • MPI_TYPE_CREATE_SUBARRAY()

142
Distributed array access
(0,0)
(0,199)
(199,0)
(199,199)
143
Distributed array access
Sizes(1) 200 sizes(2) 200 subsizes(1)
100 subsizes(2) 100 starts(1) 0 starts(2)
0 CALL MPI_TYPE_CREATE_SUBARRAY(2, SIZES,
SUBSIZES, STARTS, MPI_ORDER_FORTRAN, MPI_INT,
FILETYPE, IERR) CALL MPI_TYPE_COMMIT(FILETYPE,
IERR) CALL MPI_FILE_SET_VIEW(FH, 0, MPI_INT,
FILETYPE, NATIVE, MPI_INFO_NULL, IERR)
144
NONCONTIGUOUS FILE ACCESS
  • The file has holes in it from the processors
    perspective
  • A block-cyclic data distribution
  • multi-dimensional array distributed with a block
    distribution
  • Irregularly distributed arrays

145
Irregularly distributed arrays
  • MPI_TYPE_CREATE_INDEXED_BLOCK
  • COUNT - Number of blocks
  • LENGTH - Elements per block
  • MAP - Array of displacements
  • OLD - Old datatype
  • NEW - New datatype

146
Irregularly distributed arrays
147
Irregularly distributed arrays
CALL MPI_TYPE_CREATE_INDEXED_BLOCK (10, 1,
FILE_MAP, MPI_INT, FILETYPE, IERR) CALL
MPI_TYPE_COMMIT (FILETYPE, IERR) DISP 0 CALL
MPI_FILE_SET_VIEW (FH, DISP, MPI_INT, FILETYPE,
native, MPI_INFO_NULL, IERR)
148
DATA ACCESS
149
COLLECTIVE I/O
Memory layout on 4 processor
File layout
150
EXPLICIT OFFSETS
  • Parameters
  • FH - File handle
  • OFFSET - Location in file to start
  • BUF - Buffer to write from/read to
  • COUNT - Number of elements
  • DATATYPE - Type of each element
  • STATUS - Return status (blocking)
  • REQUEST - Request handle (non-blocking,non-collect
    ive)

151
EXPLICIT OFFSETS (cont)
  • I/O Routines
  • MPI_FILE_(READ/WRITE)_AT ()
  • MPI_FILE_(READ/WRITE)_AT_ALL ()
  • MPI_FILE_I(READ/WRITE)_AT ()
  • MPI_FILE_(READ/WRITE)_AT_ALL_BEGIN ()
  • MPI_FILE_(READ/WRITE)_AT_ALL_END (FH, BUF, STATUS)

152
EXPLICIT OFFSETS
153
IDIVIDUAL FILE POINTERS
  • Parameters
  • FH - File handle
  • BUF - Buffer to write to/read from
  • COUNT - number of elements to be read/written
  • DATATYPE - Type of each element
  • STATUS - Return status (blocking)
  • REQUEST - Request handle (non-blocking,
    non-collective)

154
INDIVIDUAL FILE POINTERS
  • I/O Routines
  • MPI_FILE_(READ/WRITE) ()
  • MPI_FILE_(READ/WRITE)_ALL ()
  • MPI_FILE_I(READ/WRITE) ()
  • MPI_FILE_(READ/WRITE)_ALL_BEGIN()
  • MPI_FILE_(READ/WRITE)_ALL_END (FH, BUF, STATUS)

155
INDIVIDUAL FILE POINTERS
156
SHARED FILE POINTERS
  • All processes must have the same view
  • Parameters
  • FH - File handle
  • BUF - Buffer
  • COUNT - Number of elements
  • DATATYPE - Type of the elements
  • STATUS - Return status (blocking)
  • REQUEST - Request handle (Non-blocking,
    non-collective

157
SHARED FILE POINTERS
  • I/O Routines
  • MPI_FILE_(READ/WRITE)_SHARED ()
  • MPI_FILE_I(READ/WRITE)_SHARED ()
  • MPI_FILE_(READ/WRITE)_ORDERED ()
  • MPI_FILE_(READ/WRITE)_ORDERED_BEGIN ()
  • MPI_FILE_(READ/WRITE)_ORDERED_END (FH, BUF,
    STATUS)

158
SHARED FILE POINTERS
159
FILE INTEROPERABILITY
  • MPI puts no constraints on how an implementation
    should store files
  • If a file is not stored as a linear byte stream,
    there must be a utility for converting the file
    into a linear byte stream
  • Data representation aids interoperability

160
FILE INTEROPERABILITY (cont)
  • Data Representation
  • Native - Data stored exactly as it is in memory.
  • Internal - Data may be converted, but can always
    be read by the same MPI implementation, even on
    different architectures
  • external32 - This representation is defined by
    MPI. Files written in external32 format can be
    read by any MPI on any machine

161
FILE INTEROPERABILITY (cont)
  • Some MPI-I/O implementations (Romio), created
    files are no different than those created by the
    underlying file system.
  • This means normal Posix commands (cp, rm, etc)
    work with files created by these implementations
  • Non-MPI programs can read these files

162
GOTCHAS - Consistency Semantics
  • Collective routines are NOT synchronizing
  • Output data may be buffered
  • Just because a process has completed a write does
    not mean the data is available to other processes
  • Three ways to ensure file consistency
  • MPI_FILE_SET_ATOMICITY ()
  • MPI_FILE_SYNC ()
  • MPI_FILE_CLOSE ()

163
CONSISTENCY SEMANTICS
  • MPI_FILE_SET_ATOMICITY ()
  • Causes all writes to be immediately written to
    disk. This is a collective operation
  • MPI_FILE_SYNC ()
  • Collective operation which forces buffered data
    to be written to disk
  • MPI_FILE_CLOSE ()
  • Writes any buffered data to disk before closing
    the file

164
GOTCHA!!!
CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_SET_ATOMICITY (FH) CALL
MPI_FILE_WRITE_AT (FH, 100, ) CALL
MPI_FILE_READ_AT (FH, 0, )
CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_SET_ATOMICITY (FH) CALL
MPI_FILE_WRITE_AT (FH, 0, ) CALL
MPI_FILE_READ_AT (FH, 100, )
165
GOTCHA!!!
CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_SET_ATOMICITY (FH) CALL
MPI_FILE_WRITE_AT (FH, 100, ) CALL MPI_BARRIER
() CALL MPI_FILE_READ_AT (FH, 0, )
CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_SET_ATOMICITY (FH) CALL
MPI_FILE_WRITE_AT (FH, 0, ) CALL MPI_BARRIER
() CALL MPI_FILE_READ_AT (FH, 100, )
166
GOTCHA!!!
CALL MPI_FILE_OPEN (, FH) CALL MPI_FILE_WRITE_AT
(FH, 100, ) CALL MPI_FILE_CLOSE (FH) CALL
MPI_FILE_OPEN (, FH) CALL MPI_FILE_READ_AT (FH,
0, )
CALL MPI_FILE_OPEN (, FH) CALL MPI_FILE_WRITE_AT
(FH, 0, ) CALL MPI_FILE_CLOSE (FH) CALL
MPI_FILE_OPEN (, FH) CALL MPI_FILE_READ_AT (FH,
100, )
167
GOTCHA!!!
CALL MPI_FILE_OPEN (, FH) CALL MPI_FILE_WRITE_AT
(FH, 100, ) CALL MPI_FILE_CLOSE (FH) CALL
MPI_BARRIER () CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_READ_AT (FH, 0, )
CALL MPI_FILE_OPEN (, FH) CALL MPI_FILE_WRITE_AT
(FH, 0, ) CALL MPI_FILE_CLOSE (FH) CALL
MPI_BARRIER () CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_READ_AT (FH, 100, )
168
CONCLUSIONS
  • MPI-I/O potentially offers significant
    improvement in I/O performance
  • This improvement can be attained with minimal
    effort on part of the user
  • Simpler programming with fewer calls to I/O
    routines
  • Easier program maintenance due to simple API

169
Recommended references
  • MPI - The Complete Reference Volume 1, The MPI
    Core
  • MPI - The Complete Reference Volume 2, The MPI
    Extensions
  • USING MPI Portable Parallel Programming with the
    Message-Passing Interface
  • Using MPI-2 Advanced Features of the
    Message-Passing Interface
Write a Comment
User Comments (0)
About PowerShow.com