Title: PARALLEL COMPUTING WITH MPI
1Parallel Programming on the SGI Origin2000
Taub Computer Center Technion
Moshe Goldberg, mgold_at_tx.technion.ac.il
With thanks to Igor Zacharov / Benoit Marchand,
SGI
Mar 2004 (v1.2)
2Parallel Programming on the SGI Origin2000
- Parallelization Concepts
- SGI Computer Design
- Efficient Scalar Design
- Parallel Programming -OpenMP
- Parallel Programming- MPI
32) Parallel Programming-MPI
4Parallel classification
- Parallel architectures
-
- Shared Memory /
- Distributed Memory
- Programming paradigms
- Data parallel /
- Message passing
5Shared Memory
- Each processor can access any part of the memory
- Access times are uniform (in principle)
- Easier to program (no explicit message passing)
- Bottleneck when several tasks access same
location
6Distributed Memory
- Processor can only access local memory
- Access times depend on location
- Processors must communicate via explicit message
passing
7Distributed Memory
Processor Memory
Processor Memory
Interconnection network
8Message Passing Programming
- Separate program on each processor
- Local Memory
- Control over distribution and transfer of data
- Additional complexity of debugging due to
communications
9Performance issues
- Concurrency ability to perform actions
simultaneously - Scalability performance is not impaired by
increasing number of processors - Locality high ration of local memory
accesses/remote memory accesses (or low
communication)
10SP2 Benchmark
- Goal Checking performance of real world
applications on the SP2 - Execution time (seconds)CPU time for
applications - Speedup
- Execution time for 1 processor
- ---------------------------------
--- - Execution time for p processors
11(No Transcript)
12WHAT is MPI?
- A message- passing library specification
- Extended message-passing model
- Not specific to implementation or computer
13BASICS of MPI PROGRAMMING
- MPI is a message-passing library
- Assumes a distributed memory architecture
- Includes routines for performing communication
(exchange of data and synchronization) among the
processors.
14Message Passing
- Data transfer synchronization
- Synchronization the act of bringing one or more
processes to known points in their execution - Distributed memory memory split up into
segments, each may be accessed by only one
process.
15Message Passing
May I send?
yes
Send data
16MPI STANDARD
- Standard by consensus, designed in an open forum
- Introduced by the MPI FORUM in May 1994, updated
in June 1995. - MPI-2 (1998) produces extensions to the MPI
standard
17Why use MPI ?
- Standardization
- Portability
- Performance
- Richness
- Designed to enable libraries
18Writing an MPI Program
- If there is a serial version , make sure it is
debugged - If not, try to write a serial version first
- When debugging in parallel , start with a few
nodes first.
19Format of MPI routines
20Six useful MPI functions
21Communication routines
22End MPI part of program
23- program hello
- include mpif.h status(MPI_STATUS_SIZE)
character12 message call MPI_INIT(ierror) call
MPI_COMM_SIZE(MPI_COMM_WORLD, size,ierror) call
MPI_COMM_RANK(MPI_COMM_WORLD, rank,ierror) tag
100 if(rank .eq. 0) then message 'Hello,
world' do i1, size-1 call
MPI_SEND(message, 12, MPI_CHARACTER , i,
tag,MPI_COMM_WORLD,ierror) - enddo
- else
- call MPI_RECV(message, 12, MPI_CHARACTER,
0,tag,MPI_COMM_WORLD, status, ierror) - endif
- print, 'node', rank, '', message
-
- call MPI_FINALIZE(ierror)
- end
24int main( int argc, char argv) int tag100
int rank,size,i MPI_Status status char
message12 MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,size)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
strcpy(message,"Hello,world")
if (rank0) for
(i1iltsizei)
MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD)
else
MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_C
OMM_WORLD,status) printf("node d s
\n",rank,message) MPI_Finalize() return
0
25MPI Messages
- DATA data to be sent
- ENVELOPE information to route the data.
26Description of MPI_Send (MPI_Recv)
27Description of MPI_Send (MPI_Recv)
28Some useful remarks
- Source MPI_ANY_SOURCE means that any source is
acceptable - Tags specified by sender and receiver must match,
or MPI_ANY_TAG any tag is acceptable - Communicator must be the same for send/receive.
Usually MPI_COMM_WORLD
29POINT-TO-POINT COMMUNICATION
- Transmission of a message between one pair of
processes - Programmer can choose mode of transmission
- Programmer can choose mode of transmission
30MODE of TRANSMISSION
- Can be chosen by programmer
- or let the system decide
- Synchronous mode
- Ready mode
- Buffered mode
- Standard mode
31BLOCKING /NON-BLOCKING COMMUNICATIONS
32BLOCKING STANDARD SEND
Date transfer from source complete
MPI_SEND
Sizegtthreshold
Task waits
S
R
wait
Transfer begins when MPI_RECV has been posted
MPI_RECV
Task continues when data transfer to buffer is
complete
33NON BLOCKING STANDARD SEND
Date transfer from source complete
MPI_ISEND
MPI_WAIT
Sizegtthreshold
Task waits
S
R
wait
Transfer begins when MPI_IRECV has been posted
MPI_IRECV
MPI_WAIT
No interruption if wait is late enough
34BLOCKING STANDARD SEND
MPI_SEND
Sizeltthreshold
Data transfer from source complete
S
R
Transfer to buffer on receiver
MPI_RECV
Task continues when data transfer to
usersbuffer is complete
35NON BLOCKING STANDARD SEND
Date transfer from source complete
MPI_ISEND
MPI_WAIT
Sizeltthreshold
No delay even though message is not yet in buffer
on R
S
R
Transfer to buffer can be avoided if
MPI_IRECV posted early enough
MPI_IRECV
MPI_WAIT
No delay if wait is late enough
36BLOCKING COMMUNICATION
37NON-BLOCKING
38(No Transcript)
39Deadlock program (cont)
if ( irank.EQ.0 ) then idest 1
isrc 1 isend_tag ITAG_A
irecv_tag ITAG_B else if ( irank.EQ.1 )
then idest 0 isrc 0
isend_tag ITAG_B irecv_tag ITAG_A
end if C ------------------------------------
----------------------------C send and
receive messagesC ---------------------------
---------------------------------- print ,
" Task ", irank, " has sent the message"
call MPI_Send ( rmessage1, MSGLEN, MPI_REAL,
idest, isend_tag, . MPI_COMM_WORLD, ierr
) call MPI_Recv ( rmessage2, MSGLEN,
MPI_REAL, isrc, irecv_tag, .
MPI_COMM_WORLD, istatus, ierr ) print , "
Task ", irank, " has received the message"
call MPI_Finalize (ierr) end
40DEADLOCK example
MPI_RECV
MPI_SEND
A
B
MPI_SEND
MPI_RECV
41Deadlock example
- SP2 implementationNo Receive has been posted
yet,so both processes block - Solutions
- Different ordering
- Non-blocking calls
- MPI_Sendrecv
42Determining Information about Messages
43MPI_WAIT
- Useful for both sender and receiver of
non-blocking communications - Receiving process blocks until message is
received, under programmer control - Sending process blocks until send operation
completes, at which time the message buffer is
available for re-use
44MPI_WAIT
compute
transmit
S
R
MPI_WAIT
45MPI_TEST
MPI_TEST
compute
transmit
S
MPI_Isend
R
46MPI_TEST
- Used for both sender and receiver of non-blocking
communication - Non-blocking call
- Receiver checks to see if a specific sender has
sent a message that is waiting to be delivered
... messages from all other senders are ignored
47MPI_TEST (cont.)
- Sender can find out if the message-buffer can be
re-used ... have to wait until operation is
complete before doing so
48MPI_PROBE
- Receiver is notified when messages from
potentially any sender arrive and are ready to be
processed. - Blocking call
49Programming recommendations
- Blocking calls are needed when
- Tasks must synchronize
- MPI_Wait immediately follows communication call
50Collective Communication
- Establish a communication pattern within a group
of nodes. - All processes in the group call the communication
routine, with matching arguments. - Collective routine calls can return when their
participation in the collective communication is
complete.
51Properties of collective calls
- On completion he caller is now free to access
locations in the communication buffer. - Does NOT indicate that other processors in the
group have completed - Only MPI_BARRIER will synchronize all processes
52Properties
- MPI guarantees that a message generated by
collective communication calls will not be
confused with a message generated by
point-to-point communication - Communicator is the group identifier.
53Barrier
- Synchronization primitive. A node calling it will
block until all the nodes within the group have
called it. - Syntax
- MPI_Barrier(Comm, Ierr)
54Broadcast
- Send data on one node to all other nodes in
communicator. - MPI_Bcast(buffer, count, datatype,root,comm,ierr)
55Broadcast
DATA
A0
A0
P0
A0
P1
A0
P2
A0
P3
56Gather and Scatter
DATA
scatter
A0
A0
P0
A1
A2
A3
A1
P1
A2
P2
A3
P3
gather
57Allgather effect
DATA
C0
A0
D0
B0
A0
P0
A0
B0
B0
D0
C0
P1
A0
C0
D0
B0
C0
P2
D0
A0
P3
D0
B0
C0
allgather
58Syntax for Scatter Gather
59Scatter and Gather
- Gather Collect data from every member of the
group (including the root) on the root node in
linear order by the rank of the node. - Scatter Distribute data from the root to every
member of the group in linear order by node.
60ALLGATHER
- All processes, not just the root, receive the
result. The jth block of the receive buffer is
the block of data sent from the jth process - Syntax
- MPI_Allgather( sndbuf,scount,datatype,recvbuf,r
count,rdatatype,comm,ierr)
61Gather example
- DIMENSION A(25,100),b(100),cpart(25),ctotal(100)
- INTEGER root
- DATA root/0/
- DO I1,25
- cpart(I) 0
- . DO K1,100
- cpart(I) cpart(I) A(I,K)b(K)
- END DO
- END DO
- call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_RE
AL, root, MPI_COMM_WORLD, ierr)
62AllGather example
- DIMENSION A(25,100),b(100),cpart(25),ctotal(100)
- INTEGER root
- DO I1,25
- cpart(I) 0
- . DO K1,100
- cpart(I) cpart(I) A(I,K)b(K)
- END DO
- END DO
- call MPI_AllGATHER(cpart,25,MPI_REAL,ctotal,25,MPI
_REAL, MPI_COMM_WORLD, ierr)
63Parallel matrix-vector multiplication
A b c
P1
25
P2
25
P3
25
P4
25
64Global Computations
65Reduction
- The partial result in each process in the group
is combined in one specified process
66Reduction
Dj D(0,j)D(1,j) ... D(n-1,j)
67Scan operation
- Scan or prefix-reduction operation performs
partial reductions on distributed data - Dkj D0jD1j ... Dkj
- k0,1,n-1
68Varying size gather and scatter
- Both size and memory location of the messages are
varying - More flexibility in writing code
- less need to copy data into temporary buffers
- more compact final code
- Vendor implementation may be optimal
69Scatterv syntax
70SCATTER
P0
P0
P1
P2
P3
71SCATTERV
P0
P0
P1
P2
P3
72Advanced Datatypes
- Predefined basic datatypes -- contiguous data
of the same type. - We sometimes need
- non-contiguous data of single type
- contiguous data of mixed types
73Solutions
- multiple MPI calls to send and receive each data
element - copy the data to a buffer before sending it
(MPI_PACK) - use MPI_BYTE to get around the datatype-matching
rules
74Drawback
- Slow , clumsy and wasteful of memory
- Using MPI_BYTE or MPI_PACKED can hamper
portability
75General Datatypes and Typemaps
- a sequence of basic datatypes
- a sequence of integer (byte) displacements
76Typemaps
- typemap (type0,disp0),(type1,disp1),.,
- (typen,disp n)
- Displacement are relative to the buffer
- Example
- Typemap (MPI_INT) (int,0)
77Extent of a Derived Datatype
78MPI_TYPE_EXTENT
- MPI_TYPE_EXTENT(datatype,extent,ierr)
- Describes distance (in bytes) from start of
datatype to start of the next datatype .
79How and When Do I Use Derived Datatypes?
- MPI derived datatypes are created at run-time
through calls to MPI library routines.
80How to use
- Construct the datatype
- Allocate the datatype.
- Use the datatype
- Deallocate the datatype
81EXAMPLE
- integer oldtype,newtype,count,blocklength,stride
- integer ierr,n
- real buffer(n,n)
- call MPI_TYPE_VECTOR(count,blocklength,stride,oldt
ype,newtype,ierr) - call MPI_TYPE_COMMIT(newtype,ierr)
- call MPI_SEND(buffer,1,newtype,dest,tag,comm,err)
- use it in communication operation
- call MPI_TYPE_FREE(newtype,ierr)
- deallocate it
82Example on MPI_TYPE_VECTOR
oldtype
newtype
BLOCK
BLOCK
83Summary
- Derived datatypes are datatypes that are built
from the basic MPI datatypes - Derived datatypes provide a portable and elegant
way of communicating non-contiguous or mixed
types in a message. - Efficiency may depend on the implementation(see
how it compares to MPI_BYTE)
84Several datatypes
85Several datatypes
86GROUP
87Group (cont.)
88Group (cont.)
c if(rank .eq. 1) then print, 'sum of
group1', (rbuf(i), i1, count)c
print, 'sum of group1', (sbuf(i), i1,
count) endif count2 size do i1,
count2 sbuf2(i) rank
rank enddo CALL MPI_REDUCE(SBUF2,RBUF2,COUNT2,MP
I_INTEGER,
MPI_SUM,0,WCOMM,IERR) if(rank .eq. 0) then
print, 'sum of wgroup', (rbuf2(i), i1,
count2) else CALL
MPI_COMM_FREE(SUBCOMM, IERR) endif CALL
MPI_GROUP_FREE(GROUP1, IERR) CALL
MPI_FINALIZE(IERR) stop endÂ
89PERFORMANCE ISSUES
- Hidden communication takes place
- Performance depends on implementation of MPI
- Because of forced synchronization, it is not
always best to use collective communication
90Example simple broadcast
1
B
DataB(P-1) Steps P-1
2
B
B
3
8
91Example simple scatter
1
B
DataB(P-1) Steps P-1
2
B
B
3
8
92Example better scatter
1
DataBplogP Steps log P
4B
2
1
2B
2B
2
4
1
3
B
B
B
B
1
5
3
6
2
7
4
8
93Timing for sending a message
- Time is composed of startup time time to send a
0 length message and transfer time time to
transfer a byte of data.
Tcomm Tstartup B Ttransfer It may
be worthwhile to group several sends together
94Performance evaluation
- Fortran
- Real8 t1
- T1 MPI_Wtime() ! Returns elapsed time
- C
- double t1
- t1 MPI_Wtime ()
95MPI References
- The MPI Standard
- www-unix.mcs.anl.gov/mpi/index.html
- Parallel Programming with MPI,Peter S.
Pacheco,Morgan Kaufmann,1997 - Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum,
The MIT Press,1999.
96Example better broadcast
1
B
B
DataB(P-1) Steps log P
2
1
2
7
1
3
1
5
3
6
2
7
4
8