Title: MPI
1High Performance Computing with MATLAB Kadin
Tseng Scientific Computing and Visualization
Group Boston University
2Outline
- Performance Issues
- Memory Access
- Vectorization
- Compiler
- Other Considerations
- Parallel MATLAB
3Memory Access
- Memory access patterns often affect computational
performances. Here are some effective ways to
enhance performance - Allocate array memory before using it
- For-loops Ordering
- Compute and save array in-place wherever possible
4Allocate Array
- Allocate array memory before using it.
- MATLAB is designed primarily as an
interactive, user-friendly environment. No
pre-allotment of memory is required. Often,
however, array sizes are known a priori. By
pre-allocating it ensures that all array elements
are allocated in one single, contiguous block
right from the start.
n5000 x(1) 1 for i2n x(i)
2x(i-1) end Wallclock time 0.0153 seconds
n5000 x ones(n,1) x(1) 1 for i2n
x(i) 2x(i-1) end Wallclock time 0.0002
seconds
The timing data are recorded on Katana. The
actual times can vary significantly depending on
the processor.
5For-loop Ordering
- Best if inner-most for loop is for left-most
index of array, etc. - For a multi-dimensional array, x(i,j), the 1D
representation of the same array, x(k),
inherently possesses the contiguous property
n5000 x zeros(n) for i1n rows
for j1n columns x(i,j) i(j-1)n
end end Wallclock time 0.88 seconds
n5000 x zeros(n) for j1n columns
for i1n rows x(i,j) i(j-1)n
end end Wallclock time 0.48 seconds
6Compute In-place
- Compute and save array in-place improves
performance
x randn(10000) tic y x.2 toc Wallclock
time 1.23 seconds
x randn(10000) tic x x.2 toc Wallclock
time 0.49 seconds
7Other Considerations
- Use function m-file instead of script m-file
whenever reasonable - Script m-file is loaded into memory and evaluate
one line at a time. Subsequent uses require
reloading. - Function m-file is compiled into a pseudo-code
and is loaded once. Subsequent use of the
function will be faster without reloading. - Avoid using virtual memory. Physical memory is
much faster. - Avoid passing large matrices to a function and
modifying only a handful of elements. - Use MATLAB profiler (profile) to identify hot
spots for performance enhancement.
8Vectorization
- The use of for loop in MATLAB, in general, can be
expensive, especially if the loop count is large
or nested for-loops. - Without array allocation, for-loops are very
costly. - From a performance standpoint, in general, a
compact vector representation should be used in
place of for-loops. Here is an example.
i 0 for t 0.0110 i i 1 y(i)
sin(t) end Wallclock time 0.0045 seconds
t 0.0110 y sin(t) Wallclock time
0.0005 seconds
9Compiler
- A MATLAB compiler, mcc, is available.
- It compiles m-files into C codes, object
libraries, or stand-alone executables. - A stand-alone executable generated with mcc can
run on compatible platforms without an installed
MATLAB or a MATLAB license. - Many MATLAB general and toolbox licenses are
available at BU. On special occasions, MATLAB
access may be denied if all licenses are checked
out. Running a stand-alone requires NO licenses
and no waiting. - Some compiled codes may run more efficiently than
m-files because they are not run in interpretive
mode. - A stand-alone enables you to share it without
revealing the source. - http//scv.bu.edu/documentation/tutorials/MATLAB/c
ompiler/
10Is Parallel MATLAB the way to go ?
- Even in the best case, cant compete with
C/Fortran with MPI/OpenMP - It is an acceptable compromise if
- Converting your MATLAB code to C/Fortran requires
too big of an effort and you dont have the time
or inclination to do that. - A big job typically takes hours, rather than
days, to run on a single processor. - You strongly prefer the relative ease and
efficiency in programming a research code in
MATLAB. - The appropriate multiprocessing MATLAB paradigm
is at your disposal.
11Multiprocessing MATLAB
1 MatlabMPI 2 pMatlab 3 SCVs parallel MATLAB 4
Distributed Computing Toolbox 5 Star-P
121 MatlabMPI
- MatlabMPI is a parallel MATLAB package developed
at Lincoln Lab in Lexington, MA. - It does not require or make use of high speed
interconnect for communication among cluster
nodes. Instead, it relies on the network file
system being visible, or shared, by all
processors. With this, message passing is
achieved through I/O to the file system. - It has a small basic set of utility routines that
mimic those of the Message Passing Interface
(MPI) in functionalities. While the MPI routines
for sending and receiving messages are performed
via high speed interconnect, the routines in this
package accomplish the same tasks via I/O. - It is good for embarrassingly parallel codes
that require only infrequent communications.
132 pMatlab
- pMatlab is a parallel MATLAB package also
developed at Lincoln Lab in Lexington, MA. It is
built on top of MatlabMPI. - As such, it inherits all the properties of
MatlabMPI. It can be thought of as providing
higher-level wrapper functions to insulate the
programmers from having to deal with lower-level
function calls to perform parallel tasks. - It is good for embarrassingly parallel algorithms
with very modest amount of communications.
143 SCVs parallel MATLAB
- SCV has a very simple parallel MATLAB package
that is also based on the shared network file
system concept as with MatlabMPI. - It is limited to most of the same restrictions as
MatlabMPI. However, there are two departures - 1. There is only one batch script and two
function m-files to be inserted to your code. - 2. These include a barrier function to
synchronize work performed on multiprocessing
nodes. This is typically required for codes that
contain serial and parallel sections. - It is good for embarrassingly parallel algorithms
with very modest amount of communications. - Email or call Kadin if you want to use any of the
above three packages. An example is given next.
15SCV parallel MATLAB Example 1
- This example demonstrates the use of
multiprocessors to compute C A B (matrix
size is N2) - Decomposition along columns can also be
decomposed along rows, or both. - C(, range(rank)) A(, range(rank)) B(,
range(rank)) - In the above, range(rank) is the range of
columns as a function of the processor rank - range(rank) rankn1ranknn
(0ltrankltnproc-1 nN/nproc) - For simplicity, N is assumed to be divisible by
nproc - N 8 size of global matrix A
- I (1N) generate column vector
- A I(, ones(1,N))10 I(, ones(1,N))
generate A on current (and all) process - pbegin, pend, rank, nproc parallel_info(N)
query for parallel info - rank (0ltrankltnproc-1) is the current MATLAB
process - n N/nproc distributed column
size of matrix B - b I(, ones(1,n))10 generate N x n
matrix b (local B) - c A(, pbeginpend) b compute local c
from A and local b - save matrix_c each current dir has own
individual copy of c
16SCV parallel MATLAB Example 1 (contd)
- Run barrier to synchronize all processors
- ierr barrier(rank, nproc)
- Finally, perform (serial) gather on c of all
ranks into C on 0 - if (rank 0)
- C zeros(N) allocate C
- C(,1n) c starts with c from rank
0 which is already in memory - for k1nproc-1
- i nk1 beginning location to which
c will be inserted - j nkn end location
- fk ../' num2str(k) /matrix_c'
file name of c on process k - load(fk, 'c')
- C(,ij) c
- end
- save(../matrixC, C) save C to
parent dir - end
17 parallel MATLAB Example 1 batch script
- !/bin/csh
- Example SGE script for running parallel MATLAB
jobs on Katana - Submit job with the command qsub batch_sge.scv
- " qsub_option" is interpreted by qsub as if
"qsub_option" was passed to qsub on commandline. - Set hard runtime (wallclock) limit, default is
2 hours. Format -l h_rtHHMMSS - -l h_rt20000
- Merge stderr into the stdout file to reduce
clutter. - -j y
- Invoke Parallel Environment for N processors.
No default value, it must be specified. - For MATLAB apps, DO NOT select omp
- -pe 1_per_node 4
- end of qsub options
- By default, the script is executed in the
directory from which it was submitted - with qsub. You might want to change directories
before invoking mpirun ... - cd PWD
- running the following script generates multiple
concurrent copies of MATLAB - Use addpath in startup.m to add path to all
necessary matlab m-files - batch_sge and sge_matlab should live in either
HOME/bin or PWD - sge_matlab PWD scv_matlab_example.m
18SCV parallel MATLAB Example 2
The airplane is represented with patches of
quadrilateral elements and the integral
formulation is discretized to yield
? is the known Neumann boundary condition. f is
the unknown to be solved for.
19 parallel MATLAB Example 2 Geometry
20 parallel MATLAB Example 2 timings
21How slow is MATLAB compared with C ?
224 Distributed Computing Toolbox
- The Mathworks has a DCT which is a parallel
MATLAB package that utilizes the clusters high
speed interconnect for inter-processor
communications. - At present, DCT is not available on SCV machines.
235 StarP
- StarP is a parallel MATLAB product of Interactive
Supercomputing, Inc. It bears some resemblance to
the pMatlab package in that it enables parallel
MATLAB while shielding the programmers from most
of the lower level parallel programming. - Like Mathworks DCT, StarP is a parallel MATLAB
package that utilizes high speed interconnect for
inter-processor communications. - At present, this package is not available on SCV
machines.
24Useful SCV Info
- SCV home page (http//scv.bu.edu/)
- Resource Applications (https//acct.bu.edu/SCF)
- Help
- Web-based tutorials (http//scv.bu.edu/)
- (MPI, OpenMP, MATLAB, IDL, Graphics tools)
- HPC consultations by appointment
- Kadin Tseng (kadin_at_bu.edu)
- Doug Sondak (sondak_at_bu.edu)
- help_at_twister.bu.edu, help_at_cootie.bu.edu