Parallel Programming Orientation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parallel Programming Orientation

1
Parallel Programming Orientation
2
Agenda

Parallel jobs
Paradigms to parallelize algorithms
Profiling and compiler optimization
Implementations to parallelize code
OpenMP
MPI
Queuing
Job queuing
Integrating Parallel Programs
Questions and Answers

3
Traditional Ad-Hoc Linux Cluster

Full Linux install to disk Load into memory
Manual Slow 5-30 minutes
Full set of disparate daemons, services,
user/password, host access setup
Basic parallel shell with complex glue scripts
run jobs
Monitoring management added as isolated tools

4
Cluster Virtualization Architecture Realized

Minimal in-memory OS with single daemon rapidly
deployed in seconds - no disk required
Less than 20 seconds
Virtual, unified process space enables intuitive
single sign-on, job submission
Effortless job migration to nodes
Monitor manage efficiently from the Master
Single System Install
Single Process Space
Shared cache of the cluster state
Single point of provisioning
Better performance due to lightweight nodes
No version skew is inherently more reliable

5
Just a Primer

Only a brief introduction is provided here. Many
other in-depth tutorials are available on the web
and in published sources.
http//www.mpi-forum.org/docs/mpi-11-html/mpi-repo
rt.html
https//computing.llnl.gov/?settrainingpageinde
x

6
Parallel Code Primer

Paradigms for writing parallel programs depend
upon the application
SIMD (single-instruction multiple-data)
MIMD (multiple-instruction multiple-data)
MISD (multiple-instruction single-data)
SIMD will be presented here as it is a commonly
used template
A single application source is compiled to
perform operations on different sets of data
(Single Program Multiple Data (SPMD) model)
The data is read by the different threads or
passed between threads via messages (hence MPI
message passing interface)
Contrast this with shared memory or OpenMP where
data is locally via memory
Optimizations in the MPI implementation can
perform localhost optimization however, the
program is still written using a message passing
construct

7
Explicitly Parallel Programs

Different paradigms exist for parallelizing
programs
Shared memory
OpenMP
Sockets
PVM
Linda
MPI
Most distributed parallel programs are now
written using MPI
Different options for MPI stacks MPICH,
OpenMPI, HP, Intel
ClusterWare comes integrated with customized
versions of MPICH and OpenMPI

8
Example Code

Calculate p through numerical integration
Compute p by integrating f(x) 4/(1 x2) from
0 to 1
Function is the derivative of arctan(x)
See source code

9
Compiling and running code

set n 400,000,000
gcc -o cpi-serial cpi-serial.c
time ./cpi-serial
Process 0
pi is approximately 3.1415926535895520, Error is
0.0000000000002411
real 0m11.009s
user 0m11.007s
sys 0m0.000s
gcc -g -pg -o cpi-serial_prof cpi-serial.c
time ./cpi-serial_prof
Process 0
pi is approximately 3.1415926535895520, Error is
0.0000000000002411
real 0m11.012s
user 0m11.010s
sys 0m0.000s
ls -ltra

10
Profiling

-g flag includes debugging information in the
binary. Useful for gdb tracing of an application
and for profiling
-pg flag generates code to writing profile
information
gprof cpi-serial_prof gmon.out
Flat profile
Each sample counts as 0.01 seconds.
cumulative self self
total
time seconds seconds calls ns/call
ns/call name
74.48 2.85 2.85
main
23.95 3.77 0.92 400000000 2.29
2.29 f
2.37 3.86 0.09
frame_dummy
Call graph (explanation follows)
granularity each sample hit covers 2 byte(s) for
0.26 of 3.86 seconds
index time self children called name
ltspontaneousgt

11
Profiling Tips

Code should be profiled using realistic data set
Contrast the call graphs of n100 versus
n400,000,000
Profiling can give tips about where to optimize
the current algorithm, but it cant suggest
alternative (better) algorithms
e.g. Monte Carlo algorithm to calculate p
Amdahls Law
The speedup parallelization achieves is limited
by the serial part of the code

12
OpenMP Introduction

Parallelization using shared memory in a single
machine
Portion of the code is forked on the machine to
parallelize
i.e. not distributed parallelization
Done using pragmas in the source code. Compiler
must support OpenMP (gcc 4, Intel, etc.)
gcc -fopenmp -o cpi-openmp cpi-openmp.c
See Source Code
Profiling can add overhead to resulting
executable
time can be used to measure improvement
Runtime selection of the number of threads using
OMP_NUM_THREADS environment variable

13
Scaling with OpenMP

time OMP_NUM_THREADS1 ./cpi-openmp
Process 0
pi is approximately 3.1415926535895520, Error is
0.0000000000002411
real 0m10.583s
user 0m10.581s
sys 0m0.001s
time OMP_NUM_THREADS2 ./cpi-openmp
Process 0
Process 1
pi is approximately 3.1415926535900218, Error is
0.0000000000002287
real 0m5.295s
user 0m11.297s
sys 0m0.000s
time OMP_NUM_THREADS4 ./cpi-openmp
real 0m2.650s
user 0m10.586s

14
Scaling with OpenMP

Code is easy to parallelize
Good scaling is seen up to 8 processors, kink in
the curve is expected

15
Role of the Compiler

Parallelization using shared memory in a single
machine
i.e. not distributed parallelization
Done using pragmas in the source code. Compiler
must support OpenMP (gcc 4, Intel, etc.)
gcc -fopenmp -o cpi-openmp cpi-openmp.c
Profiling can add overhead to resulting
executable
time can be used to measure improvement
Runtime selection of the number of threads using
OMP_NUM_THREADS environment variable

16
GCC versus Intel C

time OMP_NUM_THREADS1 ./cpi-openmp
real 0m10.583s
user 0m10.581s
sys 0m0.001s
gcc -O3 -fopenmp -o cpi-openmp-gcc-O3
cpi-openmp.c
time OMP_NUM_THREADS1 ./cpi-openmp-gcc-O3
Process 0
pi is approximately 3.1415926535895520, Error is
0.0000000000002411
real 0m3.154s
user 0m3.143s
sys 0m0.011s
time OMP_NUM_THREADS8 ./cpi-openmp-gcc-O3
real 0m0.399s
user 0m3.181s
sys 0m0.001s

17
Compiler Timings
18
Explicitly Parallel Programs

Different paradigms exist for parallelizing
programs
Shared memory
OpenMP
Sockets
PVM
Linda
MPI
Most distributed parallel programs are now
written using MPI
Different options for MPI stacks MPICH,
OpenMPI, HP, Intel
ClusterWare comes integrated with customized
versions of MPICH and OpenMPI

19
OpenMP Summary

OpenMP provides a mechanism to parallelize within
a single machine
Shared memory and variables are handled
automatically
Performance, with an appropriate compiler, can
provide significant speedups
Coupled with large core count SMP machines,
OpenMP could be all of the parallelization
required
GPU programming is similar to the OpenMP model

20
Explicitly Parallel Programs

Different paradigms exist for parallelizing
programs
Shared memory
OpenMP
Sockets
PVM
Linda
MPI
Most distributed parallel programs are now
written using MPI
Different options for MPI stacks MPICH,
OpenMPI, HP, Intel
ClusterWare comes integrated with customized
versions of MPICH and OpenMPI

21
Running MPI Code

Binaries are executed simultaneously
on the same machine or different machines
After the binaries start running, the
MPI_COMM_WORLD is established
Any data to be transferred must be explicitly
determined by the programmer
Hooks exist for a number of languages
E.g. Python (https//computing.llnl.gov/code/pdf/p
yMPI.pdf)

22
Example MPI Source

cpi.c calculates p using MPI in C
include "mpi.h"
include ltstdio.hgt
include ltmath.hgt
double f( double )
double f( double a )
return (4.0 / (1.0 aa))
int main( int argc, char argv)
int done 0, n, myid, numprocs, i
double PI25DT 3.141592653589793238462643
double mypi, pi, h, sum, x
double startwtime 0.0, endwtime
int namelen

while (!done)
if (myid 0)
/
printf("Enter the number of
intervals (0 quits) ")
scanf("d",n)
/
if (n0) n100 else n0
startwtime MPI_Wtime()
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD)
if (n 0)
done 1
else
h 1.0 / (double) n
sum 0.0

compute pi by integrating f(x) 4/(1 x2)
System include file which defines the MPI
functions
Initialize the MPI execution environment
Determines the size of the group associated with
a communictor
Determines the rank of the calling process in the
communicator
Gets the name of the processor
Differentiate actions based on rank. Only
master performs this action
MPI built-in function to get time value
Broadcasts 1 MPI_INT from n from the process
with rank 0" to all other processes of the group
Each worker does this loop and increments the
counter by the number of processors (versus
dividing the range -gt possible off-by-one error)
Does MPI_SUM function on 1 MPI_DOUBLE at mypi
on all workers in MPI_COMM_WORLD to a single
value at pi on rank 0
Only rank 0 outputs the value of pi
Terminates MPI execution environment
23
Other Common MPI Functions

MPI_Send, MPI_Recv
Blocking send and receive between two specific
ranks
MPI_Isend, MPI_Irecv
Non-blocking send and receive between two
specific ranks
man pages exist for the MPI functions
Poorly written programs can suffer from poor
communication efficiency (e.g. stair-step) or
lost data if the system buffer fills before a
blocking send or receive is initiated to
correspond with a non-blocking receive or send
Care should be used when creating temporary files
as multiple threads may be running on the same
host overwriting the same temporary file (include
rank in file name in a unique temporary directory
per simulation)

24
Compiling MPICH programs

mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the
correct MPI libraries from /usr/lib64/MPICH
Environment variables can used to set the
compiler
CC, CPP, FC, F90
Command line options to set the compiler
-cc, -cxx, -fc, -f90
GNU, PGI, and Intel compilers are supported

25
Running MPICH programs

mpirun is used to launch MPICH programs
Dynamic allocation can be done when using the np
flag
Mapping is also supported when using the map
flags
If Infiniband is installed, the interconnect
fabric can be chosen using the machine flag
-machine p4
-machine vapi

26
Scaling with MPI

which mpicc
/usr/bin/mpicc
mpicc -show -o cpi-mpi cpi-mpi.c
gcc -L/usr/lib64/MPICH/p4/gnu -I/usr/include -o
cpi-mpi cpi-mpi.c -lmpi -lbproc
mpicc -o cpi-mpi cpi-mpi.c
time mpirun -np 1 ./cpi-mpi
Process 0 on scyld.localdomain
real 0m11.198s
user 0m11.187s
sys 0m0.010s
time mpirun -np 2 ./cpi-mpi
Process 0 on scyld.localdomain
Process 1 on n0
real 0m6.486s
user 0m5.510s
sys 0m0.009s
time mpirun -map -1-1-1-1-1-1-1-10000
0000 ./cpi-mpi

27
Environment Variable Options

Additional environment variable control
NP The number of processes requested, but not
the number of processors. As in the example
earlier in this section, NP4 ./a.out will run
the MPI program a.out with 4 processes.
ALL_CPUS Set the number of processes to the
number of CPUs available to the current user.
Similar to the example above, --all-cpus1
./a.out would run the MPI program a.out on all
available CPUs.
ALL_NODESSet the number of processes to the
number of nodes available to the current user.
Similar to the ALL_CPUS variable, but you get a
maximum of one CPU per node. This is useful for
running a job per node instead of per CPU.
ALL_LOCAL Run every process on the master node
used for debugging purposes.
NO_LOCAL Dont run any processes on the master
node.
EXCLUDE A colon-delimited list of nodes to be
avoided during node assignment.
BEOWULF_JOB_MAP A colon-delimited list of
nodes. The first node listed will be the first
process (MPI Rank 0) and so on.

28
Compiling and Running OpenMPI programs

env-modules package allow users to change their
environment variables according to predefined
files
module avail
module load openmpi/gnu
GNU, PGI, and Intel compilers are supported
mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the
correct MPI libraries from /opt/scyld/openmpi
mpirun is used to run code
Interconnect can be selected at runtime
-mca btl openib,tcp,sm,self
-mca btl udapl,tcp,sm,self

29
Compiling and Running OpenMPI programs

What env-modules does
Set user environment prior to compiling
export PATH/opt/scyld/openmpi/gnu/binPATH
mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the
correct MPI libraries from /opt/scyld/openmpi
Environment variables can used to set the
compiler
OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC
Prior to running PATH and LD_LIBRARY_PATH should
be set
module load openmpi/gnu
/opt/scyld/openmpi/gnu/bin/mpirun np 16 a.out
OR
export PATH/opt/scyld/openmpi/gnu/binPATHexp
ort MANPATH/opt/scyld/openmpi/gnu/share/manexpor
t LD_LIBRARY_PATH/opt/scyld/openmpi/gnu/libLD_
LIBRARY_PATH
/opt/scyld/openmpi/gnu/bin/mpirun np 16 a.out

30
Scaling with MPI Implementations
31
Scaling with MPI Implementations

Infiniband allows wider scaling
Performance difference between MPICH versus
OpenMPI
A little artificial because its only two
physical machines

32
Scaling with MPI Implementations

Larger problems would allow continued scaling

33
MPI Summary

MPI provides a mechanism to parallelize in a
distributed fashion
Localhost optimization is done is on a shared
memory machine
Shared variables are explicitly handled by the
developer
Tradeoff between CPU versus IO can determine the
performance characteristics
Hybrid programming models are possible
MPI code with OpenMPI sections
MPI code with GPU calls

34
Queuing

How are resources allocated among multiple users
and/or groups?
Statically by using bpctl user and group
permissions
ClusterWare supports a variety of queuing
packages
TaskMaster (advanced MOAB policy based scheduler
integrated ClusterWare)
Torque
SGE

35
Interacting with Torque

To submit a job
qsub script.sh
Example script.sh
!/bin/sh
PBS j oe
PBS l nodes4
cd PBS_O_WORKDIR
hostname
qsub does not accept arguments for script.sh.
All executable arguments must be included in the
script itself
Administrators can create a qapp script that
takes user arguments, creates script.sh with the
user arguments embedded, and runs qsub
script.sh

36
Interacting with Torque

Other commands
qstat Status of queue server and jobs
qdel Remove a job from the queue
qhold, qrls Hold and release a job in the queue
qmgr Administrator command to configure
pbs_server
/var/spool/torque/server_name should match
hostname of the head node
/var/spool/torque/mom_priv/config file to
configure pbs_mom
usecp /home /home indicates that pbs_mom
should use cp rather than rcp or scp to
relocate the stdout and stderr files at the end
of execution
pbsnodes Administrator command to monitor the
status of the resources
qalter Administrator command to modify the
parameters of a particular job (e.g. requested
time)

37
Other options to qsub

Options that can be included in a script (with
the PBS directive) or on the qsub command line
Join output and error files PBS j oe
Request resources PBS l nodes2ppn2
Request walltime PBS l walltime240000
Define a job name PBS N jobname
Send mail at jobs events PBS m be
Assign job to an account PBS A account
Export current environment variables PBS V
To start an interactive queue job use
qsub I for Torque
qrsh for SGE

38
Queue script case studies
!/bin/bash Usage qapp arg1 arg2 debug0 opt1
1 opt22 if opt2
then echo Not enough arguments exit 1 fi cat
gt app.sh ltlt EOF !/bin/bash PBS j oe PBS l
nodes1 cd \PBS_O_WORKDIR app opt1
opt2 EOF if debug lt 1
then qsub app.sh fi if debug eq 0
then /bin/rm f app.sh fi

qapp script
Be careful about escaping special characters in
the redirect section (\, \, \)

39
Queue script case studies

Using local scratch

!/bin/bash PBS j oe PBS l nodes1 cd
PBS_O_WORKDIR tmpdir/scratch/USER/PBS_JOBID
/bin/mkdir p tmpdir rsync a ./ tmpdir cd
tmpdir pathto/app arg1 arg2 cd
PBS_O_WORKDIR rsync a tmpdir/ . /bin/rm fr
tmpdir
40
Queue script case studies

Using local scratch for MPICH parallel jobs
pbsdsh is a Torque command

!/bin/bash PBS j oe PBS l nodes2ppn8 cd
PBS_O_WORKDIR tmpdir/scratch/USER/PBS_JOBID
/usr/bin/pbsdsh u /bin/mkdir p
tmpdir /usr/bin/pbsdsh u bash c cd
PBS_O_WORKDIR rsync a ./ tmpdir cd
tmpdir mpirun machine vapi pathto/app arg1
arg2 cd PBS_O_WORKDIR /usr/bin/pbsdsh u
rsync a tmpdir/ PBS_O_WORKDIR /usr/bin/pbsds
h u /bin/rm fr tmpdir
41
Queue script case studies

Using local scratch for OpenMPI parallel jobs
Do a module load openmpi/gnu prior to running
qsub
OR explicitly include a module load openmpi/gnu
in the script itself

!/bin/bash PBS j oe PBS l
nodes2ppn8 PBS -V cd PBS_O_WORKDIR tmpdir
/scratch/USER/PBS_JOBID /usr/bin/pbsdsh u
/bin/mkdir p tmpdir /usr/bin/pbsdsh u bash
c cd PBS_O_WORKDIR rsync a ./ tmpdir cd
tmpdir /usr/openmpi/gnu/bin/mpirun np cat
PBS_NODEFILE wc l mca btl openib,sm,self
pathto/app arg1 arg2 cd PBS_O_WORKDIR /usr/bin
/pbsdsh u rsync a tmpdir/ PBS_O_WORKDIR /us
r/bin/pbsdsh u /bin/rm fr tmpdir
42
Other considerations

A queue script need not be a single command
Multiple steps can be performed from a single
script
Guaranteed resources
Jobs should typically be a minimum of 2 minutes
Pre-processing and post-processing can be done
from the same script using the local scratch
space
If configured, it is possible to submit
additional jobs from a running queued job
To remove multiple jobs from the queue
qstat grep RQ awk print 1 xargs
qdel

43
Integrating Parallel Programs

The scheduler on keeps track of available
resources
Dont monitor how the resources are used
Onus is on the user to request and use the
correct resources
OpenMP be sure to requests multiple processors
on the same machine
Torque PBS l nodes1ppnx
SGE Correct PE (parallel environment) submission
MPI be sure to the use the machines that have
been assigned by the queue system
Torque MPICH and OpenMPI mpirun will do the
correct thing. PBS_NODEFILE contains a list of
assigned hosts
SGE PE_HOSTFILE contains a list of assigned
hosts. OpenMPIs mpirun may need to be recompiled

44
Integrating Parallel Programs

Be careful about task pinning (taskset)
Different jobs may assuming the same CPU set
resulting in oversubscription of some cores and
some free cores
In a shared environment, not using task pinning
can be easier at a slight trade-off in
performance
Make sure that the same MPI implementation and
compiler combination is used to run the code as
was used to compile and link

45
Questions??

Write a Comment

User Comments (0)

About PowerShow.com

Parallel Programming Orientation PowerPoint PPT Presentation