Parallel Programming Orientation - PowerPoint PPT Presentation

Loading...

PPT – Parallel Programming Orientation PowerPoint presentation | free to view - id: 80dbb0-N2Y1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Parallel Programming Orientation

Description:

... no disk required Less than 20 seconds Virtual, ... Shared memory OpenMP Sockets PVM Linda MPI Most distributed parallel programs are now ... Presentation Author ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 46
Provided by: Maryan107
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallel Programming Orientation


1
Parallel Programming Orientation
2
Agenda
  • Parallel jobs
  • Paradigms to parallelize algorithms
  • Profiling and compiler optimization
  • Implementations to parallelize code
  • OpenMP
  • MPI
  • Queuing
  • Job queuing
  • Integrating Parallel Programs
  • Questions and Answers

3
Traditional Ad-Hoc Linux Cluster
  • Full Linux install to disk Load into memory
  • Manual Slow 5-30 minutes
  • Full set of disparate daemons, services,
    user/password, host access setup
  • Basic parallel shell with complex glue scripts
    run jobs
  • Monitoring management added as isolated tools

4
Cluster Virtualization Architecture Realized
  • Minimal in-memory OS with single daemon rapidly
    deployed in seconds - no disk required
  • Less than 20 seconds
  • Virtual, unified process space enables intuitive
    single sign-on, job submission
  • Effortless job migration to nodes
  • Monitor manage efficiently from the Master
  • Single System Install
  • Single Process Space
  • Shared cache of the cluster state
  • Single point of provisioning
  • Better performance due to lightweight nodes
  • No version skew is inherently more reliable

5
Just a Primer
  • Only a brief introduction is provided here. Many
    other in-depth tutorials are available on the web
    and in published sources.
  • http//www.mpi-forum.org/docs/mpi-11-html/mpi-repo
    rt.html
  • https//computing.llnl.gov/?settrainingpageinde
    x

6
Parallel Code Primer
  • Paradigms for writing parallel programs depend
    upon the application
  • SIMD (single-instruction multiple-data)
  • MIMD (multiple-instruction multiple-data)
  • MISD (multiple-instruction single-data)
  • SIMD will be presented here as it is a commonly
    used template
  • A single application source is compiled to
    perform operations on different sets of data
    (Single Program Multiple Data (SPMD) model)
  • The data is read by the different threads or
    passed between threads via messages (hence MPI
    message passing interface)
  • Contrast this with shared memory or OpenMP where
    data is locally via memory
  • Optimizations in the MPI implementation can
    perform localhost optimization however, the
    program is still written using a message passing
    construct

7
Explicitly Parallel Programs
  • Different paradigms exist for parallelizing
    programs
  • Shared memory
  • OpenMP
  • Sockets
  • PVM
  • Linda
  • MPI
  • Most distributed parallel programs are now
    written using MPI
  • Different options for MPI stacks MPICH,
    OpenMPI, HP, Intel
  • ClusterWare comes integrated with customized
    versions of MPICH and OpenMPI

8
Example Code
  • Calculate p through numerical integration
  • Compute p by integrating f(x) 4/(1 x2) from
    0 to 1
  • Function is the derivative of arctan(x)
  • See source code

9
Compiling and running code
  • set n 400,000,000
  • gcc -o cpi-serial cpi-serial.c
  • time ./cpi-serial
  • Process 0
  • pi is approximately 3.1415926535895520, Error is
    0.0000000000002411
  • real 0m11.009s
  • user 0m11.007s
  • sys 0m0.000s
  • gcc -g -pg -o cpi-serial_prof cpi-serial.c
  • time ./cpi-serial_prof
  • Process 0
  • pi is approximately 3.1415926535895520, Error is
    0.0000000000002411
  • real 0m11.012s
  • user 0m11.010s
  • sys 0m0.000s
  • ls -ltra

10
Profiling
  • -g flag includes debugging information in the
    binary. Useful for gdb tracing of an application
    and for profiling
  • -pg flag generates code to writing profile
    information
  • gprof cpi-serial_prof gmon.out
  • Flat profile
  • Each sample counts as 0.01 seconds.
  • cumulative self self
    total
  • time seconds seconds calls ns/call
    ns/call name
  • 74.48 2.85 2.85
    main
  • 23.95 3.77 0.92 400000000 2.29
    2.29 f
  • 2.37 3.86 0.09
    frame_dummy
  • Call graph (explanation follows)
  • granularity each sample hit covers 2 byte(s) for
    0.26 of 3.86 seconds
  • index time self children called name

  • ltspontaneousgt

11
Profiling Tips
  • Code should be profiled using realistic data set
  • Contrast the call graphs of n100 versus
    n400,000,000
  • Profiling can give tips about where to optimize
    the current algorithm, but it cant suggest
    alternative (better) algorithms
  • e.g. Monte Carlo algorithm to calculate p
  • Amdahls Law
  • The speedup parallelization achieves is limited
    by the serial part of the code

12
OpenMP Introduction
  • Parallelization using shared memory in a single
    machine
  • Portion of the code is forked on the machine to
    parallelize
  • i.e. not distributed parallelization
  • Done using pragmas in the source code. Compiler
    must support OpenMP (gcc 4, Intel, etc.)
  • gcc -fopenmp -o cpi-openmp cpi-openmp.c
  • See Source Code
  • Profiling can add overhead to resulting
    executable
  • time can be used to measure improvement
  • Runtime selection of the number of threads using
    OMP_NUM_THREADS environment variable

13
Scaling with OpenMP
  • time OMP_NUM_THREADS1 ./cpi-openmp
  • Process 0
  • pi is approximately 3.1415926535895520, Error is
    0.0000000000002411
  • real 0m10.583s
  • user 0m10.581s
  • sys 0m0.001s
  • time OMP_NUM_THREADS2 ./cpi-openmp
  • Process 0
  • Process 1
  • pi is approximately 3.1415926535900218, Error is
    0.0000000000002287
  • real 0m5.295s
  • user 0m11.297s
  • sys 0m0.000s
  • time OMP_NUM_THREADS4 ./cpi-openmp
  • real 0m2.650s
  • user 0m10.586s

14
Scaling with OpenMP
  • Code is easy to parallelize
  • Good scaling is seen up to 8 processors, kink in
    the curve is expected

15
Role of the Compiler
  • Parallelization using shared memory in a single
    machine
  • i.e. not distributed parallelization
  • Done using pragmas in the source code. Compiler
    must support OpenMP (gcc 4, Intel, etc.)
  • gcc -fopenmp -o cpi-openmp cpi-openmp.c
  • Profiling can add overhead to resulting
    executable
  • time can be used to measure improvement
  • Runtime selection of the number of threads using
    OMP_NUM_THREADS environment variable

16
GCC versus Intel C
  • time OMP_NUM_THREADS1 ./cpi-openmp
  • real 0m10.583s
  • user 0m10.581s
  • sys 0m0.001s
  • gcc -O3 -fopenmp -o cpi-openmp-gcc-O3
    cpi-openmp.c
  • time OMP_NUM_THREADS1 ./cpi-openmp-gcc-O3
  • Process 0
  • pi is approximately 3.1415926535895520, Error is
    0.0000000000002411
  • real 0m3.154s
  • user 0m3.143s
  • sys 0m0.011s
  • time OMP_NUM_THREADS8 ./cpi-openmp-gcc-O3
  • real 0m0.399s
  • user 0m3.181s
  • sys 0m0.001s

17
Compiler Timings
18
Explicitly Parallel Programs
  • Different paradigms exist for parallelizing
    programs
  • Shared memory
  • OpenMP
  • Sockets
  • PVM
  • Linda
  • MPI
  • Most distributed parallel programs are now
    written using MPI
  • Different options for MPI stacks MPICH,
    OpenMPI, HP, Intel
  • ClusterWare comes integrated with customized
    versions of MPICH and OpenMPI

19
OpenMP Summary
  • OpenMP provides a mechanism to parallelize within
    a single machine
  • Shared memory and variables are handled
    automatically
  • Performance, with an appropriate compiler, can
    provide significant speedups
  • Coupled with large core count SMP machines,
    OpenMP could be all of the parallelization
    required
  • GPU programming is similar to the OpenMP model

20
Explicitly Parallel Programs
  • Different paradigms exist for parallelizing
    programs
  • Shared memory
  • OpenMP
  • Sockets
  • PVM
  • Linda
  • MPI
  • Most distributed parallel programs are now
    written using MPI
  • Different options for MPI stacks MPICH,
    OpenMPI, HP, Intel
  • ClusterWare comes integrated with customized
    versions of MPICH and OpenMPI

21
Running MPI Code
  • Binaries are executed simultaneously
  • on the same machine or different machines
  • After the binaries start running, the
    MPI_COMM_WORLD is established
  • Any data to be transferred must be explicitly
    determined by the programmer
  • Hooks exist for a number of languages
  • E.g. Python (https//computing.llnl.gov/code/pdf/p
    yMPI.pdf)

22
Example MPI Source
  • cpi.c calculates p using MPI in C
  • include "mpi.h"
  • include ltstdio.hgt
  • include ltmath.hgt
  • double f( double )
  • double f( double a )
  • return (4.0 / (1.0 aa))
  • int main( int argc, char argv)
  • int done 0, n, myid, numprocs, i
  • double PI25DT 3.141592653589793238462643
  • double mypi, pi, h, sum, x
  • double startwtime 0.0, endwtime
  • int namelen
  • while (!done)
  • if (myid 0)
  • /
  • printf("Enter the number of
    intervals (0 quits) ")
  • scanf("d",n)
  • /
  • if (n0) n100 else n0
  • startwtime MPI_Wtime()
  • MPI_Bcast(n, 1, MPI_INT, 0,
    MPI_COMM_WORLD)
  • if (n 0)
  • done 1
  • else
  • h 1.0 / (double) n
  • sum 0.0

compute pi by integrating f(x) 4/(1 x2)
System include file which defines the MPI
functions
Initialize the MPI execution environment
Determines the size of the group associated with
a communictor
Determines the rank of the calling process in the
communicator
Gets the name of the processor
Differentiate actions based on rank. Only
master performs this action
MPI built-in function to get time value
Broadcasts 1 MPI_INT from n from the process
with rank 0" to all other processes of the group
Each worker does this loop and increments the
counter by the number of processors (versus
dividing the range -gt possible off-by-one error)
Does MPI_SUM function on 1 MPI_DOUBLE at mypi
on all workers in MPI_COMM_WORLD to a single
value at pi on rank 0
Only rank 0 outputs the value of pi
Terminates MPI execution environment
23
Other Common MPI Functions
  • MPI_Send, MPI_Recv
  • Blocking send and receive between two specific
    ranks
  • MPI_Isend, MPI_Irecv
  • Non-blocking send and receive between two
    specific ranks
  • man pages exist for the MPI functions
  • Poorly written programs can suffer from poor
    communication efficiency (e.g. stair-step) or
    lost data if the system buffer fills before a
    blocking send or receive is initiated to
    correspond with a non-blocking receive or send
  • Care should be used when creating temporary files
    as multiple threads may be running on the same
    host overwriting the same temporary file (include
    rank in file name in a unique temporary directory
    per simulation)

24
Compiling MPICH programs
  • mpicc, mpiCC, mpif77, mpif90 are used to
    automatically compile code and link in the
    correct MPI libraries from /usr/lib64/MPICH
  • Environment variables can used to set the
    compiler
  • CC, CPP, FC, F90
  • Command line options to set the compiler
  • -cc, -cxx, -fc, -f90
  • GNU, PGI, and Intel compilers are supported

25
Running MPICH programs
  • mpirun is used to launch MPICH programs
  • Dynamic allocation can be done when using the np
    flag
  • Mapping is also supported when using the map
    flags
  • If Infiniband is installed, the interconnect
    fabric can be chosen using the machine flag
  • -machine p4
  • -machine vapi

26
Scaling with MPI
  • which mpicc
  • /usr/bin/mpicc
  • mpicc -show -o cpi-mpi cpi-mpi.c
  • gcc -L/usr/lib64/MPICH/p4/gnu -I/usr/include -o
    cpi-mpi cpi-mpi.c -lmpi -lbproc
  • mpicc -o cpi-mpi cpi-mpi.c
  • time mpirun -np 1 ./cpi-mpi
  • Process 0 on scyld.localdomain
  • real 0m11.198s
  • user 0m11.187s
  • sys 0m0.010s
  • time mpirun -np 2 ./cpi-mpi
  • Process 0 on scyld.localdomain
  • Process 1 on n0
  • real 0m6.486s
  • user 0m5.510s
  • sys 0m0.009s
  • time mpirun -map -1-1-1-1-1-1-1-10000
    0000 ./cpi-mpi

27
Environment Variable Options
  • Additional environment variable control
  • NP The number of processes requested, but not
    the number of processors. As in the example
    earlier in this section, NP4 ./a.out will run
    the MPI program a.out with 4 processes.
  • ALL_CPUS Set the number of processes to the
    number of CPUs available to the current user.
    Similar to the example above, --all-cpus1
    ./a.out would run the MPI program a.out on all
    available CPUs.
  • ALL_NODESSet the number of processes to the
    number of nodes available to the current user.
    Similar to the ALL_CPUS variable, but you get a
    maximum of one CPU per node. This is useful for
    running a job per node instead of per CPU.
  • ALL_LOCAL Run every process on the master node
    used for debugging purposes.
  • NO_LOCAL Dont run any processes on the master
    node.
  • EXCLUDE A colon-delimited list of nodes to be
    avoided during node assignment.
  • BEOWULF_JOB_MAP A colon-delimited list of
    nodes. The first node listed will be the first
    process (MPI Rank 0) and so on.

28
Compiling and Running OpenMPI programs
  • env-modules package allow users to change their
    environment variables according to predefined
    files
  • module avail
  • module load openmpi/gnu
  • GNU, PGI, and Intel compilers are supported
  • mpicc, mpiCC, mpif77, mpif90 are used to
    automatically compile code and link in the
    correct MPI libraries from /opt/scyld/openmpi
  • mpirun is used to run code
  • Interconnect can be selected at runtime
  • -mca btl openib,tcp,sm,self
  • -mca btl udapl,tcp,sm,self

29
Compiling and Running OpenMPI programs
  • What env-modules does
  • Set user environment prior to compiling
  • export PATH/opt/scyld/openmpi/gnu/binPATH
  • mpicc, mpiCC, mpif77, mpif90 are used to
    automatically compile code and link in the
    correct MPI libraries from /opt/scyld/openmpi
  • Environment variables can used to set the
    compiler
  • OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC
  • Prior to running PATH and LD_LIBRARY_PATH should
    be set
  • module load openmpi/gnu
  • /opt/scyld/openmpi/gnu/bin/mpirun np 16 a.out
  • OR
  • export PATH/opt/scyld/openmpi/gnu/binPATHexp
    ort MANPATH/opt/scyld/openmpi/gnu/share/manexpor
    t LD_LIBRARY_PATH/opt/scyld/openmpi/gnu/libLD_
    LIBRARY_PATH
  • /opt/scyld/openmpi/gnu/bin/mpirun np 16 a.out

30
Scaling with MPI Implementations
31
Scaling with MPI Implementations
  • Infiniband allows wider scaling
  • Performance difference between MPICH versus
    OpenMPI
  • A little artificial because its only two
    physical machines

32
Scaling with MPI Implementations
  • Larger problems would allow continued scaling

33
MPI Summary
  • MPI provides a mechanism to parallelize in a
    distributed fashion
  • Localhost optimization is done is on a shared
    memory machine
  • Shared variables are explicitly handled by the
    developer
  • Tradeoff between CPU versus IO can determine the
    performance characteristics
  • Hybrid programming models are possible
  • MPI code with OpenMPI sections
  • MPI code with GPU calls

34
Queuing
  • How are resources allocated among multiple users
    and/or groups?
  • Statically by using bpctl user and group
    permissions
  • ClusterWare supports a variety of queuing
    packages
  • TaskMaster (advanced MOAB policy based scheduler
    integrated ClusterWare)
  • Torque
  • SGE

35
Interacting with Torque
  • To submit a job
  • qsub script.sh
  • Example script.sh
  • !/bin/sh
  • PBS j oe
  • PBS l nodes4
  • cd PBS_O_WORKDIR
  • hostname
  • qsub does not accept arguments for script.sh.
    All executable arguments must be included in the
    script itself
  • Administrators can create a qapp script that
    takes user arguments, creates script.sh with the
    user arguments embedded, and runs qsub
    script.sh

36
Interacting with Torque
  • Other commands
  • qstat Status of queue server and jobs
  • qdel Remove a job from the queue
  • qhold, qrls Hold and release a job in the queue
  • qmgr Administrator command to configure
    pbs_server
  • /var/spool/torque/server_name should match
    hostname of the head node
  • /var/spool/torque/mom_priv/config file to
    configure pbs_mom
  • usecp /home /home indicates that pbs_mom
    should use cp rather than rcp or scp to
    relocate the stdout and stderr files at the end
    of execution
  • pbsnodes Administrator command to monitor the
    status of the resources
  • qalter Administrator command to modify the
    parameters of a particular job (e.g. requested
    time)

37
Other options to qsub
  • Options that can be included in a script (with
    the PBS directive) or on the qsub command line
  • Join output and error files PBS j oe
  • Request resources PBS l nodes2ppn2
  • Request walltime PBS l walltime240000
  • Define a job name PBS N jobname
  • Send mail at jobs events PBS m be
  • Assign job to an account PBS A account
  • Export current environment variables PBS V
  • To start an interactive queue job use
  • qsub I for Torque
  • qrsh for SGE

38
Queue script case studies
!/bin/bash Usage qapp arg1 arg2 debug0 opt1
1 opt22 if opt2
then echo Not enough arguments exit 1 fi cat
gt app.sh ltlt EOF !/bin/bash PBS j oe PBS l
nodes1 cd \PBS_O_WORKDIR app opt1
opt2 EOF if debug lt 1
then qsub app.sh fi if debug eq 0
then /bin/rm f app.sh fi
  • qapp script
  • Be careful about escaping special characters in
    the redirect section (\, \, \)

39
Queue script case studies
  • Using local scratch

!/bin/bash PBS j oe PBS l nodes1 cd
PBS_O_WORKDIR tmpdir/scratch/USER/PBS_JOBID
/bin/mkdir p tmpdir rsync a ./ tmpdir cd
tmpdir pathto/app arg1 arg2 cd
PBS_O_WORKDIR rsync a tmpdir/ . /bin/rm fr
tmpdir
40
Queue script case studies
  • Using local scratch for MPICH parallel jobs
  • pbsdsh is a Torque command

!/bin/bash PBS j oe PBS l nodes2ppn8 cd
PBS_O_WORKDIR tmpdir/scratch/USER/PBS_JOBID
/usr/bin/pbsdsh u /bin/mkdir p
tmpdir /usr/bin/pbsdsh u bash c cd
PBS_O_WORKDIR rsync a ./ tmpdir cd
tmpdir mpirun machine vapi pathto/app arg1
arg2 cd PBS_O_WORKDIR /usr/bin/pbsdsh u
rsync a tmpdir/ PBS_O_WORKDIR /usr/bin/pbsds
h u /bin/rm fr tmpdir
41
Queue script case studies
  • Using local scratch for OpenMPI parallel jobs
  • Do a module load openmpi/gnu prior to running
    qsub
  • OR explicitly include a module load openmpi/gnu
    in the script itself

!/bin/bash PBS j oe PBS l
nodes2ppn8 PBS -V cd PBS_O_WORKDIR tmpdir
/scratch/USER/PBS_JOBID /usr/bin/pbsdsh u
/bin/mkdir p tmpdir /usr/bin/pbsdsh u bash
c cd PBS_O_WORKDIR rsync a ./ tmpdir cd
tmpdir /usr/openmpi/gnu/bin/mpirun np cat
PBS_NODEFILE wc l mca btl openib,sm,self
pathto/app arg1 arg2 cd PBS_O_WORKDIR /usr/bin
/pbsdsh u rsync a tmpdir/ PBS_O_WORKDIR /us
r/bin/pbsdsh u /bin/rm fr tmpdir
42
Other considerations
  • A queue script need not be a single command
  • Multiple steps can be performed from a single
    script
  • Guaranteed resources
  • Jobs should typically be a minimum of 2 minutes
  • Pre-processing and post-processing can be done
    from the same script using the local scratch
    space
  • If configured, it is possible to submit
    additional jobs from a running queued job
  • To remove multiple jobs from the queue
  • qstat grep RQ awk print 1 xargs
    qdel

43
Integrating Parallel Programs
  • The scheduler on keeps track of available
    resources
  • Dont monitor how the resources are used
  • Onus is on the user to request and use the
    correct resources
  • OpenMP be sure to requests multiple processors
    on the same machine
  • Torque PBS l nodes1ppnx
  • SGE Correct PE (parallel environment) submission
  • MPI be sure to the use the machines that have
    been assigned by the queue system
  • Torque MPICH and OpenMPI mpirun will do the
    correct thing. PBS_NODEFILE contains a list of
    assigned hosts
  • SGE PE_HOSTFILE contains a list of assigned
    hosts. OpenMPIs mpirun may need to be recompiled

44
Integrating Parallel Programs
  • Be careful about task pinning (taskset)
  • Different jobs may assuming the same CPU set
    resulting in oversubscription of some cores and
    some free cores
  • In a shared environment, not using task pinning
    can be easier at a slight trade-off in
    performance
  • Make sure that the same MPI implementation and
    compiler combination is used to run the code as
    was used to compile and link

45
Questions??
About PowerShow.com