Introduction to OpenMP - PowerPoint PPT Presentation


PPT – Introduction to OpenMP PowerPoint presentation | free to view - id: 9d69a-MTUxO


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Introduction to OpenMP


In MPI, all threads are active all the time ... Child threads are spawned and released as needed. ... loops split between threads will not necessarily execute ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 48
Provided by: Shawn50


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to OpenMP

Introduction to OpenMP
  • Philip Blood Scientific Specialist
  • Pittsburgh Supercomputing Center
  • Jeff Gardner (U. of Washington)
  • Shawn Brown (PSC)

Different types of parallel platforms
Distributed Memory
Different types of parallel platforms Shared
Different types of parallel platforms Shared
  • SMP Symmetric Multiprocessing
  • Identical processing units working from the same
    main memory
  • SMP machines are becoming more common in the
    everyday workplace
  • Dual-socket motherboards are very common, and
    quad-sockets are not uncommon
  • 2 and 4 core CPUs are now commonplace
  • Intel Larabee 12-48 cores in 2009-2010
  • ASMP Asymmetric Multiprocessing
  • Not all processing units are identical
  • Cell processor of PS3

Parallel Programming Models
  • Shared Memory
  • Multiple processors sharing the same memory space
  • Message Passing
  • Users make calls that explicitly share
    information between execution entities
  • Remote Memory Access
  • Processors can directly access memory on another
  • These models are then used to build more
    sophisticated models
  • Loop Driven
  • Function Driven Parallel (Task-Level)

Shared Memory Programming
  • SysV memory manipulation
  • One can actually create, manipulate, shared
    memory spaces.
  • Pthreads (Posix Threads)
  • Lower level Unix library to build multi-threaded
  • OpenMP (
  • Protocol designed to provide automatic
    parallelization through compiler pragmas.
  • Mainly loop driven parallelism
  • Best suited to desktop and small SMP computers
  • Caution Race Conditions
  • When two threads are changing the same memory
    location at the same time.

  • OpenMP is designed for shared memory systems.
  • OpenMP is easy to use
  • achieve parallelism through compiler directives
  • or the occasional function call
  • OpenMP is a quick and dirty way of
    parallelizing a program.
  • OpenMP is usually used on existing serial
    programs to achieve moderate parallelism with
    relatively little effort

Computational Threads
  • Each processor has one thread assigned to it
  • Each thread runs one copy of your program

Thread 0
Thread 1
Thread 2
Thread n
OpenMP Execution Model
  • In MPI, all threads are active all the time
  • In OpenMP, execution begins only on the master
    thread. Child threads are spawned and released
    as needed.
  • Threads are spawned when program enters a
    parallel region.
  • Threads are released when program exits a
    parallel region

OpenMP Execution Model
Parallel Region Example For loop
  • Fortran
  • !omp parallel do
  • do i 1, n
  • a(i) b(i) c(i)
  • enddo
  • C/C
  • pragma omp parallel for
  • for(i1 iltn i)
  • ai bi ci

This comment or pragma tells openmp compiler to
spawn threads and distribute work among those
threads These actions are combined here but they
can be specified separately between the threads
Pros of OpenMP
  • Because it takes advantage of shared memory, the
    programmer does not need to worry (that much)
    about data placement
  • Programming model is serial-like and thus
    conceptually simpler than message passing
  • Compiler directives are generally simple and easy
    to use
  • Legacy serial code does not need to be rewritten

Cons of OpenMP
  • Codes can only be run in shared memory
  • In general, shared memory machines beyond 8 CPUs
    are much more expensive than distributed memory
    ones, so finding a shared memory system to run on
    may be difficult
  • Compiler must support OpenMP
  • whereas MPI can be installed anywhere
  • However, gcc 4.2 now supports OpenMP

Cons of OpenMP
  • In general, only moderate speedups can be
  • Because OpenMP codes tend to have serial-only
    portions, Amdahls Law prohibits substantial
  • Amdahls Law
  • F Fraction of serial execution time that cannot
  • parallelized
  • N Number of processors

If you have big loops that dominate execution
time, these are ideal targets for OpenMP
Execution time
Goals of this lecture
  • Exposure to OpenMP
  • Understand where OpenMP may be useful to you now
  • Or perhaps 4 years from now when you need to
    parallelize a serial program, you will say, Hey!
    I can use OpenMP.
  • Avoidance of common pitfalls
  • How to make your OpenMP actually get the same
    answer that it did in serial
  • A few tips on dramatically increasing the
    performance of OpenMP applications

Compiling and Running OpenMP
  • True64 -mp
  • SGI IRIX -mp
  • IBM AIX -qsmpomp
  • Portland Group -mp
  • Intel -openmp
  • gcc (4.2) -fopenmp

Compiling and Running OpenMP
  • OMP_NUM_THREADS environment variable sets the
    number of processors the OpenMP program will have
    at its disposal.
  • Example script
  • !/bin/tcsh
  • setenv OMP_NUM_THREADS 4
  • mycode lt gt my.out

OpenMP Basics 2 Approaches to Parallelism
Divide various sections of code between threads
Divide loop iterations among threads We will
focus mainly on loop level parallelism in this
Sections Functional parallelism
  • pragma omp parallel
  • pragma omp sections
  • pragma omp section
  • block1
  • pragma omp section
  • block2

Image from https//
Parallel DO/for Loop level parallelism
  • Fortran
  • !omp parallel do
  • do i 1, n
  • a(i) b(i) c(i)
  • enddo
  • C/C
  • pragma omp parallel for
  • for(i1 iltn i)
  • ai bi ci

Image from https//
Pitfall 1 Data dependencies
  • Consider the following code
  • a0 1
  • for(i1 ilt5 i)
  • ai i ai-1
  • There are dependencies between loop iterations.
  • Sections of loops split between threads will not
    necessarily execute in order
  • Out of order loop execution will result in
    undefined behavior

Pitfall 1 Data dependencies
  • 3 simple rules for data dependencies
  • All assignments are performed on arrays.
  • Each element of an array is assigned to by at
    most one iteration.
  • No loop iteration reads array elements modified
    by any other iteration.

Avoiding dependencies by using Private Variables
(Pitfall 1.5)
  • Consider the following loop
  • pragma omp parallel for
  • for(i0 iltn i)
  • temp 2.0ai
  • ai temp
  • bi ci/temp
  • By default, all threads share a common address
    space. Therefore, all threads will be modifying
    temp simultaneously

Avoiding dependencies by using Private Variables
(Pitfall 1.5)
  • The solution is to make temp a thread-private
    variable by using the private clause
  • pragma omp parallel for private(temp)
  • for(i0 iltn i)
  • temp 2.0ai
  • ai temp
  • bi ci/temp

Avoiding dependencies by using Private Variables
(Pitfall 1.5)
  • Default OpenMP behavior is for variables to be
    shared. However, sometimes you may wish to make
    the default private and explicitly declare your
    shared variables (but only in Fortran!)
  • !omp parallel do default(private)
  • do i1,n
  • temp 2.0a(i)
  • a(i) temp
  • b(i) c(i)/temp
  • enddo
  • !omp end parallel do

Private variables
  • Note that the loop iteration variable (e.g. i in
    previous example) is private by default
  • Caution The value of any variable specified as
    private is undefined both upon entering and
    leaving the construct in which it is specified
  • Use firstprivate and lastprivate clauses to
    retain values of variables declared as private

Use of function calls within parallel loops
  • In general, the compiler will not parallelize a
    loop that involves a function call unless is can
    guarantee that there are no dependencies between
  • sin(x) is OK, for example, if x is private.
  • A good strategy is to inline function calls
    within loops. If the compiler can inline the
    function, it can usually verify lack of
  • System calls do not parallelize!!!

Pitfall 2 Updating shared variables
  • Consider the following serial code
  • the_max 0
  • for (i0iltn i)
  • the_max max(myfunc(ai), the_max)
  • This loop can be executed in any order, however
    the_max is modified every loop iteration.
  • Use critical clause to specifiy code segments
    that can only be executed by one thread at a
  • pragma omp parallel for private(temp)
  • for(i0 iltn i)
  • temp myfunc(ai)
  • pragma omp critical
  • the_max max(temp, the_max)

Reduction operations
  • Now consider a global sum
  • for(i0 iltn i)
  • sum sum ai
  • This can be done by defining critical sections,
    but for convenience, OpenMP also provides a
    reduction clause
  • pragma omp parallel for reduction(sum)
  • for(i0 iltn i)
  • sum sum ai

Reduction operations
  • C/C reduction-able operators (and initial
  • (0)
  • - (0)
  • (1)
  • (0)
  • (0)
  • (0)
  • (1)
  • (0)

Pitfall 3 Parallel overhead
  • Spawning and releasing threads results in
    significant overhead.

Pitfall 3 Parallel overhead
Pitfall 3 Parallel Overhead
  • Spawning and releasing threads results in
    significant overhead.
  • Therefore, you want to make your parallel regions
    as large as possible
  • Parallelize over the largest loop that you can
    (even though it will involve more work to declare
    all of the private variables and eliminate
  • Coarse granularity is your friend!

Separating Parallel and For directives to
reduce overhead
  • In the following example, threads are spawned
    only once, not once per loop
  • pragma omp parallel
  • pragma omp for
  • for(i0 iltmaxi i)
  • ai bi
  • pragma omp for
  • for(j0 jltmaxj j)
  • cj dj

!omp parallel !omp do do i1,maxi a(i)
b(i) enddo !omp end do !(optional) !omp do do
i1,maxj c(j) d(j) enddo !omp end do
!(optional) !omp end parallel !(required)
Use nowait to avoid barriers
  • At the end of every loop is an implied barrier.
  • Use nowait to remove the barrier at the end of
    the first loop
  • pragma omp parallel
  • pragma omp for nowait
  • for(i0 iltmaxi i)
  • ai bi
  • pragma omp for
  • for(j0 jltmaxj j)
  • cj dj

Barrier removed by nowait clause
Use nowait to avoid barriers
  • In Fortran, nowait goes at end of loop
  • !omp parallel
  • !omp do
  • do i1,maxi
  • a(i) b(i)
  • enddo
  • !omp end do nowait
  • !omp do
  • do i1,maxj
  • c(j) d(j)
  • enddo
  • !omp end do
  • !omp end parallel

Barrier removed by nowait clause
Other useful directives to avoid releasing and
spawning threads
  • pragma omp master
  • !omp master ... !omp end master
  • Denotes codes within a parallel region to only be
    executed by the master
  • pragma omp single
  • Denotes code that will be performed only one
  • Useful for overlapping serial segments with
    parallel computation.
  • pragma omp barrier
  • Sets a global barrier within a parallel region

Thread stack
  • Each thread has its own memory region called the
    thread stack
  • This can grow to be quite large, so default size
    may not be enough
  • This can be increased (e.g. to 16 MB)
  • csh
  • limit stacksize 16000 setenv KMP_STACKSIZE
  • bash
  • ulimit -s 16000 export KMP_STACKSIZE16000000

Useful OpenMP Functions
  • void omp_set_num_threads(int num_threads)
  • Sets the number of OpenMP threads (overrides
  • int omp_get_thread_num()
  • Returns the number of the current thread
  • int omp_get_num_threads()
  • Returns the total number of threads currently
    participating in a parallel region
  • Returns 1 if executed in a serial region
  • For portability, surround these functions with
    ifdef _OPENMP
  • include ltomp.hgt

Optimization Scheduling
  • OpenMP partitions workload into chunks for
    distribution among threads
  • Default strategy is static

Optimization Scheduling
  • This strategy has the least amount of overhead
  • However, if not all iterations take the same
    amount of time, this simple strategy will lead to
    load imbalance.

Optimization Scheduling
  • OpenMP offers a variety of scheduling strategies
  • schedule(static,chunksize)
  • Divides workload into equal-sized chunks
  • Default chunksize is Nwork/Nthreads
  • Setting chunksize to less than this will result
    in chunks being assigned in an interleaved manner
  • Lowest overhead
  • Least optimal workload distribution

Optimization Scheduling
  • schedule(dynamic,chunksize)
  • Dynamically assigned chunks to threads
  • Default chunksize is 1
  • Highest overhead
  • Optimal workload distribution
  • schedule(guided,chunksize)
  • Starts with big chunks proportional to (number of
    unassigned iterations)/(number of threads), then
    makes them progressively smaller until chunksize
    is reached
  • Attempts to seek a balance between overhead and
    workload optimization

Optimization Scheduling
  • schedule(runtime)
  • Scheduling can be selected at runtime using
  • e.g. setenv OMP_SCHEDULE guided, 100
  • In practice, often use
  • Default scheduling (static, large chunks)
  • Guided with default chunksize
  • Experiment with your code to determine optimal

What we have learned
  • How to compile and run OpenMP progs
  • Private vs. shared variables
  • Critical sections and reductions for updating
    scalar shared variables
  • Techniques for minimizing thread spawning/exiting
  • Different scheduling strategies

  • OpenMP is often the easiest way to achieve
    moderate parallelism on shared memory machines
  • In practice, to achieve decent scaling, will
    probably need to invest some amount of effort in
    tuning your application.
  • More information available at
  • https//
  • http//
  • Using OpenMP, MIT Press, 2008

  • If youve finished parallelizing the Laplace code
    (or you want a break from MPI)
  • Go to and click on
  • OpenMPHands-On_PSC.pdf for introductory exercises
    and examples.