Introduction to OpenMP presentation

About This Presentation

Transcript and Presenter's Notes

Title: Introduction to OpenMP

1
Introduction to OpenMP

Philip BloodScientific Specialist
Pittsburgh Supercomputing Center
Jeff Gardner (U. of Washington)
Shawn Brown (PSC)

2
Different types of parallel platforms
Distributed Memory
3
Different types of parallel platforms Shared
Memory
4
Different types of parallel platforms Shared
Memory

SMP Symmetric Multiprocessing
Identical processing units working from the same
main memory
SMP machines are becoming more common in the
everyday workplace
Dual-socket motherboards are very common, and
quad-sockets are not uncommon
2 and 4 core CPUs are now commonplace
Intel Larabee 12-48 cores in 2009-2010
ASMP Asymmetric Multiprocessing
Not all processing units are identical
Cell processor of PS3

5
Parallel Programming Models

Shared Memory
Multiple processors sharing the same memory space
Message Passing
Users make calls that explicitly share
information between execution entities
Remote Memory Access
Processors can directly access memory on another
processor
These models are then used to build more
sophisticated models
Loop Driven
Function Driven Parallel (Task-Level)

6
Shared Memory Programming

SysV memory manipulation
One can actually create, manipulate, shared
memory spaces.
Pthreads (Posix Threads)
Lower level Unix library to build multi-threaded
programs
OpenMP (www.openmp.org)
Protocol designed to provide automatic
parallelization through compiler pragmas.
Mainly loop driven parallelism
Best suited to desktop and small SMP computers
Caution Race Conditions
When two threads are changing the same memory
location at the same time.

7
Introduction

OpenMP is designed for shared memory systems.
OpenMP is easy to use
achieve parallelism through compiler directives
or the occasional function call
OpenMP is a quick and dirty way of
parallelizing a program.
OpenMP is usually used on existing serial
programs to achieve moderate parallelism with
relatively little effort

8
Computational Threads

Each processor has one thread assigned to it
Each thread runs one copy of your program

Thread 0
Thread 1
Thread 2
Thread n
9
OpenMP Execution Model

In MPI, all threads are active all the time
In OpenMP, execution begins only on the master
thread. Child threads are spawned and released
as needed.
Threads are spawned when program enters a
parallel region.
Threads are released when program exits a
parallel region

10
OpenMP Execution Model
11
Parallel Region ExampleFor loop

Fortran
!omp parallel do
do i 1, n
a(i) b(i) c(i)
enddo
C/C
pragma omp parallel for
for(i1 iltn i)
ai bi ci

This comment or pragma tells openmp compiler to
spawn threads and distribute work among those
threads These actions are combined here but they
can be specified separately between the threads
12
Pros of OpenMP

Because it takes advantage of shared memory, the
programmer does not need to worry (that much)
about data placement
Programming model is serial-like and thus
conceptually simpler than message passing
Compiler directives are generally simple and easy
to use
Legacy serial code does not need to be rewritten

13
Cons of OpenMP

Codes can only be run in shared memory
environments!
In general, shared memory machines beyond 8 CPUs
are much more expensive than distributed memory
ones, so finding a shared memory system to run on
may be difficult
Compiler must support OpenMP
whereas MPI can be installed anywhere
However, gcc 4.2 now supports OpenMP

14
Cons of OpenMP

In general, only moderate speedups can be
achieved.
Because OpenMP codes tend to have serial-only
portions, Amdahls Law prohibits substantial
speedups
Amdahls Law
F Fraction of serial execution time that cannot
be
parallelized
N Number of processors

If you have big loops that dominate execution
time, these are ideal targets for OpenMP
Execution time
15
Goals of this lecture

Exposure to OpenMP
Understand where OpenMP may be useful to you now
Or perhaps 4 years from now when you need to
parallelize a serial program, you will say, Hey!
I can use OpenMP.
Avoidance of common pitfalls
How to make your OpenMP actually get the same
answer that it did in serial
A few tips on dramatically increasing the
performance of OpenMP applications

16
Compiling and Running OpenMP

True64 -mp
SGI IRIX -mp
IBM AIX -qsmpomp
Portland Group -mp
Intel -openmp
gcc (4.2) -fopenmp

17
Compiling and Running OpenMP

OMP_NUM_THREADS environment variable sets the
number of processors the OpenMP program will have
at its disposal.
Example script
!/bin/tcsh
setenv OMP_NUM_THREADS 4
mycode lt my.in gt my.out

18
OpenMP Basics2 Approaches to Parallelism
Divide various sections of code between threads
Divide loop iterations among threads We will
focus mainly on loop level parallelism in this
lecture
19
Sections Functional parallelism

pragma omp parallel
pragma omp sections
pragma omp section
block1
pragma omp section
block2

Image from https//computing.llnl.gov/tutorials/o
penMP
20
Parallel DO/for Loop level parallelism

Fortran
!omp parallel do
do i 1, n
a(i) b(i) c(i)
enddo
C/C
pragma omp parallel for
for(i1 iltn i)
ai bi ci

Image from https//computing.llnl.gov/tutorials/o
penMP
21
Pitfall 1 Data dependencies

Consider the following code
a0 1
for(i1 ilt5 i)
ai i ai-1
There are dependencies between loop iterations.
Sections of loops split between threads will not
necessarily execute in order
Out of order loop execution will result in
undefined behavior

22
Pitfall 1 Data dependencies

3 simple rules for data dependencies
All assignments are performed on arrays.
Each element of an array is assigned to by at
most one iteration.
No loop iteration reads array elements modified
by any other iteration.

23
Avoiding dependencies by using Private Variables
(Pitfall 1.5)

Consider the following loop
pragma omp parallel for
for(i0 iltn i)
temp 2.0ai
ai temp
bi ci/temp
By default, all threads share a common address
space. Therefore, all threads will be modifying
temp simultaneously

24
Avoiding dependencies by using Private Variables
(Pitfall 1.5)

The solution is to make temp a thread-private
variable by using the private clause
pragma omp parallel for private(temp)
for(i0 iltn i)
temp 2.0ai
ai temp
bi ci/temp

25
Avoiding dependencies by using Private Variables
(Pitfall 1.5)

Default OpenMP behavior is for variables to be
shared. However, sometimes you may wish to make
the default private and explicitly declare your
shared variables (but only in Fortran!)
!omp parallel do default(private)
shared(n,a,b,c)
do i1,n
temp 2.0a(i)
a(i) temp
b(i) c(i)/temp
enddo
!omp end parallel do

26
Private variables

Note that the loop iteration variable (e.g. i in
previous example) is private by default
Caution The value of any variable specified as
private is undefined both upon entering and
leaving the construct in which it is specified
Use firstprivate and lastprivate clauses to
retain values of variables declared as private

27
Use of function calls within parallel loops

In general, the compiler will not parallelize a
loop that involves a function call unless is can
guarantee that there are no dependencies between
iterations.
sin(x) is OK, for example, if x is private.
A good strategy is to inline function calls
within loops. If the compiler can inline the
function, it can usually verify lack of
dependencies.
System calls do not parallelize!!!

28
Pitfall 2 Updating shared variables
simultaneously

Consider the following serial code
the_max 0
for (i0iltn i)
the_max max(myfunc(ai), the_max)
This loop can be executed in any order, however
the_max is modified every loop iteration.
Use critical clause to specifiy code segments
that can only be executed by one thread at a
time
pragma omp parallel for private(temp)
for(i0 iltn i)
temp myfunc(ai)
pragma omp critical
the_max max(temp, the_max)

29
Reduction operations

Now consider a global sum
for(i0 iltn i)
sum sum ai
This can be done by defining critical sections,
but for convenience, OpenMP also provides a
reduction clause
pragma omp parallel for reduction(sum)
for(i0 iltn i)
sum sum ai

30
Reduction operations

C/C reduction-able operators (and initial
values)
(0)
- (0)
(1)
(0)
(0)
(0)
(1)
(0)

31
Pitfall 3 Parallel overhead

Spawning and releasing threads results in
significant overhead.

32
Pitfall 3 Parallel overhead
33
Pitfall 3 Parallel Overhead

Spawning and releasing threads results in
significant overhead.
Therefore, you want to make your parallel regions
as large as possible
Parallelize over the largest loop that you can
(even though it will involve more work to declare
all of the private variables and eliminate
dependencies)
Coarse granularity is your friend!

34
Separating Parallel and For directives to
reduce overhead

In the following example, threads are spawned
only once, not once per loop
pragma omp parallel
pragma omp for
for(i0 iltmaxi i)
ai bi
pragma omp for
for(j0 jltmaxj j)
cj dj

!omp parallel !omp do do i1,maxi a(i)
b(i) enddo !omp end do !(optional) !omp do do
i1,maxj c(j) d(j) enddo !omp end do
!(optional) !omp end parallel !(required)
35
Use nowait to avoid barriers

At the end of every loop is an implied barrier.
Use nowait to remove the barrier at the end of
the first loop
pragma omp parallel
pragma omp for nowait
for(i0 iltmaxi i)
ai bi
pragma omp for
for(j0 jltmaxj j)
cj dj

Barrier removed by nowait clause
36
Use nowait to avoid barriers

In Fortran, nowait goes at end of loop
!omp parallel
!omp do
do i1,maxi
a(i) b(i)
enddo
!omp end do nowait
!omp do
do i1,maxj
c(j) d(j)
enddo
!omp end do
!omp end parallel

Barrier removed by nowait clause
37
Other useful directives to avoid releasing and
spawning threads

pragma omp master
!omp master ... !omp end master
Denotes codes within a parallel region to only be
executed by the master
pragma omp single
Denotes code that will be performed only one
thread
Useful for overlapping serial segments with
parallel computation.
pragma omp barrier
Sets a global barrier within a parallel region

38
Thread stack

Each thread has its own memory region called the
thread stack
This can grow to be quite large, so default size
may not be enough
This can be increased (e.g. to 16 MB)
csh
limit stacksize 16000 setenv KMP_STACKSIZE
16000000
bash
ulimit -s 16000 export KMP_STACKSIZE16000000

39
Useful OpenMP Functions

void omp_set_num_threads(int num_threads)
Sets the number of OpenMP threads (overrides
OMP_NUM_THREADS)
int omp_get_thread_num()
Returns the number of the current thread
int omp_get_num_threads()
Returns the total number of threads currently
participating in a parallel region
Returns 1 if executed in a serial region
For portability, surround these functions with
ifdef _OPENMP
include ltomp.hgt

40
Optimization Scheduling

OpenMP partitions workload into chunks for
distribution among threads
Default strategy is static

41
Optimization Scheduling

This strategy has the least amount of overhead
However, if not all iterations take the same
amount of time, this simple strategy will lead to
load imbalance.

42
Optimization Scheduling

OpenMP offers a variety of scheduling strategies
schedule(static,chunksize)
Divides workload into equal-sized chunks
Default chunksize is Nwork/Nthreads
Setting chunksize to less than this will result
in chunks being assigned in an interleaved manner
Lowest overhead
Least optimal workload distribution

43
Optimization Scheduling

schedule(dynamic,chunksize)
Dynamically assigned chunks to threads
Default chunksize is 1
Highest overhead
Optimal workload distribution
schedule(guided,chunksize)
Starts with big chunks proportional to (number of
unassigned iterations)/(number of threads), then
makes them progressively smaller until chunksize
is reached
Attempts to seek a balance between overhead and
workload optimization

44
Optimization Scheduling

schedule(runtime)
Scheduling can be selected at runtime using
OMP_SCHEDULE
e.g. setenv OMP_SCHEDULE guided, 100
In practice, often use
Default scheduling (static, large chunks)
Guided with default chunksize
Experiment with your code to determine optimal
strategy

45
What we have learned

How to compile and run OpenMP progs
Private vs. shared variables
Critical sections and reductions for updating
scalar shared variables
Techniques for minimizing thread spawning/exiting
overhead
Different scheduling strategies

46
Summary

OpenMP is often the easiest way to achieve
moderate parallelism on shared memory machines
In practice, to achieve decent scaling, will
probably need to invest some amount of effort in
tuning your application.
More information available at
https//computing.llnl.gov/tutorials/openMP/
http//www.openmp.org
Using OpenMP, MIT Press, 2008

Introduction to OpenMP PowerPoint PPT Presentation