ECE 1747 Parallel Programming presentation

About This Presentation

Transcript and Presenter's Notes

Title: ECE 1747 Parallel Programming

1
ECE 1747 Parallel Programming

Shared Memory OpenMP
Environment and Synchronization

2
What is OpenMP?

Standard for shared memory programming for
scientific applications.
Has specific support for scientific application
needs (unlike Pthreads).
Rapidly gaining acceptance among vendors and
application writers.
See http//www.openmp.org for more info.

3
OpenMP API Overview

API is a set of compiler directives inserted in
the source program (in addition to some library
functions).
Ideally, compiler directives do not affect
sequential code.
pragmas in C / C .
(special) comments in Fortran code.

4
OpenMP API Example

Sequential code
statement1
statement2
statement3
Assume we want to execute statement 2 in
parallel, and statement 1 and 3 sequentially.

5
OpenMP API Example (2 of 2)

OpenMP parallel code
statement 1
pragma ltspecific OpenMP directivegt
statement2
statement3
Statement 2 (may be) executed in parallel.
Statement 1 and 3 are executed sequentially.

6
Important Note

By giving a parallel directive, the user asserts
that the program will remain correct if the
statement is executed in parallel.
OpenMP compiler does not check correctness.
Some tools exist for helping with that.
Totalview - good parallel debugger
(www.etnus.com)

7
API Semantics

Master thread executes sequential code.
Master and slaves execute parallel code.
Note very similar to fork-join semantics of
Pthreads create/join primitives.

8
OpenMP Implementation Overview

OpenMP implementation
compiler,
library.
Unlike Pthreads (purely a library).

9
OpenMP Example Usage (1 of 2)
Sequential Program
OpenMP Compiler
Annotated Source
compiler switch
Parallel Program
10
OpenMP Example Usage (2 of 2)

If you give sequential switch,
comments and pragmas are ignored.
If you give parallel switch,
comments and/or pragmas are read, and
cause translation into parallel program.
Ideally, one source for both sequential and
parallel program (big maintenance plus).

11
OpenMP Directives

Parallelization directives
parallel region
parallel for
Data environment directives
shared, private, threadprivate, reduction, etc.
Synchronization directives
barrier, critical

12
General Rules about Directives

They always apply to the next statement, which
must be a structured block.
Examples
pragma omp
statement
pragma omp
statement1 statement2 statement3

13
OpenMP Parallel Region

pragma omp parallel
A number of threads are spawned at entry.
Each thread executes the same code.
Each thread waits at the end.
Very similar to a number of create/joins with
the same function in Pthreads.

14
Getting Threads to do Different Things

Through explicit thread identification (as in
Pthreads).
Through work-sharing directives.

15
Thread Identification

int omp_get_thread_num()
int omp_get_num_threads()
Gets the thread id.
Gets the total number of threads.

16
Example

pragma omp parallel
if( !omp_get_thread_num() )
master()
else
slave()

17
Work Sharing Directives

Always occur within a parallel region directive.
Two principal ones are
parallel for
parallel section

18
OpenMP Parallel For

pragma omp parallel
pragma omp for
for( )
Each thread executes a subset of the iterations.
All threads wait at the end of the parallel for.

19
Multiple Work Sharing Directives

May occur within a single parallel region
pragma omp parallel
pragma omp for
for( )
pragma omp for
for( )
All threads wait at the end of the first for.

20
The NoWait Qualifier

pragma omp parallel
pragma omp for nowait
for( )
pragma omp for
for( )
Threads proceed to second for w/o waiting.

21
Parallel Sections Directive

pragma omp parallel
pragma omp sections
pragma omp section ? this is a delimiter
pragma omp section

22
A Useful Shorthand

pragma omp parallel
pragma omp for
for ( )
is equivalent to
pragma omp parallel for
for ( )
(Same for parallel sections)

23
Note the Difference between ...

pragma omp parallel
pragma omp for
for( )
f()
pragma omp for
for( )

24
and ...

pragma omp parallel for
for( )
f()
pragma omp parallel for
for( )

25
Sequential Matrix Multiply

for( i0 iltn i )
for( j0 jltn j )
cij 0.0
for( k0 kltn k )
cij aikbkj

26
OpenMP Matrix Multiply

pragma omp parallel for
for( i0 iltn i )
for( j0 jltn j )
cij 0.0
for( k0 kltn k )
cij aikbkj

27
Sequential SOR

for some number of timesteps/iterations
for (i0 iltn i )
for( j1, jltn, j )
tempij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )
for( i0 iltn i )
for( j1 jltn j )
gridij tempij

28
OpenMP SOR

for some number of timesteps/iterations
pragma omp parallel for
for (i0 iltn i )
for( j0, jltn, j )
tempij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )
pragma omp parallel for
for( i0 iltn i )
for( j0 jltn j )
gridij tempij

29
Equivalent OpenMP SOR

for some number of timesteps/iterations
pragma omp parallel
pragma omp for
for (i0 iltn i )
for( j0, jltn, j )
tempij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )
pragma omp for
for( i0 iltn i )
for( j0 jltn j )
gridij tempij

30
Some Advanced Features

Conditional parallelism.
Scheduling options.
(More can be found in the specification)

31
Conditional Parallelism Issue

Oftentimes, parallelism is only useful if the
problem size is sufficiently big.
For smaller sizes, overhead of parallelization
exceeds benefit.

32
Conditional Parallelism Specification

pragma omp parallel if( expression )
pragma omp for if( expression )
pragma omp parallel for if( expression )
Execute in parallel if expression is true,
otherwise execute sequentially.

33
Conditional Parallelism Example

for( i0 iltn i )
pragma omp parallel for if( n-i gt 100 )
for( ji1 jltn j )
for( ki1 kltn k )
ajk ajk - aikaij / ajj

34
Scheduling of Iterations Issue

Scheduling assigning iterations to a thread.
So far, we have assumed the default which is
block scheduling.
OpenMP allows other scheduling strategies as
well, for instance cyclic, gss (guided
self-scheduling), etc.

35
Scheduling of Iterations Specification

pragma omp parallel for schedule(ltschedgt)
ltschedgt can be one of
block (default)
cyclic
gss

36
Example

Multiplication of two matrices C A x B, where
the A matrix is upper-triangular (all elements
below diagonal are 0).

0
A
37
Sequential Matrix Multiply Becomes

for( i0 iltn i )
for( j0 jltn j )
cij 0.0
for( ki kltn k )
cij aikbkj
Load imbalance with block distribution.

38
OpenMP Matrix Multiply

pragma omp parallel for schedule( cyclic )
for( i0 iltn i )
for( j0 jltn j )
cij 0.0
for( ki kltn k )
cij aikbkj

39
Data Environment Directives (1 of 2)

All variables are by default shared.
One exception the loop variable of a parallel
for is private.
By using data directives, some variables can be
made private or given other special
characteristics.

40
Reminder Matrix Multiply

pragma omp parallel for
for( i0 iltn i )
for( j0 jltn j )
cij 0.0
for( k0 kltn k )
cij aikbkj
a, b, c are shared
i, j, k are private

41
Data Environment Directives (2 of 2)

Private
Threadprivate
Reduction

42
Private Variables

pragma omp parallel for private( list )
Makes a private copy for each thread for each
variable in the list.
This and all further examples are with parallel
for, but same applies to other region and
work-sharing directives.

43
Private Variables Example (1 of 2)

for( i0 iltn i )
tmp ai
ai bi
bi tmp
Swaps the values in a and b.
Loop-carried dependence on tmp.
Easily fixed by privatizing tmp.

44
Private Variables Example (2 of 2)

pragma omp parallel for private( tmp )
for( i0 iltn i )
tmp ai
ai bi
bi tmp
Removes dependence on tmp.
Would be more difficult to do in Pthreads.

45
Private Variables Alternative 1

for( i0 iltn i )
tmpi ai
ai bi
bi tmpi
Requires sequential program change.
Wasteful in space, O(n) vs. O(p).

46
Private Variables Alternative 2

f()
int tmp / local allocation on stack /
for( ifrom iltto i )
tmp ai
ai bi
bi tmp

47
Threadprivate

Private variables are private on a parallel
region basis.
Threadprivate variables are global variables that
are private throughout the execution of the
program.

48
Threadprivate

pragma omp threadprivate( list )
Example pragma omp threadprivate( x)
Requires program change in Pthreads.
Requires an array of size p.
Access as xpthread_self().
Costly if accessed frequently.
Not cheap in OpenMP either.

49
Reduction Variables

pragma omp parallel for reduction( oplist )
op is one of , , -, , , , , or
The variables in list must be used with this
operator in the loop.
The variables are automatically initialized to
sensible values.

50
Reduction Variables Example

pragma omp parallel for reduction( sum )
for( i0 iltn i )
sum ai
Sum is automatically initialized to zero.

51
SOR Sequential Code with Convergence

for( diff gt delta )
for (i0 iltn i )
for( j0 jltn, j )
diff 0
for( i0 iltn i )
for( j0 jltn j )
diff max(diff, fabs(gridij -
tempij))
gridij tempij

52
SOR Sequential Code with Convergence

for( diff gt delta )
pragma omp parallel for
for (i0 iltn i )
for( j0 jltn, j )
diff 0
pragma omp parallel for reduction( max diff )
for( i0 iltn i )
for( j0 jltn j )
diff max(diff, fabs(gridij -
tempij))
gridij tempij

53
SOR Sequential Code with Convergence

for( diff gt delta )
pragma omp parallel for
for (i0 iltn i )
for( j0 jltn, j )
diff 0
pragma omp parallel for reduction( max diff )
for( i0 iltn i )
for( j0 jltn j )
diff max(diff, fabs(gridij -
tempij))
gridij tempij
Bummer no reduction operator for max or min.

54
Synchronization Primitives

Critical
pragma omp critical name
Implements critical sections by name.
Similar to Pthreads mutex locks (name lock).
Barrier
pragma omp critical barrier
Implements global barrier.

55
OpenMP SOR with Convergence (1 of 2)

pragma omp parallel private( mydiff )
for( diff gt delta )
pragma omp for nowait
for( ifrom iltto i )
for( j0 jltn, j )
diff 0.0
mydiff 0.0
pragma omp barrier
...

56
OpenMP SOR with Convergence (2 of 2)

...
pragma omp for nowait
for( ifrom iltto i )
for( j0 jltn j )
mydiffmax(mydiff,fabs(gridij-tempij)
gridij tempij
pragma critical
diff max( diff, mydiff )
pragma barrier

57
Synchronization Primitives

Big bummer no condition variables.
Result must busy wait for condition
synchronization.
Clumsy.
Very inefficient on some architectures.

58
PIPE Sequential Program

for( i0 iltnum_pic, read(in_pic) i )
int_pic_1 trans1( in_pic )
int_pic_2 trans2( int_pic_1)
int_pic_3 trans3( int_pic_2)
out_pic trans4( int_pic_3)

59
Sequential vs. Parallel Execution

Sequential
Parallel
(Color -- picture horizontal line -- processor).

60
PIPE Parallel Program

P0 for( i0 iltnum_pics, read(in_pic) i )
int_pic_1i trans1( in_pic )
signal(event_1_2i)
P1 for( i0 iltnum_pics i )
wait( event_1_2i )
int_pic_2i trans2( int_pic_1i )
signal(event_2_3i)

61
PIPE Main Program

pragma omp parallel sections
pragma omp section
stage1()
pragma omp section
stage2()
pragma omp section
stage3()
pragma omp section
stage4()

62
PIPE Stage 1

void stage1()
num1 0
for( i0 iltnum_pics, read(in_pic) i )
int_pic_1i trans1( in_pic )
pragma omp critical 1
num1

63
PIPE Stage 2

void stage2 ()
for( i0 iltnum_pic i )
do
pragma omp critical 1
cond (num1 lt i)
while (cond)
int_pic_2i trans2(int_pic_1i)
pragma omp critical 2
num2

ECE 1747 Parallel Programming PowerPoint PPT Presentation