Shared Memory Programming with OpenMP - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Shared Memory Programming with OpenMP

Description:

The private ( variable list ) clause directs the compiler to make ... area = (4.0/(1.0 x*x)); pi = h * area; Pi code: OpenMP INCORRECT. h = 1.0 / (double) n; ... – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 39
Provided by: Jim4110
Category:

less

Transcript and Presenter's Notes

Title: Shared Memory Programming with OpenMP


1
Shared Memory Programming with OpenMP
  • Some examples from Quinn Ch 17

2
A common parallel computing model is the
message-passing model.
  • Each process has local memory
  • Computation only on data in local memory
  • Processors exchange data through communication
    (MPI)

3
Another parallel computing model Shared memory
parallel programming.
  • Global memory

Memory
  • Parallelization by threads

run time
Master thread runs in serial mode
4
Another parallel computing model Shared memory
parallel programming.
  • Global memory

Memory
  • Parallelization by threads

run time
New threads created and run in parallel mode
fork
5
Another parallel computing model Shared memory
parallel programming.
  • Global memory

Memory
  • Parallelization by threads

run time
fork
join
Master thread runs in serial mode
6
OpenMP is a standard for shared memory
programming.
  • Allows incremental parallelization
  • Profile serial code
  • Mark for parallelization those loops that take
    the most time
  • Still have to think to make sure marked loops can
    be executed in parallel.
  • Performance of shared memory codes will likely be
    poor for many processors.

7
Jacobi code serial
double u_newN2, u_oldN2 u_old00.0,
u_oldN10.0 u_new00.0, u_newN10.0 h1
/N,h2hh for (iteration0 iterationltmax_iterat
ion iteration) for (i1 iltN i)
u_newi0.5(h2u_oldi1u_oldi-1) .
. .
8
Jacobi code OpenMP
include ltomp.hgt double u_newN2,
u_oldN2 u_old00.0, u_oldN10.0 u_new0
0.0, u_newN10.0 h1/N,h2hh for
(iteration0 iterationltmax_iteration
iteration) pragma omp parallel for for
(i1 iltN i) u_newi0.5(h2u_old
i1u_oldi-1) . . .
9
OpenMP parallel for
  • To allow compiler parallelize the loop, control
    clause must have canonical shape.
  • for(i start i lt end i)
  • gt i
  • lt i--
  • gt --i
  • i i - inc
  • i - inc
  • i i inc
  • i inc

10
OpenMP parallel for
  • To allow compiler parallelize the loop, loop body
    cant contain statements that allow loop to exit
    prematurely.
  • No break
  • No return
  • No exit
  • No goto statements to labels outside the loop

11
OpenMP how many threads to use?
  • int omp_get_num_procs(void)
  • Returns the number of physical processors
    available for use by the parallel program.
  • void omp_set_num_threads (int t)
  • Sets the number of threads to be used in parallel
    sections.
  • Can also be controlled by the environment
    variable OMP_NUM_THREADS

12
OpenMP how many threads to use?
  • int omp_get_num_procs(void)
  • Returns the number of physical processors
    available for use by the parallel program.
  • void omp_set_num_threads (int t)
  • Sets the number of threads to be used in parallel
    sections.
  • Can also be controlled by the environment
    variable OMP_NUM_THREADS

t omp_get_num_procs() omp_set_num_threads(t)
13
OpenMP shared and private variables
  • A shared variable has the same address in every
    thread (theres only one version)
  • All threads can access shared variables
  • A private variable has a different address in
    each thread (theres a version for each thread)
  • A thread cannot access a private variable of
    another thread
  • Default for the parallel for pragma
  • All variables are shared except for the loop
    index which is private.

14
Jacobi code OpenMP
include ltomp.hgt double u_newN2,
u_oldN2 u_old00.0, u_oldN10.0 u_new0
0.0, u_newN10.0 h1/N,h2hh for
(iteration0 iterationltmax_iteration
iteration) pragma omp parallel for for
(i1 iltN i) u_newi0.5(h2u_old
i1u_oldi-1) . . .
15
Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
16
Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
17
Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
Dkj min(Dkj, Dkk Dk,j) Dik
min(Dik, Dik Dkk)
18
Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
Dkj and Dik Do not change in the Kth
iteration.
Dkj min(Dkj, Dkk Dk,j) Dik
min(Dik, Dik Dkk)
19
Floyds Algorithm OpenMP v.1
for (k 0 k lt n k) for (i 0 i lt n
i) pragma omp parallel for
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)

20
Floyds Algorithm OpenMP v.1
for (k 0 k lt n k) for (i 0 i lt n
i) pragma omp parallel for
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)

Pay the fork/join overhead n2 times
21
Floyds Algorithm OpenMP v.2INCORRECT
for (k 0 k lt n k) pragma omp parallel
for for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)

22
Floyds Algorithm OpenMP v.2INCORRECT
for (k 0 k lt n k) pragma omp parallel
for for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)

By default, only i will be a private
variable. Everything else, including j, will be a
shared variable. Each thread will be initializing
and incrementing the same j. Unlikely to get
correct results.
23
Floyds Algorithm OpenMP v.2CORRECT
for (k 0 k lt n k) pragma omp parallel
for private(j) for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik
Dkj)
  • A clause is an optional, additional component to
    a pragma.
  • The private (ltvariable listgt) clause directs the
    compiler to make listed variables private

24
Floyds Algorithm OpenMP v.2CORRECT
for (k 0 k lt n k) pragma omp parallel
for private(j) for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik
Dkj)
Pay the fork/join overhead n times
  • A clause is an optional, additional component to
    a pragma.
  • The private (ltvariable listgt) clause directs the
    compiler to make listed variables private

25
OpenMP private variables
  • By default, private variables are undefined at
    loop entry and loop exit.
  • The clause firstprivate (x) directs the compiler
    to make x a private variable whose initial value
    for each thread is the value of x in the master
    thread before the loop.
  • The clause lastprivate (x) directs the compiler
    to make x a private variable whose value in the
    master thread after the loop will be whatever the
    value of x is in the thread that did the
    iteration that would come last sequentially.

26
Pi code serial
h 1.0 / (double) n sum 0.0
for (i 1 i lt n i) x h ((double)i
- 0.5) area (4.0/(1.0 xx)) pi h
area
27
Pi code OpenMP INCORRECT
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) for (i 1 i
lt n i) x h ((double)i - 0.5)
area (4.0/(1.0 xx)) pi h area
28
Race condition
Thread A
Thread B
Value of area
11.667
3.765
11.667
3.563
15.432
15.230
29
Race condition
Thread A
Thread B
Value of area
  • The operation is not an atomic (indivisible)
    operation.
  • The race condition results in code whose
    numerical results are nondeterministic.
  • One solution is to force the operation to be
    executed by one thread at a time.

11.667
3.765
11.667
3.563
15.432
15.230
30
Pi code OpenMP CORRECT, but inefficient.
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) for (i 1 i
lt n i) x h ((double)i -
0.5) pragma omp critical area (4.0/(1.0
xx)) pi h area
Critical section is executed by one thread at a
time. Limits attainable speedup via Amdahls law.
31
Pi code OpenMP CORRECT
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) reduction(area)
for (i 1 i lt n i) x h
((double)i - 0.5) area (4.0/(1.0
xx)) pi h area
  • Note reduction clause on the parallel for pragma
  • Compiler handles setting up private variables
    for partial sums
  • Functionally like MPI_Reduce
  • syntax reduction (ltopgtltvariablegt)

32
The fork/join cost may be greater than parallel
gain from splitting the work.
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) reduction(a) if(ngt500)
for (i 1 i lt n i) x h
((double)i - 0.5) a (4.0/(1.0
xx)) pi h a
  • Note if() clause on the parallel for pragma
  • syntax if (ltscalar expressiongt)
  • if scalar expression evaluates true, loop is
    parallelized
  • otherwise executed sequentially on master thread
  • pay fork/join overhead only when loop contains
    enough work to cover this cost

33
The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
34
The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
for ( i1 iltm i) pragma omp parallel for
for ( j0 jltn j) aij 2
ai-1j
35
The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
for ( i1 iltm i) pragma omp parallel for
for ( j0 jltn j) aij 2
ai-1j
pragma omp parallel for for ( j0 jltn j)
for ( i1 iltm i) aij 2
ai-1j
36
OpenMP and functional parallelism
v velocity_solve( ) p pressure_solve( ) e
energy(v,p) g grids( ) Plot(e,g)
37
OpenMP and functional parallelism
v velocity_solve( ) p pressure_solve( ) e
energy(v,p) g grids( ) Plot(e,g)
V
P
G
E
Plot
38
OpenMP and functional parallelism
pragma omp parallel sections pragma omp
section v velocity_solve( ) pragma omp
section p pressure_solve( ) pragma omp
section g grids( ) e
energy(v,p) Plot(e,g)
Write a Comment
User Comments (0)
About PowerShow.com