Shared Memory Parallel Programming - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Shared Memory Parallel Programming

Description:

Each thread calls pooh(ID,A) for ID = 0 to 3. Parallel Regions ... pooh(ID,A); Each thread executes a copy of the code within the structured block ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 32
Provided by: barbara179
Category:

less

Transcript and Presenter's Notes

Title: Shared Memory Parallel Programming


1
Shared Memory Parallel Programming
  • Introduction to OpenMP

2
OpenMP Overview
COMP FLUSH
pragma omp critical
COMP THREADPRIVATE(/ABC/)
CALL OMP_SET_NUM_THREADS(10)
  • OpenMP An API for Writing Multithreaded
    Applications
  • A set of compiler directives and library routines
    for parallel application programmers
  • Greatly simplifies writing multi-threaded (MT)
    programs in Fortran, C and C
  • Standardizes last 20 years of SMP practice

COMP parallel do shared(a, b, c)
call omp_test_lock(jlok)
call OMP_INIT_LOCK (ilok)
COMP MASTER
COMP ATOMIC
COMP SINGLE PRIVATE(X)
setenv OMP_SCHEDULE dynamic
COMP PARALLEL DO ORDERED PRIVATE (A, B, C)
COMP ORDERED
COMP PARALLEL REDUCTION ( A, B)
COMP SECTIONS
pragma omp parallel for private(A, B)
!OMP BARRIER
COMP PARALLEL COPYIN(/blk/)
COMP DO lastprivate(XX)
Nthrds OMP_GET_NUM_PROCS()
omp_set_lock(lck)
The name OpenMP is the property of the OpenMP
Architecture Review Board.
3
OpenMP Programming Model
  • Master thread spawns a team of threads as needed
  • Parallelism is added incrementally until desired
    performance is achieved i.e. the sequential
    program evolves into a parallel program

Master Thread
Parallel Regions
4
Life is Short, Remember?
Its official OpenMP is easier to use than MPI!
5
How Mainstream Can You Be?
  • Based firmly upon prior experience (PCF)
  • Simplified and streamlined existing APIs
  • High level programming model
  • Programmer makes strategic decisions
  • Compiler figures out details
  • Generally available in standard commercial
    compilers
  • Including Microsoft, now GNU
  • Research Omni, OpenUH, PCOMP etc.

6
The OpenMP ARB
  • OpenMP is maintained by the OpenMP Architecture
    Review Board (the ARB), which
  • Interprets OpenMP
  • Writes new specifications - keeps OpenMP relevant
  • Works to increase the impact of OpenMP
  • Members are organizations - not individuals
  • Current members
  • Permanent Cray, Fujitsu, HP, IBM, Intel, MS,
    NEC, PGI, SGI, Sun
  • Auxiliary ASCI, cOMPunity, EPCC, KSL, NASA, RWTH
    Aachen

www.compunity.org
7
OpenMP Release History
1998
OpenMPC/C 1.0
2005
OpenMP 2.5
OpenMPFortran 1.0
OpenMPFortran 1.1
A single specification for Fortran, C and C
1997
1999
8
OpenMP 2.5
  • Merged language-specific APIs
  • Fixed minor problems
  • Reorganized material
  • Improved specification of nested parallelism
  • Internal control variables
  • Fixed the flush (memory model)

9
Where will OpenMP be Relevant in Future?
Its either multithreading, or a real heat wave.
Simultaneous multithreading, hyperthreading, chip
multithreading, streaming
10
OpenMP Definitions Constructs vs. Regions
in OpenMP
OpenMP constructs occupy a single compilation
unit while a region can span multiple source
files.
poo.f
bar.f
call whoami COMP PARALLEL call
whoami COMP END PARALLEL
subroutine whoami external
omp_get_thread_num integer iam,
omp_get_thread_num iam omp_get_thread_num(
) COMP CRITICAL print,Hello from ,
iam COMP END CRITICAL return end

A Parallel construct
The Parallel region is the text of the construct
plus any code called inside the construct
Orphan constructs can execute outside a parallel
region
11
Parallel Regions
  • You create threads in OpenMP with the omp
    parallel pragma.
  • For example, To create a 4 thread parallel region

double A1000omp_set_num_threads(4)pragma
omp parallel int ID omp_get_thread_num()
pooh(ID,A)
Runtime function to request a certain number of
threads
Each thread executes a copy of the code within
the structured block
Runtime function returning a thread ID
  • Each thread calls pooh(ID,A) for ID 0 to 3

The name OpenMP is the property of the OpenMP
Architecture Review Board
12
Parallel Regions
  • You create threads in OpenMP with the omp
    parallel pragma.
  • For example, To create a 4 thread parallel region

clause to request a certain number of threads
double A1000 pragma omp parallel
num_threads(4) int ID omp_get_thread_num()
pooh(ID,A)
Each thread executes a copy of the code within
the structured block
Runtime function returning a thread ID
  • Each thread calls pooh(ID,A) for ID 0 to 3

The name OpenMP is the property of the OpenMP
Architecture Review Board
13
Parallel Regions
double A1000omp_set_num_threads(4) pragma
omp parallel int ID
omp_get_thread_num() pooh(ID,
A) printf(all done\n)
  • Each thread executes the same code redundantly.

double A1000
omp_set_num_threads(4)
A single copy of A is shared between all threads.
pooh(1,A)
pooh(2,A)
pooh(3,A)
pooh(0,A)
printf(all done\n)
Threads wait here for all threads to finish
before proceeding (i.e. a barrier)
The name OpenMP is the property of the OpenMP
Architecture Review Board
14
ExerciseA multi-threaded Hello world program
  • Write a multithreaded program where each thread
    prints hello world.

void main() int ID 0 printf(
hello(d) , ID) printf( world(d) \n,
ID)
15
A multi-threaded Hello world program
  • Write a multithreaded program where each thread
    prints hello world.

include omp.hvoid main() pragma omp
parallel int ID omp_get_thread_num()
printf( hello(d) , ID) printf(
world(d) \n, ID)
OpenMP include file
Sample Output hello(1) hello(0)
world(1) world(0) hello (3) hello(2)
world(3) world(2)
Parallel region with default number of threads
Runtime library function to return a thread ID.
End of the Parallel region
16
Parallel Regions and the if clauseActive vs
inactive parallel regions.
  • An optional if clause causes the parallel region
    to be active only if the logical expression
    within the clause evaluates to true.
  • An if clause that evaluates to false causes the
    parallel region to be inactive (i.e. executed by
    a team of size one).

double AN pragma omp parallel if
(Ngt1000) int ID omp_get_thread_num()
pooh(ID,A)
The name OpenMP is the property of the OpenMP
Architecture Review Board
17
OpenMP Work-Sharing Constructs
  • The for work-sharing construct splits up loop
    iterations among the threads in a team

pragma omp parallelpragma omp for for
(I0IltNI) NEAT_STUFF(I)
By default, there is a barrier at the end of the
omp for. Use the nowait clause to turn off
the barrier. pragma omp for
nowait nowait is useful between two
consecutive, independent omp for loops.
18
Work Sharing ConstructsA motivating example
for(i0IltNi) ai ai bi
Sequential code
pragma omp parallel int id, i, Nthrds,
istart, iend id omp_get_thread_num() Nthrds
omp_get_num_threads() istart id N /
Nthrds iend (id1) N / Nthrds for(iistart
Iltiendi) ai ai bi
OpenMP parallel region
OpenMP parallel region and a work-sharing
for-construct
pragma omp parallel pragma omp for
schedule(static) for(i0IltNi) ai
ai bi
19
OpenMP For/Do constructThe schedule clause
  • Affects how loop iterations are mapped onto
    threads
  • schedule(static ,chunk)
  • Deal-out blocks of iterations of size chunk to
    each thread.
  • schedule(dynamic,chunk)
  • Each thread grabs chunk iterations off a queue
    until all iterations have been handled.
  • schedule(guided,chunk)
  • Threads dynamically grab blocks of iterations.
    The size of the block starts large and shrinks
    down to size chunk as the calculation proceeds.
  • schedule(runtime)
  • Schedule and chunk size taken from the
    OMP_SCHEDULE environment variable.

20
The schedule clause
Least work at runtime scheduling done at
compile-time
Most work at runtime complex scheduling logic
used at run-time
21
The schedule clause
20 iterations 6 threads Static schedule
3 iterations per thread ? last thread has 5
iterations 4 iterations per thread ? last thread
has 0 iterations !
22
OpenMP Work-Sharing Constructs
  • The Sections work-sharing construct gives a
    different structured block to each thread.

pragma omp parallelpragma omp
sectionspragma omp section X_calculation()p
ragma omp section y_calculation()pragma omp
section z_calculation()
By default, there is a barrier at the end of the
omp sections. Use the nowait clause to turn
off the barrier.
23
OpenMP Work-Sharing Constructs
  • The master construct denotes a structured block
    that is only executed by the master thread. The
    other threads just skip it (no synchronization is
    implied).

pragma omp parallel private (tmp) do_many_thi
ngs( )pragma omp master
exchange_boundaries( ) pragma
barrier do_many_other_things( )
24
OpenMP Work-Sharing Constructs
  • The single construct denotes a block of code that
    is executed by only one thread.
  • A barrier is implied at the end of the single
    block.

pragma omp parallel private (tmp) do_many_thi
ngs( )pragma omp single
exchange_boundaries( ) do_many_other_things(
)
25
Combined parallel/work-share
  • OpenMP shortcut Put the parallel and the
    work-share on the same line

double resMAX int i pragma omp parallel
pragma omp for for (i0ilt MAX i)
resi huge()
double resMAX int i pragma omp parallel
for for (i0ilt MAX i) resi
huge()
These are equivalent
  • Theres also a parallel sections construct.

26
Some Examples
  • Three examples of application parallelization
    under OpenMP
  • Remember application developer gives
    parallelization strategy
  • Implementation figures out details of work to be
    performed by each threads
  • Also maps threads to hardware resources at run
    time

27
Matrix Multiply
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • for( i0 iltn i )
  • for( j0 jltn j )
  • for( k0 kltn k )
  • cij aikbkj

28
Parallel Matrix Multiply
  • No loop-carried dependences in i- or j-loop
  • In both loop nests
  • Loop-carried dependence on k-loop
  • All i- and j-iterations can be run in parallel

29
Problem Statement
j
i
i
30
Matrix Multiply
  • pragma omp parallel for
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • pragma omp parallel for
  • for( i0 iltn i )
  • for( j0 jltn j )
  • for( k0 kltn k )
  • cij aikbkj

31
Parallel Matrix Multiply (contd.)
  • OpenMP permits parallelization of only one loop
    in loop nest
  • We have chosen an approach with coarse
    granularity
  • We could have parallelized the j loops
  • Performance influenced by cost of memory accesses
  • May require some experimentation to choose best
    strategy

Homework experiment with OpenMP matrix
multiplication
Write a Comment
User Comments (0)
About PowerShow.com