OpenMP in Practice - PowerPoint PPT Presentation

1 / 80
About This Presentation
Title:

OpenMP in Practice

Description:

Designing and Building Parallel Programs. 1. OpenMP in Practice. Gina Goff. Rice University ... Designing and Building Parallel Programs. 4. OpenMP ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 81
Provided by: csR7
Category:

less

Transcript and Presenter's Notes

Title: OpenMP in Practice


1
OpenMP in Practice
  • Gina Goff
  • Rice University

2
Outline
  • Introduction
  • Parallelism, Synchronization, and Environments
  • Restructuring/Designing Programs in OpenMP
  • Example Programs

3
Outline
  • Introduction
  • Parallelism, Synchronization, and Environments
  • Restructuring/Designing Programs in OpenMP
  • Example Programs

4
OpenMP
  • A portable fork-join parallel model for
    shared-memory architectures
  • Portable
  • Based on Parallel Computing Forum (PCF)
  • Fortran 77 binding here today C coming this year

5
OpenMP (2)
  • Fork-join model
  • Execution starts with one thread of control
  • Parallel regions fork off new threads on entry
  • Threads join back together at the end of the
    region
  • Shared memory
  • (Some) Memory can be accessed by all threads

6
Shared Memory
  • Computation(s) using several processors
  • Each processor has some private memory
  • Each processor has access to a memory shared with
    the other processors
  • Synchronization
  • Used to protect integrity of parallel program
  • Prevents unsafe memory accesses
  • Fine-grained synchronization (point to point)
  • Barrier use for global synchronization

7
Shared Memory in Pictures
8
OpenMP
  • Two basic flavors of parallelism
  • Coarse-grained
  • Program is broken into segments (threads) that
    can be executed in parallel
  • Use barriers to re-synchronize execution at the
    end
  • Fine-grained
  • Execute iterations of DO loop(s) in parallel

9
OpenMP in Pictures
10
Design of OpenMP
  • A flexible standard, easily implemented across
    different platforms
  • Control structures
  • Minimal for simplicity and encouraging common
    cases
  • PARALLEL, DO, SECTIONS, SINGLE, MASTER
  • Data environment
  • New data access capabilities for forked threads
  • SHARED, PRIVATE, REDUCTION

11
Design of OpenMP (2)
  • Synchronization
  • Simple implicit synch at beginning and end of
    control structures
  • Explicit synch for more complex patterns
    BARRIER, CRITICAL, ATOMIC, FLUSH, ORDERED
  • Runtime library
  • Manages modes for forking and scheduling threads
  • E.g, OMP_GET_THREAD_NUM

12
Whos In OpenMP?
  • Software Vendors
  • Absoft Corp.
  • Edinburgh Portable Compilers
  • Kuck Associates, Inc.
  • Myrias Computer Technologies
  • Numerical Algorithms Group
  • The Portland Group, Inc.
  • Hardware Vendors
  • Digital Equipment Corp.
  • Hewlett-Packard
  • IBM
  • Intel
  • Silicon Graphics/Cray Research
  • Solution Vendors
  • ADINA RD, Inc.
  • ANSYS, Inc.
  • CPLEX division of ILOG
  • Fluent, Inc.
  • LSTC Corp.
  • MECALOG SARL
  • Oxford Molecular Group PLC
  • Research Organizations
  • US Department of Energy ASCI Program
  • Universite Louis Pasteur, Strasbourg

13
Outline
  • Introduction
  • Parallelism, Synchronization, and Environments
  • Restructuring/Designing Programs in OpenMP
  • Example Programs

14
Control Structures
  • PARALLEL / END PARALLEL
  • The actual fork and join
  • Number of threads wont change inside parallel
    region
  • Single Program Multiple Data (SPMD) execution
    within region
  • SINGLE / END SINGLE
  • (Short) sequential section
  • MASTER / END MASTER
  • SINGLE on master processor

!OMP PARALLEL CALL S1 !OMP SINGLE CALL
S2 !OMP END SINGLE CALL S3 !OMP END PARALLEL
15
Control Structures (2)
  • DO / END DO
  • The classic parallel loop
  • Inside parallel region
  • Or convenient combined directive PARALLEL DO
  • Iteration space is divided among available
    threads
  • Loop index is private to thread by default

16
Control Structures (3)
  • SECTIONS / END SECTIONS
  • Task parallelism, potentially MIMD
  • SECTION marks tasks
  • Inside parallel region
  • Nested parallelism
  • Requires creating new parallel region
  • Not supported on all OpenMP implementations
  • If not allowed, inner PARALLEL is a no-op

!OMP PARALLEL SECTIONS !OMP SECTION !OMP
PARALLEL DO DO J 1, 2 CALL FOO(J)
END DO !OMP END DO !OMP SECTION CALL
BAR(2) !OMP SECTION !OMP PARALLEL DO
DO K 1, 3 CALL BAR(K) END DO
!OMP END DO !OMP END PARALLEL SECTIONS
17
DO Scheduling
  • Static Scheduling (default)
  • Divides loop into equal size iteration chunks
  • Based on runtime loop limits
  • Totally parallel scheduling algorithm
  • Dynamic Scheduling
  • Threads go to scheduler to get next chunk
  • Guided chunks taper down at end of loop

18
DO Scheduling (2)
1
7
13
19
25
31
1
7
13
19
25
31
2
8
14
20
26
32
2
8
14
20
26
32
3
9
15
21
27
33
3
9
15
21
27
33
4
10
16
22
28
34
4
10
16
22
28
34
5
11
17
23
29
35
5
11
17
23
29
35
6
12
18
24
30
36
6
12
18
24
30
36
!OMP PARALLEL DO
!OMP PARALLEL DO
!OMP SCHEDULE(DYNAMIC,1)
!OMP SCHEDULE(GUIDED,1)
DO J 1, 36
DO J 1, 36
CALL SUBR(J)
CALL SUBR(J)
END DO
END DO
!OMP END DO
!OMP END DO
19
Orphaned Directives
PROGRAM main !OMP PARALLEL CALL foo() CALL
bar() CALL error() !OMP END PARALLEL SUBROUTINE
error() ! Not allowed due to ! nested control
structs !OMP SECTIONS !OMP SECTION CALL
foo() !OMP SECTION CALL bar() !OMP END
SECTIONS END
SUBROUTINE foo() !OMP DO DO i 1, n ... END
DO !OMP END DO END SUBROUTINE bar() !OMP
SECTIONS !OMP SECTION CALL section1() !OMP
SECTION ... !OMP SECTION ... !OMP END
SECTIONS END
20
OpenMP Synchronization
  • Implicit barriers wait for all threads
  • DO, END DO
  • SECTIONS, END SECTIONS
  • SINGLE, END SINGLE
  • MASTER, END MASTER
  • NOWAIT at END can override synch
  • Global barriers ? all threads must hit in the
  • same order

21
OpenMP Synchronization (2)
  • Explicit directives provide finer control
  • BARRIER must be hit by all threads in team
  • CRITICAL (name), END CRITICAL
  • Only one thread may enter at a time
  • ATOMIC Single-statement critical section
  • for reduction
  • FLUSH (list) Synchronization point at
  • which the implementation is required to
  • provide a consistent view of memory
  • ORDERED For pipelining loop iterations

22
OpenMP Data Environments
  • Data can be PRIVATE or SHARED
  • Private data is for local variables
  • Shared data is global
  • Data can be private to a thread all processors
    in thread can access the data, but other threads
    cant see it

23
OpenMP Data Environments
COMMON /mine/ z
INTEGER x(3), y(3), k
!OMP THREADPRIVATE(mine)
!OMP PARALLEL DO DEFAULT(PRIVATE), SHARED(x)
!OMP REDUCTION (z)
DO k 1, 3
x(k) k
y(k) kk
z z x(k)y(k)
END DO
!OMP END PARALLEL DO
SHARED MEMORY
x
1
2
3
Thread 0
Thread 1
Thread 2
4
y
1
y
y
9
z'
z'
4
z'
1
9
z
36
24
Brief Example
25
OpenMP Environment Runtime Library
  • For controlling execution
  • Needed for tuning, but may limit portability
  • Control through environment variables or runtime
    library calls
  • Runtime library takes precedence in conflict

26
OpenMP Environment Runtime (2)
  • OMP_NUM_THREADS How many to use in parallel
    region?
  • OMP_GET_NUM_THREADS,
  • OMP_SET_NUM_THREADS
  • Related OMP_GET_THREAD_NUM,
  • OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS
  • OMP_DYNAMIC Should runtime system choose number
    of threads?
  • OMP_GET_DYNAMIC, OMP_SET_DYNAMIC

27
OpenMP Environment Runtime (3)
  • OMP_NESTED Should nested parallel regions be
    supported?
  • OMP_GET_NESTED, OMP_SET_NESTED
  • OMP_SCHEDULE Choose DO scheduling option
  • Used by RUNTIME clause
  • OMP_IN_PARALLEL Is the program in a parallel
    region?

28
Outline
  • Introduction
  • Parallelism, Synchronization, and Environments
  • Restructuring/Designing Programs in OpenMP
  • Example Programs

29
Analyzing for Parallelism
  • Profiling
  • Walk the loop nests
  • Multiple parallel loops

30
Program Profile
  • Is dataset large enough?
  • At the top of the list, should find
  • parallel regions
  • routines called within them
  • What is cumulative percent?
  • Watch for system libraries near top
  • e.g., spin_wait_join_barrier

31
Walking the Key Loop Nest
  • Usually the outermost parallel loop
  • Ignore timestep and convergence loops
  • Ignore loops with few iterations
  • Ignore loops that call unimportant subroutines
  • Dont be put off by
  • Loops that write shared data
  • Loops that step through linked lists
  • Loops with I/O

32
Multiple Parallel Loops
  • Nested parallel loops are good
  • Pick easiest or most parallel code
  • Think about locality
  • Use IF clause to select best based on dataset
  • Plan on doing one across clusters
  • Non nested parallel loops
  • Consider loop fusion (impacts locality)
  • Execute code between in parallel region

33
Example Loop Nest
  • subroutine fem3d()
  • 10 call addmon()
  • if(numelh.ne0) call solide
  • subroutine solide
  • do 20 i1,nelt
  • do 20 j1,nelg
  • call unpki
  • call strain
  • call force
  • 20 continue
  • if() return
  • goto 10
  • subroutine force()
  • do 10 ilft,llt
  • sgv(i) sig1(i)-qp(i)vol(i)
  • 10 continue
  • do 50 n1,nnc
  • i0ia(n)
  • i1ia(n1)-1
  • do 50 ii0,i1
  • e(1,ix(i))e(1,ix(i))ep11(i)
  • 50 continue

34
Restructuring Applications
  • Two level strategy for parallel processing
  • Determining shared and local variables
  • Adding synchronization

35
Two Levels of Parallel Processing
  • Two level approach isolates major concerns
    makes code easier to update
  • Algorithm/Architecture Level
  • Unique to your software
  • Provides majority of SMP performance

36
Two Levels of Parallel Processing (cont.)
  • Platform Specific Level
  • Vendor provides insight
  • Remove last performance obstacles
  • Be careful to limit use of non-portable constructs

37
Determining Shared and Private
  • What are the variable classes?
  • Process for determining class
  • First private/last private

38
Types of Variables
  • Start with access patterns
  • Read Only disposition elsewhere
  • Write then Read possibly local
  • Read then Write independent or reductions
  • Written live on exit?
  • Goal determine storage classes
  • Local or private variables are local per thread
  • Shared variables are everything else

39
Guidelines for Classifying Variables
  • In general, big things are shared
  • The major arrays that take all the space
  • Its the threads default model
  • Program local vars are parallel private vars
  • Temporaries used require one copy per thread
  • Subroutine locals become private automatically
  • Move up from leaf subroutines to parallel region
  • Equivalences ick

40
Process of Classifying Variables
  • Examine refs to each var to determine shared list
  • Split common into shared common and private
    common if vars require different storage classes
  • Use copy-in to private common as alternative
  • Construct private list and declare private
    commons by examining the types of remaining
    variables

41
Process of Classifying Variables (2)
Only Read in P Region
Put on Shared list
Contains parallel loop index (Diff iterations
reference diff parts)
Examine Refs
Put on Shared list
Modified in P Region
Go to next page
Does not contain parallel loop index
42
Process of Classifying Vars (3)
Known Size
Put on Private list
Formal Parameter
Change to Pointee
Unknown
Pointee
Put on Shared
Var Type
Refd in called routines
Declare Private Common
Common Member
Private List w/Firstprivate
Only refd in P Region
Change to Common
Static
Local to subr
Private List
Automatic
43
Firstprivate and Lastprivate
  • LASTPRIVATE copies value(s) from local copy
    assigned on last iteration of loop to global copy
    of variables or arrays
  • FIRSTPRIVATE copies value(s) from global
    variables or arrays to local copy for first
    iteration of loop on each processor

44
Firstprivate and Lastprivate (2)
  • Parallelizing a loop and not knowing whether
    there are side effects?
  • subroutine foo(n)
  • common /foobar/a(1000),b(1000),x
  • comp parallel do shared(a,b,n) lastprivate(x)
  • do 10 i1,n
  • xa(i)2 b(i)2
  • 10 b(i) sqrt(x)
  • end

Use lastprivate because dont know where or if x
in common /foobar/ will be used again
45
Choosing Placing Synchronization
  • Finding variables that need to be synchronized
  • Two frequently used types
  • Critical/ordered sections small updates to
    shared structures
  • Barriers delimit phases of computation
  • Doing reductions

46
What to Synchronize
  • Updates parallel do invariant variables that are
    read then written
  • Place critical/ordered section around groups of
    updates
  • Pay attention to control flow
  • Make sure you dont branch in or out
  • Pull it outside loop or region if efficient

47
Example Critical/Ordered Section
  • if (ncycle.eq.0) then
  • do 60 ilft,llt
  • dt2amin1(dtx(i),dt2)
  • if (dt2.eq.dtx(i)) then
  • ielmtc128(ndum-1)i
  • ielmtcnhex(ielmtc)
  • ityptc1
  • endif
  • ielmtd128(ndum-1)i
  • ielmtdnhex(ielmtd)
  • write (13,90) ielmtd,dtx(i)
  • write (13,100)ielmtc
  • 60 continue
  • endif
  • do 70 ilft,llt
  • 70 dt2amin1(dtx(i),dt2)
  • if (mess.ne.'sw2.') return
  • do 80 ilft,llt
  • if (dt2.eq.dtx(i)) then
  • ielmtc128(ndum-1)i
  • ielmtcnhex(ielmtc)
  • ityptc1
  • endif
  • 80 continue

48
Reductions
  • Correct (but slow) program
  • sum 0.0
  • comp parallel private(i) shared(sum,a,n)
  • comp pdo
  • do 10 i1,n
  • comp critical
  • sum sum a(i)
  • comp end critical
  • 10 continue
  • comp end parallel
  • Serial program is a reduction
  • sum 0.0
  • do 10 i1,n
  • 10 sum sum a(i)

49
(Flawed) Plan For a Good Reduction
  • Incorrect parallel program
  • comp parallel private(suml,i)
  • comp shared(sum,a,n)
  • suml 0.0
  • comp do
  • do 10 i1,n
  • 10 suml suml a(i)
  • cbug need critical section next
  • sum sum suml
  • comp end parallel

50
Good Reductions
  • Correct reduction
  • comp parallel private(suml,i)
  • comp shared(sum,a,n)
  • suml 0.0
  • comp do
  • do 10 i1,n
  • 10 suml suml a(i)
  • comp critical
  • sum sum suml
  • comp end critical
  • comp end parallel

Using Reduction does the same comp
parallel comp shared(a,n) comp
reduction(sum) comp do private(i) do 10
i1,n 10 sum sum a(i) comp end parallel
51
Typical Parallel Bugs
  • Problem incorrectly pointing to the same place
  • Symptom bad answers
  • Fix initialization of local pointers
  • Problem incorrectly pointing to different places
  • Symptom segmentation violation
  • Fix localization of shared data
  • Problem incorrect initialization of parallel
    regions
  • Symptom bad answers
  • Fix Copy in? / Use parallel region outside
    parallel do.

52
Typical Parallel Bugs (2)
  • Problem not saving values from parallel regions
  • Symptom bad answers, core dump
  • Fix transfer from local into shared
  • Problem unsynchronized access
  • Symptom bad answers
  • Fix critical section / barrier / local
    accumulation
  • Problem numerical inconsistency
  • Symptom run-to-run variation in answers
  • Fix different scheduling mechanisms / ordered
    sections / consistent parallel reductions

53
Typical Parallel Bugs (3)
  • Problem inconsistently synchronized I/O stmts
  • Symptom jumbled output, system error messages
  • Fix critical/ordered section around I/O
  • Problem inconsistent declarations of common vars
  • Symptom segment violation
  • Fix verify consistent declaration
  • Problem parallel stack size problems
  • Symptom core dump
  • Fix increase stack size

54
Outline
  • Introduction
  • Parallelism, Synchronization, and Environments
  • Restructuring/Designing Programs in OpenMP
  • Example Programs

55
Designing Parallel Programs in OpenMP
  • Partition
  • Divide problem into tasks
  • Communicate
  • Determine amount and pattern of communication
  • Agglomerate
  • Combine tasks
  • Map
  • Assign agglomerated tasks to physical processors

56
Designing Parallel Programs in OpenMP (2)
  • Partition
  • In OpenMP, look for any independent operations
    (loop parallel, task parallel)
  • Communicate
  • In OpenMP, look for synch points and dependences
  • Agglomerate
  • In OpenMP, create parallel loops an/or parallel
    sections
  • Map
  • In OpenMP, implicit or explicit scheduling
  • Data mapping goes outside the standard

57
Jacobi Iteration The Problem
  • Numerically solve a PDE on a square mesh
  • Method
  • Update each mesh point by the average of its
    neighbors
  • Repeat until converged

58
Jacobi Iteration OpenMP Partitioning,
Communication, and Agglomeration
  • Partitioning does not change at all
  • Data parallelism natural for this problem
  • Communication does not change at all
  • Related directly to task partitioning

59
Partitioning, Communication, and Agglomeration (2)
  • Agglomeration analysis changes a little
  • OpenMP cannot nest control constructs easily
  • Requires intervening parallel section, with
    OMP_NESTED turned on
  • Major issue on shared memory machines is locality
    in memory layout
  • Nearest neighbors agglomerated together as blocks
  • Therefore, encourage each processor to keep using
    the same contiguous section(s) of memory

60
Jacobi Iteration OpenMP Mapping
  • Minimize forking and synchronization overhead
  • One parallel region at highest possible level
  • Mark outermost possible loop for work sharing
  • Keep each processor working on the same data
  • Consistent schedule for DO loops
  • Trust underlying system not to migrate threads
    for no reason
  • Lay out data to be contiguous
  • Column-major ordering in Fortran
  • Therefore, make dimension of outermost
    work-shared loop the column

61
Jacobi Iteration OpenMP Program
(to be continued)
62
Jacobi Iteration/Program (2)
63
Irregular Mesh The Problem
  • The Problem
  • Given an irregular mesh of values
  • Update each value using its neighbors in the mesh
  • The Approach
  • Store the mesh as a list of edges
  • Process all edges in parallel
  • Compute contribution of edge
  • Add to one endpoint, subtract from the other

64
Irregular Mesh Sequential Program
REAL x(nnode), y(nnode), flux INTEGER
iedge(nedge,2) err tol 1e6 DO WHILE (err gt
tol) DO i 1, nedge flux (y(iedge(i,1))-y(iedge
(i,2))) / 2 x(iedge(i,1)) x(iedge(i,1)) -
flux x(iedge(i,2)) x(iedge(i,2)) flux err
err flux(i)flux(i) END DO err err / nedge DO
i 1, nnode y(i) x(i) END DO END DO
65
Irregular Mesh OpenMP Partitioning
  • Flux computations are data-parallel
  • flux (x(iedge(i,1))-x(iedge(i,2)))/2
  • Independent because edge_val ? node_val
  • Node updates are nearly data-parallel
  • x(iedge(i,1)) x(iedge(i,1)) - flux(i)
  • Not truly independent because sometimes
    iedge(iY,1) iedge(iX,2)
  • But ATOMIC supports updates using associative
    operators
  • Error check is a reduction
  • err err flux(i)flux(i)
  • REDUCTION class for variables

66
Irregular Mesh OpenMP Communication
  • Communication needed for all parts
  • Between edges and nodes to compute flux
  • Edge-node and node-node to compute x
  • Reduction to compute err
  • Edge and node communication is static, local with
    respect to grid
  • But unstructured with respect to array indices
  • Reduction communication is static, global

67
Irregular Mesh OpenMP Agglomeration
  • Because of the tight ties between flux, x, and
    err, it is best to keep the loop intact
  • Incremental parallelization via OpenMP works
    perfectly
  • No differences between computation in different
    iterations
  • Any agglomeration scheme is likely to work well
    for load balance
  • Dont specify SCHEDULE
  • Make the system earn its keep

68
Irregular Mesh OpenMP Mapping
  • There may be significant differences in data
    movement based on scheduling
  • The ideal
  • Every processor runs over its own edges (static
    scheduling)
  • Endpoints of these edges are not shared by other
    processors
  • Data moves to its home on the first pass, then
    stays put

69
Irregular Mesh OpenMP Mapping (2)
  • The reality
  • The graph is connected ? some endpoints must be
    shared
  • Memory systems move more than one word at a time
    ? false sharing
  • OpenMP does not standardize how to resolve this
  • Best bet Once the program is working, look for
    good orderings of data

70
Irregular Mesh OpenMP Program
71
Irregular Mesh
  • Divide edge list among processors
  • Ideally, would like all edges referring to a
    given vertex to be assigned to the same processor
  • Often easier said than done

72
Irregular Mesh Pictures
73
Irregular Mesh Bad Data Order
74
Irregular Mesh Good Data Order
75
OpenMP Summary
  • Based on fork-join parallelism in shared memory
  • Threads start at beginning of parallel region,
    come back together at end
  • Close to some hardware
  • Linked from traditional languages
  • Very good for sharing data and incremental
    parallelization
  • Unclear if it is feasible for distributed memory
  • More information at http//www.openmp.org

76
Three Systems Compared
  • HPF
  • Good abstraction data parallelism
  • System can hide many details from programmer
  • Two-edged sword
  • Well-suited for regular problems on machines with
    locality
  • MPI
  • Lower-level abstraction message passing
  • System works everywhere, is usually the first
    tool available on new systems
  • Well-suited to handling data on distributed
    memory machines, but requires work up front

77
Three Systems Compared (2)
  • OpenMP
  • Good abstraction fork-join
  • System excellent for incremental parallelization
    on shared memory
  • No implementations yet on distributed memory
  • Well-suited for any parallel application if
    locality is not an issue
  • Can we combine paradigms?
  • Yes, although its still research

78
OpenMP MPI
  • Modern parallel machines are often shared memory
    nodes connected by message passing
  • Can be programmed by calling MPI from OpenMP
  • MPI implementation must be thread-safe
  • ASCI project is using this heavily

79
MPI HPF
  • Many applications (like the atmosphere/ocean
    model) consist of several data-parallel modules
  • Can link HPF codes on different machines using
    MPI
  • Requires special MPI implementation and runtime
  • HPFMPI project at Argonne has done
    proof-of-concept

80
HPF OpenMP
  • HPF can be implemented by translating it to
    OpenMP
  • Good idea on shared-memory machines
  • May have real advantages for optimizing locality
    and data layout
  • HPF may call OpenMP directly
  • Proposal being made at HPF Users Group meeting
    next week
  • Not trivial, since HPF and OpenMP may not agree
    on data layout
  • Things could get worse if MPI is also implemented
    on OpenMP
Write a Comment
User Comments (0)
About PowerShow.com