Shared Memory Programming Using OpenMP - PowerPoint PPT Presentation

Loading...

PPT – Shared Memory Programming Using OpenMP PowerPoint presentation | free to view - id: 1a5082-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Shared Memory Programming Using OpenMP

Description:

When VS 2008 Runtimes are Not installed. David Lifka. Computing and Information Science 405 ... schedule(runtime) Schedule type chosen at runtime based on the ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 55
Provided by: cacCo
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Shared Memory Programming Using OpenMP


1
Shared Memory Programming Using OpenMP
2
What is Shared Memory Programming?
  • Multiple Processors in the same computer sharing
    a common memory
  • Threads running on the various processors can
    interact with one another via shared variables.
  • Tends to be easier to make performance
    improvements to a serial code than with message
    passing parallelism which usually requires the
    code/algorithm to be redesigned.
  • Many clusters today are made up of 2 or more
    processors/compute node. Using OpenMP with MPI
    is a common strategy to take obtain maximum
    performance.

3
Fork Join
  • Multi-threaded programming is the most common
    shared-memory programming methodology
  • A serial code begins execution. The process is
    the master thread or only executing thread.
  • When a parallel portion of the code is reached
    the Master thread can fork more threads to work
    on it.
  • When the parallel portion of the code has
    completed the threads join again and the master
    thread continues executing the serial code.

fork
join
4
What is OpenMP?
  • Every system has its own threading libraries
  • They can be quite complicated
  • They are generally not portable
  • Often optimized to produce the absolute best
    performance
  • OpenMP has emerged as a standard method for
    shared-memory programming
  • Similar to the way MPI has become the standard
    for distributed-memory programming
  • Codes are portable
  • Performance is usually good enough
  • Consists of
  • Environment variable
  • Compiler directives
  • API calls
  • Compiler support
  • C, C Fortran
  • Intel
  • http//www.intel.com/cd/software/products/asmo-na/
    eng/compilers/

5
How Does OpenMP Work
  • for loops typically indicate a parallel section
    of the code.
  • Using compiler directives we can tell the
    compiler that it can automatically parallelize a
    loop for us.
  • In C C these compiler directives are called
    pragmas (pragmatic information)
  • openMP pragmas start with
  • pragma omp
  • Pragmas go in front of for loops to tell the
    compiler that they can be parallelized
  • Loops that can be parallelized have some
    requirements
  • Run time system must know how many iterations
    will be executed
  • Loops cannot have logic that allow them to exit
    early
  • Break,return, exit etc.
  • Loops must have canonical shape
  • for (i start i lt end i )
  • lt i
  • gt i--
  • gt --i
  • iinc
  • i-inc

6
Basic OpenMP Functions
  • omp_get_num_procs
  • int procs omp_get_num_procs()
  • omp_get_num_threads
  • int threads omp_get_num_threads()
  • omp_get_max_threads
  • printf("Currently d threads\n",omp_get_max_thread
    s())
  • omp_get_thread_num
  • printf("Hello from thread id d\n",omp_get_thread
    _num())
  • omp_set_num_threads
  • omp_set_num_threads(procs atoi(argv1))

7
OpenMP Helloworld Example
8
Basic OpenMP Compiler Directives Basic Data
Parallel Example
9
Command Line Compiling Linking OpenMP
Applications
10
Dealing with Manifest Files When VS 2008
Runtimes are Not installed
11
Command Line Compiling Linking OpenMP
Applications Embedding Manifest in .exe using
Manifest Tool
12
Compiling Linking OpenMP Applications Visual
Studio 2005
13
OpenMP Environment Variables
  • OMP_NUM_THREADS
  • Sets the number of threads to use during
    execution if OMP_DYNAMIC is set to (TRUE)
  • OMP_DYNAMIC
  • Enables (TRUE) or disables (FALSE) the dynamic
    adjustment of the number of threads

14
Using OMP_NUM_THREADS OMP_DYNAMIC Environment
Variables
15
OMP_DYNAMIC Examples (Note behavior differs by
OpenMP Implementation)
16
OMP_NUM_THREADS OMP_DYNAMIC Examples
17
Private Variables private clause
  • Tells compiler to allocate a private copy of a
    variable for each thread

start omp_get_wtime() h 1.0 / (double)
n area 0.0 pragma omp parallel for
private(x) for (i 1 i lt n i) x
h ((double)i - 0.5) area (4.0 / (1.0
xx)) // for pi h area end
omp_get_wtime()
18
Critical Sections pragma omp critical
  • area gets updated by every thread
  • It is not at atomic operation
  • Results are non-deterministic due to a
    race-condition

Wrong
Right
start omp_get_wtime() h 1.0 / (double)
n area 0.0 pragma omp parallel for
private(x) for (i 1 i lt n i) x
h ((double)i - 0.5) area (4.0 / (1.0
xx)) // for pi h area end
omp_get_wtime()
start omp_get_wtime() h 1.0 / (double)
n area 0.0 pragma omp parallel for
private(x) for (i 1 i lt n i) x
h ((double)i - 0.5) pragma omp
critical area (4.0 / (1.0 xx)) //
for pi h area end omp_get_wtime()
19
Example omp_bubble.c
20
Scheduling Loops
  • schedule(static)
  • statically allocate (Total Iterations/Total
    Threads) contiguous iterations per thread
  • schedule(static, chunk)
  • Interleaved allocation of chunks (chunk
    number of iterations) to each thread
  • schedule(dynamic)
  • Iterations dynamically allocated to each thread,
    1 at a time
  • schedule(dynamic, chunk)
  • Iterations dynamically allocated to each thread,
    1 chunk at a time
  • schedule(guided)
  • Guided self-scheduling with minimal chunk size
    1
  • schedule(guided, chunk)
  • Dynamic allocation of iterations using guided
    self-scheduling
  • Large chunks shrinking exponentially to a minimum
    size chunk specified at each new request for a
    chunk by a thread

21
One More OpenMP Environment Variables
OMP_SCHEDULE
  • schedule(runtime)
  • Schedule type chosen at runtime based on the
    value of the OMP_SCHEDULE environment variable
  • Example
  • Set OMP_SCHEDULEdynamic,10

22
Conditionally Executing Loops
23
Example omp.bat environment based schedule
modification
24
OpenMP omp_bubble.c Performance Quad PIII Xeon
(500Mhz 2 GB RAM)
25
OpenMP omp_bubble.c Performance Quad PIII Xeon
(500Mhz 2 GB RAM)
26
OpenMP omp_bubble.c Performance Quad PIII Xeon
(500Mhz 2 GB RAM)
27
Reductions reduction(ltoperatorgtvariable)
  • Reduction clause can be added to pragma omp
    parallel for
  • Works like MPI_Reduce
  • Reduction operators for C C
  • Operator Meaning Types Initial Value
  • Sum float,int 0
  • Product float,int 1
  • Bitwise and int all bits 1
  • Bitwise or int 0
  • Bitwise exclusive or int 0
  • Logical and int 1
  • Logical or int 0

start omp_get_wtime() h 1.0 / (double)
n area 0.0 pragma omp parallel for
private(x) reduction(area) for (i 1 i lt
n i) x h ((double)i - 0.5)
area (4.0 / (1.0 xx)) // for pi h
area end omp_get_wtime()
28
OpenMP Pi Example
29
OpenMP omp_pi.c Performance Quad PIII Xeon
(500Mhz 2 GB RAM)
30
firstprivate lastprivate clauses
  • firstprivate tells compiler to create private
    variables with the same initial value as that of
    the Master thread.
  • firstprivate variables are initialized once/thread

X0 complex_function() pragma omp parallel
for private(j) firstprivate(x) for (i0 iltn
i) for (j1 jlt4 j) xj
g(i,xj-1) answeri x1 x3
  • lastprivate tells compiler to copy back to the
    Master thread the private copy executed on the
    sequential last iteration of a loop. (This is
    the last iteration that would be executed if the
    code were serial)

pragma omp parallel for private(j)
lastprivate(x) for (i0 iltn i) x0
1.0 for (j1 jlt4 j) xj xj-1
(i1) sum_of_powersi x0 x1 x2
x3 n_cubed x3
Note examples from pages 412-413 in the text
31
firstprivate example
32
firstprivate output
33
lastprivate example
34
lastprivate output
35
Broken lastprivate example
36
Broken lastprivate example Output
37
omp_heat2d example 1 of 5
  • include ltstdio.hgt
  • include ltmath.hgt
  • include ltomp.hgt
  • include ltmalloc.hgt
  • define EPSILON 0.00001
  • define N 100
  • define time_steps 100
  • int main (int argc, char argv)
  • int i,j
  • int step
  • int threads
  • double time
  • double eps, enew
  • double time_max 3.0
  • double alpha 0.06
  • double dx 1.0/N
  • double dy 1.0/time_steps

38
omp_heat2d example 2 of 5
  • if (argc lt 2) threads 1
  • else threads atoi(argv1)
  • if (!omp_get_dynamic())
  • //Set the number of threads
  • omp_set_num_threads(threads)
  • else
  • printf("ERROR set OMP_DYNAMICFALSE\n")
  • exit(0)
  • start omp_get_wtime()
  • tS (double ) malloc(N N sizeof(double))
  • toldS (double ) malloc(N N
    sizeof(double))
  • t (double ) malloc(N sizeof(double ))
  • told (double ) malloc(N sizeof(double
    ))
  • pragma omp parallel for

39
omp_heat2d example 3 of 5
  • // set initial boundary conditions
  • pragma omp parallel for
  • for (i0 iltN i)
  • toldi0 0.0 // left
  • toldiN-1 0.0 // right
  • pragma omp parallel for
  • for (j0 jltN j)
  • toldN-1j 0.0 // bottom
  • // for all time steps
  • for (step 1 step lt time_steps step)
  • time step (time_max/time_steps)
  • // reset top boundary condition each timestep
  • pragma omp parallel for
  • for (j0 jltN j)
  • told0j 2.0 sin(time) // top
  • do

40
omp_heat2d example 4 of 5
  • //pragma omp parallel for private(j,enew)
  • for (i1 ilt(N-1) i)
  • for (j1 jlt(N-1) j)
  • enew fabs(tij - toldij)
  • //pragma omp critical
  • if (enew gt eps) eps enew
  • pragma omp parallel for private(j)
  • for (i0 iltN i)
  • for (j0 jltN j)
  • toldij tij
  • while(eps gt EPSILON)

41
omp_heat2d example 5 of 5
  • // Dump raster date to a file
  • //minval 0.0
  • //maxval 0.0
  • //for (i0 iltN i)
  • // for (j0 jltN j)
  • //
  • // if (tij lt minval) minval
    tij
  • // if (tij gt maxval) maxval
    tij
  • //
  • //sprintf(fname,"Output\\heat03d.raw",step)
  • //out fopen(fname,"wb")
  • //for (i0 iltN i)
  • // for (j0 jltN j)
  • // fprintf(out,"c",(int)(((tij-minval)2
    55.0)/(maxval - minval)))
  • //fclose(out)
  • // printf("Time step d\r",step)
  • // for all time steps
  • end omp_get_wtime()
  • printf("d time steps - d threads - .2f
    seconds\n",step-1,omp_get_max_threads(),end-start)

42
OpenMP omp_heat2d.c Performance Quad PIII Xeon
(500Mhz 2 GB RAM)
43
Things that may make omp_heat2d.c run slooooow
  • Ensure you are writing your output to T\ not H\
  • Ensure the openmp dlls are also on T\ with your
    executable
  • If you use a critical section be careful they can
    be expensive as the number of threads increases….

44
single Pragma
45
single example Output
46
Row-Major (C) vs. Column-Major (Fortran)
  • Assume the following Matrix aij C Loop
  • A11 A12 A13 for (i1 ilt3 i)
  • A21 A22 A23 for (j1jlt3j)
  • A31 A32 A33 aij
  • Row-Major (C) Stores the values in memory
  • A11 A12 A13 A21 A22 A23 A31 A32 A33 ---gt Higher
    addresses
  • j index changes faster then the i index
  • Column-Major (Fortran) Stores the values in
    memory
  • A11 A21 A31 A12 A22 A32 A13 A23 A33 ---gt Higher
    addresses
  • i index changes faster than the j index

47
Inverting Loops Example
  • Columns can be updated simultaneously (not rows)
  • Inverting the i j loops reduces the number of
    fork joins
  • Consider how transformation affects the
    cache-hit rate
  • for (i2 iltm i)
  • for (j1jltnj)
  • aij 2 ai-1j
  • Becomes
  • pragma parallel for private(i)
  • for (j1jltnj)
  • for (i2 iltm i)
  • aij 2 ai-1j

A11 A12 A13 A21 A22 A23 A31 A32 A33
Note examples from page 417 in the text
48
Consider how array is referenced in Memory
  • for (i2 iltm i)
  • for (j1jltnj)
  • A21 A11 i2, j1 to 3
  • A22 A12
  • A23 A13
  • A31 A21 i3, j1 to 3
  • A32 A22
  • A33 A23
  • for (j1jltnj)
  • for (i2 iltm i)
  • A21 A11 j1 i2 to 3
  • A31 A21
  • A22 A12 j2 i2 to 3
  • A32 A22
  • A23 A13 j3 i2 to 3

A11 A12 A13 A21 A22 A23 A31 A32 A33
Note examples from page 417 in the text
49
Nesting Parallel Directives for pragma nowait
clause
50
Nesting Parallel Directives Example output
51
Parallel pragma execute a block of code in
parallel
52
Parallel pragma Example Output
53
Functional Parallelism sections Pragma
54
Functional Parallelism Example sections Pragma
About PowerShow.com