Shared Memory Parallelization - PowerPoint PPT Presentation

About This Presentation

Title:

Shared Memory Parallelization

Description:

Parallel execution is achieved by generating threads which execute in parallel ... Number of threads is machine-dependent, or can be set at runtime by setting an ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 27

Provided by: FCS68

Category:

more less

Transcript and Presenter's Notes

Title: Shared Memory Parallelization

1
Shared Memory Parallelization

Outline
What is shared memory parallelization?
OpenMP
Fractal Example
False Sharing
Variable scoping
Examples on sharing and synchronization

2
Shared Memory Parallelization

All processors can access all the memory in the
parallel system
The time to access the memory may not be equal
for all processors - not necessarily a flat
memory
Parallelizing on a SMP does not reduce CPU time -
it reduces wallclock time
Parallel execution is achieved by generating
threads which execute in parallel
Number of threads is independent of the number of
processors

3
Shared Memory Parallelization

Overhead for SMP parallelization is large
(100-200 ?sec)- size of parallel work construct
must be significant enough to overcome overhead
SMP parallelization is degraded by other
processes on the node - important to be dedicated
on the SMP node
Remember Amdahl's Law - Only get a speedup on
code that is parallelized

4
Fork-Join Model

1.All OpenMP programs begin as a single process
the master thread
2.FORK the master
thread then creates a team of parallel threads
3.Parallel
region statements executed
in parallel among the various team
threads
4.JOIN
threads synchronize
and terminate, leaving only the master thread

5
OpenMP

1997 group of hardware and software vendors
announced their support for OpenMP, a new API for
multi-platform shared-memory programming (SMP) on
UNIX and Microsoft Windows NT platforms.
www.openmp.org
OpenMP parallelism specified through the use of
compiler directives which are imbedded in C/C
or Fortran source code. IBM does not yet support
OpenMP for C.

6
OpenMP

How is OpenMP typically used?
OpenMP is usually used to parallelize loops
Find your most time consuming loops.
Split them up between threads.
Better scaling can be obtained using OpenMP
parallel regions, but can be tricky!

7
Loop Parallelization
8
Functional Parallelization
9
Fractal Example

!OMP PARALLEL
!OMP DO SCHEDULE(RUNTIME)
do i0,inos ! Long loop
do k1,niter ! Short loop
if(zabs(z(i)).lt.lim) then
if(z(i).eq.dcmplx(0.,0.)) then
z(i)c(i)
else
z(i)z(i)alphac(i)
endif
kount(i)k
else
exit
endif
end do
end do
!OMP END PARALLEL

10
Fractal Example (contd)

Can also define parallel region thus
!OMP PARALLEL DO SCHEDULE(RUNTIME)
do i0,inos ! Long loop
do k1,niter ! Short loop
...
end do
end do
C syntax
pragma omp parallel for
for(i0 i lt inos i)
for(k1 j lt niter k)
...

11
Fractal Example (contd)

Number of threads is machine-dependent, or can be
set at runtime by setting an environment variable
SCHEDULE clause specifies how the iterations of
the loop are divided among the threads
STATIC the loop iterations divided into
contiguous chunks of equal size.
DYNAMIC iterations are broken into chunks of
specified size (default 1). As each thread
finishes its work it dynamically obtains the next
set of iterations.
RUNTIME the schedule determined at runtime
GUIDED

12
Fractal Example (contd)

Compilation
xlf90_r -qsmpomp prog.f
cc_r -qsmpomp prog.c
Threaded version of compilers will perform
automatic parallelization of your program unless
you specify otherwise using the -qsmpomp (or
noauto) option
Program will run on four processors unless
specified otherwise by setting the
XLSMPOPTSparthds environment variable
Default schedule is STATIC. Try setting it to
DYNAMIC with export XLSMPOPTS"SCHEDULEdynamic
This will assign loop iterations in chunks of 1.
Try a larger chunk size (and get better
performance), for example 40 export
XLSMPOPTS"SCHEDULEdynamic40"

13
Fractal Example (contd)

Tradeoff between Load Balancing and Reduced
Overhead
The larger the size (GRANULARITY) of the piece of
work, the lower the overall thread overhead.
The smaller the size (GRANULARITY) of the piece
of work, the better the dynamically scheduled
load balancing
Watch out for FALSE SHARING chunk size smaller
than cache line

14
False Sharing

IBM Power3 cache line is 128 Bytes (16 8-Byte
words)
!OMP PARALLEL DO
do I1,50
A(I)B(I)C(I)
enddo
say A(1-13)starts on cache line
then some of A(14-20)will be on first cache line
so wont be accessible until first thread
finished
Solution set chunk size of 32 so wont have
overlap on other cache line

15
Variable Scoping

Most difficult part of Shared Memory
Parallelization
What memory is Shared
What memory is Private - each processor has its
own copy
Compare MPI all variables are private
Variables are shared by default, except
loop indices
scalars, and arrays whose subscript is constant
with respect to PARALLEL DO, that are set and
then used in loop)

16
How does sharing work?
X initially 0

THREAD 1
increment(x)
x x 1
THREAD 1
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)

THREAD 2
increment(x)
x x 1
THREAD 2
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)

Result could be 1 or 2 Need synchronization
17
Variable Scoping example

read ,n
sum 0.0
call random (b)
call random (c)
!OMP PARALLEL
!OMP PRIVATE (i,sump )
!OMP SHARED (a,b,n,c,sum)
sump 0.0
!OMP DO
do i1,n
a(i) sqrt(b(i)2c(i)2)
sump sump a(i)
enddo
!OMP CRITICAL
sum sum sump
!OMP ENDCRITICAL
!OMP END PARALLEL
end

18
Scoping example 2

read ,n
sum 0.0
call random (b)
call random (c)
!OMP PARALLEL DO
!OMPPRIVATE (i)
!OMPSHARED (a,b,n)
!OMPREDUCTION (sum)
do i1,n
a(i) sqrt(b(i)2c(i)2)
sum sum a(i)
Enddo
!OMP PARALLEL ENDDO
end
Each processor needs
a separate copy of i
everything else is
Shared

19
Variable Scoping

Global variables are SHARED among threads
Fortran COMMON blocks, SAVE variables, MODULE
variables
C variables visible when pragma omp parallel
encountered, static variables declared within a
parallel region
But not everything is shared...
Stack variables in sub-programs called from
parallel regions are PRIVATE
Automatic variables within a statement block are
PRIVATE.

20
Hello World 1 (correct)