Predicting the Performance of Injection Communication Patterns on PVM - PowerPoint PPT Presentation

About This Presentation
Title:

Predicting the Performance of Injection Communication Patterns on PVM

Description:

When an element is referenced its neighbors will be referenced too. Temporal Locality ... to a pool of shared memory. Methods of memory access : - Bus - Crossbar ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 68
Provided by: dis86
Category:

less

Transcript and Presenter's Notes

Title: Predicting the Performance of Injection Communication Patterns on PVM


1
Introducción a la Programación Paralela (Memoria
Compartida)
Casiano Rodríguez León casiano_at_ull.es Departamento
de Estadística, Investigación Operativa y
Computación.
2
Introduction
3
What is parallel computing?
  • Parallel computing the use of multiple computers
    or processors working together on a common task.
  • Each processor works on its section of the
    problem
  • Processors are allowed to exchange information
    with other processors

4
Why do parallel computing?
  • Limits of single CPU computing
  • Available memory
  • Performance
  • Parallel computing allows
  • Solve problems that dont fit on a single CPU
  • Solve problems that cant be solved in a
    reasonable time
  • We can run
  • Larger problems
  • Faster
  • More cases


5
Performance Considerations
6
SPEEDUP
TIME OF THE FASTEST SEQUENTIAL ALGORITHM TIME OF
THE PARALLEL ALGORITHM
SPEEDUP
SPEEDUP ? NUMBER OF PROCESSORS
  • Consider a parallel algorithm that runs in T
    steps on P processors
  • It is a simple fact that the parallel algorithm
    can be simulated by
  • a sequential machine in TxP steps
  • The best sequential algorithm runs in Tbest seq ?
    TxP

7
Amdahls Law
  • Amdahls Law places a strict limit on the speedup
    that can be realized by using multiple
    processors.
  • Effect of multiple processors on run time
  • Effect of multiple processors on speed up
  • Where
  • fs serial fraction of code
  • fp parallel fraction of code
  • P number of processors

(
)
8
Illustration of Amdahl's Law
It takes only a small fraction of serial content
in a code to degrade the parallel performance. It
is essential to determine the scaling behavior of
your code before doing production runs using
large numbers of processors

9
Amdahls Law Vs. Reality
Amdahls Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for speedup assuming that there are no
costs for communications. In reality,
communications will result in a further
degradation of performance
10
Memory/Cache
CPU
Cache
MAIN MEMORY
11
Locality and Caches
  • a i b i c i
  • On uniprocessors systems from a correctness point
    of view, memory is monolithic, from a performance
    point of view is not.
  • It might take more time to bring a(i) from memory
    than to bring b(i),
  • Bringing in a(i) at one point in time maight take
    longer than bringing it in at a later point in
    time.

12
for(i0iltni) for(j0jltnj) aji
0
aji and aj1i have stride n, being n the
dimension of a.
There is an stride 1 access to aji1 that
occurs n iterations after the reference to
aji.
Spatial Locality When an element is referenced
its neighbors will be referenced too Temporal
Locality When an element is referenced, it might
be referenced again soon
13
Shared Memory Machines
14
Shared and Distributed memory
Distributed memory - each processor has its own
local memory. Must do message passing to
exchange data between processors.
Shared memory - single address space. All
processors have access to a pool of shared
memory. Methods of memory access - Bus -
Crossbar
15
Styles of Shared memory UMA and NUMA

16
UMA Memory Access Problems
  • Conventional wisdom is that systems do not scale
    well
  • Bus based systems can become saturated
  • Fast large crossbars are expensive
  • Cache coherence problem
  • Copies of a variable can be present in multiple
    caches
  • A write by one processor my not become visible to
    others
  • They'll keep accessing stale value in their
    caches
  • Need to take actions to ensure visibility or
    cache coherence

17
Cache coherence problem
  • Processors see different values for u after event
    3
  • With write back caches, value written back to
    memory depends on circumstance of which cache
    flushes or writes back value when
  • Processes accessing main memory may see the old
    value

18
Snooping-based coherence
  • Basic idea
  • Transactions on memory are visible to all
    processors
  • Processor or their representatives can snoop
    (monitor) bus and take action on relevant events
  • Implementation
  • When a processor writes a value a signal is sent
    over the bus
  • Signal is either
  • Write invalidate tell others cached value is
    invalid
  • Write broadcast - tell others the new value

19
Memory Consistency Models
20
Memory Consistency Models
In some commercial shared memory systems it is
possible to observe the old value of
MyTask-gtdata!!
21
Distributed shared memory (NUMA)
  • Consists of N processors and a global address
    space
  • All processors can see all memory
  • Each processor has some amount of local memory
  • Access to the memory of other processors is
    slower
  • NonUniform Memory Access

22
SGI Origin 2000
23
OpenMP Programming
24
Origin2000 memory hierarchy
Level Latency
(cycles) register
0 primary cache
2..3 secondary cache
8..10 local main memory TLB hit
75 remote main memory TLB hit 250
main memory TLB miss 2000 page
fault 106
25
OpenMP
  • OpenMP C and C Application Program Interface
  • DRAFT
  • Version 2.0 November 2001 DRAFT 11.05
  • OPENMP ARCHITECTURE REVIEW BOARD
  • http//www.openmp.org/
  • http//www.compunity.org/
  • http//www.openmp.org/specs/
  • http//www.it.kth.se/labs/cs/odinmp/
  • http//phase.etl.go.jp/Omni/

26
Hello World in OpenMP
27
pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
28
pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
29
pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
Search for x 100
P 10 processors
The sequential algorithm traverses 900 elements
Processor 9 finds x 10 in the first step
Speedup ? 900/1 gt 10 P
30
main() double local, pi0.0, w long i w
1.0 / N pragma omp parallel private(i,
local) pragma omp single pi
0.0 pragma omp for reduction ( pi) for (i
0 i lt N i) local (i 0.5)w
pi pi 4.0/(1.0 locallocal)
31
Nested Parallelism
32
The expression of Nested Parallelism in
OpenMP has to conform to these two rules
  • A parallel directive dynamically inside another
    parallel establishes a new team, which is
    composed of only the current thread unless nested
    parallelism is enabled.
  • for, sections and single directives that bind to
    the same parallel are not allowed to be nested
    inside each other.

33
http//www.openmp.org/index.cgi?faq
Q6 What about nested parallelism?
A6 Nested parallelism is permitted by the
OpenMP specification. Supporting nested
parallelism effectively can be difficult, and we
expect most vendors will start out by executing
nested parallel constructs on a single thread.
OpenMP encourages vendors to experiment with
nested parallelism to help us and the users of
OpenMP understand the best model and API to
include in our specification.
We will include the necessary functionality when
we understand the issues better.
34
A parallel directive dynamically inside another
parallel establishes a new team, which is
composed of only the current thread unless nested
parallelism is enabled.
NANOS Ayguade E., Martorell X., Labarta J.,
Gonzalez M. and Navarro N. Exploiting Multiple
Levels of Parallelism in OpenMP A Case
Study Proc. of the 1999 International Conference
on Parallel Processing, Aizu (Japan), September
1999. http//www.ac.upc.es/nanos/
35
KAI Shah S, Haab G, Petersen P, Throop
J. Flexible control structures for parallelism in
OpenMP. 1st European Workshop on OpenMP, Lund,
Sweden, September 1999. http//developer.intel.com
/software/products/trans/kai/
Nodeptr list ... pragma omp taskq for ( nodeptr
p list p ! NULL p p-lt next) pragma
omp task process(p-gtdata)
36
The Workqueuing Model
void traverse(Node node)
process(node.data) if (node.has_left)
traverse(node.left) if (node.has_right)
traverse(node.right)
Robert D. Blumofe, Christopher F. Joerg, Bradley
C. Kuszmaul, Charles E. Leiserson, Keith H.
Randall, Yuli Zhou Cilk An Efficient
Multithreaded Runtime System. Journal of Parallel
and Distributed Computing 37(1) 55-69 (1996).
http//supertech.lcs.mit.edu/cilk/
37
A parallel directive dynamically inside another
parallel establishes a new team, which is
composed of only the current thread unless nested
parallelism is enabled.
OMNI Yoshizumi Tanaka, Kenjiro Taura, Mitsuhisa
Sato, and Akinori Yonezawa Performance Evaluation
of OpenMP Applications with Nested
Parallelism Languages, Compilers, and Run-Time
Systems for Scalable Computers pp. 100-112,
2000 http//pdplab.trc.rwcp.or.jp/Omni/
38
for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Simplicity!!
39
for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Chandra R., Menon R., Dagum L., Kohr D.,
Maydan D. and McDonald J. Morgan Kaufmann
Publishers. Academic press. 2001.
40
for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Page 122 A work-sharing construct divides a
piece of work among a team of parallel threads.
However, once a thread is executing within a
work-sharing construct, it is the only
thread executing that code there is no team of
threads executing that specific piece of code
anymore, so, it is nonsensical to attempt to
further divide a portion of work using a
work-sharing construct.
Nesting of work-sharing constructs is therefore
illegal in OpenMP.
41
Divide and Conquer
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
.
P3
42
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
qs(v,first,last)
...
43
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
...
44
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
...
45
OpenMP Architecture Review Board OpenMP C and
C application program interface v. 1.0 -
October. (1998). http//www.openmp.org/specs/mp-do
cuments/cspec10.ps
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
page 14 The sections directive identifies a
non iterative work-sharing construct that
specifies a set of constructs that are to be
divided among threads in a team. Each section is
executed once by a thread in the team.
46
NWS The Run Time Library
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
void qs(int v, int first, int last) int i,
j if (first lt last) MASTER
partition(v, i, j, first, last)
ll_2_sections( ll_2_first_private(i, j),
qs(v, first, j), qs(v, i, last)
)
47
NWS The Run Time Library
define ll_2_sections(decl, f1, f2)
\
decl
\
ll_barrier()
\

\ int ll_oldname
ll_NAME, \ ll_oldnp
ll_NUMPROCESSORS \
if (ll_DIVIDE(2)) f2
\ else f1
\
ll_REJOIN(ll_oldname, ll_oldnp)
\
\

\
48
void ll_barrier() MASTER int i,
all_arrived do all_arrived
1 for (i 1 i lt ll_NUMPROCESSORS
i) if (!(ll_ARRIVEDi))
all_arrived 0 while (!all_arrived)
for (i 1 i lt ll_NUMPROCESSORS i)
ll_ARRIVEDi 0 SLAVE
ll_ARRIVED 1 while (ll_ARRIVED)

49
int ll_DIVIDE(int ngroups) int ll_group
double ll_first double ll_last ll_group
(int)floor(((double)(ngroupsll_NAME))/((double)ll
_NUMPROCESSORS)) ll_first
(int)ceil(((double)(ll_NUMPROCESSORSll_group))/((
double)ngroups)) ll_last(int)ceil((((double)(
ll_NUMPROCESSORS(ll_group1)))/((double)ngroups))
-1) ll_NUMPROCESSORS ll_last - ll_first
1 ll_NAME ll_NAME - ll_first return
ll_group
void ll_REJOIN(int old_name, int old_np)
ll_NAME old_name ll_NUMPROCESSORS old_np
(A bit) more overhead if weights are provided!
50
define ll_2_first_private(ll_var1, ll_var2) \
void q \ MASTER \ ll_FIRST_PRIVATE
(void ) malloc(2sizeof(void )) \ q
ll_FIRST_PRIVATE \ q (void )
(ll_var1) \ q \ q (void )
(ll_var2) \ \ ll_barrier() \ SLAVE
\ q (ll_FIRST_PRIVATE-ll_NAME) \
memcpy((void ) (ll_var1), q, sizeof(ll_var1))
\ q \ memcpy((void ) (ll_var2),
q, sizeof(ll_var2)) \ \ ll_barrier() \
MASTER free(ll_FIRST_PRIVATE) \
ll_FIRST_PRIVATE
ll_FIRST_PRIVATE
51
FAST FOURIER TRANSFORM
52
void llcFFT(Complex A, Complex a, Complex W,
unsigned N, unsigned stride, Complex D)
Complex B, C Complex Aux, pW unsigned i,
n if (N 1) pragma ll MASTER A0.re
a0.re A0.im a0.im else
n (Ngtgt1) pragma ll sections pragma ll
section llcFFT(D, a, W, n, strideltlt1,
A) pragma ll section llcFFT(Dn,
astride, W, n, strideltlt1, An)) B D
C D n pW W pragma ll for private(i)
for(i0 iltn i) \ Aux.re pW-gtre
Ci.re - pW-gtim Ci.im Aux.im
pW-gtre Ci.im pW-gtim Ci.re
Ai.re Bi.re Aux.re Ai.im Bi.im
Aux.im Ain.re Bi.re - Aux.re
Ain.im Bi.im - Aux.im pW
stride
53
(No Transcript)
54
void kaiFFTdp(Complex A, ... , int level)
... if (level gt CRITICAL_LEVEL)
seqDandCFFT(A, a, W, N, stride, D) else
if(N 1) A0.re a0.re A0.im
a0.im else n (N gtgt 1) B
D C D n pragma omp taskq
pragma omp task kaiFFTdp(B,
a, W, n, strideltlt1, A, level1)
pragma omp task kaiFFTdp(C, astride,
W, n, strideltlt1, An, level1)
55
pragma omp taskq private(i, Aux, pW, j,
start, end) Workers
omp_get_num_threads() CHUNK_SIZE n /
Workers ((n Workers) gt 0) for(j
0 j lt n j CHUNK_SIZE) start j
CHUNK_SIZE end min(start CHUNK_SIZE, n)
pW W start stride
pragma omp task for(i
start i lt end i) ...
/ task /
/ taskq /
56
(No Transcript)
57
OpenMP Distributed Memory
58
OpenMP Distributed Memory
  • Barbara Chapman, Piyush Mehortra and Hans Zima
  • Enhacing OpenMP With Features for Locality
    Control
  • Technical report TR99-02,
  • Inst. for Software Technology and Parallel
    Systems, U. Vienna, Feb. 1999.
  • http//www.par.univie.ac.at.
  • C. Amza, A. L. Cox , S. Dwarkadas , P. Keleher ,
    H. Lu , R. Rajamony , W. Yu , W. Zwaenepoel
  • TreadMarks Shared Memory Computing on Networks
    of Workstations
  • IEEE Computer, 29(2), pp. 18-28, February 1996.
  • http//www.cs.rice.edu/willy/TreadMarks/overview.
    html

59
1 pragma ll parallel for 2 pragma ll
result(rii, sii) 3 for(i1 i lt 3 i) 4
... 5 pragma ll for 6 pragma ll result
(rjj, sjj) 7 for(j0 jltij) 4
rjj function_j(i,j, sjj, .... ) 6 7
rii function_i(i, sii, .... ) 8
60
1 pragma ll parallel for 2 pragma ll
result(rii, sii) 3 for(i1 i lt 3 i) 4
... 5 pragma ll for 6 pragma ll result
(rjj, sjj) 7 for(j0 jltij) 4
rjj function_j(i,j, sjj, .... ) 6 7
rii function_i(i, sii, .... ) 8
...
23
61
worker
worker
worker
62
1 pragma ll parallel for 2 pragma ll
result(rxx, sxx) 3 for(x3 x lt 5 x) 5
pragma ll for 6 pragma ll result (ryy,
syy) 7 for(y x yltx1y) 4 ryy
... 6 7 rxx ... 8
63
One Thread is One Set of Processors Model (OTOSP)
w120, w210
2
3
4
5
0
1
  • Each Subset of Processors correspond to one and
    only one Thread.
  • A Group is a Family of Subsets.
  • A Team is a Family of Groups.

64
(No Transcript)
65
(No Transcript)
66
1 forall(i1 ilt3) result(rii, sii) 2
... 3 forall(j0 jlti) result(rjj, sjj)
4 int a, b 5 ... 6 if (i 2
1) 7 if (j 2 0) 8 send(j,
j1, a, sizeof(int)) 9 else receive(j,
j-1, b, sizeof(int))
67
Conclusions and Open Questions (Shared Memory)
  • Is OpenMP here to stay?
  • Scalability of Shared Memory Machines?
  • Performance Prediction Tools and Models?
  • OpenMP for Distributed Memory Machines?
  • The Work Queuing Model?

Pointer
  • http//nereida.deioc.ull.es/html/openmp.html
Write a Comment
User Comments (0)
About PowerShow.com