Predicting the Performance of Injection Communication Patterns on PVM - PowerPoint PPT Presentation

About This Presentation

Title:

Predicting the Performance of Injection Communication Patterns on PVM

Description:

When an element is referenced its neighbors will be referenced too. Temporal Locality ... to a pool of shared memory. Methods of memory access : - Bus - Crossbar ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 68

Provided by: dis86

Category:

more less

Transcript and Presenter's Notes

Title: Predicting the Performance of Injection Communication Patterns on PVM

1
Introducción a la Programación Paralela (Memoria
Compartida)
Casiano Rodríguez León casiano_at_ull.es Departamento
de Estadística, Investigación Operativa y
Computación.
2
Introduction
3
What is parallel computing?

Parallel computing the use of multiple computers
or processors working together on a common task.
Each processor works on its section of the
problem
Processors are allowed to exchange information
with other processors

4
Why do parallel computing?

Limits of single CPU computing
Available memory
Performance
Parallel computing allows
Solve problems that dont fit on a single CPU
Solve problems that cant be solved in a
reasonable time
We can run
Larger problems
Faster
More cases

5
Performance Considerations
6
SPEEDUP
TIME OF THE FASTEST SEQUENTIAL ALGORITHM TIME OF
THE PARALLEL ALGORITHM
SPEEDUP
SPEEDUP ? NUMBER OF PROCESSORS

Consider a parallel algorithm that runs in T
steps on P processors
It is a simple fact that the parallel algorithm
can be simulated by
a sequential machine in TxP steps
The best sequential algorithm runs in Tbest seq ?
TxP

7
Amdahls Law

Amdahls Law places a strict limit on the speedup
that can be realized by using multiple
processors.
Effect of multiple processors on run time
Effect of multiple processors on speed up
Where
fs serial fraction of code
fp parallel fraction of code
P number of processors

(
)
8
Illustration of Amdahl's Law
It takes only a small fraction of serial content
in a code to degrade the parallel performance. It
is essential to determine the scaling behavior of
your code before doing production runs using
large numbers of processors

9
Amdahls Law Vs. Reality
Amdahls Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for speedup assuming that there are no
costs for communications. In reality,
communications will result in a further
degradation of performance
10
Memory/Cache
CPU
Cache
MAIN MEMORY
11
Locality and Caches

a i b i c i
On uniprocessors systems from a correctness point
of view, memory is monolithic, from a performance
point of view is not.
It might take more time to bring a(i) from memory
than to bring b(i),
Bringing in a(i) at one point in time maight take
longer than bringing it in at a later point in
time.

12
for(i0iltni) for(j0jltnj) aji
0
aji and aj1i have stride n, being n the
dimension of a.
There is an stride 1 access to aji1 that
occurs n iterations after the reference to
aji.
Spatial Locality When an element is referenced
its neighbors will be referenced too Temporal
Locality When an element is referenced, it might
be referenced again soon
13
Shared Memory Machines
14
Shared and Distributed memory
Distributed memory - each processor has its own
local memory. Must do message passing to
exchange data between processors.
Shared memory - single address space. All
processors have access to a pool of shared
memory. Methods of memory access - Bus -
Crossbar
15
Styles of Shared memory UMA and NUMA

16
UMA Memory Access Problems

Conventional wisdom is that systems do not scale
well
Bus based systems can become saturated
Fast large crossbars are expensive
Cache coherence problem
Copies of a variable can be present in multiple
caches
A write by one processor my not become visible to
others
They'll keep accessing stale value in their
caches
Need to take actions to ensure visibility or
cache coherence

17
Cache coherence problem

Processors see different values for u after event
3
With write back caches, value written back to
memory depends on circumstance of which cache
flushes or writes back value when
Processes accessing main memory may see the old
value

18
Snooping-based coherence

Basic idea
Transactions on memory are visible to all
processors
Processor or their representatives can snoop
(monitor) bus and take action on relevant events
Implementation
When a processor writes a value a signal is sent
over the bus
Signal is either
Write invalidate tell others cached value is
invalid
Write broadcast - tell others the new value

19
Memory Consistency Models
20
Memory Consistency Models
In some commercial shared memory systems it is
possible to observe the old value of
MyTask-gtdata!!
21
Distributed shared memory (NUMA)

Consists of N processors and a global address
space
All processors can see all memory
Each processor has some amount of local memory
Access to the memory of other processors is
slower
NonUniform Memory Access

22
SGI Origin 2000
23
OpenMP Programming
24
Origin2000 memory hierarchy
Level Latency
(cycles) register
0 primary cache
2..3 secondary cache
8..10 local main memory TLB hit
75 remote main memory TLB hit 250
main memory TLB miss 2000 page
fault 106
25
OpenMP

OpenMP C and C Application Program Interface
DRAFT
Version 2.0 November 2001 DRAFT 11.05
OPENMP ARCHITECTURE REVIEW BOARD
http//www.openmp.org/
http//www.compunity.org/
http//www.openmp.org/specs/
http//www.it.kth.se/labs/cs/odinmp/
http//phase.etl.go.jp/Omni/

26
Hello World in OpenMP
27
pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
28
pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
29
pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
Search for x 100
P 10 processors
The sequential algorithm traverses 900 elements
Processor 9 finds x 10 in the first step
Speedup ? 900/1 gt 10 P
30
main() double local, pi0.0, w long i w
1.0 / N pragma omp parallel private(i,
local) pragma omp single pi
0.0 pragma omp for reduction ( pi) for (i
0 i lt N i) local (i 0.5)w
pi pi 4.0/(1.0 locallocal)
31
Nested Parallelism
32
The expression of Nested Parallelism in
OpenMP has to conform to these two rules

A parallel directive dynamically inside another
parallel establishes a new team, which is
composed of only the current thread unless nested
parallelism is enabled.

for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.

33
http//www.openmp.org/index.cgi?faq
Q6 What about nested parallelism?
A6 Nested parallelism is permitted by the
OpenMP specification. Supporting nested
parallelism effectively can be difficult, and we
expect most vendors will start out by executing
nested parallel constructs on a single thread.
OpenMP encourages vendors to experiment with
nested parallelism to help us and the users of
OpenMP understand the best model and API to
include in our specification.
We will include the necessary functionality when
we understand the issues better.
34
A parallel directive dynamically inside another
parallel establishes a new team, which is
composed of only the current thread unless nested
parallelism is enabled.
NANOS Ayguade E., Martorell X., Labarta J.,
Gonzalez M. and Navarro N. Exploiting Multiple
Levels of Parallelism in OpenMP A Case
Study Proc. of the 1999 International Conference
on Parallel Processing, Aizu (Japan), September
1999. http//www.ac.upc.es/nanos/
35
KAI Shah S, Haab G, Petersen P, Throop
J. Flexible control structures for parallelism in
OpenMP. 1st European Workshop on OpenMP, Lund,
Sweden, September 1999. http//developer.intel.com
/software/products/trans/kai/
Nodeptr list ... pragma omp taskq for ( nodeptr
p list p ! NULL p p-lt next) pragma
omp task process(p-gtdata)
36
The Workqueuing Model
void traverse(Node node)
process(node.data) if (node.has_left)
traverse(node.left) if (node.has_right)
traverse(node.right)
Robert D. Blumofe, Christopher F. Joerg, Bradley
C. Kuszmaul, Charles E. Leiserson, Keith H.
Randall, Yuli Zhou Cilk An Efficient
Multithreaded Runtime System. Journal of Parallel
and Distributed Computing 37(1) 55-69 (1996).
http//supertech.lcs.mit.edu/cilk/
37
A parallel directive dynamically inside another
parallel establishes a new team, which is
composed of only the current thread unless nested
parallelism is enabled.
OMNI Yoshizumi Tanaka, Kenjiro Taura, Mitsuhisa
Sato, and Akinori Yonezawa Performance Evaluation
of OpenMP Applications with Nested
Parallelism Languages, Compilers, and Run-Time
Systems for Scalable Computers pp. 100-112,
2000 http//pdplab.trc.rwcp.or.jp/Omni/
38
for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Simplicity!!
39
for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Chandra R., Menon R., Dagum L., Kohr D.,
Maydan D. and McDonald J. Morgan Kaufmann
Publishers. Academic press. 2001.
40
for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Page 122 A work-sharing construct divides a
piece of work among a team of parallel threads.
However, once a thread is executing within a
work-sharing construct, it is the only
thread executing that code there is no team of
threads executing that specific piece of code
anymore, so, it is nonsensical to attempt to
further divide a portion of work using a
work-sharing construct.
Nesting of work-sharing constructs is therefore
illegal in OpenMP.
41
Divide and Conquer
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
.
P3
42
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
qs(v,first,last)
...
43
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
...
44
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
...
45
OpenMP Architecture Review Board OpenMP C and
C application program interface v. 1.0 -
October. (1998). http//www.openmp.org/specs/mp-do
cuments/cspec10.ps
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
page 14 The sections directive identifies a
non iterative work-sharing construct that
specifies a set of constructs that are to be
divided among threads in a team. Each section is
executed once by a thread in the team.
46
NWS The Run Time Library
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
void qs(int v, int first, int last) int i,
j if (first lt last) MASTER
partition(v, i, j, first, last)
ll_2_sections( ll_2_first_private(i, j),
qs(v, first, j), qs(v, i, last)
)
47
NWS The Run Time Library
define ll_2_sections(decl, f1, f2)
\
decl
\
ll_barrier()
\

\ int ll_oldname
ll_NAME, \ ll_oldnp
ll_NUMPROCESSORS \
if (ll_DIVIDE(2)) f2
\ else f1
\
ll_REJOIN(ll_oldname, ll_oldnp)
\
\

\
48
void ll_barrier() MASTER int i,
all_arrived do all_arrived
1 for (i 1 i lt ll_NUMPROCESSORS
i) if (!(ll_ARRIVEDi))
all_arrived 0 while (!all_arrived)
for (i 1 i lt ll_NUMPROCESSORS i)
ll_ARRIVEDi 0 SLAVE
ll_ARRIVED 1 while (ll_ARRIVED)

49
int ll_DIVIDE(int ngroups) int ll_group
double ll_first double ll_last ll_group
(int)floor(((double)(ngroupsll_NAME))/((double)ll
_NUMPROCESSORS)) ll_first
(int)ceil(((double)(ll_NUMPROCESSORSll_group))/((
double)ngroups)) ll_last(int)ceil((((double)(
ll_NUMPROCESSORS(ll_group1)))/((double)ngroups))
-1) ll_NUMPROCESSORS ll_last - ll_first
1 ll_NAME ll_NAME - ll_first return
ll_group
void ll_REJOIN(int old_name, int old_np)
ll_NAME old_name ll_NUMPROCESSORS old_np
(A bit) more overhead if weights are provided!
50
define ll_2_first_private(ll_var1, ll_var2) \
void q \ MASTER \ ll_FIRST_PRIVATE
(void ) malloc(2sizeof(void )) \ q
ll_FIRST_PRIVATE \ q (void )
(ll_var1) \ q \ q (void )
(ll_var2) \ \ ll_barrier() \ SLAVE
\ q (ll_FIRST_PRIVATE-ll_NAME) \
memcpy((void ) (ll_var1), q, sizeof(ll_var1))
\ q \ memcpy((void ) (ll_var2),
q, sizeof(ll_var2)) \ \ ll_barrier() \
MASTER free(ll_FIRST_PRIVATE) \
ll_FIRST_PRIVATE
ll_FIRST_PRIVATE
51
FAST FOURIER TRANSFORM
52
void llcFFT(Complex A, Complex a, Complex W,
unsigned N, unsigned stride, Complex D)
Complex B, C Complex Aux, pW unsigned i,
n if (N 1) pragma ll MASTER A0.re
a0.re A0.im a0.im else
n (Ngtgt1) pragma ll sections pragma ll
section llcFFT(D, a, W, n, strideltlt1,
A) pragma ll section llcFFT(Dn,
astride, W, n, strideltlt1, An)) B D
C D n pW W pragma ll for private(i)
for(i0 iltn i) \ Aux.re pW-gtre
Ci.re - pW-gtim Ci.im Aux.im
pW-gtre Ci.im pW-gtim Ci.re
Ai.re Bi.re Aux.re Ai.im Bi.im
Aux.im Ain.re Bi.re - Aux.re
Ain.im Bi.im - Aux.im pW
stride
53
(No Transcript)
54
void kaiFFTdp(Complex A, ... , int level)
... if (level gt CRITICAL_LEVEL)
seqDandCFFT(A, a, W, N, stride, D) else
if(N 1) A0.re a0.re A0.im
a0.im else n (N gtgt 1) B
D C D n pragma omp taskq
pragma omp task kaiFFTdp(B,
a, W, n, strideltlt1, A, level1)
pragma omp task kaiFFTdp(C, astride,
W, n, strideltlt1, An, level1)
55
pragma omp taskq private(i, Aux, pW, j,
start, end) Workers
omp_get_num_threads() CHUNK_SIZE n /
Workers ((n Workers) gt 0) for(j
0 j lt n j CHUNK_SIZE) start j
CHUNK_SIZE end min(start CHUNK_SIZE, n)
pW W start stride
pragma omp task for(i
start i lt end i) ...
/ task /
/ taskq /
56
(No Transcript)
57
OpenMP Distributed Memory
58
OpenMP Distributed Memory

Barbara Chapman, Piyush Mehortra and Hans Zima
Enhacing OpenMP With Features for Locality
Control
Technical report TR99-02,
Inst. for Software Technology and Parallel
Systems, U. Vienna, Feb. 1999.
http//www.par.univie.ac.at.

C. Amza, A. L. Cox , S. Dwarkadas , P. Keleher ,
H. Lu , R. Rajamony , W. Yu , W. Zwaenepoel
TreadMarks Shared Memory Computing on Networks
of Workstations
IEEE Computer, 29(2), pp. 18-28, February 1996.
http//www.cs.rice.edu/willy/TreadMarks/overview.
html

59
1 pragma ll parallel for 2 pragma ll
result(rii, sii) 3 for(i1 i lt 3 i) 4
... 5 pragma ll for 6 pragma ll result
(rjj, sjj) 7 for(j0 jltij) 4
rjj function_j(i,j, sjj, .... ) 6 7
rii function_i(i, sii, .... ) 8
60
1 pragma ll parallel for 2 pragma ll
result(rii, sii) 3 for(i1 i lt 3 i) 4
... 5 pragma ll for 6 pragma ll result
(rjj, sjj) 7 for(j0 jltij) 4
rjj function_j(i,j, sjj, .... ) 6 7
rii function_i(i, sii, .... ) 8
...
23
61
worker
worker
worker
62
1 pragma ll parallel for 2 pragma ll
result(rxx, sxx) 3 for(x3 x lt 5 x) 5
pragma ll for 6 pragma ll result (ryy,
syy) 7 for(y x yltx1y) 4 ryy
... 6 7 rxx ... 8
63
One Thread is One Set of Processors Model (OTOSP)
w120, w210
2
3
4
5
0
1

Each Subset of Processors correspond to one and
only one Thread.
A Group is a Family of Subsets.
A Team is a Family of Groups.

64
(No Transcript)
65
(No Transcript)
66
1 forall(i1 ilt3) result(rii, sii) 2
... 3 forall(j0 jlti) result(rjj, sjj)
4 int a, b 5 ... 6 if (i 2
1) 7 if (j 2 0) 8 send(j,
j1, a, sizeof(int)) 9 else receive(j,
j-1, b, sizeof(int))
67
Conclusions and Open Questions (Shared Memory)