Title: Predicting the Performance of Injection Communication Patterns on PVM
1Introducción a la Programación Paralela (Memoria
Compartida)
Casiano Rodríguez León casiano_at_ull.es Departamento
de Estadística, Investigación Operativa y
Computación.
2Introduction
3What is parallel computing?
- Parallel computing the use of multiple computers
or processors working together on a common task. - Each processor works on its section of the
problem - Processors are allowed to exchange information
with other processors
4Why do parallel computing?
- Limits of single CPU computing
- Available memory
- Performance
- Parallel computing allows
- Solve problems that dont fit on a single CPU
- Solve problems that cant be solved in a
reasonable time - We can run
- Larger problems
- Faster
- More cases
5Performance Considerations
6SPEEDUP
TIME OF THE FASTEST SEQUENTIAL ALGORITHM TIME OF
THE PARALLEL ALGORITHM
SPEEDUP
SPEEDUP ? NUMBER OF PROCESSORS
- Consider a parallel algorithm that runs in T
steps on P processors - It is a simple fact that the parallel algorithm
can be simulated by - a sequential machine in TxP steps
- The best sequential algorithm runs in Tbest seq ?
TxP
7Amdahls Law
- Amdahls Law places a strict limit on the speedup
that can be realized by using multiple
processors. - Effect of multiple processors on run time
- Effect of multiple processors on speed up
- Where
- fs serial fraction of code
- fp parallel fraction of code
- P number of processors
(
)
8Illustration of Amdahl's Law
It takes only a small fraction of serial content
in a code to degrade the parallel performance. It
is essential to determine the scaling behavior of
your code before doing production runs using
large numbers of processors
9Amdahls Law Vs. Reality
Amdahls Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for speedup assuming that there are no
costs for communications. In reality,
communications will result in a further
degradation of performance
10Memory/Cache
CPU
Cache
MAIN MEMORY
11Locality and Caches
- a i b i c i
- On uniprocessors systems from a correctness point
of view, memory is monolithic, from a performance
point of view is not. - It might take more time to bring a(i) from memory
than to bring b(i), - Bringing in a(i) at one point in time maight take
longer than bringing it in at a later point in
time.
12for(i0iltni) for(j0jltnj) aji
0
aji and aj1i have stride n, being n the
dimension of a.
There is an stride 1 access to aji1 that
occurs n iterations after the reference to
aji.
Spatial Locality When an element is referenced
its neighbors will be referenced too Temporal
Locality When an element is referenced, it might
be referenced again soon
13Shared Memory Machines
14Shared and Distributed memory
Distributed memory - each processor has its own
local memory. Must do message passing to
exchange data between processors.
Shared memory - single address space. All
processors have access to a pool of shared
memory. Methods of memory access - Bus -
Crossbar
15Styles of Shared memory UMA and NUMA
16UMA Memory Access Problems
- Conventional wisdom is that systems do not scale
well - Bus based systems can become saturated
- Fast large crossbars are expensive
- Cache coherence problem
- Copies of a variable can be present in multiple
caches - A write by one processor my not become visible to
others - They'll keep accessing stale value in their
caches - Need to take actions to ensure visibility or
cache coherence
17Cache coherence problem
- Processors see different values for u after event
3 - With write back caches, value written back to
memory depends on circumstance of which cache
flushes or writes back value when - Processes accessing main memory may see the old
value
18Snooping-based coherence
- Basic idea
- Transactions on memory are visible to all
processors - Processor or their representatives can snoop
(monitor) bus and take action on relevant events - Implementation
- When a processor writes a value a signal is sent
over the bus - Signal is either
- Write invalidate tell others cached value is
invalid - Write broadcast - tell others the new value
19Memory Consistency Models
20Memory Consistency Models
In some commercial shared memory systems it is
possible to observe the old value of
MyTask-gtdata!!
21Distributed shared memory (NUMA)
- Consists of N processors and a global address
space - All processors can see all memory
- Each processor has some amount of local memory
- Access to the memory of other processors is
slower - NonUniform Memory Access
22SGI Origin 2000
23OpenMP Programming
24Origin2000 memory hierarchy
Level Latency
(cycles) register
0 primary cache
2..3 secondary cache
8..10 local main memory TLB hit
75 remote main memory TLB hit 250
main memory TLB miss 2000 page
fault 106
25OpenMP
- OpenMP C and C Application Program Interface
- DRAFT
- Version 2.0 November 2001 DRAFT 11.05
- OPENMP ARCHITECTURE REVIEW BOARD
- http//www.openmp.org/
- http//www.compunity.org/
- http//www.openmp.org/specs/
- http//www.it.kth.se/labs/cs/odinmp/
- http//phase.etl.go.jp/Omni/
26Hello World in OpenMP
27pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
28pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
29pragma omp parallel private(i, id, p, load,
begin, end) p omp_get_num_threads()
id omp_get_thread_num() load N/p begin
idload end beginload for (i begin
((iltend) keepon) i) if (ai x)
keepon 0 position i
pragma omp flush(keepon)
Search for x 100
P 10 processors
The sequential algorithm traverses 900 elements
Processor 9 finds x 10 in the first step
Speedup ? 900/1 gt 10 P
30main() double local, pi0.0, w long i w
1.0 / N pragma omp parallel private(i,
local) pragma omp single pi
0.0 pragma omp for reduction ( pi) for (i
0 i lt N i) local (i 0.5)w
pi pi 4.0/(1.0 locallocal)
31Nested Parallelism
32The expression of Nested Parallelism in
OpenMP has to conform to these two rules
- A parallel directive dynamically inside another
parallel establishes a new team, which is
composed of only the current thread unless nested
parallelism is enabled.
- for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
33http//www.openmp.org/index.cgi?faq
Q6 What about nested parallelism?
A6 Nested parallelism is permitted by the
OpenMP specification. Supporting nested
parallelism effectively can be difficult, and we
expect most vendors will start out by executing
nested parallel constructs on a single thread.
OpenMP encourages vendors to experiment with
nested parallelism to help us and the users of
OpenMP understand the best model and API to
include in our specification.
We will include the necessary functionality when
we understand the issues better.
34A parallel directive dynamically inside another
parallel establishes a new team, which is
composed of only the current thread unless nested
parallelism is enabled.
NANOS Ayguade E., Martorell X., Labarta J.,
Gonzalez M. and Navarro N. Exploiting Multiple
Levels of Parallelism in OpenMP A Case
Study Proc. of the 1999 International Conference
on Parallel Processing, Aizu (Japan), September
1999. http//www.ac.upc.es/nanos/
35KAI Shah S, Haab G, Petersen P, Throop
J. Flexible control structures for parallelism in
OpenMP. 1st European Workshop on OpenMP, Lund,
Sweden, September 1999. http//developer.intel.com
/software/products/trans/kai/
Nodeptr list ... pragma omp taskq for ( nodeptr
p list p ! NULL p p-lt next) pragma
omp task process(p-gtdata)
36The Workqueuing Model
void traverse(Node node)
process(node.data) if (node.has_left)
traverse(node.left) if (node.has_right)
traverse(node.right)
Robert D. Blumofe, Christopher F. Joerg, Bradley
C. Kuszmaul, Charles E. Leiserson, Keith H.
Randall, Yuli Zhou Cilk An Efficient
Multithreaded Runtime System. Journal of Parallel
and Distributed Computing 37(1) 55-69 (1996).
http//supertech.lcs.mit.edu/cilk/
37A parallel directive dynamically inside another
parallel establishes a new team, which is
composed of only the current thread unless nested
parallelism is enabled.
OMNI Yoshizumi Tanaka, Kenjiro Taura, Mitsuhisa
Sato, and Akinori Yonezawa Performance Evaluation
of OpenMP Applications with Nested
Parallelism Languages, Compilers, and Run-Time
Systems for Scalable Computers pp. 100-112,
2000 http//pdplab.trc.rwcp.or.jp/Omni/
38for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Simplicity!!
39for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Chandra R., Menon R., Dagum L., Kohr D.,
Maydan D. and McDonald J. Morgan Kaufmann
Publishers. Academic press. 2001.
40for, sections and single directives that bind to
the same parallel are not allowed to be nested
inside each other.
Page 122 A work-sharing construct divides a
piece of work among a team of parallel threads.
However, once a thread is executing within a
work-sharing construct, it is the only
thread executing that code there is no team of
threads executing that specific piece of code
anymore, so, it is nonsensical to attempt to
further divide a portion of work using a
work-sharing construct.
Nesting of work-sharing constructs is therefore
illegal in OpenMP.
41Divide and Conquer
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
.
P3
42void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
qs(v,first,last)
...
43void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
...
44void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
...
45OpenMP Architecture Review Board OpenMP C and
C application program interface v. 1.0 -
October. (1998). http//www.openmp.org/specs/mp-do
cuments/cspec10.ps
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
page 14 The sections directive identifies a
non iterative work-sharing construct that
specifies a set of constructs that are to be
divided among threads in a team. Each section is
executed once by a thread in the team.
46NWS The Run Time Library
void qs(int v, int first, int last) int i,
j if (first lt last) pragma ll MASTER
partition(v, i, j first, last) pragma ll
sections firstprivate(i,j) pragma ll
section qs(v, first, j) pragma ll
section qs(v, i, last)
void qs(int v, int first, int last) int i,
j if (first lt last) MASTER
partition(v, i, j, first, last)
ll_2_sections( ll_2_first_private(i, j),
qs(v, first, j), qs(v, i, last)
)
47NWS The Run Time Library
define ll_2_sections(decl, f1, f2)
\
decl
\
ll_barrier()
\
\ int ll_oldname
ll_NAME, \ ll_oldnp
ll_NUMPROCESSORS \
if (ll_DIVIDE(2)) f2
\ else f1
\
ll_REJOIN(ll_oldname, ll_oldnp)
\
\
\
48void ll_barrier() MASTER int i,
all_arrived do all_arrived
1 for (i 1 i lt ll_NUMPROCESSORS
i) if (!(ll_ARRIVEDi))
all_arrived 0 while (!all_arrived)
for (i 1 i lt ll_NUMPROCESSORS i)
ll_ARRIVEDi 0 SLAVE
ll_ARRIVED 1 while (ll_ARRIVED)
49int ll_DIVIDE(int ngroups) int ll_group
double ll_first double ll_last ll_group
(int)floor(((double)(ngroupsll_NAME))/((double)ll
_NUMPROCESSORS)) ll_first
(int)ceil(((double)(ll_NUMPROCESSORSll_group))/((
double)ngroups)) ll_last(int)ceil((((double)(
ll_NUMPROCESSORS(ll_group1)))/((double)ngroups))
-1) ll_NUMPROCESSORS ll_last - ll_first
1 ll_NAME ll_NAME - ll_first return
ll_group
void ll_REJOIN(int old_name, int old_np)
ll_NAME old_name ll_NUMPROCESSORS old_np
(A bit) more overhead if weights are provided!
50define ll_2_first_private(ll_var1, ll_var2) \
void q \ MASTER \ ll_FIRST_PRIVATE
(void ) malloc(2sizeof(void )) \ q
ll_FIRST_PRIVATE \ q (void )
(ll_var1) \ q \ q (void )
(ll_var2) \ \ ll_barrier() \ SLAVE
\ q (ll_FIRST_PRIVATE-ll_NAME) \
memcpy((void ) (ll_var1), q, sizeof(ll_var1))
\ q \ memcpy((void ) (ll_var2),
q, sizeof(ll_var2)) \ \ ll_barrier() \
MASTER free(ll_FIRST_PRIVATE) \
ll_FIRST_PRIVATE
ll_FIRST_PRIVATE
51FAST FOURIER TRANSFORM
52void llcFFT(Complex A, Complex a, Complex W,
unsigned N, unsigned stride, Complex D)
Complex B, C Complex Aux, pW unsigned i,
n if (N 1) pragma ll MASTER A0.re
a0.re A0.im a0.im else
n (Ngtgt1) pragma ll sections pragma ll
section llcFFT(D, a, W, n, strideltlt1,
A) pragma ll section llcFFT(Dn,
astride, W, n, strideltlt1, An)) B D
C D n pW W pragma ll for private(i)
for(i0 iltn i) \ Aux.re pW-gtre
Ci.re - pW-gtim Ci.im Aux.im
pW-gtre Ci.im pW-gtim Ci.re
Ai.re Bi.re Aux.re Ai.im Bi.im
Aux.im Ain.re Bi.re - Aux.re
Ain.im Bi.im - Aux.im pW
stride
53(No Transcript)
54 void kaiFFTdp(Complex A, ... , int level)
... if (level gt CRITICAL_LEVEL)
seqDandCFFT(A, a, W, N, stride, D) else
if(N 1) A0.re a0.re A0.im
a0.im else n (N gtgt 1) B
D C D n pragma omp taskq
pragma omp task kaiFFTdp(B,
a, W, n, strideltlt1, A, level1)
pragma omp task kaiFFTdp(C, astride,
W, n, strideltlt1, An, level1)
55 pragma omp taskq private(i, Aux, pW, j,
start, end) Workers
omp_get_num_threads() CHUNK_SIZE n /
Workers ((n Workers) gt 0) for(j
0 j lt n j CHUNK_SIZE) start j
CHUNK_SIZE end min(start CHUNK_SIZE, n)
pW W start stride
pragma omp task for(i
start i lt end i) ...
/ task /
/ taskq /
56(No Transcript)
57OpenMP Distributed Memory
58OpenMP Distributed Memory
- Barbara Chapman, Piyush Mehortra and Hans Zima
- Enhacing OpenMP With Features for Locality
Control - Technical report TR99-02,
- Inst. for Software Technology and Parallel
Systems, U. Vienna, Feb. 1999. - http//www.par.univie.ac.at.
- C. Amza, A. L. Cox , S. Dwarkadas , P. Keleher ,
H. Lu , R. Rajamony , W. Yu , W. Zwaenepoel - TreadMarks Shared Memory Computing on Networks
of Workstations - IEEE Computer, 29(2), pp. 18-28, February 1996.
- http//www.cs.rice.edu/willy/TreadMarks/overview.
html
59 1 pragma ll parallel for 2 pragma ll
result(rii, sii) 3 for(i1 i lt 3 i) 4
... 5 pragma ll for 6 pragma ll result
(rjj, sjj) 7 for(j0 jltij) 4
rjj function_j(i,j, sjj, .... ) 6 7
rii function_i(i, sii, .... ) 8
60 1 pragma ll parallel for 2 pragma ll
result(rii, sii) 3 for(i1 i lt 3 i) 4
... 5 pragma ll for 6 pragma ll result
(rjj, sjj) 7 for(j0 jltij) 4
rjj function_j(i,j, sjj, .... ) 6 7
rii function_i(i, sii, .... ) 8
...
23
61worker
worker
worker
62 1 pragma ll parallel for 2 pragma ll
result(rxx, sxx) 3 for(x3 x lt 5 x) 5
pragma ll for 6 pragma ll result (ryy,
syy) 7 for(y x yltx1y) 4 ryy
... 6 7 rxx ... 8
63One Thread is One Set of Processors Model (OTOSP)
w120, w210
2
3
4
5
0
1
- Each Subset of Processors correspond to one and
only one Thread. - A Group is a Family of Subsets.
- A Team is a Family of Groups.
64(No Transcript)
65(No Transcript)
66 1 forall(i1 ilt3) result(rii, sii) 2
... 3 forall(j0 jlti) result(rjj, sjj)
4 int a, b 5 ... 6 if (i 2
1) 7 if (j 2 0) 8 send(j,
j1, a, sizeof(int)) 9 else receive(j,
j-1, b, sizeof(int))
67Conclusions and Open Questions (Shared Memory)
- Scalability of Shared Memory Machines?
- Performance Prediction Tools and Models?
- OpenMP for Distributed Memory Machines?
Pointer
- http//nereida.deioc.ull.es/html/openmp.html