MiniSymposium Adaptive Algortihms for Scientific computing - PowerPoint PPT Presentation

About This Presentation

Title:

MiniSymposium Adaptive Algortihms for Scientific computing

Description:

10h45 Adaptive programming with hierarchical multiprocessor tasks ... Choice is performed at runtime, depending on resource idleness ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 43

Provided by: JeanLou85

Category:

more less

Transcript and Presenter's Notes

Title: MiniSymposium Adaptive Algortihms for Scientific computing

1
(No Transcript)
2
Mini SymposiumAdaptive Algorithms for
Scientific computing

Adaptive, hybrids, oblivious what do those
terms mean ?
Taxonomy of autonomic computing Ganek Corbi
2003
Self-configuring / self-healing /
self-optimising / self-protecting
Objective towards an analysis based on
the algorithm performance

9h45 Adaptive algorithms - Theory and
applications Jean-Louis Roch al. AHA Team
INRIA-CNRS Grenoble, France
10h15 Hybrids in exact linear algebra Dave
Saunders U. Delaware, USA
10h45 Adaptive programming with hierarchical
multiprocessor tasks Thomas Rauber, Gudula
Runger, U. Bayreuth, Germany
11h15 Cache-Oblivious algorithms Michael
Bender, Stony Brook U., USA

3
Adaptive algorithmsTheory and applications

Van Dat Cung, Jean-Guillaume Dumas, Thierry
Gautier, Guillaume Huard, Bruno Raffin,
Jean-Louis Roch, Denis Trystram
IMAG-INRIA Workgroup on Adaptive and Hybrid
Algorithms Grenoble, France

Contents I. Some criteria to analyze adaptive
algorithms II. Work-stealing and adaptive
parallel algorithms III. Adaptive parallel
prefix computation
4
Why adaptive algorithms and how?
Resources availability is versatile
Input data vary
Measures on resources
Measures on data
Adaptation to improve performances

Scheduling
partitioning
load-balancing
work-stealing

Calibration
tuning parameters block size/ cache
choice of instructions,
priority managing

5
Modeling an hybrid algorithm

Several algorithms to solve a same problem f
Eg algo_f1, algo_f2(block size), algo_fk
each algo_fk being recursive

algo_fi ( n, ) . f ( n
- 1, ) . f ( n /
2, )

E.g. practical hybrids
Atlas, Goto, FFPack
FFTW
cache-oblivious B-tree
any parallel program with scheduling
support Cilk, Athapascan/Kaapi, Nesl,TLib

How to manage overhead due to choices ?
Classification 1/2
Simple hybrid iff O(1) choices eg block
size in Atlas,
Baroque hybrid iff an unbounded number of choices
eg recursive splitting factors in FFTW
choices are either dynamic or pre-computed based
on input properties.

Choices may or may not be based on architecture
parameters.
Classification 2/2. an hybrid is
Oblivious control flow does not depend neither
on static properties of the resources nor on the
input eg cache-oblivious algorithm Bender
Tuned strategic choices are based on static
parameters eg block size w.r.t cache,
granularity,
Engineered tuned or self tunedeg ATLAS and
GOTO libraries, FFTW, eg LinBox/FFLAS
Saundersal
Adaptive self-configuration of the algorithm,
dynamlc
Based on input properties or resource
circumstances discovered at run-timeeg idle
processors, data properties, eg TLib
RauberRünger

8
Examples

BLAS libraries
Atlas simple tuned (self-tuned)
Goto simple engineered (engineered tuned)
LinBox / FFLAS simple self-tuned,adaptive
Saundersal
FFTW
Halving factor baroque tuned
Stopping criterion simple tuned
Parallel algorithm and scheduling
Choice of parallel degree eg Tlib
RauberRünger
Work-stealing schedile baroque hybrid

9
Adaptive algorithmsTheory and applications

Van Dat Cung, Jean-Guillaume Dumas, Thierry
Gautier,Guillaume Huard, Bruno Raffin,
Jean-Louis Roch, Denis Trystram
INRIA-CNRS Project onAdaptive and Hybrid
Algorithms Grenoble, France

Contents I. Some criteria to analyze for
adaptive algorithms II. Work-stealing and
adaptive parallel algorithms III. Adaptive
parallel prefix computation
10
Work-stealing (1/2)
Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)

Workstealing greedy schedule but
distributed and randomized
Each processor manages locally the tasks it
creates
When idle, a processor steals the oldest ready
task on a remote -non idle- victim processor
(randomly chosen)

11
Work-stealing (2/2)
Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)

Interests -gt suited to heterogeneous
architectures with slight modification
Bender-Rabin02 -gt with good probability,
near-optimal schedule on p processors with
average speeds ?ave Tp lt W1/(p ?ave) O ( W?
/ ?ave )
NB succeeded steals task migrations lt p
W? Blumofe 98, Narlikar 01, Bender 02
Implementation work-first principle Cilk,
Kaapi
Local parallelism is implemented by sequential
function call
Restrictions to ensure validity of the default
sequential schedule - serie-parallel/Cilk
- reference order/Kaapi

12
Work-stealing and adaptability

Work-stealing ensures allocation of processors to
tasks transparently to the application with
provable performances
Support to addition of new resources
Support to resilience of resources and
fault-tolerance (crash faults, network, )
Checkpoint/restart mechanisms with provable
performances Porch, Kaapi,
Baroque hybrid adaptation there is an
-implicit- dynamic choice between two algorithms
a sequential (local) algorithm depth-first
(default choice)
A parallel algorithm breadth-first
Choice is performed at runtime, depending on
resource idleness
Well suited to applications where a fine grain
parallel algorithm is also a good sequential
algorithm Cilk
Parallel DivideConquer computations
Tree searching, BranchX
-gt suited when both sequential and parallel
algorithms perform (almost) the same number
of operations

13
But often parallelism has a cost !

Solution to mix both a sequential and a parallel
algorithm
Basic technique
Parallel algorithm until a certain grain
then use the sequential one
Problem W? increases also, the number of
migration and the inefficiency o(
Work-preserving speed-up Bini-Pan 94
cascading Jaja92 Careful interplay of both
algorithms to build one with both W? small
and W1 O( Wseq )
Divide the sequential algorithm into block
Each block is computed with the (non-optimal)
parallel algorithm
Drawback sequential at coarse grain and
parallel at fine grain o(
Adaptive granularity dual approach
Parallelism is extracted at run-time from any
sequential task

14
Self-adaptive grain algorithm

Based on the Work-first principle Executes
always a sequential algorithm to reduce
parallelism overhead
gt use parallel algorithm only if a processor
becomes idle by extracting parallelism from a
sequential computation
Hypothesis two algorithms
- 1 sequential SeqCompute- 1 parallel
LastPartComputation at any time, it is
possible to extract parallelism from the
remaining computations of the sequential
algorithm
Examples - iterated product Vernizzi 05 -
gzip / compression Kerfali 04 - MPEG-4 / H264
Bernard 06 - prefix computation Traore 06

15
Adaptive algorithmsTheory and applications

Van Dat Cung, Jean-Guillaume Dumas, Thierry
Gautier,Guillaume Huard, Bruno Raffin,
Jean-Louis Roch, Denis Trystram
INRIA-CNRS Project onAdaptive and Hybrid
Algorithms Grenoble, France

Contents I. Some criteria to analyze for
adaptive algorithms II. Work-stealing and
adaptive parallel algorithms III. Adaptive
parallel prefix computation
16
Prefix computation an example where
parallelism always costs ?1 a0a1
?2a0a1a2 ?na0a1an

Sequential algorithm for (i 0 i lt n
i ) ? i ? i 1 a i
Parallel algorithm Ladner-Fischer

W1 W? n
a0 a1 a2 a3 a4 an-1 an
W? 2. log n but W1 2.n Twice more
expensive than the sequential
17
Adaptive prefix computation

Any (parallel) prefix performs at least W1 ? 2.n
- W? ops
Strict-lower bound on p identical processors Tp
? 2n/(p1) block algorithm pipeline
Nicolaual. 2000
Application of adaptive scheme
One process performs the main sequential
computation
Other work-stealer processes computes parallel
segmented prefix
Near-optimal performance on processors with
changing speeds Tp lt 2n/((p1).
?ave) O ( log n / ?ave)

lower bound
18
Scheme of the proof

Dynamic coupling of two algorithms that completes
simultaneously
Sequential (optimal) number of operations S
Parallel performs X operations
dynamic splitting always possible till finest
grain BUT local sequential
Scheduled by workstealing on p-1 processors
Critical path small (log X)
Each non constant time task can be splitted
(variable speeds)
Analysis
Algorithmic scheme ensures Ts Tp O(log X)gt
enables to bound the whole number X of operations
performedand the overhead of parallelism (sX)
- ops_optimal
Comparison to the lower bound on the number of
operations.

19
Adaptive Prefix on 3 processors
?1
20
Adaptive Prefix on 3 processors
?3
?7
21
Adaptive Prefix on 3 processors
?8
22
Adaptive Prefix on 3 processors
?8
?8
?5
?6
?9
?11
23
Adaptive Prefix on 3 processors
?12
?11
?8
?8
?5
?6
?7
?9
?11
?10
24
Adaptive Prefix on 3 processors
Implicit critical path on the sequential process
25
Adaptive prefix some experiments
Joint work with Daouda Traore
Prefix of 10000 elements on a SMP 8 procs (IA64 /
linux)
External charge
Time (s)
Time (s)
processors
processors
Multi-user context Adaptive is the
fastest15 benefit over a static grain algorithm

Single user context
Adaptive is equivalent to
- sequential on 1 proc
- optimal parallel-2 proc. on 2 processors
-
- optimal parallel-8 proc. on 8 processors

26
The Prefix race sequential/parallel fixed/
adaptive
On each of the 10 executions, adaptive completes
first
27
With double sum ( riri-1 xi )
Finest grain limited to 1 page 16384 octets
2048 double
Single user
Processors with variable speeds
Remark for n4.096.000 doubles - pure
sequential 0,20 s - minimal grain 100
doubles 0.26s on 1 proc and 0.175 on 2 procs
(close to lower bound)
28
E.g.Triangular system solving
1/ x1 - b1 / a11 2/ For k2..n bk bk -
ak1.x1
A
system of dimension n
system of dimension n-1
29
E.g.Triangular system solving
30
Conclusion

Adaptive what choices and how to choose ?
Illustration Adaptive parallel prefix based on
work-stealing
- self-tuned baroque hybrid O(p log n )
choices
- achieves near-optimal performance
processor oblivious
Generic adaptive scheme to implement parallel
algorithms with provable performance

31
Mini SymposiumAdaptive Algorithms for
Scientific computing

Adaptive, hybrids, oblivious what do those
terms mean ?
Taxonomy of autonomic computing Ganek Corbi
2003
Self-configuring / self-healing /
self-optimising / self-protecting
Objective towards an analysis based on
the algorithm performance

9h45 Adaptive algorithms - Theory and
applications Jean-Louis Roch al. AHA Team
INRIA-CNRS Grenoble, France
10h15 Hybrids in exact linear algebra Dave
Saunders, U. Delaware, USA
10h45 Adaptive programming with hierarchical
multiprocessor tasks Thomas Rauber, U. Bayreuth,
Germany
11h15 Cache-Obloivious algorithms Michael
Bender, Stony Brook U., USA

32
Questions ?
33
Some examples (1/2)

Adaptive algorithms used empirically an
theoretically
Atlas 2001 dense linear algebra library
Instruction set and instruction schedule
Self-camobration pg yjr blpvk idr
lt!uuuuuuuuuu de la taille des blocs à
linstallation sur la machine
FFTW (1998, ) FFT (n) lt p FFT(q) and q
FFT(n)
For any n, for any recursive call FFT(n)
pre-compite the nest value for p
Pré-calcul de la découpe optimale pour la taille
n du vecteur sur la machine
Cache-oblivious B-trees
Block recursive splitting to minimize page
faults
Self adaptation to memory hierarchy
Workstealing (Cilk (1998, ) (2000, )
recursive parallelism
Choice between sequential depth-first schedule
and breadth-first schedule
Work-first principle to optimize local
sequentilal execution and put overhead on rare
steals from idle processors .
Implicitly adaptive

34
Some examples (2/2)

Moldable tasks Ordonnancement bi-critère avec
garantie Trystramal 2004
Combinaison récursive alternatiive
dapproximation pour chaque critère
Auto-adaptation avec performance garantie pour
chaque critère
Algorithmes Cache-Oblivious Benderal
2004
Découpe récursive par bloc qui minimise les
défauts de page
Auto-adaptation à la hiérarchie mémoire
(B-tree)
Algorithmes Processor-Oblivious Rochal
2005
Combinaison récursive de 2 algorithmes séquentiel
et parallèle
Auto-adaptation à linactivité des ressources

35
Best case parallel algorithm is efficient

W? is small and W1 Wseq
The parallel algorithm is an optimal sequential
one
Exemples parallel DC algorithms
Implementation work-first principle - no
overhead when local execution of tasks
Examples
Cilk THE protocol
Kaapi Compareswap only

36
Experimentation knary benchmark
procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Distributed Archi. iCluster Athapascan
SMP Architecture Origin 3800 (32 procs)Cilk /
Athapascan
Ts 2397 s ? T1 2435
37
High potential degree of parallelism
In practice coarse granularity Splitting
into p resources Drawback heterogeneous
architecture, dynamic ?i(t) speed
of processor i at time t In theory fine
granularity Maximal parallelism Drawback
overhead of tasks management
How to choose/adapt granularity ?
38
How to obtain an efficientfine-grain algorithm ?

Hypothesis for efficiency of work-stealing
the parallel algorithm is work-optimal
T? is very small (recursive parallelism)
Problem
Fine grain (T? small) parallel algorithms may
involve a large overhead with respect to a
sequential efficient algorithm
Overhead due to parallelism creation and
synchronization
But also arithmetic overhead

39
Self-grain Adaptive algorithms

Recursive computations
Local sequential computation
Special case
recursive extraction of parallelism when a
resource becomes idle
But local execution of a sequential algorithm
Hypothesis two algorithms
- 1 sequential SeqCompute
- 1 parallel LastPartComputation gt at any
time, it is possible to extract parallelism from
the remaining computations of the sequential
algorithm
Example
- iterated product Vernizzi - gzip /
compression Kerfali
- MPEG-4 / H264 Bernard . - prefix
computation Traore