Title: MOAIS%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Multi-programmation%20et%20Ordonnancement%20pour%20les%20Applications%20Interactives%20de%20Simulation%20%20Programming%20and%20Scheduling%20Design%20of%20Interactive%20Simulation%20Applications%20%20on%20Distributed%20Resources
1Algorithmes parallèles à grain adaptatif
Quelques exemples Jean-Louis.Roch_at_imag.fr Pro
jet MOAIS (www-id.imag.fr/MOAIS) Laboratoire
ID-IMAG (CNRS-INRIA INPG-UJF)
2 MOAIS Multi-programmat
ion et Ordonnancement pour les Applications
Interactives de SimulationProgramming and
Scheduling Design of Interactive Simulation
Applications on Distributed Resources
3ID Research activities
- Adaptive middleware
- Resource management and scheduling systems
- Resource brokering based on prediction tools
- 2nd generation Open Grid Service Architecture
with a P2P approach - Operational aspect of P2P systems
- Deployability of P2P services (memberships)
- Network operating systems
- Open source Grid-aware OS (extensions of Linux)
- Distributed algorithms
- Dependable and adaptative
- Programming models languages
- High-performance component models
- Lightweight component platforms
- QoS aware self-organizing component platforms
with dynamic reconfiguration - Automatic exploitation of coarse-grain algorithms
- Communication models
- Generic framework
- Computational models
- Novel algorithms applications
- Need for grid-aware algorithms and applications
- Large scale data management
- Shared objects, persistence, coherency
- Security / Accountability
- P2P services in an unfriendly world
- Tools
- Performance analysis prediction
- Application testbeds
- Bioinformatics
- Engineering applications
4ID Research activities
- Two INRIA projects
- MOAIS contact Jean-Louis.Roch_at_imag.fr
- Programming scheduling
- Adaptive and interactive applications
- 2 full-time researchers INRIA
- 4 assistant professors 3 INPG, 1 UJF
- 14 Ph-D students
- MESCAL contact Bruno.Gaujal_at_inrialpes.fr
- Dynamic resource management
- Performance evaluation and dimensioning
- 2 full-time researchers INRIA, CNRS
- 6 assistant professors 2 INPG, 4 UJF
- 13 Ph-D students
5Objective
- Programming of applications where performance is
a matter of resources - take benefit of more and suit to less
- eg a global computing platform (P2P)
- Application code independent from resources
and adaptive
- Target applications interactive simulation
- virtual observatory
6GRIMAGE platform
B. Raffin
Commodity components
- 2003 11 PCs, 8 projectors and 4 cameras
- First demo 12/03
- 2005 30 PCs, 16 projectors and 20 cameras
- A display wall
- Surface 2x2.70 m
- Resolution 4106x3172 pixels
- Very bright daylight work
7Video
J Allard, C Menier
8MOAIS to adapt parallelism by scheduling
9How to adapt the application ?
- By minimizing communications
- e.g. amortizing synchronizations in the
simulation Beaumont, Daoudi, Maillard,
Manneback, Roch - PMAA 2004 adaptive
granularity - By contolling latency (interactivity constraints)
- FlowVR Allard, Menier, Raffin
- overhead
- By managing node failures and resilience
Checkpoint/restartcheckers - FlowCert Jafar, Krings, Leprevost Roch,
Varrette - By adapting granularity
- malleable tasks Trystram, Mounié
- dataflow cactus-stack Athapascan/Kaapi
Gautier - recursive parallelism by work-stealling
Blumofe-Leiserson 98, Cilk, Athapascan, ...
- Self-adaptive grain algorithms
- dynamic extraction of paralllelism
Daoudi, Gautier, Revire, Roch - J. TSI
2004
10Algorithmes parallèles à grain adaptatif
Quelques exemples
- Ordonnancement de programme parallèle à grain
fin work-stealing et efficacité - Algorithmes à grain adaptatif principe dune
cascade dynamique - Exemples
11High potential degree of parallelism
In practice coarse granularity Splitting
into p resources Drawback heterogeneous
architecture, dynamic In theory fine
granularity Maximal parallelism Drawback
overhead of tasks management
How to choose/adapt granularity ?
12Parallelism and efficiency
Depth parallel time on ?? resources T? ops
on a critcal path
Work sequential timeT1 operations
Problem how to adapt the potential parallelism
to the resources ?
Scheduling
control of the policy (realisation)
efficient policy (close to optimal)
Difficult in general (coarse grain) But easy if
T? small (fine grain) Tp T1/p T? Greedy
scheduling, Graham69
Expensive in general (fine grain) But small
overhead if coarse
grain
gt to have T? small with coarse grain control
13Work stealing scheduling of a parallel recursive
fine-grain algorithm
- Work-stealing scheduling
- an idle processor steals the oldest ready task
- Interests gt succeeded steals lt p. T?
Blumofe 98, Narlikar 01, .... gt suited to
heterogeneous architectures Bender-Rabin 03,
.... - Hypothesis for efficient parallel executions
- the parallel algorithm is work-optimal
- T? is very small (recursive parallelism)
- a sequential execution of the parallel
algorithm is valid - e.g. search trees, BranchBound, ...
- Implementation work-first principle
Multilisp, Cilk, - overhead of task creation only upon steal
request sequential degeneration of the
parallel algorithm - cactus-stack management
14Implementation of work-stealing
Hypothesis a sequential schedule is valid
non-préemptive execution of ready task
Stack
f1() . fork f2
fork f2
- Intérêt Grain fin statique , mais contrôle
dynamique - Inconvénient surcôut possible de lalgorithme
parallèle
ex. préfixes
15Experimentation knary benchmark
procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Distributed Archi. iCluster Athapascan
SMP Architecture Origin 3800 (32 procs)Cilk /
Athapascan
Ts 2397 s ? T1 2435
16How to obtain an efficientfine-grain algorithm ?
- Hypothesis for efficiency of work-stealing
- the parallel algorithm is work-optimal
- T? is very small (recursive parallelism)
- Problem
- Fine grain (T? small) parallel algorithms may
involve a large overhead with respect to a
sequential efficient algorithm - Overhead due to parallelism creation and
synchronization - But also arithmetic overhead
17Préfixe ( n / 2 )
18Algorithmes parallèles à grain adaptatif
Quelques exemples
- Ordonnancement de programme parallèle à grain
fin work-stealing et efficacité - Algorithmes à grain adaptatif principe dune
cascade dynamique - Exemples
19Self-adaptive grain algorithm
- Principle To save parallelism overhead by
provilegiating a sequential algorithm - gt use parallel algorithm only if a processor
becomes idle by extracting parallelism from a
sequential computation - Hypothesis two algorithms
- - 1 sequential SeqCompute
- - 1 parallel LastPartComputation gt at any
time, it is possible to extract parallelism from
the remaining computations of the sequential
algorithm -
20Generic self-adaptive grain algorithm
21Illustration f(i), i1..100
LastPart(w)
W2..100
SeqComp(w) sur CPUA f(1)
22Illustration f(i), i1..100
LastPart(w)
W3..100
SeqComp(w) sur CPUA f(1)f(2)
23Illustration f(i), i1..100
LastPart(w) on CPUB
W3..100
SeqComp(w) sur CPUA f(1)f(2)
24Illustration f(i), i1..100
LastPart(w)on CPUB
LastPart(w)
LastPart(w)
W3..51
W52..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w)
25Illustration f(i), i1..100
LastPart(w)
LastPart(w)
W3..51
W52..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w)
26Illustration f(i), i1..100
LastPart(w)
LastPart(w)
W3..51
W53..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w) sur CPUB f(52)
27Cascading a parallel and a sequential algorithm
- In general two different algorithms may be
used - Sequential recursive algorithm Ts operations
algo(n) ... algo(n-1) ... - Parallel algorithm T? small but T1 gtgt Ts
- Work-preserving speed-up Bini-Pan 94
cascading technique Jaja92 Careful
interplay of both algorithms to build one
with both T? small and T1 O( Ts )But
fine grain - Divide the sequential algorithm into block
- Each block is compute with the (non-optimal)
parallel algorithm - Adaptive grain duale approach parallelism
is extracted from any sequential task
28 E.g.Triangular system solving
1/ x1 - b1 / a11 2/ For k2..n bk bk -
ak1.x1
A
system of dimension n
system of dimension n-1
29 E.g.Triangular system solving
30Algorithmes parallèles à grain adaptatif
Quelques exemples
- Ordonnancement de programme parallèle à grain
fin work-stealing et efficacité - Algorithmes à grain adaptatif principe dune
cascade dynamique - Exemples
- Produit itéré, préfixe
- Compression gzip
- Inversion de systèmes triangulaire
- Vision 3D / Calcul doct-tree
31Produit iteré Séquentiel, parallèle,
adaptatif Davide Vernizzi
- Séquentiel
- Entrée tableau de n valeurs
- Sortie
- c/c code
- for (i0 iltn i)
- res atoi(xi)
- Algorithme parallèle
- calcul récursif par bloc (arbre binaire avec
fusion) - Taille de bloc pagesize
- Code kaapi athapascan API
32Variante somme de pages
- Entrée ensemble de n pages. Chaque page est un
tableau de valeurs - Sortie une page où chaque élément estla somme
des éléments de même indice des pages
précédentes - c/c code
- for (i0 iltn i)
- for (j0 jltpageSize j)
- res j f (pagesij)
33Démonstration sur ensibull
Script vernizzd_at_ensibull demo more go-tout.sh
!/bin/sh ./spg /tmp/data ./ppg /tmp/data 1
--a1 -thread.poolsize 3 ./apg /tmp/data 1 --a1
-thread.poolsize 3
Résultat vernizzd_at_ensibull demo ./go-tout.sh
Page size 4096 Memory allocated Memory
allocated 0In main th 1, parallel 0
----------------------------------------- 0 res
-2.048e07 0 time 0.408178 s ADAPTATIF (3
procs) 0 Threads created 54 0
----------------------------------------- 0 res
-2.048e07 0 time 0.964014 s PARALLELE (3
procs) 0 fork 7497 0 ------------------------
----------------- ------------------------------
----------- res -2.048e07 time 1.15204
s SEQUENTIEL (1 proc) -----------------------
------------------
34Doù vient la différence ? Les sources des
programmes
Source des codes pour la somme des pages
parallèle / arbre binaire adaptatif par
couplage - séquentiel ForkltLastPartCompgt -
LastParComp génération (récursive) de 3 tâches
35Algorithme parallèle
struct Iterated void operator()
(a1Shared_wltPagegt res, int start, int stop)
if ( (stop-start) lt2) // If max num of pages is
reached, sequential algorithm Page resLocal
(pageSize) IteratedSeq(start, resLocal)
res.write(resLocal) else // If max num of
pages is not reached int half
(startstop)/2 a1SharedltPagegt res1 //
First thread result a1SharedltPagegt res2 //
Second thread result a1ForkltIteratedgt ()
(res1, start, half) //First thread
a1ForkltIteratedgt () (res2, half, stop)
//Second thread a1ForkltMergegt () (res,
res1, res2) //Merging results...
36Parallélisation adaptative
- Calcul par bloc sur des entrées en k blocs
- 1 bloc pagesize
- Exécution indépendante des k tâches
- Fusion des resultats
37Algorithme adaptatif (1/3)
- Hypothèse ordonnancement non préemptif - de type
work-stealing - Couplage séquentiel adaptatif
void Adaptative (a1Shared_wltPagegt resLocal,
DescWork dw) // cout ltlt "Adaptative" ltlt
endl a1Shared ltPagegt resLPC
a1ForkltLPCgt() (resLPC, dw) Page
resSeq (pageSize) AdaptSeq (dw,
resSeq) a1Fork ltMergegt () (resLPC,
resLocal, resSeq)
38Algorithme adaptatif (2/3)
void AdaptSeq (DescWork dw, Page resSeq)
DescLocalWork w Page resLoc
(pageSize) double k while
(!dw.desc-gtextractSeq(w)) for
(int i0 iltpageSize i )
k resLoc.get (i) (double)
buffwpageSizei
resLoc.put(i, k)
resSeqresLoc
39Algorithme adaptatif (3/3)
- Côté extraction algorithme parallèle
struct LPC void operator ()
(a1Shared_wltPagegt resLPC, DescWork dw)
DescWork dw2
dw2.Allocate()
dw2.desc-gtl.initialize() if
(dw.desc-gtextractPar(dw2))
a1SharedltPagegt res2
a1ForkltAdaptativeMaingt() (res2, dw2.desc-gti,
dw2.desc-gtj)
a1SharedltPagegt resLPCold
a1ForkltLPCgt() (resLPCold, dw)
a1ForkltMergeLPCgt() (resLPCold, res2,
resLPC)
40Parallélisation adaptative
- Une seule tache de calcul est demarrée pour
toutes les entrées - Division du travail qui reste à faire seulement
dans le cas où un processeur devient inactif - Moins de taches, moins de fusions
41Exemple 2 parallélisation de gzip
- Gzip
- Utilisé (web) et coûteux bien que de complexité
linéaire - Code source 10000 lignes C, structures de
données complexes - Principe LZ77 arbre Huffman
- Pourquoi gzip ?
- Problème P-complet, mais parallélisation pratique
possible - Inconvénient toute parallélisation (connue)
entraîne un surcoût - -gt perte de taux de compression
42Comment paralléliser gzip ?
Parallélisation facile ,100 compatible avec
gzip/gunzip Problèmes perte de taux de
compression, grain dépend de la machine, surcoût
43Parallélisation gzip à grain adaptatif
LastPartComputation
44Surcoût en taille de fichier comprimé
Taille Fichiers Gzip Adaptatif 2 procs Adaptatif 8 procs Adaptatif 16 procs
0,86 Mo 272573 275692 280660 280660
5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo
9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo
10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo
Gain en T?
5,2 Mo 3,35 s 0,96 s 0,55 s
9,4 Mo 7,67 s 6,73 s 6,79 s
10 Mo 6,79 s 1,71 s 0,88 s
45Performances
Pentium 4x200Mhz
46Conclusion
- Grain adaptatif
- Cascade dynamique récursive de 2 algos 1
séquentiel, 1 parallèle - Génération de parallélisme que sur inactivité de
ressources - -gt Opérateur de base ExtractPar de travail
séquentiel en cours - Programmation générique, ... et simple !??
- Intérêt Réduction du surcoût lié au parallélisme
- création de tâche, ordonnancement
- surcoût arithmétique intrinsèque
- Gain pratique code PL inférence probabiliste
Mazer, SHARP - Perspectives
- - Expérimentations SMP et distribuées gzip,
préfixes, .... - - Extension au cas distribué et hétérogène
ajout/résilience - - Extensions à dautres algorithmes action
IMAG-INRIA AHA Vision 3D, calcul formel, ... -
47Questions ?
APACHE/MOAIS EVASION, J Allard, C Menier, R
Revire, F Zara
Video
48Performance
49Performances sur SMP
Pentium 4x200 Mhz
50Performances en distribué
Recherche distribuée dans 2 répertoires de même
taille chacun sur un disque distant (NFS)
- Séquentiel Pentium 4x200 Mhz
- SMP Pentium 4x200 Mhz
- Architecture distribuée Myrinet Pentium
4x200 Mhz 2x333 Mhz