MOAIS%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Multi-programmation%20et%20Ordonnancement%20pour%20les%20Applications%20Interactives%20de%20Simulation%20%20Programming%20and%20Scheduling%20Design%20of%20Interactive%20Simulation%20Applications%20%20on%20Distributed%20Resources - PowerPoint PPT Presentation

About This Presentation

Title:

MOAIS%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Multi-programmation%20et%20Ordonnancement%20pour%20les%20Applications%20Interactives%20de%20Simulation%20%20Programming%20and%20Scheduling%20Design%20of%20Interactive%20Simulation%20Applications%20%20on%20Distributed%20Resources

Description:

Programming and Scheduling Design of Interactive Simulation Applications ... algo(n) = { ...; algo(n-1); ... } Parallel algorithm : T small but T1 Ts ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 51

Provided by: moais

Category:

more less

Transcript and Presenter's Notes

Title: MOAIS%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Multi-programmation%20et%20Ordonnancement%20pour%20les%20Applications%20Interactives%20de%20Simulation%20%20Programming%20and%20Scheduling%20Design%20of%20Interactive%20Simulation%20Applications%20%20on%20Distributed%20Resources

1
Algorithmes parallèles à grain adaptatif
Quelques exemples Jean-Louis.Roch_at_imag.fr Pro
jet MOAIS (www-id.imag.fr/MOAIS) Laboratoire
ID-IMAG (CNRS-INRIA INPG-UJF)
2
MOAIS Multi-programmat
ion et Ordonnancement pour les Applications
Interactives de SimulationProgramming and
Scheduling Design of Interactive Simulation
Applications on Distributed Resources
3
ID Research activities

Adaptive middleware
Resource management and scheduling systems
Resource brokering based on prediction tools
2nd generation Open Grid Service Architecture
with a P2P approach
Operational aspect of P2P systems
Deployability of P2P services (memberships)
Network operating systems
Open source Grid-aware OS (extensions of Linux)
Distributed algorithms
Dependable and adaptative
Programming models languages
High-performance component models
Lightweight component platforms
QoS aware self-organizing component platforms
with dynamic reconfiguration
Automatic exploitation of coarse-grain algorithms

Communication models
Generic framework
Computational models
Novel algorithms applications
Need for grid-aware algorithms and applications
Large scale data management
Shared objects, persistence, coherency
Security / Accountability
P2P services in an unfriendly world
Tools
Performance analysis prediction
Application testbeds
Bioinformatics
Engineering applications

4
ID Research activities

Two INRIA projects
MOAIS contact Jean-Louis.Roch_at_imag.fr
Programming scheduling
Adaptive and interactive applications
2 full-time researchers INRIA
4 assistant professors 3 INPG, 1 UJF
14 Ph-D students
MESCAL contact Bruno.Gaujal_at_inrialpes.fr
Dynamic resource management
Performance evaluation and dimensioning
2 full-time researchers INRIA, CNRS
6 assistant professors 2 INPG, 4 UJF
13 Ph-D students

5
Objective

Programming of applications where performance is
a matter of resources
take benefit of more and suit to less
eg a global computing platform (P2P)
Application code independent from resources
and adaptive

Target applications interactive simulation
virtual observatory

6
GRIMAGE platform
B. Raffin
Commodity components

2003 11 PCs, 8 projectors and 4 cameras
First demo 12/03
2005 30 PCs, 16 projectors and 20 cameras
A display wall
Surface 2x2.70 m
Resolution 4106x3172 pixels
Very bright daylight work

7
Video
J Allard, C Menier
8
MOAIS to adapt parallelism by scheduling
9
How to adapt the application ?

By minimizing communications
e.g. amortizing synchronizations in the
simulation Beaumont, Daoudi, Maillard,
Manneback, Roch - PMAA 2004 adaptive
granularity
By contolling latency (interactivity constraints)
FlowVR Allard, Menier, Raffin
overhead
By managing node failures and resilience
Checkpoint/restartcheckers
FlowCert Jafar, Krings, Leprevost Roch,
Varrette
By adapting granularity
malleable tasks Trystram, Mounié
dataflow cactus-stack Athapascan/Kaapi
Gautier
recursive parallelism by work-stealling

Blumofe-Leiserson 98, Cilk, Athapascan, ...
Self-adaptive grain algorithms
dynamic extraction of paralllelism
Daoudi, Gautier, Revire, Roch - J. TSI
2004

10
Algorithmes parallèles à grain adaptatif
Quelques exemples

Ordonnancement de programme parallèle à grain
fin work-stealing et efficacité
Algorithmes à grain adaptatif principe dune
cascade dynamique
Exemples

11
High potential degree of parallelism
In practice coarse granularity Splitting
into p resources Drawback heterogeneous
architecture, dynamic In theory fine
granularity Maximal parallelism Drawback
overhead of tasks management
How to choose/adapt granularity ?
12
Parallelism and efficiency
Depth parallel time on ?? resources T? ops
on a critcal path
Work sequential timeT1 operations
Problem how to adapt the potential parallelism
to the resources ?
Scheduling
control of the policy (realisation)
efficient policy (close to optimal)
Difficult in general (coarse grain) But easy if
T? small (fine grain) Tp T1/p T? Greedy
scheduling, Graham69
Expensive in general (fine grain) But small
overhead if coarse
grain
gt to have T? small with coarse grain control
13
Work stealing scheduling of a parallel recursive
fine-grain algorithm

Work-stealing scheduling
an idle processor steals the oldest ready task
Interests gt succeeded steals lt p. T?
Blumofe 98, Narlikar 01, .... gt suited to
heterogeneous architectures Bender-Rabin 03,
....
Hypothesis for efficient parallel executions
the parallel algorithm is work-optimal
T? is very small (recursive parallelism)
a sequential execution of the parallel
algorithm is valid
e.g. search trees, BranchBound, ...
Implementation work-first principle
Multilisp, Cilk,
overhead of task creation only upon steal
request sequential degeneration of the
parallel algorithm
cactus-stack management

14
Implementation of work-stealing
Hypothesis a sequential schedule is valid
non-préemptive execution of ready task
Stack
f1() . fork f2
fork f2

Intérêt Grain fin statique , mais contrôle
dynamique
Inconvénient surcôut possible de lalgorithme
parallèle
ex. préfixes

15
Experimentation knary benchmark
procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Distributed Archi. iCluster Athapascan
SMP Architecture Origin 3800 (32 procs)Cilk /
Athapascan
Ts 2397 s ? T1 2435
16
How to obtain an efficientfine-grain algorithm ?

Hypothesis for efficiency of work-stealing
the parallel algorithm is work-optimal
T? is very small (recursive parallelism)
Problem
Fine grain (T? small) parallel algorithms may
involve a large overhead with respect to a
sequential efficient algorithm
Overhead due to parallelism creation and
synchronization
But also arithmetic overhead

17
Préfixe ( n / 2 )
18
Algorithmes parallèles à grain adaptatif
Quelques exemples

Ordonnancement de programme parallèle à grain
fin work-stealing et efficacité
Algorithmes à grain adaptatif principe dune
cascade dynamique
Exemples

19
Self-adaptive grain algorithm

Principle To save parallelism overhead by
provilegiating a sequential algorithm
gt use parallel algorithm only if a processor
becomes idle by extracting parallelism from a
sequential computation
Hypothesis two algorithms
- 1 sequential SeqCompute
- 1 parallel LastPartComputation gt at any
time, it is possible to extract parallelism from
the remaining computations of the sequential
algorithm

20
Generic self-adaptive grain algorithm
21
Illustration f(i), i1..100
LastPart(w)
W2..100
SeqComp(w) sur CPUA f(1)
22
Illustration f(i), i1..100
LastPart(w)
W3..100
SeqComp(w) sur CPUA f(1)f(2)
23
Illustration f(i), i1..100
LastPart(w) on CPUB
W3..100
SeqComp(w) sur CPUA f(1)f(2)
24
Illustration f(i), i1..100
LastPart(w)on CPUB
LastPart(w)
LastPart(w)
W3..51
W52..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w)
25
Illustration f(i), i1..100
LastPart(w)
LastPart(w)
W3..51
W52..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w)
26
Illustration f(i), i1..100
LastPart(w)
LastPart(w)
W3..51
W53..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w) sur CPUB f(52)
27
Cascading a parallel and a sequential algorithm

In general two different algorithms may be
used
Sequential recursive algorithm Ts operations
algo(n) ... algo(n-1) ...
Parallel algorithm T? small but T1 gtgt Ts
Work-preserving speed-up Bini-Pan 94
cascading technique Jaja92 Careful
interplay of both algorithms to build one
with both T? small and T1 O( Ts )But
fine grain
Divide the sequential algorithm into block
Each block is compute with the (non-optimal)
parallel algorithm
Adaptive grain duale approach parallelism
is extracted from any sequential task

28
E.g.Triangular system solving
1/ x1 - b1 / a11 2/ For k2..n bk bk -
ak1.x1
A
system of dimension n
system of dimension n-1
29
E.g.Triangular system solving
30
Algorithmes parallèles à grain adaptatif
Quelques exemples

Ordonnancement de programme parallèle à grain
fin work-stealing et efficacité
Algorithmes à grain adaptatif principe dune
cascade dynamique
Exemples
Produit itéré, préfixe
Compression gzip
Inversion de systèmes triangulaire
Vision 3D / Calcul doct-tree

31
Produit iteré Séquentiel, parallèle,
adaptatif Davide Vernizzi

Séquentiel
Entrée tableau de n valeurs
Sortie
c/c code
for (i0 iltn i)
res atoi(xi)
Algorithme parallèle
calcul récursif par bloc (arbre binaire avec
fusion)
Taille de bloc pagesize
Code kaapi athapascan API

32
Variante somme de pages

Entrée ensemble de n pages. Chaque page est un
tableau de valeurs
Sortie une page où chaque élément estla somme
des éléments de même indice des pages
précédentes
c/c code
for (i0 iltn i)
for (j0 jltpageSize j)
res j f (pagesij)

33
Démonstration sur ensibull
Script vernizzd_at_ensibull demo more go-tout.sh
!/bin/sh ./spg /tmp/data ./ppg /tmp/data 1
--a1 -thread.poolsize 3 ./apg /tmp/data 1 --a1
-thread.poolsize 3
Résultat vernizzd_at_ensibull demo ./go-tout.sh
Page size 4096 Memory allocated Memory
allocated 0In main th 1, parallel 0
----------------------------------------- 0 res
-2.048e07 0 time 0.408178 s ADAPTATIF (3
procs) 0 Threads created 54 0
----------------------------------------- 0 res
-2.048e07 0 time 0.964014 s PARALLELE (3
procs) 0 fork 7497 0 ------------------------
----------------- ------------------------------
----------- res -2.048e07 time 1.15204
s SEQUENTIEL (1 proc) -----------------------
------------------
34
Doù vient la différence ? Les sources des
programmes
Source des codes pour la somme des pages
parallèle / arbre binaire adaptatif par
couplage - séquentiel ForkltLastPartCompgt -
LastParComp génération (récursive) de 3 tâches
35
Algorithme parallèle
struct Iterated void operator()
(a1Shared_wltPagegt res, int start, int stop)
if ( (stop-start) lt2) // If max num of pages is
reached, sequential algorithm Page resLocal
(pageSize) IteratedSeq(start, resLocal)
res.write(resLocal) else // If max num of
pages is not reached int half
(startstop)/2 a1SharedltPagegt res1 //
First thread result a1SharedltPagegt res2 //
Second thread result a1ForkltIteratedgt ()
(res1, start, half) //First thread
a1ForkltIteratedgt () (res2, half, stop)
//Second thread a1ForkltMergegt () (res,
res1, res2) //Merging results...
36
Parallélisation adaptative

Calcul par bloc sur des entrées en k blocs
1 bloc pagesize
Exécution indépendante des k tâches
Fusion des resultats

37
Algorithme adaptatif (1/3)

Hypothèse ordonnancement non préemptif - de type
work-stealing
Couplage séquentiel adaptatif

void Adaptative (a1Shared_wltPagegt resLocal,
DescWork dw) // cout ltlt "Adaptative" ltlt
endl a1Shared ltPagegt resLPC
a1ForkltLPCgt() (resLPC, dw) Page
resSeq (pageSize) AdaptSeq (dw,
resSeq) a1Fork ltMergegt () (resLPC,
resLocal, resSeq)
38
Algorithme adaptatif (2/3)

Côté séquentiel

void AdaptSeq (DescWork dw, Page resSeq)
DescLocalWork w Page resLoc
(pageSize) double k while
(!dw.desc-gtextractSeq(w)) for
(int i0 iltpageSize i )
k resLoc.get (i) (double)
buffwpageSizei
resLoc.put(i, k)
resSeqresLoc
39
Algorithme adaptatif (3/3)

Côté extraction algorithme parallèle

struct LPC void operator ()
(a1Shared_wltPagegt resLPC, DescWork dw)
DescWork dw2
dw2.Allocate()
dw2.desc-gtl.initialize() if
(dw.desc-gtextractPar(dw2))
a1SharedltPagegt res2
a1ForkltAdaptativeMaingt() (res2, dw2.desc-gti,
dw2.desc-gtj)
a1SharedltPagegt resLPCold
a1ForkltLPCgt() (resLPCold, dw)
a1ForkltMergeLPCgt() (resLPCold, res2,
resLPC)
40
Parallélisation adaptative

Une seule tache de calcul est demarrée pour
toutes les entrées
Division du travail qui reste à faire seulement
dans le cas où un processeur devient inactif
Moins de taches, moins de fusions

41
Exemple 2 parallélisation de gzip

Gzip
Utilisé (web) et coûteux bien que de complexité
linéaire
Code source 10000 lignes C, structures de
données complexes
Principe LZ77 arbre Huffman
Pourquoi gzip ?
Problème P-complet, mais parallélisation pratique
possible
Inconvénient toute parallélisation (connue)
entraîne un surcoût
-gt perte de taux de compression

42
Comment paralléliser gzip ?
Parallélisation facile ,100 compatible avec
gzip/gunzip Problèmes perte de taux de
compression, grain dépend de la machine, surcoût
43
Parallélisation gzip à grain adaptatif
LastPartComputation
44
Surcoût en taille de fichier comprimé
Taille Fichiers Gzip Adaptatif 2 procs Adaptatif 8 procs Adaptatif 16 procs
0,86 Mo 272573 275692 280660 280660
5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo
9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo
10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo
Gain en T?
5,2 Mo 3,35 s 0,96 s 0,55 s
9,4 Mo 7,67 s 6,73 s 6,79 s
10 Mo 6,79 s 1,71 s 0,88 s
45
Performances
Pentium 4x200Mhz
46
Conclusion

Grain adaptatif
Cascade dynamique récursive de 2 algos 1
séquentiel, 1 parallèle
Génération de parallélisme que sur inactivité de
ressources
-gt Opérateur de base ExtractPar de travail
séquentiel en cours
Programmation générique, ... et simple !??
Intérêt Réduction du surcoût lié au parallélisme
création de tâche, ordonnancement
surcoût arithmétique intrinsèque
Gain pratique code PL inférence probabiliste
Mazer, SHARP
Perspectives
- Expérimentations SMP et distribuées gzip,
préfixes, ....
- Extension au cas distribué et hétérogène
ajout/résilience
- Extensions à dautres algorithmes action
IMAG-INRIA AHA Vision 3D, calcul formel, ...

47
Questions ?
APACHE/MOAIS EVASION, J Allard, C Menier, R
Revire, F Zara
Video
48
Performance
49
Performances sur SMP
Pentium 4x200 Mhz
50
Performances en distribué
Recherche distribuée dans 2 répertoires de même
taille chacun sur un disque distant (NFS)