MOAIS%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Multi-programmation%20et%20Ordonnancement%20pour%20les%20Applications%20Interactives%20de%20Simulation%20%20Programming%20and%20Scheduling%20Design%20of%20Interactive%20Simulation%20Applications%20%20on%20Distributed%20Resources - PowerPoint PPT Presentation

About This Presentation
Title:

MOAIS%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Multi-programmation%20et%20Ordonnancement%20pour%20les%20Applications%20Interactives%20de%20Simulation%20%20Programming%20and%20Scheduling%20Design%20of%20Interactive%20Simulation%20Applications%20%20on%20Distributed%20Resources

Description:

Programming and Scheduling Design of Interactive Simulation Applications ... algo(n) = { ...; algo(n-1); ... } Parallel algorithm : T small but T1 Ts ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: MOAIS%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Multi-programmation%20et%20Ordonnancement%20pour%20les%20Applications%20Interactives%20de%20Simulation%20%20Programming%20and%20Scheduling%20Design%20of%20Interactive%20Simulation%20Applications%20%20on%20Distributed%20Resources


1
Algorithmes parallèles à grain adaptatif
Quelques exemples Jean-Louis.Roch_at_imag.fr Pro
jet MOAIS (www-id.imag.fr/MOAIS) Laboratoire
ID-IMAG (CNRS-INRIA INPG-UJF)
2
MOAIS Multi-programmat
ion et Ordonnancement pour les Applications
Interactives de SimulationProgramming and
Scheduling Design of Interactive Simulation
Applications on Distributed Resources
3
ID Research activities
  • Adaptive middleware
  • Resource management and scheduling systems
  • Resource brokering based on prediction tools
  • 2nd generation Open Grid Service Architecture
    with a P2P approach
  • Operational aspect of P2P systems
  • Deployability of P2P services (memberships)
  • Network operating systems
  • Open source Grid-aware OS (extensions of Linux)
  • Distributed algorithms
  • Dependable and adaptative
  • Programming models languages
  • High-performance component models
  • Lightweight component platforms
  • QoS aware self-organizing component platforms
    with dynamic reconfiguration
  • Automatic exploitation of coarse-grain algorithms
  • Communication models
  • Generic framework
  • Computational models
  • Novel algorithms applications
  • Need for grid-aware algorithms and applications
  • Large scale data management
  • Shared objects, persistence, coherency
  • Security / Accountability
  • P2P services in an unfriendly world
  • Tools
  • Performance analysis prediction
  • Application testbeds
  • Bioinformatics
  • Engineering applications

4
ID Research activities
  • Two INRIA projects
  • MOAIS contact Jean-Louis.Roch_at_imag.fr
  • Programming scheduling
  • Adaptive and interactive applications
  • 2 full-time researchers INRIA
  • 4 assistant professors 3 INPG, 1 UJF
  • 14 Ph-D students
  • MESCAL contact Bruno.Gaujal_at_inrialpes.fr
  • Dynamic resource management
  • Performance evaluation and dimensioning
  • 2 full-time researchers INRIA, CNRS
  • 6 assistant professors 2 INPG, 4 UJF
  • 13 Ph-D students

5
Objective
  • Programming of applications where performance is
    a matter of resources
  • take benefit of more and suit to less
  • eg a global computing platform (P2P)
  • Application code independent from resources
    and adaptive
  • Target applications interactive simulation
  • virtual observatory

6
GRIMAGE platform
B. Raffin
Commodity components
  • 2003 11 PCs, 8 projectors and 4 cameras
  • First demo 12/03
  • 2005 30 PCs, 16 projectors and 20 cameras
  • A display wall
  • Surface 2x2.70 m
  • Resolution 4106x3172 pixels
  • Very bright daylight work

7
Video
J Allard, C Menier
8
MOAIS to adapt parallelism by scheduling
9
How to adapt the application ?
  • By minimizing communications
  • e.g. amortizing synchronizations in the
    simulation Beaumont, Daoudi, Maillard,
    Manneback, Roch - PMAA 2004 adaptive
    granularity
  • By contolling latency (interactivity constraints)
  • FlowVR Allard, Menier, Raffin
  • overhead
  • By managing node failures and resilience
    Checkpoint/restartcheckers
  • FlowCert Jafar, Krings, Leprevost Roch,
    Varrette
  • By adapting granularity
  • malleable tasks Trystram, Mounié
  • dataflow cactus-stack Athapascan/Kaapi
    Gautier
  • recursive parallelism by  work-stealling 

    Blumofe-Leiserson 98, Cilk, Athapascan, ...
  • Self-adaptive grain algorithms
  • dynamic extraction of paralllelism
    Daoudi, Gautier, Revire, Roch - J. TSI
    2004

10
Algorithmes parallèles à grain adaptatif
Quelques exemples
  • Ordonnancement de programme parallèle à grain
    fin work-stealing et efficacité
  • Algorithmes à grain adaptatif principe dune
     cascade  dynamique
  • Exemples

11
High potential degree of parallelism
In  practice  coarse granularity Splitting
into p resources Drawback heterogeneous
architecture, dynamic In  theory  fine
granularity Maximal parallelism Drawback
overhead of tasks management
How to choose/adapt granularity ?
12
Parallelism and efficiency
Depth  parallel time on ?? resources T? ops
on a critcal path
 Work  sequential timeT1 operations
Problem how to adapt the potential parallelism
to the resources ?
Scheduling
control of the policy (realisation)
efficient policy (close to optimal)
Difficult in general (coarse grain) But easy if
T? small (fine grain) Tp T1/p T? Greedy
scheduling, Graham69
Expensive in general (fine grain) But small
overhead if coarse
grain
gt to have T? small with coarse grain control
13
Work stealing scheduling of a parallel recursive
fine-grain algorithm
  • Work-stealing scheduling
  • an idle processor steals the oldest ready task
  • Interests gt succeeded steals lt p. T?
    Blumofe 98, Narlikar 01, .... gt suited to
    heterogeneous architectures Bender-Rabin 03,
    ....
  • Hypothesis for efficient parallel executions
  • the parallel algorithm is  work-optimal 
  • T? is very small (recursive parallelism)
  • a  sequential  execution of the parallel
    algorithm is valid
  • e.g. search trees, BranchBound, ...
  • Implementation work-first principle
    Multilisp, Cilk,
  • overhead of task creation only upon steal
    request sequential degeneration of the
    parallel algorithm
  • cactus-stack management

14
Implementation of work-stealing
Hypothesis a sequential schedule is valid
non-préemptive execution of ready task
Stack
f1() . fork f2
fork f2
  • Intérêt Grain fin  statique , mais contrôle
    dynamique
  • Inconvénient surcôut possible de lalgorithme
    parallèle
    ex. préfixes

15
Experimentation knary benchmark
procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Distributed Archi. iCluster Athapascan
SMP Architecture Origin 3800 (32 procs)Cilk /
Athapascan
Ts 2397 s ? T1 2435
16
How to obtain an efficientfine-grain algorithm ?
  • Hypothesis for efficiency of work-stealing
  • the parallel algorithm is  work-optimal 
  • T? is very small (recursive parallelism)
  • Problem
  • Fine grain (T? small) parallel algorithms may
    involve a large overhead with respect to a
    sequential efficient algorithm
  • Overhead due to parallelism creation and
    synchronization
  • But also arithmetic overhead

17
Préfixe ( n / 2 )
18
Algorithmes parallèles à grain adaptatif
Quelques exemples
  • Ordonnancement de programme parallèle à grain
    fin work-stealing et efficacité
  • Algorithmes à grain adaptatif principe dune
     cascade  dynamique
  • Exemples

19
Self-adaptive grain algorithm
  • Principle To save parallelism overhead by
    provilegiating a sequential algorithm
  • gt use parallel algorithm only if a processor
    becomes idle by extracting parallelism from a
    sequential computation
  • Hypothesis two algorithms
  • - 1 sequential SeqCompute
  • - 1 parallel LastPartComputation gt at any
    time, it is possible to extract parallelism from
    the remaining computations of the sequential
    algorithm

20
Generic self-adaptive grain algorithm
21
Illustration f(i), i1..100
LastPart(w)
W2..100
SeqComp(w) sur CPUA f(1)
22
Illustration f(i), i1..100
LastPart(w)
W3..100
SeqComp(w) sur CPUA f(1)f(2)
23
Illustration f(i), i1..100
LastPart(w) on CPUB
W3..100
SeqComp(w) sur CPUA f(1)f(2)
24
Illustration f(i), i1..100
LastPart(w)on CPUB
LastPart(w)
LastPart(w)
W3..51
W52..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w)
25
Illustration f(i), i1..100
LastPart(w)
LastPart(w)
W3..51
W52..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w)
26
Illustration f(i), i1..100
LastPart(w)
LastPart(w)
W3..51
W53..100
SeqComp(w) sur CPUA f(1)f(2)
SeqComp(w) sur CPUB f(52)
27
Cascading a parallel and a sequential algorithm
  • In general two different algorithms may be
    used
  • Sequential recursive algorithm Ts operations
    algo(n) ... algo(n-1) ...
  • Parallel algorithm T? small but T1 gtgt Ts
  • Work-preserving speed-up Bini-Pan 94
    cascading technique Jaja92 Careful
    interplay of both algorithms to build one
    with both T? small and T1 O( Ts )But
    fine grain
  • Divide the sequential algorithm into block
  • Each block is compute with the (non-optimal)
    parallel algorithm
  • Adaptive grain duale approach parallelism
    is extracted from any sequential task

28
E.g.Triangular system solving
1/ x1 - b1 / a11 2/ For k2..n bk bk -
ak1.x1
A
system of dimension n
system of dimension n-1
29
E.g.Triangular system solving
30
Algorithmes parallèles à grain adaptatif
Quelques exemples
  • Ordonnancement de programme parallèle à grain
    fin work-stealing et efficacité
  • Algorithmes à grain adaptatif principe dune
     cascade  dynamique
  • Exemples
  • Produit itéré, préfixe
  • Compression gzip
  • Inversion de systèmes triangulaire
  • Vision 3D / Calcul doct-tree

31
Produit iteré Séquentiel, parallèle,
adaptatif Davide Vernizzi
  • Séquentiel
  • Entrée tableau de n valeurs
  • Sortie
  • c/c code
  • for (i0 iltn i)
  • res atoi(xi)
  • Algorithme parallèle
  • calcul récursif par bloc (arbre binaire avec
    fusion)
  • Taille de bloc pagesize
  • Code kaapi athapascan API

32
Variante somme de pages
  • Entrée ensemble de n pages. Chaque page est un
    tableau de valeurs
  • Sortie une page où chaque élément estla somme
    des éléments de même indice des pages
    précédentes
  • c/c code
  • for (i0 iltn i)
  • for (j0 jltpageSize j)
  • res j f (pagesij)

33
Démonstration sur ensibull
Script vernizzd_at_ensibull demo more go-tout.sh
!/bin/sh ./spg /tmp/data ./ppg /tmp/data 1
--a1 -thread.poolsize 3 ./apg /tmp/data 1 --a1
-thread.poolsize 3
Résultat vernizzd_at_ensibull demo ./go-tout.sh
Page size 4096 Memory allocated Memory
allocated 0In main th 1, parallel 0
----------------------------------------- 0 res
-2.048e07 0 time 0.408178 s ADAPTATIF (3
procs) 0 Threads created 54 0
----------------------------------------- 0 res
-2.048e07 0 time 0.964014 s PARALLELE (3
procs) 0 fork 7497 0 ------------------------
----------------- ------------------------------
----------- res -2.048e07 time 1.15204
s SEQUENTIEL (1 proc) -----------------------
------------------
34
Doù vient la différence ? Les sources des
programmes
Source des codes pour la somme des pages
parallèle / arbre binaire adaptatif par
couplage - séquentiel ForkltLastPartCompgt -
LastParComp génération (récursive) de 3 tâches
35
Algorithme parallèle
struct Iterated void operator()
(a1Shared_wltPagegt res, int start, int stop)
if ( (stop-start) lt2) // If max num of pages is
reached, sequential algorithm Page resLocal
(pageSize) IteratedSeq(start, resLocal)
res.write(resLocal) else // If max num of
pages is not reached int half
(startstop)/2 a1SharedltPagegt res1 //
First thread result a1SharedltPagegt res2 //
Second thread result a1ForkltIteratedgt ()
(res1, start, half) //First thread
a1ForkltIteratedgt () (res2, half, stop)
//Second thread a1ForkltMergegt () (res,
res1, res2) //Merging results...
36
Parallélisation adaptative
  • Calcul par bloc sur des entrées en k blocs
  • 1 bloc pagesize
  • Exécution indépendante des k tâches
  • Fusion des resultats

37
Algorithme adaptatif (1/3)
  • Hypothèse ordonnancement non préemptif - de type
    work-stealing
  • Couplage séquentiel adaptatif

void Adaptative (a1Shared_wltPagegt resLocal,
DescWork dw) // cout ltlt "Adaptative" ltlt
endl a1Shared ltPagegt resLPC
a1ForkltLPCgt() (resLPC, dw) Page
resSeq (pageSize) AdaptSeq (dw,
resSeq) a1Fork ltMergegt () (resLPC,
resLocal, resSeq)
38
Algorithme adaptatif (2/3)
  • Côté séquentiel

void AdaptSeq (DescWork dw, Page resSeq)
DescLocalWork w Page resLoc
(pageSize) double k while
(!dw.desc-gtextractSeq(w)) for
(int i0 iltpageSize i )
k resLoc.get (i) (double)
buffwpageSizei
resLoc.put(i, k)
resSeqresLoc
39
Algorithme adaptatif (3/3)
  • Côté extraction algorithme parallèle

struct LPC void operator ()
(a1Shared_wltPagegt resLPC, DescWork dw)
DescWork dw2
dw2.Allocate()
dw2.desc-gtl.initialize() if
(dw.desc-gtextractPar(dw2))
a1SharedltPagegt res2
a1ForkltAdaptativeMaingt() (res2, dw2.desc-gti,
dw2.desc-gtj)
a1SharedltPagegt resLPCold
a1ForkltLPCgt() (resLPCold, dw)
a1ForkltMergeLPCgt() (resLPCold, res2,
resLPC)
40
Parallélisation adaptative
  • Une seule tache de calcul est demarrée pour
    toutes les entrées
  • Division du travail qui reste à faire seulement
    dans le cas où un processeur devient inactif
  • Moins de taches, moins de fusions

41
Exemple 2 parallélisation de gzip
  • Gzip
  • Utilisé (web) et coûteux bien que de complexité
    linéaire
  • Code source 10000 lignes C, structures de
    données complexes
  • Principe LZ77 arbre Huffman
  • Pourquoi gzip ?
  • Problème P-complet, mais parallélisation pratique
    possible
  • Inconvénient toute parallélisation (connue)
    entraîne un surcoût
  • -gt perte de taux de compression

42
Comment paralléliser gzip ?
Parallélisation  facile   ,100 compatible avec
gzip/gunzip Problèmes perte de taux de
compression, grain dépend de la machine, surcoût
43
Parallélisation gzip à grain adaptatif
LastPartComputation
44
Surcoût en taille de fichier comprimé
Taille Fichiers Gzip Adaptatif 2 procs Adaptatif 8 procs Adaptatif 16 procs
0,86 Mo 272573 275692 280660 280660
5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo
9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo
10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo
Gain en T?
5,2 Mo 3,35 s 0,96 s 0,55 s
9,4 Mo 7,67 s 6,73 s 6,79 s
10 Mo 6,79 s 1,71 s 0,88 s
45
Performances
Pentium 4x200Mhz
46
Conclusion
  • Grain adaptatif
  • Cascade dynamique récursive de 2 algos 1
    séquentiel, 1 parallèle
  • Génération de parallélisme que sur inactivité de
    ressources
  • -gt Opérateur de base ExtractPar de travail
    séquentiel en cours
  • Programmation générique, ... et simple !??
  • Intérêt Réduction du surcoût lié au parallélisme
  • création de tâche, ordonnancement
  • surcoût arithmétique intrinsèque
  • Gain pratique code PL inférence probabiliste
    Mazer, SHARP
  • Perspectives
  • - Expérimentations SMP et distribuées gzip,
    préfixes, ....
  • - Extension au cas distribué et hétérogène
    ajout/résilience
  • - Extensions à dautres algorithmes action
    IMAG-INRIA AHA Vision 3D, calcul formel, ...

47
Questions ?
APACHE/MOAIS EVASION, J Allard, C Menier, R
Revire, F Zara
Video
48
Performance
49
Performances sur SMP
Pentium 4x200 Mhz
50
Performances en distribué
Recherche distribuée dans 2 répertoires de même
taille chacun sur un disque distant (NFS)
  • Séquentiel Pentium 4x200 Mhz
  • SMP Pentium 4x200 Mhz
  • Architecture distribuée Myrinet Pentium
    4x200 Mhz 2x333 Mhz
Write a Comment
User Comments (0)
About PowerShow.com