Online adaptative parallel prefix computation - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Online adaptative parallel prefix computation

Description:

stealer 2. Adaptive Prefix on 3 processors. Steal request. Parallel. Sequential ... The (p-1) work-stealers perform X = 2(n-S) operations with depth log X ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 25
Provided by: JeanLou85
Category:

less

Transcript and Presenter's Notes

Title: Online adaptative parallel prefix computation


1
On-line adaptative parallel prefix computation
  • Jean-Louis Roch, Daouda Traore, Julien Bernard
  • INRIA-CNRS Moais team - LIG Grenoble, France

Contents I. Motivation II. Work-stealing
scheduling of parallel algorithms
III. Processor-oblivious parallel prefix
computation
EUROPAR2006 - Dresden, Germany - 2006,
August 29th,
2
Parallel prefix on fixed architecture
  • Prefix problem
  • input a0, a1, , an
  • output ?1, , ?n with


3
The problem
To design a single algorithm that computes
efficiently prefix( a ) on an arbitrary dynamic
architecture
parallel Pmax
parallel P2
Sequential algorithm
parallel P100


. . .
?
Which algorithm to choose ?
Multi-user SMP server
Grid
Heterogeneous network
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode
4
Lower bound for prefix on processors with
changing speeds
- Model of heterogeneous processors with
changing speed Benderal 02 gt ?i(t)
instantaneous speed of processor i at time t
(in operations per second
) Assumption ?max(t) lt constant .
?min(t) Def ?ave average speed per
processor for a computation with duration T
- Theorem 2 Lower bound for the time of
prefix computation on p processors with changing
speeds Sketch of the proof - extension
of the lower bound on p
identical processors Faith82 - based on the
analysis on the number of performed operations.
5
Changing speeds and work-stealing
  • Workstealing schedule on-line adapts to
    processors availability and speeds
    Bender-02
  • Principle of work-stealing greedy
    schedule but distributed and randomized
  • Each processor manages locally the tasks it
    creates
  • When idle, a processor steals the oldest ready
    task on a remote -non idle- victim processor
    (randomly chosen)

Bender-Rabin02
6
Work-stealing and adaptation
 Work  W1 total operations
performed
Depth  W? ops on a critical path (parallel
time on ?? resources)
  • Interest if W1 fixed and W? small, near-optimal
    adaptative schedulewith good probability on p
    processors with average speeds ?ave
  • Moreover steals task migrations lt p.W?
    Blumofe 98 Narlikar 01 Bender 02
  • But lower bounds for prefix
  • Minimal work W1 n ? W? n
    ?
  • Minimal depth W? lt 2 log n ? W1 gt 2n ?
  • With work-stealing, how to reach the lower bound
    ?

7
How to get both work W1 and depth W? small?
  • General approach by coupling two algorithms
  • a sequential algorithm with optimal number of
    operations Ws
  • and a fine grain parallel algorithm with minimal
    critical time W? butparallel work gtgt Ws
  • Folk technique parallel, than sequential
  • Parallel algorithm until a certain  grain 
    then use the sequential one
  • Drawback with changing speeds
  • Either too much idle processors or too much
    operations
  • Work-preserving speed-up technique Bini-Pan94
    sequential, then parallel Cascading Jaja92
    Careful interplay of both algorithms to build
    one with both W? small and W1 O(
    Wseq )
  • Use the work-optimal sequential algorithm to
    reduce the size
  • Then use the time-optimal parallel algorithm to
    decrease the time Drawback sequential at coarse
    grain and parallel at fine grain ?

8
Alternative concurrently sequential and parallel
  • Based on the work-stealing and the Work-first
    principle
  • Execute always a sequential algorithm to
    reduce parallelism overhead
  • use parallel algorithm only if a processor
    becomes idle (ie workstealing) by extracting
    parallelism from a sequential computation (ie
    adaptive granularity)
  • Hypothesis two algorithms
  • - 1 sequential SeqCompute- 1 parallel
    LastPartComputation at any time, it is
    possible to extract parallelism from the
    remaining computations of the sequential
    algorithm
  • Self-adaptive granularity based on
    work-stealing

9
Alternative concurrently sequential and parallel
SeqCompute
preempt
10
Alternative concurrently sequential and parallel
merge/jump
SeqCompute
Seq
complete
11
Adaptive Prefix on 3 processors
Sequential
?1
Parallel
12
Adaptive Prefix on 3 processors
Sequential
?3
Parallel
?7
13
Adaptive Prefix on 3 processors
Sequential
Parallel
?8
14
Adaptive Prefix on 3 processors
Sequential
?8
Parallel
?8
?5
?6
?9
?11
15
Adaptive Prefix on 3 processors
Sequential
?12
?11
?8
Parallel
?8
?5
?6
?7
?9
?11
?10
16
Adaptive Prefix on 3 processors
Sequential
Implicit critical path on the sequential process
Parallel
17
Analysis of the algorithm
  • Theorem 3 Execution time
  • Sketch of the proof Analysis of the operations
    performed by
  • The sequential main performs S operations on one
    processor
  • The (p-1) work-stealers perform X 2(n-S)
    operations with depth log X
  • Each non constant time task can potentially be
    splitted (variable speeds)
  • The coupling ensures both algorithms complete
    simultaneously Ts Tp - O(log X)gt enables to
    bound the whole number X of operations
    performedand the overhead of parallelism (SX)
    - ops_optimal

18
Adaptive prefix experiments1
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Single user context
Time (s)
processors
Single-user context processor-adaptive prefix
achieves near-optimal performance - close to
the lower bound both on 1 proc and on p
processors - Less sensitive to system overhead
even better than the theoretically optimal
off-line parallel algorithm on p processors
19
Adaptive prefix experiments 2
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Multi-user context
External charge (9-p external processes)
Time (s)
processors
Multi-user context Additional external
charge (9-p) additional external dummy processes
are concurrently executed Processor-adaptive
prefix computation is always the fastest
15 benefit over a parallel algorithm for p
processors with off-line schedule,
20
Conclusion
  • The interplay of an on-line parallel algorithm
    directed by work-stealing schedule is useful for
    the design of processor-oblivious algorithms
  • Application to prefix computation
  • - theoretically reaches the lower bound on
    heterogeneous processors with changing speeds
  • - practically, achieves near-optimal
    performances on multi-user SMPs
  • Generic adaptive scheme to implement parallel
    algorithms with provable performance
  • - work in progress parallel 3D
    reconstruction oct-tree scheme with
    deadline constraint

21
Thank you !
Interactive Distributed Simulation B Raffin E
Boyer - 5 cameras, - 6 PCs 3D-reconstruction
simulation rendering -gtAdaptive scheme to
maximize 3D-reconstruction precision within fixed
timestamp L Suares, B Raffin, JL Roch
22
The Prefix race sequential/parallel fixed/
adaptive
On each of the 10 executions, adaptive completes
first
23
Adaptive prefix some experiments
Prefix of 10000 elements on a SMP 8 procs (IA64 /
linux)
External charge
Time (s)
Time (s)
processors
processors
Multi-user context Adaptive is the
fastest15 benefit over a static grain algorithm
  • Single user context
  • Adaptive is equivalent to
  • - sequential on 1 proc
  • - optimal parallel-2 proc. on 2 processors
  • -
  • - optimal parallel-8 proc. on 8 processors

24
With double sum ( riri-1 xi )
Finest grain limited to 1 page 16384 octets
2048 double
Single user
Processors with variable speeds
Remark for n4.096.000 doubles - pure
sequential 0,20 s - minimal grain 100
doubles 0.26s on 1 proc and 0.175 on 2 procs
(close to lower bound)
Write a Comment
User Comments (0)
About PowerShow.com