Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix


1
Processor-oblivious parallel algorithms and
scheduling Illustration on parallel prefix
  • Jean-Louis Roch, Daouda Traore
  • INRIA-CNRS Moais team - LIG Grenoble, France

Contents I. What is a processor-oblivious
parallel algorithm ? II. Work-stealing
scheduling of parallel algorithms
III. Processor-oblivious parallel prefix
computation
Workshop Scheduling Algorithms for New Emerging
Applications - CIRM Luminy -May 29th-June 2nd,
2006
2
The problem
Problem compute f(a)
parallel Pmax
parallel P2
Sequential algorithm
parallel P100


. . .
?
Which algorithm to choose ?
Multi-user SMP server
Grid
Heterogeneous network
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode
3
Processor-oblivious algorithms
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode gt
motivates  processor-oblivious  parallel
algorithm that is independent from the
underlying architecture no reference to p
nor ?i(t) speed of processor i at time t nor
on a given architecture, has
performance guarantees behaves as well as an
optimal (off-line, non-oblivious) one Problem
often, the larger the parallel degree, the larger
the operations to perform !
4
Prefix computation
  • Prefix problem
  • input a0, a1, , an
  • output ?0, ?1, , ?n with
  • Sequential algorithm for (i 0 i lt n
    i ) ? i ? i 1 a i
  • Fine grain optimal parallel algorithm
    Ladner-Fischer


performs W1 W? n operations

a0 a1 a2 a3 a4 an-1 an
Critical time W? 2. log n but performs W1
2.n ops Twice more expensive   than
the sequential
5
Prefix computation an example where
parallelism always costs
  • Any parallel algorithm with critical time W?
    runs on p processors in time
  • strict lower bound block algorithm pipeline
    Nicolaual. 1996
  • Question How to design a generic parallel
    algorithm, independent from the architecture,
    that achieves optimal performance on any given
    architecture ?
  • gt to design a malleable algorithm where
    scheduling suits the number of operations
    performed to the architecture

6
Architecture model
- Heterogeneous processors with changing speed
Bender-Rabin02 gt ?i(t) instantaneous
speed of processor i at time t in operations
per second - Average speed per processor for a
computation with duration T - Lower bound
for the time of prefix computation
7
Work-stealing (1/2)
 Work  W1 total operations
performed
Depth  W? ops on a critical path (parallel
time on ?? resources)
  • Workstealing greedy schedule but
    distributed and randomized
  • Each processor manages locally the tasks it
    creates
  • When idle, a processor steals the oldest ready
    task on a remote -non idle- victim processor
    (randomly chosen)

8
Work-stealing (2/2)
 Work  W1 total operations
performed
Depth  W? ops on a critical path (parallel
time on ?? resources)
  • Interests -gt suited to heterogeneous
    architectures with slight modification
    Bender-Rabin02 -gt if W? small enough
    near-optimal processor-oblivious schedule with
    good probability on p processors with average
    speeds ?ave
  • NB succeeded steals task migrations lt p
    W? Blumofe 98, Narlikar 01, Bender 02
  • Implementation work-first principle Cilk
    serie-parallel, Kaapi dataflow -gt Move
    scheduling overhead on the steal operations
    (infrequent case)-gt General case local
    parallelism implemented by sequential function
    call

9
How to get both optimal work W1 and W? small?
  • General approach to mix both
  • a sequential algorithm with optimal work W1
  • and a fine grain parallel algorithm with minimal
    critical time W?
  • Folk technique parallel, than sequential
  • Parallel algorithm until a certain  grain 
    then use the sequential one
  • Drawback W? increases o) and, also, the
    number of steals
  • Work-preserving speed-up technique Bini-Pan94
    sequential, then parallel Cascading Jaja92
    Careful interplay of both algorithms to build one
    with both W? small and W1 O( Wseq )
  • Use the work-optimal sequential algorithm to
    reduce the size
  • Then use the time-optimal parallel algorithm to
    decrease the time Drawback sequential at coarse
    grain and parallel at fine grain o(

10
Alternative concurrently sequential and parallel
  • Based on the Work-first principle
  • Executes always a sequential algorithm to reduce
    parallelism overhead
  • use parallel algorithm only if a processor
    becomes idle (ie steals) by extracting
    parallelism from a sequential computation
  • Hypothesis two algorithms
  • - 1 sequential SeqCompute- 1 parallel
    LastPartComputation at any time, it is
    possible to extract parallelism from the
    remaining computations of the sequential
    algorithm
  • Self-adaptive granularity based on
    work-stealing

11
Adaptive Prefix on 3 processors
Sequential
?1
Parallel
12
Adaptive Prefix on 3 processors
Sequential
?3
Parallel
?7
13
Adaptive Prefix on 3 processors
Sequential
Parallel
?8
14
Adaptive Prefix on 3 processors
Sequential
?8
Parallel
?8
?5
?6
?9
?11
15
Adaptive Prefix on 3 processors
Sequential
?12
?11
?8
Parallel
?8
?5
?6
?7
?9
?11
?10
16
Adaptive Prefix on 3 processors
Sequential
Implicit critical path on the sequential process
Parallel
17
Analysis of the algorithm
  • Execution time
  • Sketch of the proof
  • Dynamic coupling of two algorithms that completes
    simultaneously
  • Sequential (optimal) number of operations S on
    one processor
  • Parallel minimal time but performs X operations
    on other processors
  • dynamic splitting always possible till finest
    grain BUT local sequential
  • Critical path small ( eg log X)
  • Each non constant time task can potentially be
    splitted (variable speeds)
  • Algorithmic scheme ensures Ts Tp O(log X)gt
    enables to bound the whole number X of operations
    performedand the overhead of parallelism (sX)
    - ops_optimal

Lower bound
18
Adaptive prefix experiments1
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Single user context
Time (s)
processors
Single-user context processor-oblivious prefix
achieves near-optimal performance - close to
the lower bound both on 1 proc and on p
processors - Less sensitive to system overhead
even better than the theoretically optimal
off-line parallel algorithm on p processors
19
Adaptive prefix experiments 2
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Multi-user context
External charge (9-p external processes)
Time (s)
processors
Multi-user context Additional external
charge (9-p) additional external dummy processes
are concurrently executed Processor-oblivious
prefix computation is always the fastest
15 benefit over a parallel algorithm for p
processors with off-line schedule,
20
Conclusion
  • The interplay of an on-line parallel algorithm
    directed by work-stealing schedule is useful for
    the design of processor-oblivious algorithms
  • Application to prefix computation
  • - theoretically reaches the lower bound on
    heterogeneous processors with changing speeds
  • - practically, achieves near-optimal
    performances on multi-user SMPs
  • Generic adaptive scheme to implement parallel
    algorithms with provable performance
  • - work in progress parallel 3D
    reconstruction oct-tree scheme with
    deadline constraint

21
Thank you !
Interactive Distributed Simulation B Raffin E
Boyer - 5 cameras, - 6 PCs 3D-reconstruction
simulation rendering -gtAdaptive scheme to
maximize 3D-reconstruction precision within fixed
timestamp
22
The Prefix race sequential/parallel fixed/
adaptive
On each of the 10 executions, adaptive completes
first
23
Adaptive prefix some experiments
Prefix of 10000 elements on a SMP 8 procs (IA64 /
linux)
External charge
Time (s)
Time (s)
processors
processors
Multi-user context Adaptive is the
fastest15 benefit over a static grain algorithm
  • Single user context
  • Adaptive is equivalent to
  • - sequential on 1 proc
  • - optimal parallel-2 proc. on 2 processors
  • -
  • - optimal parallel-8 proc. on 8 processors

24
With double sum ( riri-1 xi )
Finest grain limited to 1 page 16384 octets
2048 double
Single user
Processors with variable speeds
Remark for n4.096.000 doubles - pure
sequential 0,20 s - minimal grain 100
doubles 0.26s on 1 proc and 0.175 on 2 procs
(close to lower bound)
Write a Comment
User Comments (0)
About PowerShow.com