Title: Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix
1Processor-oblivious parallel algorithms and
scheduling Illustration on parallel prefix
- Jean-Louis Roch, Daouda Traore
- INRIA-CNRS Moais team - LIG Grenoble, France
Contents I. What is a processor-oblivious
parallel algorithm ? II. Work-stealing
scheduling of parallel algorithms
III. Processor-oblivious parallel prefix
computation
Workshop Scheduling Algorithms for New Emerging
Applications - CIRM Luminy -May 29th-June 2nd,
2006
2The problem
Problem compute f(a)
parallel Pmax
parallel P2
Sequential algorithm
parallel P100
. . .
?
Which algorithm to choose ?
Multi-user SMP server
Grid
Heterogeneous network
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode
3Processor-oblivious algorithms
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode gt
motivates processor-oblivious parallel
algorithm that is independent from the
underlying architecture no reference to p
nor ?i(t) speed of processor i at time t nor
on a given architecture, has
performance guarantees behaves as well as an
optimal (off-line, non-oblivious) one Problem
often, the larger the parallel degree, the larger
the operations to perform !
4 Prefix computation
- Prefix problem
- input a0, a1, , an
- output ?0, ?1, , ?n with
- Sequential algorithm for (i 0 i lt n
i ) ? i ? i 1 a i - Fine grain optimal parallel algorithm
Ladner-Fischer
performs W1 W? n operations
a0 a1 a2 a3 a4 an-1 an
Critical time W? 2. log n but performs W1
2.n ops Twice more expensive than
the sequential
5 Prefix computation an example where
parallelism always costs
- Any parallel algorithm with critical time W?
runs on p processors in time - strict lower bound block algorithm pipeline
Nicolaual. 1996 - Question How to design a generic parallel
algorithm, independent from the architecture,
that achieves optimal performance on any given
architecture ? - gt to design a malleable algorithm where
scheduling suits the number of operations
performed to the architecture
6Architecture model
- Heterogeneous processors with changing speed
Bender-Rabin02 gt ?i(t) instantaneous
speed of processor i at time t in operations
per second - Average speed per processor for a
computation with duration T - Lower bound
for the time of prefix computation
7Work-stealing (1/2)
Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)
- Workstealing greedy schedule but
distributed and randomized - Each processor manages locally the tasks it
creates - When idle, a processor steals the oldest ready
task on a remote -non idle- victim processor
(randomly chosen)
8Work-stealing (2/2)
Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)
- Interests -gt suited to heterogeneous
architectures with slight modification
Bender-Rabin02 -gt if W? small enough
near-optimal processor-oblivious schedule with
good probability on p processors with average
speeds ?ave - NB succeeded steals task migrations lt p
W? Blumofe 98, Narlikar 01, Bender 02 - Implementation work-first principle Cilk
serie-parallel, Kaapi dataflow -gt Move
scheduling overhead on the steal operations
(infrequent case)-gt General case local
parallelism implemented by sequential function
call
9How to get both optimal work W1 and W? small?
- General approach to mix both
- a sequential algorithm with optimal work W1
- and a fine grain parallel algorithm with minimal
critical time W? - Folk technique parallel, than sequential
- Parallel algorithm until a certain grain
then use the sequential one - Drawback W? increases o) and, also, the
number of steals - Work-preserving speed-up technique Bini-Pan94
sequential, then parallel Cascading Jaja92
Careful interplay of both algorithms to build one
with both W? small and W1 O( Wseq ) - Use the work-optimal sequential algorithm to
reduce the size - Then use the time-optimal parallel algorithm to
decrease the time Drawback sequential at coarse
grain and parallel at fine grain o(
10Alternative concurrently sequential and parallel
- Based on the Work-first principle
- Executes always a sequential algorithm to reduce
parallelism overhead - use parallel algorithm only if a processor
becomes idle (ie steals) by extracting
parallelism from a sequential computation - Hypothesis two algorithms
- - 1 sequential SeqCompute- 1 parallel
LastPartComputation at any time, it is
possible to extract parallelism from the
remaining computations of the sequential
algorithm - Self-adaptive granularity based on
work-stealing
11Adaptive Prefix on 3 processors
Sequential
?1
Parallel
12Adaptive Prefix on 3 processors
Sequential
?3
Parallel
?7
13Adaptive Prefix on 3 processors
Sequential
Parallel
?8
14Adaptive Prefix on 3 processors
Sequential
?8
Parallel
?8
?5
?6
?9
?11
15Adaptive Prefix on 3 processors
Sequential
?12
?11
?8
Parallel
?8
?5
?6
?7
?9
?11
?10
16Adaptive Prefix on 3 processors
Sequential
Implicit critical path on the sequential process
Parallel
17Analysis of the algorithm
- Execution time
- Sketch of the proof
- Dynamic coupling of two algorithms that completes
simultaneously - Sequential (optimal) number of operations S on
one processor - Parallel minimal time but performs X operations
on other processors - dynamic splitting always possible till finest
grain BUT local sequential - Critical path small ( eg log X)
- Each non constant time task can potentially be
splitted (variable speeds) -
- Algorithmic scheme ensures Ts Tp O(log X)gt
enables to bound the whole number X of operations
performedand the overhead of parallelism (sX)
- ops_optimal
Lower bound
18Adaptive prefix experiments1
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Single user context
Time (s)
processors
Single-user context processor-oblivious prefix
achieves near-optimal performance - close to
the lower bound both on 1 proc and on p
processors - Less sensitive to system overhead
even better than the theoretically optimal
off-line parallel algorithm on p processors
19Adaptive prefix experiments 2
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Multi-user context
External charge (9-p external processes)
Time (s)
processors
Multi-user context Additional external
charge (9-p) additional external dummy processes
are concurrently executed Processor-oblivious
prefix computation is always the fastest
15 benefit over a parallel algorithm for p
processors with off-line schedule,
20Conclusion
- The interplay of an on-line parallel algorithm
directed by work-stealing schedule is useful for
the design of processor-oblivious algorithms - Application to prefix computation
- - theoretically reaches the lower bound on
heterogeneous processors with changing speeds - - practically, achieves near-optimal
performances on multi-user SMPs - Generic adaptive scheme to implement parallel
algorithms with provable performance - - work in progress parallel 3D
reconstruction oct-tree scheme with
deadline constraint
21 Thank you !
Interactive Distributed Simulation B Raffin E
Boyer - 5 cameras, - 6 PCs 3D-reconstruction
simulation rendering -gtAdaptive scheme to
maximize 3D-reconstruction precision within fixed
timestamp
22The Prefix race sequential/parallel fixed/
adaptive
On each of the 10 executions, adaptive completes
first
23Adaptive prefix some experiments
Prefix of 10000 elements on a SMP 8 procs (IA64 /
linux)
External charge
Time (s)
Time (s)
processors
processors
Multi-user context Adaptive is the
fastest15 benefit over a static grain algorithm
- Single user context
- Adaptive is equivalent to
- - sequential on 1 proc
- - optimal parallel-2 proc. on 2 processors
- -
- - optimal parallel-8 proc. on 8 processors
24With double sum ( riri-1 xi )
Finest grain limited to 1 page 16384 octets
2048 double
Single user
Processors with variable speeds
Remark for n4.096.000 doubles - pure
sequential 0,20 s - minimal grain 100
doubles 0.26s on 1 proc and 0.175 on 2 procs
(close to lower bound)