Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix presentation

About This Presentation

Transcript and Presenter's Notes

Title: Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix

1
Processor-oblivious parallel algorithms and
scheduling Illustration on parallel prefix

Jean-Louis Roch, Daouda Traore
INRIA-CNRS Moais team - LIG Grenoble, France

Contents I. What is a processor-oblivious
parallel algorithm ? II. Work-stealing
scheduling of parallel algorithms
III. Processor-oblivious parallel prefix
computation
Workshop Scheduling Algorithms for New Emerging
Applications - CIRM Luminy -May 29th-June 2nd,
2006
2
The problem
Problem compute f(a)
parallel Pmax
parallel P2
Sequential algorithm
parallel P100

. . .
?
Which algorithm to choose ?
Multi-user SMP server
Grid
Heterogeneous network
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode
3
Processor-oblivious algorithms
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode gt
motivates processor-oblivious parallel
algorithm that is independent from the
underlying architecture no reference to p
nor ?i(t) speed of processor i at time t nor
on a given architecture, has
performance guarantees behaves as well as an
optimal (off-line, non-oblivious) one Problem
often, the larger the parallel degree, the larger
the operations to perform !
4
Prefix computation

Prefix problem
input a0, a1, , an
output ?0, ?1, , ?n with
Sequential algorithm for (i 0 i lt n
i ) ? i ? i 1 a i
Fine grain optimal parallel algorithm
Ladner-Fischer

performs W1 W? n operations

a0 a1 a2 a3 a4 an-1 an
Critical time W? 2. log n but performs W1
2.n ops Twice more expensive than
the sequential
5
Prefix computation an example where
parallelism always costs

Any parallel algorithm with critical time W?
runs on p processors in time
strict lower bound block algorithm pipeline
Nicolaual. 1996
Question How to design a generic parallel
algorithm, independent from the architecture,
that achieves optimal performance on any given
architecture ?
gt to design a malleable algorithm where
scheduling suits the number of operations
performed to the architecture

6
Architecture model
- Heterogeneous processors with changing speed
Bender-Rabin02 gt ?i(t) instantaneous
speed of processor i at time t in operations
per second - Average speed per processor for a
computation with duration T - Lower bound
for the time of prefix computation
7
Work-stealing (1/2)
Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)

Workstealing greedy schedule but
distributed and randomized
Each processor manages locally the tasks it
creates
When idle, a processor steals the oldest ready
task on a remote -non idle- victim processor
(randomly chosen)

8
Work-stealing (2/2)
Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)

Interests -gt suited to heterogeneous
architectures with slight modification
Bender-Rabin02 -gt if W? small enough
near-optimal processor-oblivious schedule with
good probability on p processors with average
speeds ?ave
NB succeeded steals task migrations lt p
W? Blumofe 98, Narlikar 01, Bender 02
Implementation work-first principle Cilk
serie-parallel, Kaapi dataflow -gt Move
scheduling overhead on the steal operations
(infrequent case)-gt General case local
parallelism implemented by sequential function
call

9
How to get both optimal work W1 and W? small?

General approach to mix both
a sequential algorithm with optimal work W1
and a fine grain parallel algorithm with minimal
critical time W?
Folk technique parallel, than sequential
Parallel algorithm until a certain grain
then use the sequential one
Drawback W? increases o) and, also, the
number of steals
Work-preserving speed-up technique Bini-Pan94
sequential, then parallel Cascading Jaja92
Careful interplay of both algorithms to build one
with both W? small and W1 O( Wseq )
Use the work-optimal sequential algorithm to
reduce the size
Then use the time-optimal parallel algorithm to
decrease the time Drawback sequential at coarse
grain and parallel at fine grain o(

10
Alternative concurrently sequential and parallel

Based on the Work-first principle
Executes always a sequential algorithm to reduce
parallelism overhead
use parallel algorithm only if a processor
becomes idle (ie steals) by extracting
parallelism from a sequential computation
Hypothesis two algorithms
- 1 sequential SeqCompute- 1 parallel
LastPartComputation at any time, it is
possible to extract parallelism from the
remaining computations of the sequential
algorithm
Self-adaptive granularity based on
work-stealing

11
Adaptive Prefix on 3 processors
Sequential
?1
Parallel
12
Adaptive Prefix on 3 processors
Sequential
?3
Parallel
?7
13
Adaptive Prefix on 3 processors
Sequential
Parallel
?8
14
Adaptive Prefix on 3 processors
Sequential
?8
Parallel
?8
?5
?6
?9
?11
15
Adaptive Prefix on 3 processors
Sequential
?12
?11
?8
Parallel
?8
?5
?6
?7
?9
?11
?10
16
Adaptive Prefix on 3 processors
Sequential
Implicit critical path on the sequential process
Parallel
17
Analysis of the algorithm

Execution time
Sketch of the proof
Dynamic coupling of two algorithms that completes
simultaneously
Sequential (optimal) number of operations S on
one processor
Parallel minimal time but performs X operations
on other processors
dynamic splitting always possible till finest
grain BUT local sequential
Critical path small ( eg log X)
Each non constant time task can potentially be
splitted (variable speeds)
Algorithmic scheme ensures Ts Tp O(log X)gt
enables to bound the whole number X of operations
performedand the overhead of parallelism (sX)
- ops_optimal

Lower bound
18
Adaptive prefix experiments1
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Single user context
Time (s)
processors
Single-user context processor-oblivious prefix
achieves near-optimal performance - close to
the lower bound both on 1 proc and on p
processors - Less sensitive to system overhead
even better than the theoretically optimal
off-line parallel algorithm on p processors
19
Adaptive prefix experiments 2
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Multi-user context
External charge (9-p external processes)
Time (s)
processors
Multi-user context Additional external
charge (9-p) additional external dummy processes
are concurrently executed Processor-oblivious
prefix computation is always the fastest
15 benefit over a parallel algorithm for p
processors with off-line schedule,
20
Conclusion

The interplay of an on-line parallel algorithm
directed by work-stealing schedule is useful for
the design of processor-oblivious algorithms
Application to prefix computation
- theoretically reaches the lower bound on
heterogeneous processors with changing speeds
- practically, achieves near-optimal
performances on multi-user SMPs
Generic adaptive scheme to implement parallel
algorithms with provable performance
- work in progress parallel 3D
reconstruction oct-tree scheme with
deadline constraint

21
Thank you !
Interactive Distributed Simulation B Raffin E
Boyer - 5 cameras, - 6 PCs 3D-reconstruction
simulation rendering -gtAdaptive scheme to
maximize 3D-reconstruction precision within fixed
timestamp
22
The Prefix race sequential/parallel fixed/
adaptive
On each of the 10 executions, adaptive completes
first
23
Adaptive prefix some experiments
Prefix of 10000 elements on a SMP 8 procs (IA64 /
linux)
External charge
Time (s)
Time (s)
processors
processors
Multi-user context Adaptive is the
fastest15 benefit over a static grain algorithm

Single user context
Adaptive is equivalent to
- sequential on 1 proc
- optimal parallel-2 proc. on 2 processors
-
- optimal parallel-8 proc. on 8 processors

24
With double sum ( riri-1 xi )
Finest grain limited to 1 page 16384 octets
2048 double
Single user
Processors with variable speeds
Remark for n4.096.000 doubles - pure
sequential 0,20 s - minimal grain 100
doubles 0.26s on 1 proc and 0.175 on 2 procs
(close to lower bound)

Write a Comment

User Comments (0)

About PowerShow.com

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix PowerPoint PPT Presentation