Dariusz Kowalski - PowerPoint PPT Presentation

About This Presentation

Title:

Dariusz Kowalski

Description:

Performing Tasks in Asynchronous Environments Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman University of Connecticut ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 30

Provided by: UNIVERSITY808

Category:

more less

Transcript and Presenter's Notes

Title: Dariusz Kowalski

1
Performing Tasks in Asynchronous Environments

Dariusz Kowalski
University of Connecticut Warsaw University
joint work with
Alex Shvartsman
University of Connecticut MIT

2
Do-All problem (DHW et al.)

DA(p,t) problem abstracts the basic problem of
cooperation in a distributed setting
p processors must perform t tasks, andat least
one processor must know about it
Dwork Halpern Waarts
92/98
Tasks are
known to every processor
similar - each takes similar number of local
steps
independent - may be performed in any order
idempotent - may be performed concurrently

3
Do-All synchronous model with crashes

Model processors are synchronous, may fail by
crashes
Solutions problem well understood, results close
to optimal
Shared-memory model -- communication by
read/write
Kanellakis, P.C., Shvartsman, A.A.
Fault-tolerant parallel computation. Kluwer
Academic Publishers (1997)
Message-passing model -- communication by
exchanging messages
Dwork, C., Halpern, J., Waarts, O.
Performing work efficiently in the presence of
faults.
SIAM Journal on Computing, 27 (1998)
De Prisco, R., Mayer, A., Yung, M.
Time-optimal message-efficient work performance
in the presence of faults. Proc. of 13th
PODC, (1994)
Chlebus, B., De Prisco, R., Shvartsman, A.A.
Performing tasks on synchronous restartable
message- passing processors. Distributed
Computing, 14 (2001)

4
Do-All asynchronous models

Models
Shared-memory model -- communication by
read/write -- widely studied, but solutions far
from optimal
Kanellakis, P.C., Shvartsman, A.A.
Fault-tolerant parallel computation. Kluwer
Academic Publishers (1997)
Anderson, R.J., Woll, H. Algorithms for the
certified Write-All problem. SIAM Journal on
Computing, 26 (1997)
Kedem, Z., Palem, K., Raghunathan, A., Spirakis,
P. Combining tentative and definite executions
for very fast dependable parallel computing.
Proc. of 23rd STOC, (1991)
Message-passing model -- communication by
exchanging messages -- no interesting solutions
until recently

5
Shared-Memory vs. Message-Passing

Shared-Memory (atomic registers)
processors communicate by read/write in
shared-memory
atomicity - guarantees that read outputs the last
written value
one read/write operation per local clock cycle
information propagates and information is
persistent
Hence cooperation is always possible, although
delayedHere processor scheduling is the major
challenge
Message-Passing
processors communicate by exchanging messages
duration of a local step may be unbounded
message delays may be unbounded
information may not propagate -- send/recv depend
on delay

6
Message-delay-sensitive approach

Even if messages delay are bounded by d
(d-adversary),cooperation may be difficult
Observation
If d ?(t) then work must be ?(t p)
This means that cooperation is difficult, and
addressing scheduling alone is not enough - -
algorithm design and analysis must be d-sensitive
Message-delay-sensitive approach
C. Dwork, N. Lynch and L. Stockmeyer. Consensus
in the presence of partial synchrony. J. of the
ACM, 35 (1988)

7
Measures of efficiency

Termination time the first time when all tasks
are done and at least one processors knows about
it
Used only to define work and message complexity
Not interesting on its own if all processors but
one are delayed then trivially time is ?(t)
Work measures the sum, over all processors, of
the number of local steps taken until termination
time
Message complexity (message-passing model)
measures number of all point-to-point messages
sent until termination time

8
Structure of the presentation

Part 1 Shared-memory model
Model and bibliography
Improving AW algorithm in shared-memory by better
scheduling processors (task load-balancing)

Part 2 Message-passing model.
Model asynchrony, message delay, and modeling
issues
Delay-sensitive lower bounds for Do-All
Progress-tree Do-All algorithms
Simulating shared-memory and Anderson-Woll (AW)
Asynch. message-passing progress-tree algorithm
Permutation Do-All algorithms

9
Shared-Memory - model and goal

We consider the following model
p asynchronous processors with PID in 0,,p-1
processors communicate by read/write in
shared-memory
atomicity - read outputs the last written value
one read/write operation per local clock cycle
Write-All write 1s into t locations of given
array

Goal improve scheduling of cooperating
asynchronous processors leading to better
load-balancing wrt tasks
10
Write-All Selected Bibliography

Introducing Write-All problem
Kanellakis, P.C., Shvartsman, A.A. Efficient
parallel algorithms can be made robust. PODC
(1989), Distributed Computing (1992)
AW algorithm with work O(t p? )
Anderson, R.J., Woll, H. Algorithms for the
certified Write-All problem. SIAM Journal on
Computing, 26 (1997)
Randomized algorithm with work ?(t plog p)
Martel, C., Subramonian, R. On the complexity of
Certified Write-All algorithms. J. Algorithms 16
(1994)
First work-optimal deterministic algorithm for
t ?(p4log p)
Malewicz, G. A work-optimal deterministic
algorithm for the asynchronous Certified
Write-All problem. PODC (2003)

11
Progress tree algorithms BKRS, AW

Shared memory
p processors, t tasks (p t)
q permutations of q
q-ary progress tree of depth logq p
nodes are binary completion bits

Permutations establish the order in which
the children are visited
p processors traverse the tree and use
q-ary expansion of their PID to choose
permutations
Anderson Woll

1 2 3 q
1 2 3 q
1 2 3 q
12
Algorithm AWT Anderson Woll

Progress tree data structure is stored in shared
memory
p, t 9 , q 3
? list of 3 schedules from S3
T ternary tree of 9 leaves (progress
tree), values 0-1
PID(j) j-th digit of ternary-representation
of PID

1
2
3
?0 PID 0,3,6
?1 PID 1,4,7
0
?2 PID 2,5,8
7213
1
2
3
7213
4
5
8
7
9
10
12
11
6
13
Contention of permutations

Sn - group of all permutations on set n,
with composition ? and identity ?n
?, ? - permutations in Sn
? - set of q permutations from Sn
i is lrm (left-to-right maximum) in ? if ?(i) gt
maxjlti ?(j)
LRM(? ) - number of lrm in ? Knuth
Cont(?,? ) ?? ?? LRM(? -1 ? ?)
Contention of ? Cont(? ) max? Cont(?,? )
AW
Theorem AW For any n gt 0 there exists set ?
of n permutations from Sn with Cont(? ) ? 3nHn
?(n log n).
Knuth Knuth, D.E. The art of computer
programming Vol. 3 (third edition).
Addison-Wesley Pub Co. (1998)

10
3
5
2
4
6
1
9
7
8
11
14
Procedure Oblivious Do

n - number of jobs and units
? - list of n schedules from Sn
Procedure Oblivious
Forall processors PID 0 to n-1
for i 1 to n do
perform Job(? PID(i))
Execution of Job(? PID(i)) by processor PID is
primary, if job ? PID(i) has not been previously
performed
Lemma AW In algorithm Oblivious with n units,
n jobs, and using the list? of n permutations
from Sn, the number of primary job executions is
at most Cont(? ).

15
AWT(q) - new progress tree traversal algorithm

Instead of using q permutations on set q, we
use q permutations on set n, where n q2 log
q
p 6 , t 16 , q 2, n 4
? list of 2 schedules from S4
T 4-ary tree of 16 leaves (progress
tree), values 0-1
PID(j) j-th digit of ternary-representation
of PID

?0 PID even
?1 PID odd
0
51014
1
2
3
4
51014
5
6
9
8
10
11
13
12
7
14
15
16
17
18
19
20
16
Main result

Set n q2 log q and let ? be list of q
schedules from Sn
Define Cont(?, ?) max? ? ? Cont(?,? )
Lemma For sufficiently large q and any set ? of
at most exp(q2 log2 q) permutations on set q2
log q, there is a list of q schedules ? from Sn
such that
Cont(?, ?) ? q2 log q 6q log q
Take q log p and ? from above Lemma
Theorem For every ? gt 0, sufficiently large p
and t ?(p2?), algorithm AWT(q) performs
work O(t).

17
Message-Passing - model and goals

We consider the following model
p asynchronous processors with PID in 0,,p-1
processors communicate by message passing
in one local step each processor can send a
message to any subset of processors
messages incur delays between send and receive
processing of all received messages can be done
during one local step
Goal understand the impact of message delay on
efficiency of algorithmic solutions for Do-All

18
Lower bound - randomized algorithms

Theorem Any randomized algorithm solving DA with
t tasks using p asynchronous message-passing
processors performs expected work
?(tp?d?logd1 t)
against any d-adversary.
Proof (sketch)
Adversary partitions computation into stages,
each containingd time units, and constructs
delay pattern stage after stage
? delays all messages in stage to be received at
the end of stage
? delays linear number of processors (which want
to perform more than (1-1/(3d)) fraction of
undone tasks) during stage
selection is on-line, with high probability has
good properties

19
Simulating shared-memory algorithms

Write-All algorithm AWT
Anderson, R.J., Woll, H. Algorithms for the
certified Write-All problem. SIAM Journal on
Computing, 26 (1997)
Quorum systems Atomic memory services
Attiya, H., Bar-Noy, A., Dolev, D. Sharing
memory robust-ly in message passing systems. J.
of the ACM, 42 (1996)
Lynch, N., Shvartsman, A. RAMBO A
Reconfigurable Atomic Memory Service. Proc. of
16th DISC, (2002)
Emulating asynchronous shared-memory algorithms
Momenzadeh, M. Emulating shared-memory Do-All in
asynchronous message passing systems. Masters
Thesis, CSE, University of Conn, (2003)

20
Atomic memory is not required

We use q-ary progress trees as the main data
structure that is written and read -- note
that atomicity is not required
If the following two writes occur (the entire
tree is written), then a subsequent read may
obtain a third value that was never written
Property of monotone progress
1 at a tree node i indicates that all tasks
attached to the leaves in the sub-tree rooted in
i have been performed
If 1 is written at a node i in the progress tree
of a processor, it remains 1 forever

0
0
0
write
write
read
1
0
0
1
1
1
21
Algorithm DAq - traverse progress tree

Instead of using shared memory, processors
broadcast their progress trees as soon as local
progress is recorded
p, t 9 , q 3
? list of 3 schedules from S3
T ternary tree of 9 leaves (progress
tree), values 0-1
PID(j) j-th digit of ternary-representation
of PID

1
2
3
?0 PID 0,3,6
?1 PID 1,4,7
0
?2 PID 2,5,8
7213
1
2
3
7213
4
5
8
7
9
10
12
11
6
22
Algorithm DAq - case p ? t
23
Procedure DOWORK
24
Algorithm DAq - analysis

Modification of algorithm DAq for p lt t
We partition the t tasks into p jobs of size t
/p and let the algorithm DAq work with these
jobs.
It takes a processor O(t /p) work (instead of
constant) to process such a job (job unit).
In each step, a processor broadcasts at most one
message to p-1 other processors, we obtain
Theorem 4 For any constant ? gt 0 there is a
constant q such that the algorithm DAq has work
W(p,t,d) O(t?p? p?d??t /d? ? )
and message complexity
O(p ? W(p,t,d))
against any d-adversary (do(t)).

25
Permutation algorithms - case p ? t

Algorithms proceed in a loop
select the next task using ORDERSELECT rule
perform selected task
send messages, receive messages, and update state
ORDERSELECT rules
PARAN1 initially processor PID permutes tasks
randomly
PID selects first task remaining on
his schedule
PARAN2 no initial order
PID selects task from remaining sets
randomly
PADET initially processor PID chooses
schedule ?PID in ?
PID selects first task remaining on
schedule ?PID
? - list of p schedules from St

26
d-Contention of permutations

We introduce the notion of d-Contention
i is d-lrm in ? if j lt i ?(i) lt ?(j) lt d
d 2
LRMd(?) - number of d-lrm in ?
Contd(?,? ) ?? ?? LRMd(? -1 ? ?)
d-Contention of ? Contd(? ) max? Contd(?,?
)
Theorem For sufficiently large p and n, there
is a list ? of p permutations from Sn such that,
for every integer d gt1,
Contd(? ) ? n log n 5pd ln(en/d).
Moreover, random ? is good with high
probability.

10
3
5
2
4
6
1
9
7
8
11
27
d-Contention and work

Lemma For algorithms PADET and PARAN1, the
respective worst case work and expected work is
at most
Contd(? )
against any d-adversary.
Example
p 2, t 11, d 2

Order of tasks to perform 1,2,3,4,5,6,7,8,9,10,1
1
1
3
2
5
7
4
9
8
6
11
10
1
3
2
5
7
9
11
10
2
4
6
8
10
11
9
7
5
3
1
2
4
6
8
10
11
28
Permutation algorithms - results

Theorem Randomized algorithms PARAN1 and PARAN2
perform expected work
O(t?log p p?d?log(t /d))
and have expected communication
O(t?p?log p p2?d?log(t /d))
against any d-adversary (do(t)).
Corollary There exists a deterministic list of
schedules ? such that algorithm PADET performs
work
O(t?log p p?mint,d?log(2t /d))
and has communication
O(t?p?log p p2?mint,d?log(2t /d))
when p ? t.

29
Conclusions and open problems

Work-optimal Write-All algorithm for t ?(p2?)
First message-delay-sensitive analysis of the
Do-All problem for asynchronous processors in
message-passing model
lower bounds for deterministic and randomized
algorithms
deterministic and randomized algorithms with
subquadratic(in p and t ) work for any message
delay d as long as do(t)
Among the interesting open questions are
is there work-optimal scheduling for t ?(p log
p)
for algorithm PADET how to construct list ? of
permutations efficiently
closing the gap between the upper and the lower
bounds
investigate algorithms that simultaneously
control work and message complexity