Title: Workflow Scheduling Optimisation: The case for revisiting DAG scheduling
1Workflow Scheduling OptimisationThe case for
revisiting DAG scheduling
- Rizos Sakellariou and Henan Zhao
- University of Manchester
2Scheduling
Ewa Deelman, deelman_at_isi.edu www.isi.edu/deelma
n pegasus.isi.edu
Slide Courtesy Ewa Deelman, deelman_at_isi.edu www
.isi.edu/deelman pegasus.isi.edu
3Execution Environment
Slide Courtesy Ewa Deelman, deelman_at_isi.edu www
.isi.edu/deelman pegasus.isi.edu
4In this talk, optimisation relates to
performanceWhat affects performance?
- Aim to minimise the execution time of the
workflow - How?
- Exploit task parallelism
- But, even if there is enough parallelism, can the
environment guarantee that this parallelism can
be exploited to improve performance? - No!
- Why?
- There is interference from the batch job
schedulers that are traditionally used to submit
jobs to HPC resources!
5Example
- The uncertainty of batch schedulers means that
any workflow enactment engine must wait for
components to complete before beginning to
schedule dependent components. - Furthermore, it is not clear if parallelism will
be fully exploited e.g., if the three tasks
above that can be executed in parallel are
submitted to 3 different queues of different
length, there is no guarantee that they will
execute in parallel job queues rule!
This execution model fails to hide the latencies
resulting from the length of job queues these
determine the execution time of the workflows.
6Then try to get rid of the evil job queues!
- Advance reservation of resources has been
proposed to make jobs run at a precise time. - However, resources would be wasted if they are
reserved for the whole execution of the workflow. - Can we automatically make advance reservations
for individual tasks?
7Assuming that there is no job queue
- what affects performance?
- The structure of the workflow
- number of parallel tasks
- how long these tasks take to execute
- The number of the resources
- typically, much smaller than the parallelism
available. - In addition
- there are communication costs
- there is heterogeneity
- estimating computationcommunication is not
trivial.
8What does all this imply for mapping?
- An order by which tasks will be executed needs to
be established (eg., red, yellow, or blue first?) - Resources need to be chosen for each task (some
resources are fast, some are not so fast!) - The cost of moving data between resources should
not outweigh the benefits of parallelism.
9Does the order matter?
- If task 6 on the right takes comparatively longer
to run, wed like to execute task 2 just after
task 0 finishes and before tasks 1, 3, 4, 5.
Follow the critical path! Is this new? Not
really ?
10Modelling the problem
- A workflow is a Directed Acyclic Graph (DAG)
- Scheduling DAGs onto resources is well studied in
the context of homogeneous systems less so, in
the context of heterogeneous systems (mostly
without taking into account any uncertainty). - Needless to say that this is an NP-complete
problem. - Are workflows really a general type of DAGs or a
subclass? We dont really know (some are clearly
not DAGs only DAGs considered here)
11Our approach
- Revisit the DAG scheduling problem for
heterogeneous systems - Start with simple static scenarios
- Even this problem is not well understood, despite
the fact that there have been perhaps more than
30 heuristics published (check the Heterogeneous
Computing Workshop proceedings for a start) - Try to build on as we obtain a good understanding
of each step!
12Outline
- Static DAG scheduling onto heterogeneous systems
(ie, we know computation communication a
priori) - Introduce uncertainty in computation times.
- Handle multiple DAGs at the same time.
- Use the knowledge accumulated above to reserve
slots for tasks onto resources.
13Based on
- 1 Rizos Sakellariou, Henan Zhao. A Hybrid
Heuristic for DAG Scheduling on Heterogeneous
Systems. Proceedings of the 13th IEEE
Heterogeneous Computing Workshop (HCW04) (in
conjunction with IPDPS 2004), Santa Fe, April
2006, IEEE Computer Society Press, 2004. -
- 2 Rizos Sakellariou, Henan Zhao. A low-cost
rescheduling policy for efficient mapping of
workflows on grid systems. Scientific
Programming, 12(4), December 2004, pp. 253-262. -
- 3 Henan Zhao, Rizos Sakellariou. Scheduling
Multiple DAGs onto Heterogeneous Systems.
Proceedings of the 15th Heterogeneous Computing
Workshop (HCW'06) (in conjunction with IPDPS
2006), Rhodes, Apr. 2006, IEEE Computer Society
Press. -
- 4 Henan Zhao, Rizos Sakellariou. Advance
Reservation Policies for Workflows. Proceedings
of the 12th Workshop on Job Scheduling Strategies
for Parallel Processing, 2006.
14How to schedule? Our modelA DAG, 10 tasks, 3
machines(assume we know execution times,
communication costs)
15A simple idea
- Assign nodes to the fastest machine!
Communication between nodes 4 and 8 takes way
too long!!!
Heuristics that take into account the whole
structure of the DAG are needed
Makespan is gt 1000!
16H.Zhao,R.Sakellariou. An experimental study of
the rank function of HEFT. Proceedings of
EuroPar03.
17Hmm
- This was a rather well defined problem
- This was just a small change in the algorithm
- What about different heuristics?
- What about more generic problems?
18DAG scheduling A Hybrid Heuristic
- Trying to find out why there were such
differences in the outcome of HEFTwe observed
problems with the order to address those
problems we came up with a Hybrid Heuristic it
worked quite well! - Phases
- Rank (list scheduling)
- Create groups of independent tasks
- Schedule independent tasks
- Can be carried out using any scheduling algorithm
for independent tasks, e.g. MinMin, MaxMin, - A novel heuristic (Balanced Minimum Completion
Time)
R.Sakellariou, H.Zhao. A Hybrid Heuristic for DAG
Scheduling on Heterogeneous Systems. Proceedings
of the IEEE Heterogeneous Computing Workshop
(HCW 04) , 2004.
19An Example
14
18 22 13 25 15
14 21
17
26 20 26
20 19
20-
- Mean
Upward Ranking Scheme - The order is
0, 1, 4, 5, 7, 2, 3, 6, 8, 9
21An Example
- Phase 1 Rank the nodes
- Phase 2 Create groups of independent tasks
-
-
The order is 0, 1, 4, 5, 7, 2, 3, 6, 8, 9
22Balanced Minimum Completion TimeAlgorithm (BMCT)
- Step I
- Assign each task to the machine that gives the
fastest execution time. - Step II
- Find the machine M with the maximal finish time.
Move a task from M to another machine, if it
minimizes the overall makespan.
23- Initially assign each task in the group to the
machine giving the fastest time - No movement for the entry task
24An Example (2)
- Phase 3 Schedule Independent Tasks in Group 1
- M0
M1 M2 - 0
-
- 20
-
- 40
-
- 60
-
- 80
-
- 100
- 120
- 140
- Initially assign each task in the group to the
machine giving the fastest time
25- Initially assign each task in the group to the
machine giving the fastest time - M2 is the machine with the Maximal Finish Time
(70)
26- Task 5 moves to M0 since it can achieve an
earlier overall finish time - Now M0 is the machine with the Maximal Finish
Time (69)
27- Task 1 moves to M2 since it can achieve an
earlier overall finish time - Now M2 is the machine with the Maximal Finish
Time (59) - No task can be moved from M2, the movement stops.
- Schedule next group
28- Initially assign each task in this group to the
machine giving the fastest time
29- Task 2 moves to M1 since it can achieve an
earlier overall finish time - M2 is the machine with the Maximal Finish Time
- No movement from M2
- Schedule next group
30- Initially assign each task in this group to the
machine giving the fastest time
31- Task 6 moves to M0 since it can achieve an
earlier overall finish time - M2 is the machine with the Maximal Finish Time
32- Task 8 moves to M1 since it can achieve an
earlier overall finish time - M1 is the machine with the Maximal Finish Time
- No movement from M1
- Schedule next group
33- Initially assign each task in this group to the
machine giving the fastest time - No movement for the exit task
34(No Transcript)
35Experiments
- DAG Scheduling
- Algorithms
- Hybrid.BMCT (i.e. The algorithm as presented),
and - Hybrid.MinMin (i.e. MinMin instead of BMCT)
- Applications
- Random-generated graphs
- Laplace
- FFT
- Fork-join graphs
- Heterogeneity setting (following an approach by
Siegel et al) - Consistent
- Partially-consistent
- Inconsistent
36Hybrid Heuristic Comparison
- NSL
- Random DAGs, 25-100 tasks with inconsistent
heterogeneity - Average improvement 25
37Hmm
- Yes, but, so far, you have used static task
execution times in practice such times are
difficult to specify exactly - There is an answer for run-time deviations
adjust at run-time - But
- dont we need to understand the static case
first?
38Characterise the Schedule
- Spare time indicates the maximum time that a
node, i, may delay without affecting the start
time of an immediate successor, j. - A node i with an immediate successor j on the
DAG spare(i,j) Start_Time(j)
Data_Arrival_Time(i,j) - A node i with an immediate successor j on the
same machine spare(i,j) Start_Time(j)
Finish_Time(i) - The minimum of the above MinSpare for a task.
R.Sakellariou, H.Zhao. A low-cost rescheduling
policy for efficient mapping of workflows on
grid systems. Scientific Programming, 12(4),
December 2004, pp. 253-262.
39Example
- DAT(4,7)40.5, ST(7)45.5, hence, spare(4,7) 5
- FT(3)28, ST(5)29.5, hence, spare(3,5) 1.5
- DAT Data_Arrival_Time, ST Start_Time, FT
Finish_Time
40Characterise the schedule (cont.)
- Slack indicates the maximum time that a node, i,
may delay without affecting the overall makespan. - Slack(i)min(slack(j)spare(i,j)), for all
successor nodes j (both on the DAG and the
machine) - The idea keep track of the values of the slack
and/or the spare time and reschedule only when
the delay exceeds slack
R.Sakellariou, H.Zhao. A low-cost rescheduling
policy for efficient mapping of workflows on
grid systems. Scientific Programming, 12(4),
December 2004, pp. 253-262.
41Lessons Learned(simulation and deviations of up
to 100)
- Heuristics that perform better statically,
perform better under uncertainties. - By using the metrics on spare time, one can
provide guarantees for the maximum deviation from
the static estimate. Then, we can minimise the
number of times we reschedule still achieving
good results. - Could lead to orders of magnitude improvement
with respect to workflow execution using DAGMAN
(would depend on the workflow, only partly true
with Montage)
42Challenges still unanswered
- What are the representative DAGs (workflows) in
the context of Grid computing? - Extensive evaluation / analysis (theoretical too)
is needed. Not clear what is the best makespan we
can get (because it is not easy to find the
critical path) - What are the uncertainties involved? How good are
the estimates that we can obtain for the
execution time / communication cost? Performance
prediction is hard - How heterogeneous our Grid resources really are?
43Moving on to multiple DAGs
- It is really ideal to assume that we have
exclusive usage of resources - In practice, we may have multiple DAGs competing
for resources at the same time
Henan Zhao, Rizos Sakellariou. Scheduling
Multiple DAGs onto Heterogeneous Systems.
Proceedings of the 15th Heterogeneous Computing
Workshop (HCW'06) (in conjunction with IPDPS
2006), Rhodes, Apr. 2006, IEEE Computer Society
Press.
44Scheduling Multiple DAGsApproaches
- Approach 1 Schedule one DAG after the other with
existing DAG scheduling algorithms - Low resource utilization long overall makespan
- Approach 2 Still one after the other, but do
some backfilling and fill the gaps - Which DAG to schedule first? The one with
longest makespan or the one with shortest
makespan? - Approach 3 Merge all DAGs into a single,
composite DAG. Much better than Approach 1 or 2.
45Example Two DAGs to be scheduled together
46Composition Techniques
- C1 Common Entry and Common Exit Node
47Composition Techniques
48Composition Techniques
- C3 Alternate between DAGs (round robin between
DAGs) - Easy!
49Composition Techniques
- C4 Ranking-Based Composition (compute a weight
for each node and merge accordingly)
A0 B0
50- But, is makespan optimisation a good objective
when scheduling multiple DAGs?
51Mission Fairness
- In multiple DAGs
- Users perspective I want my DAG to complete
execution as soon as possible. - System perspective I would like to keep as many
users as possible happy I would like to increase
resource utilisation. - Lets be fair to users!
- (The system may want to take into account
different levels of quality of service agreed
with each user)
52Slowdown
- Slowdown what is the delay that a DAG would
experience as a result of sharing the resources
with other DAGs (as opposed to having the
resources on its own). -
- Average slowdown for all DAGs
53Unfairness
- Unfairness indicates, for all DAGs, how different
the slowdown of each DAG is from the average
slowdown (over all DAGs). - The higher the difference, the higher the
unfairness!
54Scheduling for Fairness
- Key idea at each step (that is, every time a
task is to be scheduled), select the most
affected DAG (that is the DAG with the highest
slowdown value) to schedule. - What is the most affected DAG at any given point
in time?
55Fairness Scheduling Policies
- F1 Based on latest Finish Time
- calculates the slowdown value only at the time
the last task that was scheduled for this DAG
finishes. - F2 Based on Current Time
- re-calculates the slowdown value for every DAG
after any task finishes. A proportion of time,
for tasks running, is taken when the calculation
is carried out.
56Lessons Learned Open questions
- It is possible to achieve reasonably good
fairness without affecting makespan. - An algorithm with good behaviour in the static
case appears to make things easier in terms of
achieving fairness - What is fairness?
- What is the behavior when run-time changes occur?
- What about different notions of Quality of
Service (SLAs, etc)
57Finally
- How to automate advance reservations at the task
level for a workflow, when the user has specified
a deadline constraint only for the whole workflow?
Henan Zhao, Rizos Sakellariou. Advance
Reservation Policies for Workflows. Proceedings
of the 12th Workshop on Job Scheduling
Strategies for Parallel Processing, 2006.
58The schedule on the left can be used to plan
reservations. However, if one task fails to
finish within its slot, the remaining tasks have
to be re-negotiated.
59What we are looking for is
60The Idea
- 1. Obtain an initial assignment using any DAG
scheduling algorithm (HEFT, HBMCT, ). - 2. Repeat
- I. Compute the Application Spare Time ( user
specified deadline DAG finish time). - II. Distribute the Application Spare Time among
the tasks. - 3. Until the Application Spare Time is below a
threshold.
61Spare Time
- The Spare Time indicates the maximum time that a
node may delay, without affecting the start time
of any of its immediate successor nodes. - A node i with an immediate successor j on the
DAG spare(i,j) Start_Time(j)
Data_Arrival_Time(i,j) - A node i with an immediate successor j on the
same machine spare(i,j) Start_Time(j)
Finish_Time(i) - The minimum of the above for all immediate
successors is the Spare Time of a task. - Distributing the Application Spare Time needs to
take care of the inherently present spare time!
62Two main strategies
- Recursive spare time allocation
- The Application Spare Time is divided among all
the tasks. - This is a repetitive process until the
Application Spare Time is below a threshold. - Critical Path based allocation
- Divide the Application Spare Time among the tasks
in the critical path. - Balance the Spare Time of all the other tasks.
- (a total of 6 variants have been studied)
63An Example
64Critical Path based allocation
65Finally
66Findings
- Advance reservation of resources for workflows
can be automatically converted to reservations at
the task level, thus improving resource
utilization. - If the deadline set for the DAG is such that
there is enough spare time, then we can reserve
resources for each individual task so that
deviations of the same order, for each task, can
be afforded without any problems. - Advance reservation is known to harm resource
utilization. But this study indicated that if the
user is prepared to pay for full usage when only
60 of the slot is used there is no loss for the
machine owner.
67which leads to pricing!
- R.Sakellariou, H.Zhao, E.Tsiakkouri, M.Dikaiakos.
Scheduling workflows under budget constraints.
To appear as a Chapter in a book with selected
papers from the 1st CoreGrid Integration
Workshop. - The idea
- Given a specific budget, what is the best
schedule you can obtain for your workflow? - Multicriteria optimisation is hard!
- Our approach
- Start from a good solution for one objective, and
try to meet the other! - It works! How well difficult to tell!
68To summarize
- Understanding the basic static scenarios and
having robust solutions for those scenarios helps
the extension to more complex cases - Pretty much everything here is addressed by
heuristics. Their evaluation requires extensive
experimentation Still - No agreement about how DAGs (workflows) look
like. - No agreement about how heterogeneous resources
are - The problems addressed here are perhaps more
related to what is supposed to be core CS - But we may be talking about lots of work for
only incremental improvements 10-15
69- Who cares in Computer Science about performance
improvements in the order of 10-15??? - (yet, if Gordon Brown was to increase our taxes
by 10-15 everyone would be so unhappy ?) - Oh well
70Thank you!