Exploring Issues with Workflow Scheduling on the Grid - PowerPoint PPT Presentation

Loading...

PPT – Exploring Issues with Workflow Scheduling on the Grid PowerPoint presentation | free to download - id: a9461-MGRmM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Exploring Issues with Workflow Scheduling on the Grid

Description:

Spare time indicates the maximum time that a node, i, may delay without ... By using the metrics on spare time, one can track the amount of deviation of the ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 41
Provided by: scie304
Learn more at: http://www.lpds.sztaki.hu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Exploring Issues with Workflow Scheduling on the Grid


1
Exploring Issues with Workflow Scheduling on the
Grid
  • Rizos Sakellariou
  • University of Manchester, UK
  • with thanks to
  • Henan Zhao and Ewa Deelman for providing slides!
  • also Viktor Yarmolenko, Wei Zheng,
  • and Anastasios Gounaris for presenting it!

2
Workflow applications are widely considered a
common use case of Grids
LIGO (Pegasus team, ISI) (large-scale)
myGrid, Manchester (small size)
3
Modelling the problem
  • A workflow is a Directed Acyclic Graph (DAG)
  • Scheduling DAGs onto resources is well studied in
    the context of homogeneous systems less so, in
    the context of heterogeneous systems (mostly
    without taking into account any uncertainty).
  • Needless to say that this is an NP-complete
    problem.
  • Are workflows really any type of DAGs or a
    special type of DAGs? We dont really know (some
    workflows are clearly not DAGs only DAGs
    considered here)

4
DAG scheduling
  • An order by which tasks will be executed needs to
    be established (eg., red, yellow, or blue first?)
  • Resources need to be chosen for each task (some
    resources are fast, some are not so fast!)
  • The cost of moving data between resources should
    not outweigh the benefits of parallelism.

5
Does the order matter?
  • If task 6 takes comparatively longer to run, wed
    like to execute task 2 just after task 0 finishes
    (perhaps before tasks 1, 3, 4, 5).

Follow the critical path! This is not really
new! ?
6
Our methodology
  • Revisit the DAG scheduling problem for
    heterogeneous systems
  • Start with simple static scenarios
  • Even this problem is not well understood, despite
    the fact that there have been more than 30
    heuristics published (check the proceedings of
    the Heterogeneous Computing Workshop for a
    start)
  • Try to build on existing knowledge, as we obtain
    a good understanding of each step!

7
Outline of Part I
  • Static DAG scheduling onto heterogeneous systems
    (i.e., we know computation communication a
    priori)
  • Introduce uncertainty in computation times.
  • Handle multiple DAGs at the same time.

1 Rizos Sakellariou, Henan Zhao. A Hybrid
Heuristic for DAG Scheduling on Heterogeneous
Systems. Proceedings of the 13th IEEE
Heterogeneous Computing Workshop (HCW04) (in
conjunction with IPDPS 2004), Santa Fe, April
2004, IEEE Computer Society Press, 2004.   2
Rizos Sakellariou, Henan Zhao. A low-cost
rescheduling policy for efficient mapping of
workflows on grid systems. Scientific
Programming, 12(4), December 2004, pp.
253-262.   3 Henan Zhao, Rizos Sakellariou.
Scheduling Multiple DAGs onto Heterogeneous
Systems. Proceedings of the 15th Heterogeneous
Computing Workshop (HCW'06) (in conjunction with
IPDPS 2006), Rhodes, Apr. 2006, IEEE Computer
Society Press.
8
The starting point for a modelA DAG, 10 tasks,
3 machines(assume we know execution times,
communication costs)
9
A simple idea
  • Assign nodes to the fastest machine!

Communication between nodes 4 and 8 takes way
too long!!!
Heuristics that take into account the whole
structure of the DAG are needed
Makespan is gt 1000!
10
H.Zhao,R.Sakellariou. An experimental study of
the rank function of HEFT. Proceedings of
EuroPar03.
11
Hmm
  • This was a rather well defined problem
  • This was just a small change in the algorithm
  • Yet, with big variations in the outcome.
  • What about different heuristics?
  • What about more generic problems?

12
DAG scheduling A Hybrid Heuristic
  • Trying to find out why there were such
    differences in the outcome of HEFTwe observed
    problems with the order to address those
    problems we came up with a Hybrid Heuristic it
    worked quite well!
  • Phases
  • Rank (list scheduling)
  • Create groups of independent tasks
  • Schedule independent tasks
  • Can be carried out using any scheduling algorithm
    for independent tasks, e.g. MinMin, MaxMin,
  • A novel heuristic (Balanced Minimum Completion
    Time)

R.Sakellariou, H.Zhao. A Hybrid Heuristic for DAG
Scheduling on Heterogeneous Systems. Proceedings
of the IEEE Heterogeneous Computing Workshop
(HCW 04) , 2004.
13
Hmm
  • Yes, but, so far, you have used static task
    execution times in practice such times are
    difficult to specify exactly
  • There is an answer for run-time deviations
    adjust at run-time
  • But
  • dont we need to understand the static case
    first?

14
Characterise the Schedule
  • Spare time indicates the maximum time that a
    node, i, may delay without affecting the start
    time of an immediate successor, j.
  • Slack indicates the maximum time that a node, i,
    may delay without affecting the overall makespan.
  • The idea keep track of the values of the slack
    and/or the spare time and reschedule only when
    the delay exceeds slack(selective rescheduling)

R.Sakellariou, H.Zhao. A low-cost rescheduling
policy for efficient mapping of workflows on
grid systems. Scientific Programming, 12(4),
December 2004, pp. 253-262.
15
Example
  • FT(4)32.5, DAT(4,7)40.5, ST(7)45.5
    ?Spare_Time(4)5
  • Slack(8)0
  • Slack(7)Slack(8)Spare_Time(7)0
  • Slack(5)Slack(8)Spare_Time(5)6

16
Lessons Learned(simulation and deviations of up
to 100)
  • Heuristics that perform better statically,
    perform better under uncertainty.
  • By using the metrics on spare time, one can track
    the amount of deviation of the makespan from the
    static estimate. Then, we can minimise the number
    of times we reschedule, still achieving good
    results.

17
Moving on to multiple DAGs
  • It is really ideal to assume that we have
    exclusive usage of resources
  • In practice, we may have multiple DAGs competing
    for resources at the same time

Henan Zhao, Rizos Sakellariou. Scheduling
Multiple DAGs onto Heterogeneous Systems.
Proceedings of the 15th Heterogeneous Computing
Workshop (HCW'06) (in conjunction with IPDPS
2006), Rhodes, Apr. 2006, IEEE Computer Society
Press.
18
Scheduling Multiple DAGsApproaches
  • Approach 1 Schedule one DAG after the other with
    existing DAG scheduling algorithms
  • Low resource utilization long overall makespan
  • Approach 2 Still one after the other, but do
    some backfilling and fill the gaps
  • Which DAG to schedule first? The one with
    longest makespan or the one with shortest
    makespan?
  • Approach 3 Alternate between DAGs (either
    round-robin or using some other form of
    priority).
  • Much better than Approach 1 2.

19
  • But, is makespan optimisation a good objective
    when scheduling multiple DAGs?

20
Mission Fairness
  • In multiple DAGs
  • Users perspective I want my DAG to complete
    execution as soon as possible.
  • System perspective I would like to keep as many
    users as possible happy I would like to increase
    resource utilisation (and income).
  • Lets be fair to users!
  • (The system may want to take into account
    different levels of quality of service agreed
    with each user)

21
Lessons Learned Open questions
  • It is possible to achieve reasonably good
    fairness without affecting makespan.
  • An algorithm with good behaviour in the static
    case appears to make things easier in terms of
    achieving fairness
  • What is fairness?
  • What should be the behavior when run-time changes
    occur?
  • What about different notions of Quality of
    Service (e.g., based on SLAs)

22
Questions still unanswered
  • What are the representative DAGs (workflows) in
    the context of Grid computing?
  • Extensive evaluation / analysis (theoretical too)
    is needed. Not clear what is the best makespan we
    can get (it is not easy to find the critical
    path)
  • What are the uncertainties involved? How good are
    the estimates that we can obtain for the
    execution time / communication cost? Performance
    prediction is hard
  • How heterogeneous our Grid resources really are?

23
Workflows are not generic DAGs
  • Bioinformatics workflows are really small (10s of
    nodes)
  • There are scientific workflows with thousands of
    nodes (Montage, LIGO, SCEC), but they have a
    rather regular structure.
  • Experience from joint work with the Pegasus team
    indicates that there may not be much to gain from
    sophisticated heuristics (paper to be published
    based on the earlier studies below)
  • James Blythe, S. Jain, Ewa Deelman, Yolanda Gil,
    Karan Vahi, Anirban Mandal, Ken Kennedy Task
    scheduling strategies for workflow-based
    applications in grids. CCGRID 2005 759-767
  • Rizos Sakellariou, Henan Zhao. A Hybrid Heuristic
    for DAG Scheduling on Heterogeneous Systems.
    Proceedings of the 13th IEEE Heterogeneous
    Computing Workshop (HCW04) (in conjunction with
    IPDPS 2004), Santa Fe, April 2004, IEEE Computer
    Society Press, 2004.

24
Part IIBut, there is more (than just shortening
the makespan) when scheduling DAGs (workflows)!
25
Efficient data handling
  • Workflow input data is staged dynamically, new
    data products are generated during execution
  • For large workflows 10,000 input files
  • (Similar order of intermediate/output files)
  • If not enough disk space failures occur
  • Solution
  • Determine which data are no longer needed and
    when
  • Add nodes to the workflow to cleanup data along
    the way
  • Take into account disk space onto resources
  • Benefits simulations show up to 57 space
    improvements for LIGO-like workflows

Scheduling Data-Intensive Workflows onto
Storage-Constrained Distributed Resources, A.
Ramakrishnan, G. Singh, H. Zhao, E. Deelman, R.
Sakellariou, K. Vahi, K. Blackburn, D. Meyers,
and M. Samidi, CCGrid 2007
26
44 Improvement in footprint for Montage
workflow(when adding cleanup nodes)
27
LIGO Inspiral Analysis Workflow Small Workflow
164 nodes Full Scale analysis 185,000 nodes and
466,000 edges 10 TB of input data and 1 TB of
output data
LIGO workflow running on OSG
Optimizing Workflow Data Footprint G. Singh, K.
Vahi, A. Ramakrishnan, G. Mehta, E. Deelman, H.
Zhao, R. Sakellariou, K. Blackburn, D. Brown, S.
Fairhurst, D. Meyers, G. B. Berriman , J. Good,
D. S. Katz, Scientific Programming.
28
LIGO Workflows
26 Improvement In disk space Usage 50 slower
runtime
29
LIGO Workflows
56 improvement in space usage 3 times slower
in runtime
Optimizing Workflow Data Footprint G. Singh, K.
Vahi, A. Ramakrishnan, G. Mehta, E. Deelman, H.
Zhao, R. Sakellariou, K. Blackburn, D. Brown, S.
Fairhurst, D. Meyers, G. B. Berriman , J. Good,
D. S. Katz, Scientific Programming.
30
Lesson Learned
  • When scheduling workflows, one may want to trade
    performance with storage requirements to make it
    feasible to complete the execution of a workflow!

31
Part IIIBut, there are other issues related to
performance that have to do with the workflow
execution environment and the queuing mechanisms
of traditional systems!
32
Scheduling
Ewa Deelman, deelman_at_isi.edu www.isi.edu/deelma
n pegasus.isi.edu
Slide Courtesy Ewa Deelman, deelman_at_isi.edu www
.isi.edu/deelman pegasus.isi.edu
33
Execution Environment
Slide Courtesy Ewa Deelman, deelman_at_isi.edu www
.isi.edu/deelman pegasus.isi.edu
34
Queues are evil ?
Is Advance Reservation a solution?
35
Might be For sure, there are several challenges
with respect to workflows e.g., given a
user-specified deadline how can we make
reservations for individual tasks?
  • Henan Zhao, Rizos Sakellariou. Advance
    Reservation Policies for Workflows. Proceedings
    of the 12th Workshop on Job Scheduling Strategies
    for Parallel Processing, 2006.

36
Advance Reservation provides still a limited
level of service!
  • Can we think of a model where
  • users specify their constraints,
  • make an agreement (legally binding contract) with
    the resource owner (Service Level AgreementSLA)
  • its up to the system to do the scheduling (based
    on the SLAs) to honour the agreement.

http//www.gridscheduling.org
Viktor Yarmolenko, Rizos Sakellariou. Towards
Increased Expressiveness in Service Level
Agreements. Concurrency and Computation Practice
and Experience, 2007.
Viktor Yarmolenko, Rizos Sakellariou. An
Evaluation of Heuristics for SLA-based parallel
job scheduling. High Performance Grid Computing
Workshop, IPDPS, 2006.
37
SLA based job scheduling
  • SLA based job scheduling can offer the levels of
    service currently missing
  • It happens all the time in the real-world!
  • But, there are several key challenges to address
  • Build appropriate protocols (legally binding),
    behaviour models, etc. for negotiation and
    re-negotiation
  • Pricing Policies (income, penalties, etc)
  • Manage complexity
  • Regulation, monitoring, dispute resolution
  • Convince the users to change attitudes!
  • Scheduling the SLAs doesnt appear to be the
    biggest challenge But
  • How to schedule workflows using SLAs (how to deal
    with co-allocation problems, for instance) is a
    big challenge!
  • Needs extensive evaluation! ?

38
To summarize
  • Understanding the basic static scenarios and
    having robust solutions for those scenarios helps
    the extension to more complex cases
  • Pretty much everything here is addressed by
    heuristics. Their evaluation requires extensive
    experimentation Still
  • No agreement about how DAGs (workflows) look
    like.
  • No agreement about how heterogeneous resources
    really are.
  • There are indications that sophisticated DAG
    scheduling may not be very relevant for
    workflows. But, there are optimization problems
    that relate to
  • Data handling, Licences?, Budget?, (or multiple
    criteria)
  • and, above all

39
What is the way to ease the constraints imposed
by the traditional queue-based models for job
scheduling?
40
Id be happy to hear from anyone with interests
in these problems.You are also welcome to come
and visit us in Manchester!
About PowerShow.com