6d.1 - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

6d.1

Description:

Condor ... Condor-G ... Each 'job' statement has an abstract job name (say A) and a file (say a.condor) ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 46
Provided by: barry200
Category:
Tags: condor

less

Transcript and Presenter's Notes

Title: 6d.1


1
Schedulers and Resource Brokers
ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B.
Wilkinson.
2
Scheduler
  • Job manager submits jobs to scheduler.
  • Scheduler assigns work to resources to achieve
    specified time requirements.

3
Scheduling
  • From "Introduction to Grid Computing with
    Globus," IBM Redbooks

4
Executing GT 4 jobs
  • Globus has two modes
  • Interactive/interactive-streaming
  • Batch

5
GT 4 Fork Scheduler
  • GT 4 comes with a fork scheduler which attempts
    to execute the job immediately
  • Provided for starting and controlling a job on a
    local host if job does not require any special
    software loaded or requirements.
  • Other schedulers have to be added separately,
    using an adapter.

6
Batch scheduling
  • Batch, a term form old computing days, when one
    submitted a pack of punched cards as the program
    to a computer and one would come back after the
    program had been run on the computer, maybe
    overnight.

7
Relationship between GT4 GRAM and a Local
Scheduler
GT4 Java Container
Compute element
Local job control
Job functions
GRAM services
GRAM services
GRAM adapter
Local scheduler
Client
User job
Various possible
I Foster
8
Scheduler adapters included in GT 4
  • PBS (Portable Batch System)
  • Condor
  • LSF (Load Sharing Facility)
  • Third party adapter provided for
  • SGE (Sun Grid Engine)

9
Meta-schedulers
  • Loosely defined as a higher level scheduler that
    can scheduler jobs between sites.

10
(Local) Scheduler Issues
  • Distribute job
  • Based on load and characteristics of machines,
    available disk storage, network characteristics,
    .
  • Both globally and locally.
  • Runtime scheduling!
  • Arrange data in right place (Staging)
  • Data Replication and movement as needed
  • Data Error checking

11
Scheduler Issues (continued)
  • Performance
  • Error checking check pointing
  • Monitoring job, progress monitoring
  • QOS (Quality of service)
  • Cost (an area considered by Nimrod-G)
  • Security
  • Need to authenticate and authorize remote user
    for job submission
  • Fault Tolerance

12
Batch Scheduling policies
  • First-in, First-out
  • Favor certain types of jobs
  • Shortest job first
  • Smallest (or largest) memory first
  • Short(or long) running job first
  • Fair sharing or priority to certain users
  • Dynamic policies
  • Depending upon time of day and load
  • Custom, preemptive, process migration

13
Advance Reservation
  • Requesting actions at times in future.
  • A service level agreement in which the
    conditions of the agreement start at some
    agreed-upon time in the future 2
  • 2 The Grid 2, Blueprint for a New Computing
    Infrastructure, I. Foster and C. Kesselman
    editors, Morgan Kaufmann, 2004.

14
Resource Broker
  • A scheduler that optimizers the performance of a
    particular resource. Performance may be measured
    by such criteria as fairness (to ensure that all
    requests for the resources are satisfied) or
    utilization (to measure the amount of the
    resource used). 2

15
  • Scheduler/Resource Broker Examples
  • Schedulers/Resource Brokers available that work
    with Globus
  • Condor/Condor-G
  • Sun Grid Engine
  • To be covered by James Ruff and to be used in
    Assignment 4 this year.

16
Condor
  • First developed at University of
    Wisconsin-Madison in mid 1980s to convert a
    collection of distributed workstations and
    clusters into a high-throughput computing
    facility.
  • Key concept - using wasted computer power of idle
    workstations.

17
Condor
  • Converts collections of distributed workstations
    and dedicated clusters into a distributed
    high-throughput computing facility.

18
Features
  • Include
  • Resource finder
  • Batch queue manager
  • Scheduler
  • Checkpoint/restart
  • Process migration

19
  • Intended to complete job even if
  • Machines crash
  • Disk space exhausted
  • Software not installed
  • Machines are needed by others
  • Machines are managed by others
  • Machines are far away

20
Uses
  • Consider following scenario
  • I have a simulation that takes two hours to run
    on my high-end computer
  • I need to run it 1000 times with slightly
    different parameters each time.
  • If I do this on one computer, it will take at
    least 2000 hours (or about 3 months)

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, 2004
21
  • Suppose my department has 100 PCs like mine that
    are mostly sitting idle overnight (say 8 hours a
    day).
  • If I could use them when their legitimate users
    are not using them, so that I do not
    inconvenience them, I could get about 800 CPU
    hours/day.
  • This is an ideal situation for Condor.
  • I could do my simulations in 2.5 days.

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, 2004
22
  • The Condor high-throughput computing system
  • Condor-G agent for grid computing

23
HTCS
  • Distributed batch computing system
  • Provide large amounts of fault-tolerant computing
    power.
  • Opportunistic computing.
  • Effectively utilizing all resources on the
    network
  • Scavenger (but polite!)

24
Tools
  • ClassAds Flexible framework for matching
    resource requests and providers.
  • Job checkpoint and migration
  • Remote system calls.
  • redirect I/O back to local machine

25
Condor-G
  • Globus contributes protocols for secure
    inter-domain communications and standardized
    access to remote batch systems.
  • Condor provides everything else

26
(No Transcript)
27
Condor Core Components
28
  • User submits job to agent
  • Keeps job and finds resources willing to run
    them.
  • Agents and resources advertise themselves to a
    matchmaker.
  • E-harmony.com 29 dimensions of compatibility.
  • Agent contacts resource

29
  • Agent creates shadow provides all details
    necessary to run job.
  • Resource creates sandbox a sage execution
    environment for the job and the resource.
  • All independent and individually responsible for
    enforcing their owners policies.
  • This led to Condor Pools

30
(No Transcript)
31
Direct Flocking (Multiple pools)
32
Globus
  • To develop worldwide Grid, needed uniform
    interface for batch execution.
  • Grid Resource Access and Management protocol
    (GRAM).
  • Provides abstraction for remote process queuing
    and execution (with security and GridFTP).
  • Globus provides a server that speaks GRAM,
    converts its commands into a form understood by
    local schedulers

33
  • GRAM does not
  • Remember what jobs have been submitted, where
    they are, what they are doing.
  • Analyze job failure and resubmit
  • Provide queuing, prioritization, logging,
    accounting.
  • Decouple resource allocation and job execution.

34
  • Agent must direct a particular job, executable
    image and all, to a particular queue.
  • Gosh, what if there is a backlog and no
    reasonably available resources?

35
  • Condor adapted standard agent to speak GRAM and
    uses own middleware.
  • Gliding

36
(No Transcript)
37
  • Directed Acyclic Graph
  • Manager (DAGMan)
  • Meta-scheduler
  • Allows one to specify dependencies between Condor
    Jobs.

38
  • Example
  • Do not run Job B until Job A completed
    successfully
  • Especially important to jobs working together (as
    in Grid computing).

39
Directed Acyclic Graph(DAG)
  • A data structure used to represent dependencies.
  • Directed graph.
  • No cycles.
  • Each job is a node in the DAG.
  • Each node can have any number of parents and
    children as long as there are no loops (Acyclic
    graph).

40
DAG
Do job A. Do jobs B and C after job A
finished Do job D after both jobs B and C
finished.
41
Defining a DAG
  • Defined by a .dag file, listing each of the nodes
    and their dependencies.
  • Each job statement has an abstract job name
    (say A) and a file (say a.condor)
  • PARENT-CHILD statement describes relationship
    between two or more jobs
  • Other statements available.

42
  • Example

diamond.dag Job A a.sub Job B b.sub Job C
c.sub Job D d.sub Parent A Child B C Parent B C
Child D
43
Running a DAG
  • DAGMan acts as a scheduler managing the
    submission of jobs to Condor based upon DAG
    dependencies.
  • DAGMan holds and submits jobs to Condor queue at
    appropriate times.

44
Job Failures
  • DAGMan continues until it cannot make progress
    and then creates a rescue file holding current
    state of DAG.
  • When failed job ready to re-run, rescue file used
    to restore prior state of DAG.

45
Summary of Key Condor Features
  • High throughput computing using an
    opportunitistic environment.
  • Provides a mechanisms for running jobs on remote
    machines.
  • Matchmaking
  • Checkpointing
  • DAG scheduling
Write a Comment
User Comments (0)
About PowerShow.com