Job Delegation and Planning in CondorG ISGC 2005 Taipei, Taiwan - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Job Delegation and Planning in CondorG ISGC 2005 Taipei, Taiwan

Description:

Job management services for Grid applications (Condor-G, Stork) ... Stork. A scheduler for data placement activities in the Grid ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 54
Provided by: miro81
Category:

less

Transcript and Presenter's Notes

Title: Job Delegation and Planning in CondorG ISGC 2005 Taipei, Taiwan


1
Job Delegation and Planning in Condor-GISGC
2005 Taipei, Taiwan
2
The Condor Project (Established 85)
  • Distributed High Throughput Computing research
    performed by a team of 35 faculty, full time
    staff and students.

3
The Condor Project (Established 85)
  • Distributed High Throughput Computing research
    performed by a team of 35 faculty, full time
    staff and students who
  • face software engineering challenges in a
    distributed UNIX/Linux/NT environment
  • are involved in national and international grid
    collaborations,
  • actively interact with academic and commercial
    users,
  • maintain and support large distributed
    production environments,
  • and educate and train students.
  • Funding US Govt. (DoD, DoE, NASA, NSF, NIH),
  • ATT, IBM, INTEL, Microsoft, UW-Madison,

4
A Multifaceted Project
  • Harnessing the power of clusters dedicated
    and/or opportunistic (Condor)
  • Job management services for Grid applications
    (Condor-G, Stork)
  • Fabric management services for Grid resources
    (Condor, GlideIns, NeST)
  • Distributed I/O technology (Parrot, Kangaroo,
    NeST)
  • Job-flow management (DAGMan, Condor, Hawk)
  • Distributed monitoring and management (HawkEye)
  • Technology for Distributed Systems (ClassAD, MW)
  • Packaging and Integration (NMI, VDT)

5
Some software produced by the Condor Project
  • Condor System
  • ClassAd Library
  • DAGMan
  • Fault Tolerant Shell (FTSH)
  • Hawkeye
  • GCB
  • MW
  • NeST
  • Stork
  • Parrot
  • VDT
  • And others all as open source

Data!
6
Who uses Condor?
  • Commercial
  • Oracle, Micron, Hartford Life Insurance, CORE,
    Xerox, Exxon/Mobile, Shell, Alterra, Texas
    Instruments,
  • Research Community
  • Universities, Govt Labs
  • Bundles NMI, VDT
  • Grid Communities EGEE/LCG/gLite, Particle
    Physics Data Grid (PPDG), USCMS, LIGO, iVDGL, NSF
    Middleware Initiative GRIDS Center,

7
Condor Pool
MatchMaker
Startd
Schedd
Startd
Jobs
Jobs
Startd
Schedd
Jobs
Jobs
8
Condor Pool
MatchMaker
Startd
Schedd
Startd
Jobs
Jobs
Jobs
Startd
Jobs
Schedd
Jobs
Jobs
9
Condor-G
Schedd
- Condor-C
LSF PBS
Globus 2 Globus 4 Unicore (Nordugrid)
- Condor-G
Schedd
Startd
Jobs
Jobs
10
User/Application/Portal
Grid
Fabric (processing, storage, communication)
11
Job Delegation
  • Transfer of responsibility to schedule and
    execute a job
  • Stage in executable and data files
  • Transfer policy instructions
  • Securely transfer (and refresh?) credentials,
    obtain local identities
  • Monitor and present job progress (tranparency!)
  • Return results
  • Multiple delegations can be combined in
    interesting ways

12
Simple Job Delegation in Condor-G
Globus GRAM
Batch System Front-end
Execute Machine
Condor-G
13
Expanding the Model
  • What can we do with new forms of job delegation?
  • Some ideas
  • Mirroring
  • Load-balancing
  • Glide-in schedd, startd
  • Multi-hop grid scheduling

14
Mirroring
  • What it does
  • Jobs mirrored on two Condor-Gs
  • If primary Condor-G crashes, secondary one starts
    running jobs
  • On recovery, primary Condor-G gets job status
    from secondary one
  • Removes Condor-G submit point as single point of
    failure

15
Mirroring Example
Condor-G 1
Condor-G 2
X
Jobs
Jobs
Execute Machine
16
Mirroring Example
Condor-G 1
Condor-G 2
Jobs
Execute Machine
17
Load-Balancing
  • What it does
  • Front-end Condor-G distributes all jobs among
    several back-end Condor-Gs
  • Front-end Condor-G keeps updated job status
  • Improves scalability
  • Maintains single submit point for users

18
Load-Balancing Example
Condor-G Back-end 1
Condor-G Front-end
Condor-G Back-end 3
Condor-G Back-end 2
19
Glide-In
  • Schedd and Startd are separate services that do
    not require any special privledges
  • Thus we can submit them as jobs!
  • Glide-In Schedd
  • What it does
  • Drop a Condor-G onto the front-end machine of a
    remote cluster
  • Delegate jobs to the cluster through the glide-in
    schedd
  • Can apply cluster-specific policies to jobs
  • Not fork-and-forget
  • Send a manager to the site, instead of manage
    across the internet

20
Glide-In Schedd Example
Frontend
Middleware
Jobs
Condor-G
Jobs
Batch System
21
Glide-In Startd Example
Frontend
Middleware
Batch System
Condor-G (Schedd)
Startd
Job
22
Glide-In Startd
  • Why?
  • Restores all the benefits that may have been
    washed away by the middleware
  • End-to-end management solution
  • Preserves job semantic guarantees
  • Preserves policy
  • Enables lazy planning

23
Sample Job Submit file
  • universe grid
  • grid_type gt2
  • globusscheduler cluster1.cs.wisc.edu/jobmanager-
    lsf
  • executable find_particle
  • arguments .
  • output .
  • log

But we want metascheduling
24
Represent grid clusters as ClassAds
  • ClassAds
  • are a set of uniquely named expressions each
    expression is called an attribute and is an
    attribute name/value pair
  • combine query and data
  • extensible
  • semi-structured no fixed schema (flexibility in
    an environment consisting of distributed
    administrative domains)
  • Designed for MatchMaking

25
  • Example of a ClassAd that could represent a
    compute cluster in a grid
  • Type "GridSite"
  • Name "FermiComputeCluster"
  • Arch Intel-Linux
  • Gatekeeper_url "globus.fnal.gov/lsf"
  • Load
  • QueuedJobs 42
  • RunningJobs 200
  • Requirements ( other.Type "Job"
  • Load.QueuedJobs lt 100 )
  • GoodPeople "howard", "harry"
  • Rank member(other.Owner,
  • GoodPeople) 500

26
Another Sample - Job Submit
  • universe grid
  • grid_type gt2owner howard
  • executable find_particle.(Arch)
  • requirements other.Arch Intel-Linux
    other.Arch Sparc-Solaris
  • rank 0 other.Load.QueuedJobs
  • globusscheduler (gatekeeper_url)

Note We introduced augmentation of the job
ClassAd based upon information discovered in its
matching resource ClassAd.
27
Multi-Hop Grid Scheduling
  • Match a job to a Virtual Organization (VO), then
    to a resource within that VO
  • Easier to schedule jobs across multiple VOs and
    grids

28
Multi-Hop Grid Scheduling Example
Experiment Resource Broker
VO Resource Broker
Experiment Condor-G
VO Condor-G
HEP
CMS
Globus GRAM
Batch Scheduler
29
Endless Possibilities
  • These new models can be combined with each other
    or with other new models
  • Resulting system can be arbitrarily sophisticated

30
Job Delegation Challenges
  • New complexity introduces new issues and
    exacerbates existing ones
  • A few
  • Transparency
  • Representation
  • Scheduling Control
  • Active Job Control
  • Revocation
  • Error Handling and Debugging

31
Transparency
  • Full information about job should be available to
    user
  • Information from full delegation path
  • No manual tracing across multiple machines
  • Users need to know whats happening with their
    jobs

32
Representation
  • Job state is a vector
  • How best to show this to user
  • Summary
  • Current delegation endpoint
  • Job state at endpoint
  • Full information available if desired
  • Series of nested ClassAds?

33
Scheduling Control
  • Avoid loops in delegation path
  • Give user control of scheduling
  • Allow limiting of delegation path length?
  • Allow user to specify part or all of delegation
    path

34
Active Job Control
  • User may request certain actions
  • hold, suspend, vacate, checkpoint
  • Actions cannot be completed synchronously for
    user
  • Must forward along delegation path
  • User checks completion later

35
Active Job Control (cont)
  • Endpoint systems may not support actions
  • If possible, execute them at furthest point that
    does support them
  • Allow user to apply action in middle of
    delegation path

36
Revocation
  • Leases
  • Lease must be renewed periodically for delegation
    to remain valid
  • Allows revocation during long-term failures
  • What are good values for lease lifetime and
    update interval?

37
Error Handling and Debugging
  • Many more places for things to go horribly wrong
  • Need clear, simple error semantics
  • Logs, logs, logs
  • Have them everywhere

38
From earlier
  • Transfer of responsibility to schedule and
    execute a job
  • Transfer policy instructions
  • Stage in executable and data files
  • Securely transfer (and refresh?) credentials,
    obtain local identities
  • Monitor and present job progress (tranparency!)
  • Return results

39
Job Failure Policy Expressions
  • Condor/Condor-G augemented so users can supply
    job failure policy expressions in the submit
    file.
  • Can be used to describe a successful run, or what
    to do in the face of failure.
  • on_exit_remove ltexpressiongt
  • on_exit_hold ltexpressiongt
  • periodic_remove ltexpressiongt
  • periodic_hold ltexpressiongt

40
Job Failure Policy Examples
  • Do not remove from queue (i.e. reschedule) if
    exits with a signal
  • on_exit_remove ExitBySignal False
  • Place on hold if exits with nonzero status or ran
    for less than an hour
  • on_exit_hold ((ExitBySignalFalse)
    (ExitSignal ! 0)) ((ServerStartTime
    JobStartDate) lt 3600)
  • Place on hold if job has spent more than 50 of
    its time suspended
  • periodic_hold CumulativeSuspensionTime gt
    (RemoteWallClockTime / 2.0)

41
Data Placement (DaP) must be an integral part
ofthe end-to-end solution
Space management and Data transfer

42
Stork
  • A scheduler for data placement activities in the
    Grid
  • What Condor is for computational jobs, Stork is
    for data placement
  • Stork comes with a new concept
  • Make data placement a first class citizen in the
    Grid.

43
Data Placement Jobs
Computational Jobs
44
DAG with DaP
DAG specification
C
45
Why Stork?
  • Stork understands the characteristics and
    semantics of data placement jobs.
  • Can make smart scheduling decisions, for reliable
    and efficient data placement.

46
Failure Recovery and Efficient Resource
Utilization
  • Fault tolerance
  • Just submit a bunch of data placement jobs, and
    then go away..
  • Control number of concurrent transfers from/to
    any storage system
  • Prevents overloading
  • Space allocation and De-allocations
  • Make sure space is available

47
Support for Heterogeneity
Protocol translation using Stork memory buffer.
48
Support for Heterogeneity
Protocol translation using Stork Disk Cache.
49
Flexible Job Representation and Multilevel Policy
Support
  • Type Transfer
  • Src_Url srb//ghidorac.sdsc.edu/kosart.cond
    or/x.dat
  • Dest_Url nest//turkey.cs.wisc.edu/kosart/x
    .dat
  • Max_Retry 10
  • Restart_in 2 hours

50
Run-time Adaptation
  • Dynamic protocol selection
  • dap_type transfer
  • src_url drouter//slic04.sdsc.edu/tmp/tes
    t.dat
  • dest_url drouter//quest2.ncsa.uiuc.edu/tmp
    /test.dat
  • alt_protocols nest-nest, gsiftp-gsiftp
  • dap_type transfer
  • src_url any//slic04.sdsc.edu/tmp/test.da
    t
  • dest_url any//quest2.ncsa.uiuc.edu/tmp/tes
    t.dat

51
Run-time Adaptation
  • Run-time Protocol Auto-tuning
  • link slic04.sdsc.edu quest2.ncsa.uiuc.edu
  • protocol gsiftp
  • bs 1024KB //block size
  • tcp_bs 1024KB //TCP buffer size
  • p 4

52
Planner
DAGMan
Condor-G
Stork
GRAM
StartD
Parrot
Application
RFT
GridFTP
53
Thank You!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com