New%20Ways%20to%20Fetch%20Work - PowerPoint PPT Presentation

About This Presentation
Title:

New%20Ways%20to%20Fetch%20Work

Description:

Periodically invoked by the starter to let you know what's happening with the job ... If the starter can't run your fetched job because your ClassAd is bogus, no hook ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 24
Provided by: Der23
Category:
Tags: 20fetch | 20ways | 20work | 20to | new | starter

less

Transcript and Presenter's Notes

Title: New%20Ways%20to%20Fetch%20Work


1
New Ways to Fetch Work
  • The new hook infrastructure in Condor 7.1.

2
Whats the problem?
  • Users wanted to take advantage of Condors
    resource management daemon (condor_startd) to run
    jobs, but they had their own scheduling system.
  • Specialized scheduling needs
  • Jobs live in their own database or other storage
    than a Condor job queue

3
Fetch vs. push
  • Instead of trying to get these jobs into a
    condor_schedd, or try to push them to the
    condor_startd, just get the condor_startd to
    fetch (pull) the work
  • Lower latency than the overhead of matchmaking
    and the schedd
  • Fetching only requires an outbound network
    connection which makes life easier if you
    glide-in behind a firewall

4
Whats the dumb solution?
  • Put code directly into the condor_startd that can
    talk directly to the other scheduling system(s)
  • Wed have to support other protocols
  • Wed have to link even more libraries and
    dependencies into our code
  • Very inflexible

5
Another dumb solution
  • Make it a web service!
  • Mostly the same problems
  • What protocol?
  • What format to describe the jobs?
  • Add a dependency on libCurl?
  • What if I dont want a webserver to be handling
    my jobs?
  • Security? Authentication? Privacy?

6
Our solution (hopefully not dumb)
  • Make a system of hooks that you can plug into
  • A hook is a point during the life-cycle of a job
    where the Condor daemons will invoke an external
    program
  • The hook invocation points have to be hard-coded
    into Condor, but then anyone can implement their
    own hooks to do what they want

7
Why isnt that dumb?
  • All the logic, code, libraries, etc, to fetch
    jobs from any given system lives completely
    outside of the Condor source and binaries
  • New hooks can be installed without a new version
    of Condor
  • No new library dependencies for us
  • Hooks are written by people who know what theyre
    doing

8
How does Condor communicate with hooks?
  • Passing around ASCII ClassAds via standard input
    and standard output
  • Some hooks get control data via a command-line
    argument (argv)
  • Hooks can be written in any language (scripts,
    binaries, whatever you want) so long as you can
    read/write STDIN/OUT
  • Decades of UNIX wisdom cant be wrong!

9
What hooks are available?
  • Hooks for fetching work (condor_startd)
  • FETCH_JOB
  • REPLY_FETCH
  • EVICT_CLAIM
  • Hooks for running jobs (condor_starter)
  • PREPARE_JOB
  • UPDATE_JOB_INFO
  • JOB_EXIT

10
HOOK_FETCH_JOB
  • Invoked by the startd whenever it wants to try to
    fetch new work
  • FetchWorkDelay expression
  • Hook gets a current copy of the slot ClassAd
  • Hook prints the job ClassAd to STDOUT
  • If STDOUT is empty, theres no work

11
HOOK_REPLY_FETCH
  • Invoked by the startd once it decides what to do
    with the job ClassAd returned by HOOK_FETCH_WORK
  • Gives your external system a chance to know what
    happened
  • argv1 accept or reject
  • Gets a copy of slot and job ClassAds
  • Condor ignores all output
  • Optional hook

12
HOOK_EVICT_CLAIM
  • Invoked if the startd has to evict a claim thats
    running fetched work
  • Informational only you cant stop or delay this
    train once its left the station
  • STDIN Both slot and job ClassAds
  • STDOUT gt /dev/null

13
HOOK_PREPARE_JOB
  • Invoked by the condor_starter when it first
    starts up (only if defined)
  • Opportunity to prepare the job execution
    environment
  • Transfer input files, executables, etc.
  • INPUT both slot and job ClassAds
  • OUTPUT ignored, but starter wont continue until
    this hook exits
  • Not specific to fetched work

14
HOOK_UPDATE_JOB_INFO
  • Periodically invoked by the starter to let you
    know whats happening with the job
  • INPUT both ClassAds
  • Job ClassAd is updated with additional attributes
    computed by the starter
  • ImageSize, JobState, RemoteUserCpu, etc.
  • OUTPUT ignored

15
HOOK_JOB_EXIT
  • Invoked by the starter whenever the job exits for
    any reason
  • Argv1 indicates what happened
  • exit Died a natural death
  • evict Booted off prematurely by the startd
    (PREEMPT TRUE, condor_off, etc)
  • remove Removed by condor_rm
  • hold Held by condor_hold

16
HOOK_JOB_EXIT
  • HUH!?! condor_rm? What are you talking about?
  • The starter hooks can be defined even for regular
    Condor jobs, local universe, etc.
  • INPUT copy of the job ClassAd with extra
    attributes about what happened
  • ExitCode, JobDuration, etc.
  • OUTPUT Ignored
  • Except for dumb exceptions the schedd doesnt
    distinguish rm vs. hold when telling the starter
    to go away (yet). Argh!

17
Defining hooks
  • Each slot can have its own hook keyword
  • Prefix for config file parameters
  • Can use different sets of hooks to talk to
    different external systems on each slot
  • Global keyword used when the per-slot keyword is
    not defined
  • Keyword is inserted by the startd into its copy
    of the job ClassAd and given to the starter

18
Defining hooks example
  • Most slots fetch work from the database system
  • STARTD_JOB_HOOK_KEYWORD DB
  • Slot4 fetches and runs work from a web service
  • SLOT4_JOB_HOOK_KEYWORD WEB
  • The database system needs to both provide work
    and
  • know the reply for each attempted claim
  • DB_DIR /usr/local/condor/fetch/db
  • DATABASE_HOOK_FETCH_WORK (DB_DIR)/fetch_work.ph
    p
  • DATABASE_HOOK_REPLY_FETCH (DB_DIR)/reply_fetch.
    php
  • The web system only needs to fetch work
  • WEB_DIR /usr/local/condor/fetch/web
  • WEB_HOOK_FETCH_WORK (WEB_DIR)/fetch_work.php

19
Semantics of fetched jobs
  • Condor_startd treats them just like any other
    kind of job
  • All the standard resource policy expressions
    apply (START, SUSPEND, PREEMPT, RANK, etc).
  • Fetched jobs can coexist in the same pool with
    jobs pushed by Condor, COD, etc.
  • Fetched work ! Backfill

20
Semantics continued
  • If the startd is unclaimed and fetches a job, a
    claim is created
  • If that job completes, the claim is reused and
    the startd fetches again
  • Keep fetching until either
  • The claim is evicted by Condor
  • The fetch hook returns no more work

21
Limitations for fetched jobs
  • No schedd/shadow means no standard universe for
    checkpointing, migration, and remote system calls
  • Could use stand-alone checkpointing
  • Application-specific checkpointing
  • Other features that are unavailable
  • User policy expressions (e.g. periodic hold)
  • No DAGMan (youre on your own)

22
Limitations of the hooks
  • If the starter cant run your fetched job because
    your ClassAd is bogus, no hook is invoked to tell
    you about it
  • We need a HOOK_STARTER_FAILURE
  • No hook when the starter is about to evict you
    (so you can checkpoint)
  • Can implement this yourself with a wrapper script
    and the SoftKillSig attribute

23
More information
  • New section in the Condor 7.1 manual
  • Chapter 4 Miscellaneous Concepts
  • 4.4 Job Hooks
  • http//www.cs.wisc.edu/condor/manual/v7.1/4_4Job_H
    ooks.html
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com