Condor Tutorial for Users INFN-Bologna, 6/29/99 - PowerPoint PPT Presentation


PPT – Condor Tutorial for Users INFN-Bologna, 6/29/99 PowerPoint presentation | free to download - id: 14f7d5-MjU2Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Condor Tutorial for Users INFN-Bologna, 6/29/99


If you want the email to go to a different address, use this: notify_user = ... Finding and Using the ClassAd Attributes in your Pool ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 91
Provided by: derekw4


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Condor Tutorial for Users INFN-Bologna, 6/29/99

Condor Tutorial for UsersINFN-Bologna, 6/29/99
  • Derek Wright
  • Computer Sciences Department
  • University of Wisconsin-Madison

Conventions Used In This Presentation
  • A slide with an all-yellow background is the
    beginning of a new chapter
  • The slides after it will describe each entry on
    the yellow slide in great detail
  • A Condor tool that users would use will be in red
  • A ClassAd attribute name will be in blue
  • A UNIX shell command or file name will be in
    courier font

What is Condor?
  • A system for High-Throughput Computing
  • Lots of jobs over a long period of time, not a
    short burst of high-performance
  • Condor manages both resources (machines) and
    resource requests (jobs)
  • Supports additional features for jobs that are
    re-linked with Condor libraries
  • checkpointing
  • remote system calls

Whats Condor Good For?
  • Managing a large number of jobs
  • You specify the jobs in a file and submit them to
    Condor, which runs them all and sends you email
    when they complete
  • Mechanisms to help you manage huge numbers of
    jobs (1000s), all the data, etc.
  • Condor can handle inter-job dependencies (DAGMan)

Whats Condor Good For? (contd)
  • Robustness
  • Checkpointing allows guaranteed forward progress
    of your jobs, even jobs that run for weeks before
  • If an execute machine crashes, you only loose
    work done since the last checkpoint
  • Condor maintains a persistent job queue - if the
    submit machine crashes, Condor will recover

Whats Condor Good For? (contd)
  • Giving you access to more computing resources
  • Checkpointing allows your job to run on
    opportunistic resources (not dedicated)
  • Checkpointing also provides migration - if a
    machine is no longer available, move!
  • With remote system calls, you dont even need an
    account on a machine where your job executes

What is a Condor Pool?
  • Pool can be a single machine, or a group of
  • Determined by a central manager - the
    matchmaker and centralized information repository
  • Each machine runs various daemons to provide
    different services, either to the users who
    submit jobs, the machine owners, or the pool

What Kind of Job Do You Have?
  • You must know some things about your job to
    decide if and how it will work with Condor
  • What kind of I/O does it do?
  • Does it use TCP/IP? (network sockets)
  • Can the job be resumed?
  • Is the job multi-process (fork(), pvm_addhost(),

What Kind of I/O Does Your Job Do?
  • Interactive TTY
  • Batch TTY (just reads from STDIN and writes to
    STDOUT or STDERR, but you can redirect to/from
  • X Windows
  • NFS, AFS, or another network file system
  • Local file system
  • TCP/IP

What Does Condor Support?
  • Condor can support various combinations of these
    features in different Universes
  • Different Universes provide different
    functionality for your job
  • Vanilla
  • Standard
  • Scheduler
  • PVM

What Does Condor Support?
Condor Universes
  • A Universe specifies a Condor runtime
  • Supports Checkpointing
  • Supports Remote System Calls
  • Has some limitations (no fork(), socket(), etc.)
  • Any Unix executable (shell scripts, etc)
  • No Condor Checkpointing or Remote I/O

Condor Universes (contd)
  • PVM (Parallel Virtual Machine)
  • Allows you to run parallel jobs in Condor (more
    on this later)
  • Special kind of Condor job the job is run on the
    submit machine, not a remote execute machine
  • Job is automatically restarted is the
    condor_schedd is shutdown
  • Used to schedule jobs (e.g. DAGMan)

Submitting Jobs to Condor
  • Choosing a Universe for your job (already
    covered this)
  • Preparing your job
  • Making it batch-ready
  • Re-linking if checkpointing and remote system
    calls are desired (condor_compile)
  • Creating a submit description file
  • Running condor_submit
  • Sends your request to the User Agent

Preparing Your Job
  • Making your job batch-ready
  • Must be able to run in the background no
    interactive input, windows, GUI, etc.
  • Can still use STDIN, STDOUT, and STDERR (the
    keyboard and the screen), but files are used for
    these instead of the actual devices
  • If your job expects input from the keyboard, you
    have to put the input you want into a file

Preparing Your Job (contd)
  • If you are going to use the standard universe
    with checkpointing and remote system calls, you
    must re-link your job with Condors special
  • To do this, you use condor_compile
  • Place condor_compile in front of the command
    you normally use to link your job

condor_compile gcc -o myjob myjob.c
Creating a Submit Description File
  • A plain ASCII text file
  • Tells Condor about your job
  • Which executable, universe, input, output and
    error files to use, command-line arguments,
    environment variables, any special requirements
    or preferences (more on this later)
  • Can describe many jobs at once (a cluster) each
    with different input, arguments, output, etc.

Example Submit Description File
Example condor_submit input file (Lines
beginning with are comments) NOTE the words
on the left side are not case sensitive,
but filenames are! Universe
standard Executable /home/wright/condor/my_job.c
ondor Input my_job.stdin Output
my_job.stdout Error my_job.stderr Log
my_job.log Arguments -arg1
-arg2 InitialDir /home/wright/condor/run_1 Queue

Example Submit Description File Described
  • Submits a single job to the standard universe,
    specifies files for STDIN, STDOUT and STDERR,
    creates a UserLog defines command line arguments,
    and specifies the directory the job should be run
  • Equivalent to (for outside of Condor)

cd /home/wright/condor/run_1
/home/wright/condor/my_job.condor -arg1 -arg2 \
gt my_job.stdout 2gt my_job.stderr \ lt
Clusters and Processes
  • If your submit file describes multiple jobs, we
    call this a cluster
  • Each job within a cluster is called a process
    or proc
  • If you only specify one job, you still get a
    cluster, but it has only one process
  • A Condor Job ID is the cluster number, a
    period, and the process number (23.5)
  • Process numbers always start at 0

Example Submit Description File for a Cluster
Example condor_submit input file that defines
a whole cluster of jobs at once Universe
standard Executable /home/wright/condor/my_job.c
ondor Input my_job.stdin Output
my_job.stdout Error my_job.stderr Log
my_job.log Arguments -arg1
-arg2 InitialDir /home/wright/condor/run_(Proce
ss) Queue 500
Example Submit Description File for a Cluster -
  • Now, the initial directory for each job is
    specified with the (Process) macro, and instead
    of submitting a single job, we use Queue 500 to
    submit 500 jobs at once
  • (Process) will be expaned to the process number
    for each job in the cluster (from 0 up to 499 in
    this case), so well have run_0, run_1,
    run_499 directories
  • All the input/output files will be in different

Running condor_submit
  • You give condor_submit the name of the submit
    file you have created
  • condor_submit parses the file and creates a
    ClassAd that describes your job(s)
  • Creates the files you specified for STDOUT and
  • Sends your jobs ClassAd(s) and executable to the
    condor_schedd, which stores the job in its queue

Monitoring Your Jobs
  • Using condor_q
  • Using a User Log file
  • Using condor_status
  • Using condor_rm
  • Getting email from Condor
  • Once they complete, you can use condor_history to
    examine them

Using condor_q
  • To view the jobs you have submitted, you use
  • Displays the status of your job, how much compute
    time it has accumulated, etc.
  • Many different options
  • A single job, a single cluster, all jobs that
    match a certain constraint, or all jobs
  • Can view remote job queues (either individual
    queues, or -global)

Using a User Log file
  • A UserLog must be specified in your submit file
  • Log filename
  • You get a log entry for everything that happens
    to your job
  • When it was submitted, when it starts executing,
    if it is checkpointed or vacated, if there are
    any problems, etc.
  • Very useful! Highly recommended!

Using condor_status
  • To view the status of the whole Condor pool, you
    use condor_status
  • Can use the -run option to see which machines
    are running jobs, as well as
  • The user who submitted each job
  • The machine they submitted from
  • Can also view the status of various submitters
    with -submitter ltnamegt

Using condor_rm
  • If you want to remove a job from the Condor
    queue, you use condor_rm
  • You can only remove jobs that you own (you cant
    run condor_rm on someone elses jobs unless you
    are root)
  • You can give specific job IDs (cluster or
    cluster.proc), or you can remove all of your jobs
    with the -a option.

Getting Email from Condor
  • By default, Condor will send you email when your
    jobs completes
  • If you dont want this email, put this in your
    submit file
  • notification never
  • If you want email every time something happens to
    your job (checkpoint, exit, etc), use this
  • notification always

Getting Email from Condor (contd)
  • If you only want email if your job exits with an
    error, use this
  • notification error
  • By default, the email is sent to your account on
    the host you submitted from. If you want the
    email to go to a different address, use this
  • notify_user

Using condor_history
  • Once your job completes, it will no longer show
    up in condor_q
  • Now, you must use condor_history to view the
    jobs ClassAd
  • The status field (ST) will have either a C
    for completed, or an X if the job was removed
    with condor_rm

Any questions?
  • Nothing is too basic
  • If I was unclear, you probably are not the only
    person who doesnt understand, and the rest of
    the day will be even more confusing

Hands-On Exercise 1 Submitting and Monitoring a
Simple Test Job
Hands-On Exercise 1
  • Login to your machine as user condor
  • You will see two windows
  • Netscape, with instructions
  • An xterm, where you execute commands
  • To begin, click on Simple Test Job
  • Please follow the directions carefully
  • Any lines beginning with are commands that you
    should execute in your xterm
  • If you accidentally exit Netscape, click on
    Tutorial in the Start menu

Lunch break
  • Please be back by 1330

Welcome Back
Classified Advertisements
  • ClassAds
  • Language for expressing attributes
  • Semantics for evaluating them
  • Intuitively, a ClassAd is a set of named
  • Each named expression is an attribute
  • Expressions are similar to C
  • Constants, attribute references, operators

Classified Advertisements Example
  • MyType "Machine"
  • TargetType "Job"
  • Name ""
  • StartdIpAddr"lt128.105.73.4433846gt"
  • Arch "INTEL"
  • OpSys "SOLARIS26"
  • VirtualMemory 225312
  • Disk 35957
  • KFlops 21058
  • Mips 103
  • LoadAvg 0.011719
  • KeyboardIdle 12
  • Cpus 1
  • Memory 128
  • Requirements LoadAvg lt 0.300000
    KeyboardIdle gt 15 60
  • Rank 0

Classified Advertisements Matching
  • ClassAds are always considered in pairs
  • Does ClassAd A match ClassAd B (and vice versa)?
  • This is called 2-way matching
  • If the same attribute appears in both ClassAds,
    you can specify which attribute you mean by
    putting MY. or TARGET. in front of the
    attribute name

Classified Advertisements Examples
  • ClassAd A
  • MyType "Apartment"
  • TargetType "ApartmentRenter"
  • SquareArea 3500
  • RentOffer 1000
  • HeatIncluded False
  • OnBusLine True
  • Rank UnderGradFalse
  • Requirements MY.RentOffer -
    TARGET.RentOffer lt 150
  • ClassAd B
  • MyType "ApartmentRenter"
  • TargetType "Apartment"
  • UnderGrad False
  • RentOffer 900
  • Rank 1/(TARGET.RentOffer 100.0)
  • Requirements OnBusLine
  • SquareArea gt 2700

ClassAds in the Condor System
  • ClassAds allow Condor to be a general system
  • Constraints and ranks on matches expressed by the
    entities themselves
  • Only priority logic integrated into the
  • All principal entities in the Condor system are
    represented by ClassAds
  • Machines, Jobs, Submitters

ClassAds in Condor Requirements and
Rank(Example for Machines)
  • Friend Owner "tannenba" Owner "wright"
  • ResearchGroup Owner "jbasney" Owner
  • Trusted Owner ! "rival" Owner !
  • Requirements Trusted ( ResearchGroup
    (LoadAvg lt 0.3 KeyboardIdle gt 1560) )
  • Rank Friend ResearchGroup10

Requirements for Machine Example Described
  • Machine will never start a job submitted by
    rival or riffraff
  • If someone from ResearchGroup (jbasney or
    raman) submits a job, it will always run,
    regardless of keyboard activity or load average
  • If anyone else submits a job, it will only run
    here if the keyboard has been idle for more than
    15 minutes and the load average is less than 0.3

Machine Rank Example Described
  • If the machine is running a job submitted by
    owner foo, it will give this a Rank of 0, since
    foo is neither a friend nor in the same research
  • If wright or tannenba submits a job, it will
    be ranked at 1 (since Friend will evaluate to 1
    and ResearchGroup is 0)
  • If raman or jbasney submit a job, it will
    have a rank of 10
  • While a machine is running a job, it will be
    preempted for a higher ranked job

ClassAds in Condor Requirements and
Rank(Example for Jobs)
  • Requirements Arch INTEL OpSys
    LINUX Memory gt 20
  • Rank (Memory gt 32) ( (Memory 100)
    (IsDedicated 10000) Mips )

Job Example Described
  • The job must run on an Intel CPU, running Linux,
    with at least 20 megs of RAM
  • All machines with 32 megs of RAM or less are
    Ranked at 0
  • Machines with more than 32 megs of RAM are ranked
    according to how much RAM they have, if the
    machine is dedicated (which counts a lot to this
    job!), and how fast the machine is, as measured
    in Million Instructions Per Second

Finding and Using the ClassAd Attributes in your
  • Condor defines a number of attributes by default,
    which are listed in the User Manual (About
    Requirements and Rank)
  • To see if machines in your pool have other
    attributes defined, use
  • condor_status -long lthostnamegt
  • A custom-defined attribute might not be defined
    on all machines in your pool, so youll probably
    want to use meta-operators

ClassAd Meta-Operators
  • Meta operators allow you to compare against
    UNDEFINED as if it were a real value
  • ? is meta-equal-to
  • ! is meta-not-equal-to
  • Color ! Red (non-meta) would evaluate to
    UNDEFINED if Color is not defined
  • Color ! Red would evaluate to True if Color
    is not defined, since UNDEFINED is not Red

Hands-On Exercise 2 Submitting Jobs with
Requirements and Rank
Hands-On Exercise 2
  • Please point your browser to the new
  • Go back to the tutorial homepage
  • Click on Requirements and Rank
  • Again, read the instructions carefully and
    execute any commands on a line beginning with
    in your xterm
  • If you exited Netscape, just click on Tutorial
    from your Start menu

Priorities In Condor
  • Two kinds of priorities
  • User Priorities
  • Priorities between users in the pool to ensure
  • The lower the value, the better the priority
  • Job Priorities
  • Priorities that users give to their own jobs to
    determine the order in which they will run
  • The higher the value, the better the priority
  • Only matters within a given users jobs

User Priorities in Condor
  • Each active user in the pool has a user priority
  • Viewed or changed with condor_userprio
  • The lower the number, the better
  • A given users share of available machines is
    inversely related to the ratio between user
  • Example Freds priority is 10, Joes is 20.
    Fred will be allocated twice as many machines as

User Priorities in Condor, cont.
  • Condor continuously adjusts user priorities over
  • machines allocated gt priority, priority worsens
  • machines allocated lt priority, priority improves
  • Priority Preemption
  • Higher priority users will grab machines away
    from lower priority users (thanks to
  • Starvation is prevented
  • Priority thrashing is prevented

Job Priorities in Condor
  • Can be set at submit-time in your description
    file with
  • prio ltnumbergt
  • Can be viewed with condor_q
  • Can be changed at any time with condor_prio
  • The higher the number, the more likely the job
    will run (only among the jobs of an individual

Managing a Large Cluster of Jobs
  • Condor can manage huge numbers of jobs
  • Special features of the submit description file
    make this easier
  • Condor can also manage inter-job dependencies
    with condor_dagman
  • For example job A should run first, then, run
    jobs B and C, when those finish, submit D, etc
  • Well discuss DAGMan later

Submitting a Large Cluster
  • Anywhere in your submit file, if you use
    (Process), that will expand to the process
    number of each job in the cluster
  • input my_input.(process)
  • arguments (process)
  • It is common to use (Process) to specify
    InitialDir, so that each process runs in its own
  • InitialDir dir.(process)

Submitting a Large Cluster (contd)
  • Can either have multiple Queue entries, or put a
    number after Queue to tell Condor how many to
  • Queue 1000
  • A cluster is more efficient Your jobs will run
    faster, and theyll use less space
  • Can only have one executable per cluster
    Different executables must be different clusters!

Hands-On Exercise 3 Submitting a Large Cluster
of Jobs
Hands-On Exercise 3
  • Please point your browser to the new
  • Go back to the tutorial homepage
  • Click on Large Clusters
  • Again, read the instructions carefully and
    execute any commands on a line beginning with
    in your xterm
  • If you exited Netscape, just click on Tutorial
    from your Start menu

10 Minute Break
  • Questions are welcome.

Inter-Job Dependencies with DAGMan
  • DAGMan can be used to handle a set of jobs that
    must be run in a certain order
  • Also provides pre and post operations, so you
    can have a program or script run before each job
    is submitted and after it completes
  • Robust handles errors and submit-machine crashes

Using DAGMan
  • You define a DAG description file, which is
    similar in function to the submit file you give
    to condor_submit
  • DAGMan restrictions
  • Each job in the DAG must be in its own cluster
    (this is a limitation we will remove in future
  • All jobs in the DAG must have a User Log and must
    share the same file

Format of the DAGMan Description File
  • is a comment
  • First section names the jobs in your DAG and
    associates a submit description file with each
  • Second (optional) section defines PRE and POST
    scripts to run
  • Final section defines the job dependencies

Example DAGMan Description File
Example DAGMan input file Job A A.submit Job B
B.submit Job C C.submit Job D D.submit Script PRE
D d_input_checker Script POST A
a_output_processor A.out PARENT A CHILD B
Setting up a DAG for Condor
  • Must create the DAG description file
  • Must create all the submit description files for
    the individual jobs
  • Must prepare any executables you plan to use
  • If you want, you can have a mix of Vanilla and
    Standard jobs
  • Must setup any PRE/POST commands or scripts you
    wish to use

Submitting a DAG to Condor
  • Once you have everything in place, to submit a
    DAG, you use condor_submit_dag and give it the
    name of your DAG description file
  • This will check your input file for errors and
    submit a copy of condor_dagman as a scheduler
    universe job with all the necessary command-line

Removing a DAG
  • Removing a DAG is easy
  • Just use on the scheduler universe job
  • On shutdown, DAGMan will remove any jobs that are
    currently in the queue that are associated with
    its DAG
  • Once all jobs are gone, DAGMan itself will exit,
    and the scheduler universe job will be removed
    from the queue

Hands-On Exercise 4 Using DAGMan
Hands-On Exercise 4
  • Please point your browser to the new
  • Go back to the tutorial homepage
  • Click on Using_DAGMan
  • Again, read the instructions carefully and
    execute any commands on a line beginning with
    in your xterm
  • If you exited Netscape, just click on Tutorial
    from your Start menu

Whats Wrong with my Vanilla Job?
  • Special requirements expressions for vanilla jobs
  • You didnt submit it from a directory that is
  • Condor isnt running as root (more on this later)
  • You dont have your file permissions setup
    correctly (more on this later)

Special Requirements Expressions for Vanilla Jobs
  • When you submit a vanilla job, Condor
    automatically appends two extra Requirements
  • UID_DOMAIN ltsubmit_uid_domaingt
  • FILESYSTEM_DOMAIN ltsubmit_fsgt
  • Since there are no remote system calls with
    Vanilla jobs, they depend on a shared file system
    and a common UID space to run as you and access
    your files

Special Requirements Expressions for Vanilla Jobs
  • By default, each machine in your pool is in its
    pool administrator has to configure your pool
    specially if there really is a common UID space
    and a network file system
  • If you dont have an account on the remote
    system, Vanilla jobs wont work

Shared File Systems for Vanilla Jobs
  • Just because you have AFS or NFS doesnt mean ALL
    files are shared
  • Initialdir /tmp will probably cause trouble for
    Vanilla jobs!
  • You must be sure to set Initialdir to a shared
    directory (or cd into it to run condor_submit)
    for Vanilla jobs

Why Dont My Jobs Run?
  • Try using condor_q -analyze
  • Try specify a User Log for your job
  • Look at condor_userprio maybe you have a bad
    priority and higher priority users are being
  • Problems with file permissions or network file
  • Look at the SchedLog

Using condor_q -analyze
  • condor_q -analyze will analyze your jobs
    ClassAd, get all the ClassAds of the machines in
    the pool, and tell you whats going on
  • Will report errors in your Requirements
    expression (impossible to match, etc.)
  • Will tell you about user priorities in the pool
    (other people have better priority)

Looking at condor_userprio
  • You can look at condor_userprio yourself
  • If your priority value is a really high number
    (because youve been running a lot of Condor
    jobs), other users will have priority to run jobs
    in your pool

File Permissions in Condor
  • If Condor isnt running as root, the
    condor_shadow process runs as the user the
    condor_schedd is running as (usually condor)
  • You must grant this user write access to your
    output files, and read access to your input files
    (both STDOUT, STDIN from your submit file, as
    well as files your job explicitly opens)

File Permissions in Condor (contd)
  • Often, there will be a condor group and you can
    make your files owned and write-able by this
  • For vanilla jobs, even if the UID_DOMAIN setting
    is correct, and they match for your submit and
    execute machines, if Condor isnt running as
    root, your job will be started as user Condor,
    not as you!

Problems with NFS in Condor
  • For NFS, sometimes the administrators will setup
    read-only mounts, or have UIDs remapped for
    certain partitions (the classic example is root
    nobody, but modern NFS can do arbitrary

Problems with NFS in Condor (contd)
  • If your pool uses NFS automounting, the directory
    that Condor thinks is your InitialDir (the
    directory you were in when you ran condor_submit)
    might not exist on a remote machine
  • E.g. youre in /mnt/tmp/home/me/...
  • With automounting, you always need to specify
    InitialDir explicitly
  • InitialDir /home/me/...

Problems with AFS in Condor
  • If your pool uses AFS, the condor_shadow, even if
    its running with your UID, will not have your
    AFS token
  • You must grant an unauthenticated AFS user the
    appropriate access to your files
  • Some sites provide a better alternative that
    world-writable files
  • Host ACLs
  • Network-specific ACLs

Looking at the SchedLog
  • Looking at the log file of the condor_schedd, the
    SchedLog file can possibly give you a clue if
    there are problems
  • Find it with
  • condor_config_val schedd_log
  • You might need your pool administrator to turn on
    a higher debugging level to see more verbose

Other User Features
  • Submit-Only installation
  • Heterogeneous Submit
  • PVM jobs

Submit-Only Installation
  • Can install just a condor_master and
    condor_schedd on your machine
  • Can submit jobs into a remote pool
  • Special option to condor_install

Heterogeneous Submit
  • The job you submit doesnt have to be the same
    platform as the machine you submit from
  • Maybe you have access to a pool thats full of
    Alphas, but you have a Sparc on your desk, and
    moving all your data is a pain
  • You can take an Alpha binary, copy it to your
    Sparc, and submit it with a requirements
    expression that says you need to run on ALPHA/OSF1

Parallel Jobs in Condor
  • Condor can run parallel applications
  • Written to the popular PVM message passing
  • Future work includes support for MPI
  • Master-Worker Paradigm
  • What does Condor-PVM do?
  • How to compile and submit Condor-PVM jobs

Master-Worker Paradigm
  • Condor-PVM is designed to run PVM applications
    which follow the master-worker paradigm.
  • Master
  • has a pool of work, sends pieces of work to the
    workers, manages the work and the workers
  • Worker
  • gets a piece of work, does the computation, sends
    the result back

What does Condor-PVM do?
  • Condor acts as the PVM resource manager.
  • All pvm_addhost requests get re-mapped to Condor.
  • Condor dynamically constructs PVM virtual
    machines out of non-dedicated desktop machines.
  • When a machine leaves the pool, the user gets
    notified via the normal PVM notification

How to compile and submit Condor-PVM jobs
  • Binary Compatible
  • Compile and link with PVM library just as normal
    PVM applications. No need to link with Condor.
  • Submit
  • In the submit description file, set
  • universe PVM
  • machine_count ltmingt..ltmaxgt

Obtaining Condor
  • Condor can be downloaded from the Condor web site
  • http//
  • Complete Users and Administrators manual
  • http//
  • Contracted Support is available
  • Questions? Email