Title: Condor Tutorial for Users INFN-Bologna, 6/29/99
1Condor Tutorial for UsersINFN-Bologna, 6/29/99
- Derek Wright
- Computer Sciences Department
- University of Wisconsin-Madison
- wright_at_cs.wisc.edu
2Conventions Used In This Presentation
- A slide with an all-yellow background is the
beginning of a new chapter - The slides after it will describe each entry on
the yellow slide in great detail - A Condor tool that users would use will be in red
italics - A ClassAd attribute name will be in blue
- A UNIX shell command or file name will be in
courier font
3What is Condor?
- A system for High-Throughput Computing
- Lots of jobs over a long period of time, not a
short burst of high-performance - Condor manages both resources (machines) and
resource requests (jobs) - Supports additional features for jobs that are
re-linked with Condor libraries - checkpointing
- remote system calls
4Whats Condor Good For?
- Managing a large number of jobs
- You specify the jobs in a file and submit them to
Condor, which runs them all and sends you email
when they complete - Mechanisms to help you manage huge numbers of
jobs (1000s), all the data, etc. - Condor can handle inter-job dependencies (DAGMan)
5Whats Condor Good For? (contd)
- Robustness
- Checkpointing allows guaranteed forward progress
of your jobs, even jobs that run for weeks before
completion - If an execute machine crashes, you only loose
work done since the last checkpoint - Condor maintains a persistent job queue - if the
submit machine crashes, Condor will recover
6Whats Condor Good For? (contd)
- Giving you access to more computing resources
- Checkpointing allows your job to run on
opportunistic resources (not dedicated) - Checkpointing also provides migration - if a
machine is no longer available, move! - With remote system calls, you dont even need an
account on a machine where your job executes
7What is a Condor Pool?
- Pool can be a single machine, or a group of
machines - Determined by a central manager - the
matchmaker and centralized information repository - Each machine runs various daemons to provide
different services, either to the users who
submit jobs, the machine owners, or the pool
itself
8What Kind of Job Do You Have?
- You must know some things about your job to
decide if and how it will work with Condor - What kind of I/O does it do?
- Does it use TCP/IP? (network sockets)
- Can the job be resumed?
- Is the job multi-process (fork(), pvm_addhost(),
etc.)
9What Kind of I/O Does Your Job Do?
- Interactive TTY
- Batch TTY (just reads from STDIN and writes to
STDOUT or STDERR, but you can redirect to/from
files) - X Windows
- NFS, AFS, or another network file system
- Local file system
- TCP/IP
10What Does Condor Support?
- Condor can support various combinations of these
features in different Universes - Different Universes provide different
functionality for your job - Vanilla
- Standard
- Scheduler
- PVM
11What Does Condor Support?
12Condor Universes
- A Universe specifies a Condor runtime
environment - STANDARD
- Supports Checkpointing
- Supports Remote System Calls
- Has some limitations (no fork(), socket(), etc.)
- VANILLA
- Any Unix executable (shell scripts, etc)
- No Condor Checkpointing or Remote I/O
13Condor Universes (contd)
- PVM (Parallel Virtual Machine)
- Allows you to run parallel jobs in Condor (more
on this later) - SCHEDULER
- Special kind of Condor job the job is run on the
submit machine, not a remote execute machine - Job is automatically restarted is the
condor_schedd is shutdown - Used to schedule jobs (e.g. DAGMan)
14Submitting Jobs to Condor
- Choosing a Universe for your job (already
covered this) - Preparing your job
- Making it batch-ready
- Re-linking if checkpointing and remote system
calls are desired (condor_compile) - Creating a submit description file
- Running condor_submit
- Sends your request to the User Agent
(condor_schedd)
15Preparing Your Job
- Making your job batch-ready
- Must be able to run in the background no
interactive input, windows, GUI, etc. - Can still use STDIN, STDOUT, and STDERR (the
keyboard and the screen), but files are used for
these instead of the actual devices - If your job expects input from the keyboard, you
have to put the input you want into a file
16Preparing Your Job (contd)
- If you are going to use the standard universe
with checkpointing and remote system calls, you
must re-link your job with Condors special
libraries - To do this, you use condor_compile
- Place condor_compile in front of the command
you normally use to link your job
condor_compile gcc -o myjob myjob.c
17Creating a Submit Description File
- A plain ASCII text file
- Tells Condor about your job
- Which executable, universe, input, output and
error files to use, command-line arguments,
environment variables, any special requirements
or preferences (more on this later) - Can describe many jobs at once (a cluster) each
with different input, arguments, output, etc.
18Example Submit Description File
Example condor_submit input file (Lines
beginning with are comments) NOTE the words
on the left side are not case sensitive,
but filenames are! Universe
standard Executable /home/wright/condor/my_job.c
ondor Input my_job.stdin Output
my_job.stdout Error my_job.stderr Log
my_job.log Arguments -arg1
-arg2 InitialDir /home/wright/condor/run_1 Queue
19Example Submit Description File Described
- Submits a single job to the standard universe,
specifies files for STDIN, STDOUT and STDERR,
creates a UserLog defines command line arguments,
and specifies the directory the job should be run
in - Equivalent to (for outside of Condor)
cd /home/wright/condor/run_1
/home/wright/condor/my_job.condor -arg1 -arg2 \
gt my_job.stdout 2gt my_job.stderr \ lt
my_job.stdin
20Clusters and Processes
- If your submit file describes multiple jobs, we
call this a cluster - Each job within a cluster is called a process
or proc - If you only specify one job, you still get a
cluster, but it has only one process - A Condor Job ID is the cluster number, a
period, and the process number (23.5) - Process numbers always start at 0
21Example Submit Description File for a Cluster
Example condor_submit input file that defines
a whole cluster of jobs at once Universe
standard Executable /home/wright/condor/my_job.c
ondor Input my_job.stdin Output
my_job.stdout Error my_job.stderr Log
my_job.log Arguments -arg1
-arg2 InitialDir /home/wright/condor/run_(Proce
ss) Queue 500
22Example Submit Description File for a Cluster -
Described
- Now, the initial directory for each job is
specified with the (Process) macro, and instead
of submitting a single job, we use Queue 500 to
submit 500 jobs at once - (Process) will be expaned to the process number
for each job in the cluster (from 0 up to 499 in
this case), so well have run_0, run_1,
run_499 directories - All the input/output files will be in different
directories!
23Running condor_submit
- You give condor_submit the name of the submit
file you have created - condor_submit parses the file and creates a
ClassAd that describes your job(s) - Creates the files you specified for STDOUT and
STDERR - Sends your jobs ClassAd(s) and executable to the
condor_schedd, which stores the job in its queue
24Monitoring Your Jobs
- Using condor_q
- Using a User Log file
- Using condor_status
- Using condor_rm
- Getting email from Condor
- Once they complete, you can use condor_history to
examine them
25Using condor_q
- To view the jobs you have submitted, you use
condor_q - Displays the status of your job, how much compute
time it has accumulated, etc. - Many different options
- A single job, a single cluster, all jobs that
match a certain constraint, or all jobs - Can view remote job queues (either individual
queues, or -global)
26Using a User Log file
- A UserLog must be specified in your submit file
- Log filename
- You get a log entry for everything that happens
to your job - When it was submitted, when it starts executing,
if it is checkpointed or vacated, if there are
any problems, etc. - Very useful! Highly recommended!
27Using condor_status
- To view the status of the whole Condor pool, you
use condor_status - Can use the -run option to see which machines
are running jobs, as well as - The user who submitted each job
- The machine they submitted from
- Can also view the status of various submitters
with -submitter ltnamegt
28Using condor_rm
- If you want to remove a job from the Condor
queue, you use condor_rm - You can only remove jobs that you own (you cant
run condor_rm on someone elses jobs unless you
are root) - You can give specific job IDs (cluster or
cluster.proc), or you can remove all of your jobs
with the -a option.
29Getting Email from Condor
- By default, Condor will send you email when your
jobs completes - If you dont want this email, put this in your
submit file - notification never
- If you want email every time something happens to
your job (checkpoint, exit, etc), use this - notification always
30Getting Email from Condor (contd)
- If you only want email if your job exits with an
error, use this - notification error
- By default, the email is sent to your account on
the host you submitted from. If you want the
email to go to a different address, use this - notify_user email_at_address.here
31Using condor_history
- Once your job completes, it will no longer show
up in condor_q - Now, you must use condor_history to view the
jobs ClassAd - The status field (ST) will have either a C
for completed, or an X if the job was removed
with condor_rm
32Any questions?
- Nothing is too basic
- If I was unclear, you probably are not the only
person who doesnt understand, and the rest of
the day will be even more confusing
33Hands-On Exercise 1 Submitting and Monitoring a
Simple Test Job
34Hands-On Exercise 1
- Login to your machine as user condor
- You will see two windows
- Netscape, with instructions
- An xterm, where you execute commands
- To begin, click on Simple Test Job
- Please follow the directions carefully
- Any lines beginning with are commands that you
should execute in your xterm - If you accidentally exit Netscape, click on
Tutorial in the Start menu
35Lunch break
36Welcome Back
37Classified Advertisements
- ClassAds
- Language for expressing attributes
- Semantics for evaluating them
- Intuitively, a ClassAd is a set of named
expressions - Each named expression is an attribute
- Expressions are similar to C
- Constants, attribute references, operators
38Classified Advertisements Example
- MyType "Machine"
- TargetType "Job"
- Name "froth.cs.wisc.edu"
- StartdIpAddr"lt128.105.73.4433846gt"
- Arch "INTEL"
- OpSys "SOLARIS26"
- VirtualMemory 225312
- Disk 35957
- KFlops 21058
- Mips 103
- LoadAvg 0.011719
- KeyboardIdle 12
- Cpus 1
- Memory 128
- Requirements LoadAvg lt 0.300000
KeyboardIdle gt 15 60 - Rank 0
39Classified Advertisements Matching
- ClassAds are always considered in pairs
- Does ClassAd A match ClassAd B (and vice versa)?
- This is called 2-way matching
- If the same attribute appears in both ClassAds,
you can specify which attribute you mean by
putting MY. or TARGET. in front of the
attribute name
40Classified Advertisements Examples
- ClassAd A
- MyType "Apartment"
- TargetType "ApartmentRenter"
- SquareArea 3500
- RentOffer 1000
- HeatIncluded False
- OnBusLine True
- Rank UnderGradFalse
TARGET.RentOffer - Requirements MY.RentOffer -
TARGET.RentOffer lt 150
- ClassAd B
- MyType "ApartmentRenter"
- TargetType "Apartment"
- UnderGrad False
- RentOffer 900
- Rank 1/(TARGET.RentOffer 100.0)
50HeatIncluded - Requirements OnBusLine
- SquareArea gt 2700
41ClassAds in the Condor System
- ClassAds allow Condor to be a general system
- Constraints and ranks on matches expressed by the
entities themselves - Only priority logic integrated into the
Match-Maker - All principal entities in the Condor system are
represented by ClassAds - Machines, Jobs, Submitters
42ClassAds in Condor Requirements and
Rank(Example for Machines)
- Friend Owner "tannenba" Owner "wright"
- ResearchGroup Owner "jbasney" Owner
"raman" - Trusted Owner ! "rival" Owner !
"riffraff" - Requirements Trusted ( ResearchGroup
(LoadAvg lt 0.3 KeyboardIdle gt 1560) ) - Rank Friend ResearchGroup10
43Requirements for Machine Example Described
- Machine will never start a job submitted by
rival or riffraff - If someone from ResearchGroup (jbasney or
raman) submits a job, it will always run,
regardless of keyboard activity or load average - If anyone else submits a job, it will only run
here if the keyboard has been idle for more than
15 minutes and the load average is less than 0.3
44Machine Rank Example Described
- If the machine is running a job submitted by
owner foo, it will give this a Rank of 0, since
foo is neither a friend nor in the same research
group - If wright or tannenba submits a job, it will
be ranked at 1 (since Friend will evaluate to 1
and ResearchGroup is 0) - If raman or jbasney submit a job, it will
have a rank of 10 - While a machine is running a job, it will be
preempted for a higher ranked job
45ClassAds in Condor Requirements and
Rank(Example for Jobs)
- Requirements Arch INTEL OpSys
LINUX Memory gt 20 - Rank (Memory gt 32) ( (Memory 100)
(IsDedicated 10000) Mips )
46Job Example Described
- The job must run on an Intel CPU, running Linux,
with at least 20 megs of RAM - All machines with 32 megs of RAM or less are
Ranked at 0 - Machines with more than 32 megs of RAM are ranked
according to how much RAM they have, if the
machine is dedicated (which counts a lot to this
job!), and how fast the machine is, as measured
in Million Instructions Per Second
47Finding and Using the ClassAd Attributes in your
Pool
- Condor defines a number of attributes by default,
which are listed in the User Manual (About
Requirements and Rank) - To see if machines in your pool have other
attributes defined, use - condor_status -long lthostnamegt
- A custom-defined attribute might not be defined
on all machines in your pool, so youll probably
want to use meta-operators
48ClassAd Meta-Operators
- Meta operators allow you to compare against
UNDEFINED as if it were a real value - ? is meta-equal-to
- ! is meta-not-equal-to
- Color ! Red (non-meta) would evaluate to
UNDEFINED if Color is not defined - Color ! Red would evaluate to True if Color
is not defined, since UNDEFINED is not Red
49Hands-On Exercise 2 Submitting Jobs with
Requirements and Rank
50Hands-On Exercise 2
- Please point your browser to the new
instructions - Go back to the tutorial homepage
- Click on Requirements and Rank
- Again, read the instructions carefully and
execute any commands on a line beginning with
in your xterm - If you exited Netscape, just click on Tutorial
from your Start menu
51Priorities In Condor
- Two kinds of priorities
- User Priorities
- Priorities between users in the pool to ensure
fairness - The lower the value, the better the priority
- Job Priorities
- Priorities that users give to their own jobs to
determine the order in which they will run - The higher the value, the better the priority
- Only matters within a given users jobs
52User Priorities in Condor
- Each active user in the pool has a user priority
- Viewed or changed with condor_userprio
- The lower the number, the better
- A given users share of available machines is
inversely related to the ratio between user
priorities. - Example Freds priority is 10, Joes is 20.
Fred will be allocated twice as many machines as
Joe.
53User Priorities in Condor, cont.
- Condor continuously adjusts user priorities over
time - machines allocated gt priority, priority worsens
- machines allocated lt priority, priority improves
- Priority Preemption
- Higher priority users will grab machines away
from lower priority users (thanks to
Checkpointing) - Starvation is prevented
- Priority thrashing is prevented
54Job Priorities in Condor
- Can be set at submit-time in your description
file with - prio ltnumbergt
- Can be viewed with condor_q
- Can be changed at any time with condor_prio
- The higher the number, the more likely the job
will run (only among the jobs of an individual
user)
55Managing a Large Cluster of Jobs
- Condor can manage huge numbers of jobs
- Special features of the submit description file
make this easier - Condor can also manage inter-job dependencies
with condor_dagman - For example job A should run first, then, run
jobs B and C, when those finish, submit D, etc - Well discuss DAGMan later
56Submitting a Large Cluster
- Anywhere in your submit file, if you use
(Process), that will expand to the process
number of each job in the cluster - input my_input.(process)
- arguments (process)
- It is common to use (Process) to specify
InitialDir, so that each process runs in its own
directory - InitialDir dir.(process)
57Submitting a Large Cluster (contd)
- Can either have multiple Queue entries, or put a
number after Queue to tell Condor how many to
submit - Queue 1000
- A cluster is more efficient Your jobs will run
faster, and theyll use less space - Can only have one executable per cluster
Different executables must be different clusters!
58Hands-On Exercise 3 Submitting a Large Cluster
of Jobs
59Hands-On Exercise 3
- Please point your browser to the new
instructions - Go back to the tutorial homepage
- Click on Large Clusters
- Again, read the instructions carefully and
execute any commands on a line beginning with
in your xterm - If you exited Netscape, just click on Tutorial
from your Start menu
6010 Minute Break
61Inter-Job Dependencies with DAGMan
- DAGMan can be used to handle a set of jobs that
must be run in a certain order - Also provides pre and post operations, so you
can have a program or script run before each job
is submitted and after it completes - Robust handles errors and submit-machine crashes
62Using DAGMan
- You define a DAG description file, which is
similar in function to the submit file you give
to condor_submit - DAGMan restrictions
- Each job in the DAG must be in its own cluster
(this is a limitation we will remove in future
versions) - All jobs in the DAG must have a User Log and must
share the same file
63Format of the DAGMan Description File
- is a comment
- First section names the jobs in your DAG and
associates a submit description file with each
job - Second (optional) section defines PRE and POST
scripts to run - Final section defines the job dependencies
64Example DAGMan Description File
Example DAGMan input file Job A A.submit Job B
B.submit Job C C.submit Job D D.submit Script PRE
D d_input_checker Script POST A
a_output_processor A.out PARENT A CHILD B
C PARENT B C CHILD D
65Setting up a DAG for Condor
- Must create the DAG description file
- Must create all the submit description files for
the individual jobs - Must prepare any executables you plan to use
- If you want, you can have a mix of Vanilla and
Standard jobs - Must setup any PRE/POST commands or scripts you
wish to use
66Submitting a DAG to Condor
- Once you have everything in place, to submit a
DAG, you use condor_submit_dag and give it the
name of your DAG description file - This will check your input file for errors and
submit a copy of condor_dagman as a scheduler
universe job with all the necessary command-line
arguments
67Removing a DAG
- Removing a DAG is easy
- Just use on the scheduler universe job
(condor_dagman) - On shutdown, DAGMan will remove any jobs that are
currently in the queue that are associated with
its DAG - Once all jobs are gone, DAGMan itself will exit,
and the scheduler universe job will be removed
from the queue
68Hands-On Exercise 4 Using DAGMan
69Hands-On Exercise 4
- Please point your browser to the new
instructions - Go back to the tutorial homepage
- Click on Using_DAGMan
- Again, read the instructions carefully and
execute any commands on a line beginning with
in your xterm - If you exited Netscape, just click on Tutorial
from your Start menu
70Whats Wrong with my Vanilla Job?
- Special requirements expressions for vanilla jobs
- You didnt submit it from a directory that is
shared - Condor isnt running as root (more on this later)
- You dont have your file permissions setup
correctly (more on this later)
71Special Requirements Expressions for Vanilla Jobs
- When you submit a vanilla job, Condor
automatically appends two extra Requirements - UID_DOMAIN ltsubmit_uid_domaingt
- FILESYSTEM_DOMAIN ltsubmit_fsgt
- Since there are no remote system calls with
Vanilla jobs, they depend on a shared file system
and a common UID space to run as you and access
your files
72Special Requirements Expressions for Vanilla Jobs
- By default, each machine in your pool is in its
own UID_DOMAIN and FILESYSTEM_DOMAIN, so your
pool administrator has to configure your pool
specially if there really is a common UID space
and a network file system - If you dont have an account on the remote
system, Vanilla jobs wont work
73Shared File Systems for Vanilla Jobs
- Just because you have AFS or NFS doesnt mean ALL
files are shared - Initialdir /tmp will probably cause trouble for
Vanilla jobs! - You must be sure to set Initialdir to a shared
directory (or cd into it to run condor_submit)
for Vanilla jobs
74Why Dont My Jobs Run?
- Try using condor_q -analyze
- Try specify a User Log for your job
- Look at condor_userprio maybe you have a bad
priority and higher priority users are being
served - Problems with file permissions or network file
systems - Look at the SchedLog
75Using condor_q -analyze
- condor_q -analyze will analyze your jobs
ClassAd, get all the ClassAds of the machines in
the pool, and tell you whats going on - Will report errors in your Requirements
expression (impossible to match, etc.) - Will tell you about user priorities in the pool
(other people have better priority)
76Looking at condor_userprio
- You can look at condor_userprio yourself
- If your priority value is a really high number
(because youve been running a lot of Condor
jobs), other users will have priority to run jobs
in your pool
77File Permissions in Condor
- If Condor isnt running as root, the
condor_shadow process runs as the user the
condor_schedd is running as (usually condor) - You must grant this user write access to your
output files, and read access to your input files
(both STDOUT, STDIN from your submit file, as
well as files your job explicitly opens)
78File Permissions in Condor (contd)
- Often, there will be a condor group and you can
make your files owned and write-able by this
group - For vanilla jobs, even if the UID_DOMAIN setting
is correct, and they match for your submit and
execute machines, if Condor isnt running as
root, your job will be started as user Condor,
not as you!
79Problems with NFS in Condor
- For NFS, sometimes the administrators will setup
read-only mounts, or have UIDs remapped for
certain partitions (the classic example is root
nobody, but modern NFS can do arbitrary
remappings)
80Problems with NFS in Condor (contd)
- If your pool uses NFS automounting, the directory
that Condor thinks is your InitialDir (the
directory you were in when you ran condor_submit)
might not exist on a remote machine - E.g. youre in /mnt/tmp/home/me/...
- With automounting, you always need to specify
InitialDir explicitly - InitialDir /home/me/...
81Problems with AFS in Condor
- If your pool uses AFS, the condor_shadow, even if
its running with your UID, will not have your
AFS token - You must grant an unauthenticated AFS user the
appropriate access to your files - Some sites provide a better alternative that
world-writable files - Host ACLs
- Network-specific ACLs
82Looking at the SchedLog
- Looking at the log file of the condor_schedd, the
SchedLog file can possibly give you a clue if
there are problems - Find it with
- condor_config_val schedd_log
- You might need your pool administrator to turn on
a higher debugging level to see more verbose
output
83Other User Features
- Submit-Only installation
- Heterogeneous Submit
- PVM jobs
84Submit-Only Installation
- Can install just a condor_master and
condor_schedd on your machine - Can submit jobs into a remote pool
- Special option to condor_install
85Heterogeneous Submit
- The job you submit doesnt have to be the same
platform as the machine you submit from - Maybe you have access to a pool thats full of
Alphas, but you have a Sparc on your desk, and
moving all your data is a pain - You can take an Alpha binary, copy it to your
Sparc, and submit it with a requirements
expression that says you need to run on ALPHA/OSF1
86Parallel Jobs in Condor
- Condor can run parallel applications
- Written to the popular PVM message passing
library - Future work includes support for MPI
- Master-Worker Paradigm
- What does Condor-PVM do?
- How to compile and submit Condor-PVM jobs
87Master-Worker Paradigm
- Condor-PVM is designed to run PVM applications
which follow the master-worker paradigm. - Master
- has a pool of work, sends pieces of work to the
workers, manages the work and the workers - Worker
- gets a piece of work, does the computation, sends
the result back
88What does Condor-PVM do?
- Condor acts as the PVM resource manager.
- All pvm_addhost requests get re-mapped to Condor.
- Condor dynamically constructs PVM virtual
machines out of non-dedicated desktop machines. - When a machine leaves the pool, the user gets
notified via the normal PVM notification
mechanisms.
89How to compile and submit Condor-PVM jobs
- Binary Compatible
- Compile and link with PVM library just as normal
PVM applications. No need to link with Condor. - Submit
- In the submit description file, set
- universe PVM
- machine_count ltmingt..ltmaxgt
90Obtaining Condor
- Condor can be downloaded from the Condor web site
at - http//www.cs.wisc.edu/condor
- Complete Users and Administrators manual
available - http//www.cs.wisc.edu/condor/manual
- Contracted Support is available
- Questions? Email
- condor-admin_at_cs.wisc.edu