Title: High Throughput Computing with Condor at Notre Dame
1High Throughput Computingwith Condor at Notre
Dame
- Douglas Thain
- 30 April 2009
2Todays Talk
- High Level Introduction (20 min)
- What is Condor?
- How does it work?
- What is it good for?
- Hands-On Tutorial (30 min)
- Finding Resources
- Submitting Jobs
- Managing Jobs
- Ideas for Scaling Up
3The Cooperative Computing Lab
- We create software that enables the reliable
sharing of cycles and storage capacity between
cooperating people. - We conduct research on the effectiveness of
various systems and strategies for large scale
computing. - We collaborate with others that need to use large
scale computing, so as to find the real problems
and make an impact on the world. - We operate systems like Condor that directly
support research and collaboration at ND.
http//www.cse.nd.edu/ccl
4What is Condor?
- Condor is software from UW-Madison that harnesses
idle cycles from existing machines. (Most
workstations are 90 idle!) - With the assistance of CSE, OIT, and CRC staff,
Condor has been installed on 700 cores in
Engineering and Science since early 2005. - The Condor pool expands the capabilities of
researchers in to perform both cycle and storage
intensive research. - New users and contributors are welcome to join!
http//condor.cse.nd.edu
5(No Transcript)
6(No Transcript)
7Batch Users
Purdue 10k cores
Wisconsin 5k cores
www portals
login nodes
db server
central mgr
flocking to other condor pools
Condor Distributed Batch System (700 cores)
green house
cclsun 16x2
netscale 16x2
compbio 1x8
CSE 170
ccl 8x1
Storage Research
Network Research
Storage Research
Timeshared Collaboration
Fitzpatrick 130
iss 44x2
loco 32x2
sc0 32x2
cvrl 32x2
netscale 1x32
CHEG 25
EE 10
Nieu 20
DeBart 10
Hadoop
Biometrics
Network Research
Batch Capacity
MPI
Personal Workstations
Primary Interactive Users
8http//www.cse.nd.edu/ccl/viz
9(No Transcript)
10The Condor Principle
- Machine Owners Have Absolute Control
- Set who, what, and when can use machine.
- Can kick jobs off at any time manually.
- Default policy that satisfies most people
- Start job if console idle gt 15 minutes
- Suspend job if console used or CPU busy.
- Kick off job if suspended gt 10 minutes.
- After that, jobs run in this order owner,
research group, Notre Dame, elsewhere.
For the full technical details,
see http//www.cse.nd.edu/ccl/operations/condor/
policy.shtml
11Whats the value proposition?
- If you install Condor on your workstations,
servers, or clusters, then - You retain immediate, preemptive priority on your
machines, both batch and interactive. - You gain access to the unused cycles available on
other machines. - By the way, other people get to use your machines
when you are not.
12http//condor.cse.nd.edu
13http//condor.cse.nd.edu
14http//condor.cse.nd.edu
15Condor Architecture
match maker
Represents an available machine.
schedd
startd
Represents a user with jobs to run.
16700 CPUs at Notre Dame
match maker
startd
schedd
startd
schedd
startd
schedd
startd
schedd
startd
schedd
startd
schedd
schedd
17Flocking to Other Sites
2000 CPUs University of Wisconsin
20,000 CPUs Purdue University
700 CPUs Notre Dame
18What is Condor Good For?
- Condor works well on large workflows of
sequential jobs, provided that they match the
machines available to you. - Ideal workload
- One million jobs that require one hour each.
- Doesnt work at all
- An 8-node MPI job that must run now.
- Many workloads can be converted into the ideal
form, with varying degrees of effort.
19High Throughput Computing
- Condor is not High Performance Computing
- HPC Run one program as fast as possible.
- Condor is High Throughput Computing
- HTC Run as many programs as possible before my
paper deadline on May 1st.
20Intermission and Questions
21Getting Started
- If your shell is tcsh
- setenv PATH /afs/nd.edu/user37/condor/software/b
inPATH - If your shell is bash
- export PATH/afs/nd.edu/user37/condor/software/b
inPATH - Then, create a temporary working space
- mkdir /tmp/YOURNAME
- cd /tmp/YOURNAME
22Viewing Available Resources
- Condor Status Web Page
- http//condor.cse.nd.edu
- Command Line Tool
- condor_status
- condor_status constraint (Memorygt2048)
- condor_status constraint (ArchINTEL)
- condor_status constraint (OpSysLINUX)
- condor_status -run
- condor_status submitters
- condor_status -pool boilergrid.rcac.purdue.edu
23A Simple Script Job
- vi simple.sh
- chmod 755 simple.sh
- ./simple.sh hello world
- !/bin/sh
- echo _at_
- date
- uname a
24A Simple Submit File
vi simple.submit
- universe vanilla
- executable simple.sh
- arguments hello condor
- output simple.stdout
- error simple.stderr
- should_transfer_files yes
- when_to_transfer_output on_exit
- log simple.logfile
- queue
25Submitting and Watching a Job
- Submit the job
- condor_submit simple.submit
- Look at the job queue
- condor_q
- Remove a job
- condor_rm ltgt
- See where the job went
- tail -f simple.logfile
26Submitting Lots of Jobs
vi simple.submit
- universe vanilla
- executable simple.sh
- arguments hello (PROCESS)
- output simple.stdout.(PROCESS)
- error simple.stderr.(PROCESS)
- should_transfer_files yes
- when_to_transfer_output on_exit
- log simple.logfile
- queue 50
27What Happened to All My Jobs?
- http//condorlog.cse.nd.edu
28Setting Requirements
- By default, Condor will only run your job on a
machine with the same CPU and OS as the
submitter. - Use requirements to send your job to other kinds
of machines - requirements (Memorygt2084)
- requirements (ArchINTEL ArchX86_64)
- requirements (MachineGroupfitzlab)
- requirements (UidDomain!nd.edu)
- (Hint Try out your requirements expressions
using condor_status as above.)
29Setting Requirements
- By default, Condor will assume any machine that
satisfies your requirements is sufficient. - Use the rank expression to indicate which
machines that you prefer - rank (Memorygt1024)
- rank (MachineGroupfitzlab)
- rank (ArchINTEL)10
- (ArchX86_64)20
30File Transfer
- Notes to keep in mind
- Condor cannot write to AFS. (no creds)
- Not all machines in Condor have AFS.
- So, you must specify what files your job needs,
and Condor will send them there - transfer_input_files x.dat, y.calib, z.library
- By default, all files created by your job will be
sent home automatically.
31In Class Assignment
- Execute 50 jobs that run on a machine not at
Notre Dame that has gt1GB RAM.