High Throughput Computing with Condor at Notre Dame - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

High Throughput Computing with Condor at Notre Dame

Description:

We operate systems like Condor that directly support research and collaboration at ND. ... The Condor pool expands the capabilities of researchers in to ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 32
Provided by: dougla9
Learn more at: http://www.cse.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: High Throughput Computing with Condor at Notre Dame


1
High Throughput Computingwith Condor at Notre
Dame
  • Douglas Thain
  • 30 April 2009

2
Todays Talk
  • High Level Introduction (20 min)
  • What is Condor?
  • How does it work?
  • What is it good for?
  • Hands-On Tutorial (30 min)
  • Finding Resources
  • Submitting Jobs
  • Managing Jobs
  • Ideas for Scaling Up

3
The Cooperative Computing Lab
  • We create software that enables the reliable
    sharing of cycles and storage capacity between
    cooperating people.
  • We conduct research on the effectiveness of
    various systems and strategies for large scale
    computing.
  • We collaborate with others that need to use large
    scale computing, so as to find the real problems
    and make an impact on the world.
  • We operate systems like Condor that directly
    support research and collaboration at ND.

http//www.cse.nd.edu/ccl
4
What is Condor?
  • Condor is software from UW-Madison that harnesses
    idle cycles from existing machines. (Most
    workstations are 90 idle!)
  • With the assistance of CSE, OIT, and CRC staff,
    Condor has been installed on 700 cores in
    Engineering and Science since early 2005.
  • The Condor pool expands the capabilities of
    researchers in to perform both cycle and storage
    intensive research.
  • New users and contributors are welcome to join!

http//condor.cse.nd.edu
5
(No Transcript)
6
(No Transcript)
7
Batch Users
Purdue 10k cores
Wisconsin 5k cores
www portals
login nodes
db server
central mgr
flocking to other condor pools
Condor Distributed Batch System (700 cores)
green house
cclsun 16x2
netscale 16x2
compbio 1x8
CSE 170
ccl 8x1
Storage Research
Network Research
Storage Research
Timeshared Collaboration
Fitzpatrick 130
iss 44x2
loco 32x2
sc0 32x2
cvrl 32x2
netscale 1x32
CHEG 25
EE 10
Nieu 20
DeBart 10
Hadoop
Biometrics
Network Research
Batch Capacity
MPI
Personal Workstations
Primary Interactive Users
8
http//www.cse.nd.edu/ccl/viz
9
(No Transcript)
10
The Condor Principle
  • Machine Owners Have Absolute Control
  • Set who, what, and when can use machine.
  • Can kick jobs off at any time manually.
  • Default policy that satisfies most people
  • Start job if console idle gt 15 minutes
  • Suspend job if console used or CPU busy.
  • Kick off job if suspended gt 10 minutes.
  • After that, jobs run in this order owner,
    research group, Notre Dame, elsewhere.

For the full technical details,
see http//www.cse.nd.edu/ccl/operations/condor/
policy.shtml
11
Whats the value proposition?
  • If you install Condor on your workstations,
    servers, or clusters, then
  • You retain immediate, preemptive priority on your
    machines, both batch and interactive.
  • You gain access to the unused cycles available on
    other machines.
  • By the way, other people get to use your machines
    when you are not.

12
http//condor.cse.nd.edu
13
http//condor.cse.nd.edu
14
http//condor.cse.nd.edu
15
Condor Architecture
match maker
Represents an available machine.
schedd
startd
Represents a user with jobs to run.
16
700 CPUs at Notre Dame
match maker
startd
schedd
startd
schedd
startd
schedd
startd
schedd
startd
schedd
startd
schedd
schedd
17
Flocking to Other Sites
2000 CPUs University of Wisconsin
20,000 CPUs Purdue University
700 CPUs Notre Dame
18
What is Condor Good For?
  • Condor works well on large workflows of
    sequential jobs, provided that they match the
    machines available to you.
  • Ideal workload
  • One million jobs that require one hour each.
  • Doesnt work at all
  • An 8-node MPI job that must run now.
  • Many workloads can be converted into the ideal
    form, with varying degrees of effort.

19
High Throughput Computing
  • Condor is not High Performance Computing
  • HPC Run one program as fast as possible.
  • Condor is High Throughput Computing
  • HTC Run as many programs as possible before my
    paper deadline on May 1st.

20
Intermission and Questions
21
Getting Started
  • If your shell is tcsh
  • setenv PATH /afs/nd.edu/user37/condor/software/b
    inPATH
  • If your shell is bash
  • export PATH/afs/nd.edu/user37/condor/software/b
    inPATH
  • Then, create a temporary working space
  • mkdir /tmp/YOURNAME
  • cd /tmp/YOURNAME

22
Viewing Available Resources
  • Condor Status Web Page
  • http//condor.cse.nd.edu
  • Command Line Tool
  • condor_status
  • condor_status constraint (Memorygt2048)
  • condor_status constraint (ArchINTEL)
  • condor_status constraint (OpSysLINUX)
  • condor_status -run
  • condor_status submitters
  • condor_status -pool boilergrid.rcac.purdue.edu

23
A Simple Script Job
  • vi simple.sh
  • chmod 755 simple.sh
  • ./simple.sh hello world
  • !/bin/sh
  • echo _at_
  • date
  • uname a

24
A Simple Submit File
vi simple.submit
  • universe vanilla
  • executable simple.sh
  • arguments hello condor
  • output simple.stdout
  • error simple.stderr
  • should_transfer_files yes
  • when_to_transfer_output on_exit
  • log simple.logfile
  • queue

25
Submitting and Watching a Job
  • Submit the job
  • condor_submit simple.submit
  • Look at the job queue
  • condor_q
  • Remove a job
  • condor_rm ltgt
  • See where the job went
  • tail -f simple.logfile

26
Submitting Lots of Jobs
vi simple.submit
  • universe vanilla
  • executable simple.sh
  • arguments hello (PROCESS)
  • output simple.stdout.(PROCESS)
  • error simple.stderr.(PROCESS)
  • should_transfer_files yes
  • when_to_transfer_output on_exit
  • log simple.logfile
  • queue 50

27
What Happened to All My Jobs?
  • http//condorlog.cse.nd.edu

28
Setting Requirements
  • By default, Condor will only run your job on a
    machine with the same CPU and OS as the
    submitter.
  • Use requirements to send your job to other kinds
    of machines
  • requirements (Memorygt2084)
  • requirements (ArchINTEL ArchX86_64)
  • requirements (MachineGroupfitzlab)
  • requirements (UidDomain!nd.edu)
  • (Hint Try out your requirements expressions
    using condor_status as above.)

29
Setting Requirements
  • By default, Condor will assume any machine that
    satisfies your requirements is sufficient.
  • Use the rank expression to indicate which
    machines that you prefer
  • rank (Memorygt1024)
  • rank (MachineGroupfitzlab)
  • rank (ArchINTEL)10
  • (ArchX86_64)20

30
File Transfer
  • Notes to keep in mind
  • Condor cannot write to AFS. (no creds)
  • Not all machines in Condor have AFS.
  • So, you must specify what files your job needs,
    and Condor will send them there
  • transfer_input_files x.dat, y.calib, z.library
  • By default, all files created by your job will be
    sent home automatically.

31
In Class Assignment
  • Execute 50 jobs that run on a machine not at
    Notre Dame that has gt1GB RAM.
Write a Comment
User Comments (0)
About PowerShow.com