Introduction to Condor - PowerPoint PPT Presentation

1 / 110
About This Presentation
Title:

Introduction to Condor

Description:

http://www.cs.wisc.edu/condor. 23-June-2002. Introduction to Condor. ondor. C ... Adopted by the 'real world' (Galileo, Maxtor, Micron, Oracle, Tigr, ... ) ondor. C ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 111
Provided by: Alai79
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Condor


1
Introduction to Condor
2
?????? ????!
  • Thank you for having me!
  • I am
  • Alain Roy
  • Computer Science Ph.D. in Quality of Service,
    with Globus Project
  • Working with the Condor Project

3
Condor Tutorials
  • Today (Sunday) 1000-1230
  • A general introduction to Condor
  • Monday 1700-1900
  • Using and administering Condor
  • Tuesday 1700-1900
  • Using Condor on the Grid

4
A General Introduction to Condor
5
The Condor Project (Established 1985)
  • Distributed Computing research performed by a
    team of about 30 faculty, full time staff, and
    students who
  • face software engineering challenges in a Unix
    and Windows environment,
  • are involved in national and international
    collaborations,
  • actively interact with users,
  • maintain and support a distributed production
    environment,
  • and educate and train students.

6
A Multifaceted Project
  • Harnessing clustersopportunistic and dedicated
    (Condor)
  • Job management for Grid applications (Condor-G,
    DaPSched)
  • Fabric management for Grid resources (Condor,
    GlideIns, NeST)
  • Distributed I/O technology (PFS, Kangaroo, NeST)
  • Job-flow management (DAGMan, Condor)
  • Distributed monitoring and management (HawkEye)
  • Technology for Distributed Systems (ClassAD, MW)

7
Harnessing Computers
  • We have more than 300 pools with more than 8500
    CPUs worldwide.
  • We have more than 1800 CPUs in 10 pools on our
    campus.
  • Established a complete production environment
    for the UW CMS group
  • Adopted by the real world (Galileo, Maxtor,
    Micron, Oracle, Tigr, )

8
The Grid
  • Close collaboration and coordination with the
    Globus Projectjoint development, adoption of
    common protocols, technology exchange,
  • Partner in major national Grid RD2 (Research,
    Development and Deployment) efforts (GriPhyN,
    iVDGL, IPG, TeraGrid)
  • Close collaboration with Grid projects in Europe
    (EDG, GridLab, e-Science)

9
User/Application
Grid
Fabric (processing, storage, communication)
10
User/Application
Grid
Fabric (processing, storage, communication)
11
distributed I/O
  • Close collaboration with the Scientific Data
    Management Group at LBL.
  • Provide management services for distributed data
    storage resources
  • Provide management and scheduling services for
    Data Placement jobs (DaPs)
  • Effective, secure and flexible remote I/O
    capabilities
  • Exception handling

12
job flow management
  • Adoption of Directed Acyclic Graphs (DAGs) as a
    common job flow abstraction.
  • Adoption of the DAGMan as an effective solution
    to job flow management.

13
For the Rest of Today
  • Condor
  • Condor and the Grid
  • Related Technologies
  • DAGMan
  • ClassAds
  • Master-Worker
  • NeST
  • DaP Scheduler
  • Hawkeye
  • Today Just the Big Picture

14
What is Condor?
  • Condor converts collections of distributively
    owned workstations and dedicated clusters into a
    distributed high-throughput computing facility.
  • Run lots of jobs over a long period of time,
  • Not a short burst of high-performance
  • Condor manages both machines and jobs with
    ClassAd Matchmaking to keep everyone happy

15
Condor Takes Care of You
  • Condor does whatever it takes to run your jobs,
    even if some machines
  • Crash (or are disconnected)
  • Run out of disk space
  • Dont have your software installed
  • Are frequently needed by others
  • Are far away managed by someone else

16
What is Unique about Condor?
  • ClassAds
  • Transparent checkpoint/restart
  • Remote system calls
  • Works in heterogeneous clusters
  • Clusters can be
  • Dedicated
  • Opportunistic

17
Whats Condor Good For?
  • Managing a large number of jobs
  • You specify the jobs in a file and submit them to
    Condor, which runs them all and sends you email
    when they complete
  • Mechanisms to help you manage huge numbers of
    jobs (1000s), all the data, etc.
  • Condor can handle inter-job dependencies (DAGMan)

18
Whats Condor Good For? (contd)
  • Robustness
  • Checkpointing allows guaranteed forward progress
    of your jobs, even jobs that run for weeks before
    completion
  • If an execute machine crashes, you only lose work
    done since the last checkpoint
  • Condor maintains a persistent job queue - if the
    submit machine crashes, Condor will recover
  • (Story)

19
Whats Condor Good For? (contd)
  • Giving your job the agility to access more
    computing resources
  • Checkpointing allows your job to run on
    opportunistic resources (not dedicated)
  • Checkpointing also provides migration - if a
    machine is no longer available, move!
  • With remote system calls, run on systems which do
    not share a filesystem - You dont even need an
    account on a machine where your job executes

20
Other Condor features
  • Implement your policy on when the jobs can run on
    your workstation
  • Implement your policy on the execution order of
    the jobs
  • Keep a log of your job activities

21
A Condor Pool In Action
22
A Bit of Condor Philosophy
  • Condor brings more computing to everyone
  • A small-time scientist can make an opportunistic
    pool with 10 machines, and get 10 times as much
    computing done.
  • A large collaboration can use Condor to control
    its dedicated pool with hundreds of machines.

23
The Condor Idea
  • Computing power is everywhere, we try to make
    it usable by anyone.

24
Meet Frieda.
She is a scientist. But she has a big problem.
25
Friedas Application
  • Simulate the behavior of F(x,y,z) for 20 values
    of x, 10 values of y and 3 values of z (20103
    600 combinations)
  • F takes on the average 3 hours to compute on a
    typical workstation (total 1800 hours)
  • F requires a moderate (128MB) amount of memory
  • F performs moderate I/O - (x,y,z) is 5 MB and
    F(x,y,z) is 50 MB

26
I have 600simulations to run.Where can I get
help?
27
Install a Personal Condor!
28
Installing Condor
  • Download Condor for your operating system
  • Available as a free download from
  • http//www.cs.wisc.edu/condor
  • Not labelled as Personal Condor, just Condor.
  • Available for most Unix platforms and Windows NT

29
So Frieda Installs Personal Condor on her machine
  • What do we mean by a Personal Condor?
  • Condor on your own workstation, no root access
    required, no system administrator intervention
    neededeasy to set up.
  • So after installation, Frieda submits her jobs to
    her Personal Condor

30
Personal Condor?!Whats the benefit of a Condor
Pool with just one user and one machine?
31
Your Personal Condor will ...
  • Keep an eye on your jobs and will keep you posted
    on their progress
  • Keep a log of your job activities
  • Add fault tolerance to your jobs
  • Implement your policy on when the jobs can run on
    your workstation

32
Frieda is happy untilShe realizes she needs to
run a post-analysis on each job, after it
completes.
33
Condor DAGMan
  • Directed Acyclic Graph Manager
  • DAGMan allows you to specify the dependencies
    between your Condor jobs, so it can manage them
    automatically for you.
  • (e.g., Dont run job B until job A has
    completed successfully.)

34
What is a DAG?
  • A DAG is the data structure used by DAGMan to
    represent these dependencies.
  • Each job is a node in the DAG.
  • Each node can have any number of parent or
    children nodes as long as there are no loops!

35
Running a DAG
  • DAGMan acts as a meta-scheduler, managing the
    submission of your jobs to Condor based on the
    DAG dependencies.

DAGMan
A
Condor Job Queue
.dag File
A
C
B
D
36
Running a DAG (contd)
  • DAGMan holds submits jobs to Condor at the
    appropriate times.

DAGMan
A
Condor Job Queue
B
C
B
C
D
37
Running a DAG (contd)
  • In case of a job failure, DAGMan continues until
    it can no longer make progress, and then creates
    a rescue file with the current state of the DAG.

DAGMan
A
Condor Job Queue
Rescue File
X
B
D
38
Recovering a DAG
  • Once the failed job is ready to be re-run, the
    rescue file can be used to restore the prior
    state of the DAG.

DAGMan
A
Condor Job Queue
Rescue File
C
B
C
D
39
Recovering a DAG (contd)
  • Once that job completes, DAGMan will continue the
    DAG as if the failure never happened.

DAGMan
A
Condor Job Queue
C
B
D
D
40
Finishing a DAG
  • Once the DAG is complete, the DAGMan job itself
    is finished, and exits.

DAGMan
A
Condor Job Queue
C
B
D
41
Frieda wants more
  • She decides to use the graduate students
    computers when they arent, and get done sooner.
  • In exchange, they can use the Condor pool too.

42
Friedas Condor pool
Friedas Computer Central Manager
Graduate Students Desktop Computers
43
Friedas Pool is Flexible
  • Since Friedas is a professor, her jobs are
    preferred.
  • Frieda doesnt always have jobs, so now the
    graduate students have access to more computing
    power.
  • Friedas pool has enabled more work to be done by
    everyone.

44
How does this work?
  • Frieda submits a job. Condor makes a ClassAd and
    give it to the Central Manager
  • Owner Frieda
  • MemoryUsed 40M
  • ImageSize20M
  • Requirements(OpsysLinux Memory gt
    MemoryUsed)
  • Central Manager collects machine ClassAds
  • Memory128M
  • Requirements(ImageSize lt 50M)
  • Rank(OwnerFrieda)
  • Central Manager finds best match

45
After a match is found
  • Central Manager tells both parties about the
    match
  • Friedas computer and the remote computer
    cooperate to run Friedas job.

46
Lots of flexibility
  • Machines can
  • Only run jobs when I have been idle for at least
    15 minutesor always run them.
  • Kick off jobs when someone starts using the
    computeror never kick them off.
  • Jobs can
  • Require or prefer certain machines
  • Use checkpointing, remote I/O, etc

47
Happy Day! Friedas organization purchased a
Beowulf Cluster!
  • Other scientists in her department have realized
    the power of Condor and want to share it..
  • The Beowulf cluster and the graduate student
    computers can be part of a single Condor pool.

48
Friedas Condor pool
Graduate Students Desktop Computers
Friedas Computer Central Manager
Beowulf Cluster
49
Friedas Big Condor Pool
  • Jobs can prefer to run in the Beowulf cluster by
    using Rank.
  • Jobs can run just on appropriate machines based
    on
  • Memory, disk space, software, etc.
  • The Beowulf cluster is dedicated.
  • The student computers are still useful.
  • Everyones computing power is increased.

50
Frieda collaborates
  • She wants to share her Condor pool with
    scientists from another lab.

51
Condor Flocking
  • Condor pools can work cooperatively

52
Flocking
  • Flocking is Condor specificyou can just link
    Condor pools together
  • Jobs usually prefer running in their native
    pool, before running in alternate pools.
  • What if you want to connect to a non-Condor pool?

53
Condor-G
  • Condor-G lets you submit jobs to Grid resources.
  • Uses Globus job submission mechanisms
  • You get Condors benefits
  • Fault tolerance, monitoring, etc.
  • You get the Grids benefits
  • Use any Grid resources

54
Condor as a Grid Resource
  • Condor can be a backend for Globus
  • Submit Globus jobs to Condor resource
  • The Globus jobs run in the Condor pool

55
Condor Summary
  • Condor is useful, even on a single machine or a
    small pool.
  • Condor can bring computing power to people that
    cant afford a real cluster.
  • Condor can work with dedicated clusters
  • Condor works with the Grid
  • Questions so far?

56
ClassAds
  • Condor uses ClassAds internally to pair jobs with
    machines.
  • Normally, you dont need to know the details when
    you use Condor
  • We saw sample ClassAds earlier.
  • If you like, you can also use ClassAds in your
    own projects.

57
What Are ClassAds?
  • A ClassAd maps attributes to expressions
  • Expressions
  • Constants strings, numbers, etc.
  • Expressions other.Memory gt 600M
  • Lists roy, pfc, melski
  • Other ClassAds
  • Powerful tool for grid computing
  • Semi-structured (you pick your structure)
  • Matchmaking

58
ClassAd Example
  • Type Job
  • Owner roy
  • Universe Standard
  • Requirements (other.OpSys Linux
    other.DiskSpace gt 140M)
  • Rank (other.DiskSpace gt 300M ? 10 1)
  • ClusterID 12314
  • JobID 0
  • Env
  • Real ClassAds have a more fields than will fit on
    this slide.

59
ClassAd Matchmaking
  • Type Job
  • Owner roy
  • Requirements (other.OpSys Linux
    other.DiskSpace gt 140M)
  • Rank (other.DiskSpace gt 300M ? 10 1)
  • Type Machine
  • OpSys Linux
  • DiskSpace 500M
  • AllowedUsers roy, melski, pfc
  • Requirements (IsMember(other.Owner,
    AllowedUsers)

60
ClassAds Are Open Source
  • Library GNU Public License (LGPL)
  • Complete source code included
  • Library code
  • Test program
  • Available from
  • http//www.cs.wisc.edu/condor/classad
  • Version 0.9.3

61
Who Uses ClassAds?
  • Condor
  • European Data Grid
  • NeST
  • Web site
  • You?

62
ClassAd User Condor
  • ClassAds describe jobs and machines
  • Matchmaking figures out what jobs run on which
    machines
  • DAGMan will soon internally represent DAGs as
    ClassAds

63
ClassAd User EU Datagrid
  • JDL ClassAd schema to describe jobs/machines
  • ResourceBroker matches jobs to machines

64
ClassAd User NeST
  • NeST is a storage appliance
  • NeST uses ClassAd collections for persistent
    storage of
  • User Information
  • File meta-data
  • Disk Information
  • Lots (storage space allocations)

65
ClassAd User Web Site
  • Web-based application in Germany
  • User actions (transitions) are constrained
  • Constraints expressed through ClassAds

66
ClassAd Summary
  • ClassAds are flexible
  • Matchmaking is powerful
  • You can use ClassAd independently of Condor
  • http//www.cs.wisc.edu/condor/classad/

67
MW Master-Worker
  • Master-Worker Style Parallel Applications
  • Large problem partitioned into small pieces
    (tasks)
  • The master manages tasks and resources (worker
    pool)
  • Each worker gets a task, execute it, sends the
    result back, and repeat until all tasks are done
  • Examples ray-tracing, optimization problems,
    etc.
  • On Condor (PVM, Globus, )
  • Many opportunities!
  • Issues (in a Distributed Opportunistic
    Environment)
  • Resource management, communication, portability
  • Fault-tolerance, dealing with runtime pool
    changes.

68
MW to Simplify the Work!
  • An OO framework with simple interfaces
  • 3 classes to extend, a few virtual functions to
    fill
  • Scientists can focus on their algorithms.
  • Lots of Functionality
  • Handles all the issues in a meta-computing
    environment
  • Provides sufficient info. to make smart
    decisions.
  • Many Choices without Changing User Code
  • Multiple resource managers Condor, PVM,
  • Multiple communication interfaces PVM, File,
    Socket,

69
MWs Layered Architecture
Application classes
API
MW abstract classes
MW App.
IPI
M W
Resource Mgr
Communication Layer
Infrastructure Providers Interface
Underlying infrastructure
70
MWs Runtime Structure
Master Process
Worker Process
Workers
ToDo tasks
Running tasks
Worker Process

Worker Process
  1. User code adds tasks to the masters Todo list
  2. Each task is sent to a worker (Todo -gt Running)
  3. The task is executed by the worker
  4. The result is sent back to the master
  5. User code processes the result (can add/remove
    tasks).

71
MW Summary
  • Its simple
  • simple API, minimal user code.
  • Its powerful
  • works on meta-computing platforms.
  • Its inexpensive
  • On top of Condor, it can exploits 100s of
    machines.
  • It solves hard problems!
  • Nug30, STORM,

72
MW Success Stories
  • Nug30 solved in 7 days by MW-QAP
  • Quadratic assignment problem outstanding for 30
    years
  • Utilized 2500 machines from 10 sites
  • NCSA, ANL, UWisc, Gatech, INFN_at_Italy,
  • 1009 workers at peak, 11 CPU years
  • http//www-unix.mcs.anl.gov/metaneos/nug30/
  • STORM (flight scheduling)
  • Stochastic programming problem (1000M row X
    13000M col)
  • 2K times larger than the best sequential program
    can do
  • 556 workers at peak, 1 CPU year
  • http//www.cs.wisc.edu/swright/stochastic/atr/

73
MW Information
  • http//www.cs.wisc.edu/condor/mw/

74
Questions So Far?
75
NeST
  • Traditional file servers have not evolved
  • NeST is a 2nd gen file server
  • Flexible storage appliance for the grid
  • Provides local and remote access to data
  • Easy management of storage resources
  • User level sw turns machines into storage apps
  • Deployable and portable

76
Research Meets Production
  • NeST exists at an exciting intersection
  • Freedom to pursue academic curiosities
  • Opportunities to discover real user concerns

77
Very exciting intersection
78
NeST Supports Lots
  • A lot is a guaranteed storage allocation.
  • When you run your large analysis on a Grid, will
    you have sufficient storage for your results?
  • Lots ensure you have storage space.

79
NeST Supports Multiple Protocols
  • Interoperability between admin domains
  • NeST currently speaks
  • Grid FTP and FTP
  • HTTP
  • NFS (beta)
  • Chirp
  • Designed for integration of new protocols

80
Design structure
Physical network layer
Chirp
FTP
Grid ftp
NFS
HTTP
Common protocol layer
Storage Mgr
Physical storage layer
81
Why not JBOS?
  • Just a bunch of servers has limitations
  • NeST advantages over JBOS
  • Single config and admin interface
  • Optimizations across multiple protocols
  • e.g. cache aware scheduling
  • Management and control of protocols
  • e.g. prefer local users to remote users

82
Three-Way Matching
Refers to NearestStorage.
Knows where NearestStorage is.
Job Ad
Machine Ad
Storage Ad
match
Machine
Job
NeST
83
Three way ClassAds
Type job TargetType machine Cmd
sim.exe Owner thain Requirements
(OpSyslinux) NearestStorage.HasCMSData
Type machine TargetType job OpSys
linux Requirements (Ownerthain) NearestSto
rage ( Name turkey) (TypeStorage)
Machine ClassAd
Job ClassAd
84
NeST Information
  • http//www.cs.wisc.edu/condor/nest
  • Version 0.9 now available (linux only, no NFS)
  • Solaris and NFS coming soon
  • Requests welcome

85
DaP Scheduler
  • Intelligent scheduling of data transfers

86
Applications Demand Storage
  • Database systems
  • Multimedia applications
  • Scientific applications
  • High Energy Physics Computational Genomics
  • Currently terabytes, soon petabytes of data

87
Is Remote access good enough?
  • Huge amounts of data (mostly in tapes)
  • Large number of users
  • Distance / Low Bandwidth
  • Different platforms
  • Scalability and efficiency concerns
  • gt A middleware is required

88
Two approaches
  • Move job/application to the data
  • Less common
  • Insufficient computational power on storage site
  • Not efficient
  • Does not scale
  • Move data to the job/application

89
Move data to the Job
WAN
Local Storage Area (eg. Local Disk, NeST
Server..)
LAN
Remote Staging Area
Compute cluster
90
Main Issues
  • 1. Insufficient local storage area
  • 2. CPU should not wait much for I/O
  • 3. Crash Recovery
  • 4. Different Platforms Protocols
  • 5. Make it simple

91
Data Placement Scheduler (DaPS)
  • Intelligently Manages and Schedules Data
    Placement (DaP) activities/jobs
  • What Condor is for computational jobs, DaPS means
    the same for DaP jobs
  • Just submit a bunch of DaP jobs and then relax..

92
Supported Protocols
  • Currently supported
  • FTP
  • GridFTP
  • NeST (chirp)
  • SRB (Storage Resource Broker)
  • Very soon
  • SRM (Storage Resource Manager)
  • GDMP (Grid Data Management Pilot)

93
Case Study DAGMan
.dag File
94
Current DAG structure
  • All jobs are assumed to be computational jobs

Job A
Job C
Job B
Job D
95
Current DAG structure
  • If data transfer to/from remote sites is
    required, this is performed via pre- and
    post-scripts attached to each job.

Job A
PRE Job B POST
Job C
Job D
96
New DAG structure
  • Add DaP jobs to the DAG structure

PRE Job B POST
97
New DAGMan Architecture
.dag File
DAGMan
DAGMan
A
DaPS Job Queue
Condor Job Queue
X
X
A
B
C
Y
D
98
DaP Conclusion
  • More intelligent management of remote data
    transfer staging
  • increase local storage utilization
  • maximize CPU throughput

99
Questions So Far?
100
Hawkeye
  • Sys admins first need information about what is
    happening on the machines they are responsible
    for.
  • Both current and past
  • Information must be consolidated and easily
    accessible
  • Information must be dynamic

101
HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Manager
HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Monitoring Agent
102
HawkEye Monitoring Agent
Hawkeye_Startup_Agent
Hawkeye_Monitor
HawkEye Manager
ClassAd Updates
/proc, kstat
HawkEye Monitoring Agent
103
Monitor Agent, cont.
  • Updates are sent periodically
  • Information does not get stale
  • Updates also serve as a heartbeat monitor
  • Know when a machine is down
  • Out of the box, the update ClassAd has many
    attributes about the machine of interest for
    system administration
  • Current Prototype about 200 attributes

104
Custom Attributes
Hawkeye_Startup_Agent
Hawkeye_Monitor
HawkEye Manager
/proc, kstat
Data from hawkeye_update_attribute command line
tool
Create your own HawkEye plugins, or share plugins
with others
HawkEye Monitoring Agent
105
Role of HawkEye Manager
  • Store all incoming ClassAds in a indexed resident
    data structure
  • Fast response to client tool queries about
    current state
  • Show me all machines with a load average gt 10
  • Periodically store ClassAd attributes into a
    Round Robin Database
  • Store information over time
  • Show me a graph with the load average for this
    machine over the past week
  • Speak to clients via CEDAR, HTTP

106
Web client
http//www.cs.wisc.edu/roy/hawkeye/
  • Command-line, GUI, Web-based

107
Running tasks on behalf of the sys admin
  • Submit your sys admin tasks to HawkEye
  • Tasks are stored in a persistent queue by the
    Manager
  • Tasks can leave the queue upon completion, or
    repeat after specified intervals
  • Tasks can have complex interdependencies via
    DAGMan
  • Records are kept on which task ran where
  • Sounds like Condor, eh?
  • Yes, but simpler

108
Run Tasks in response to monitoring information
  • ClassAd Requirements Attribute
  • Example Send email if a machine is low on disk
    space or low on swap space
  • Submit an email task with an attribute
  • Requirements free_disk lt 5 free_swap lt 5
  • Example w/ task interdependency If load average
    is high and OSLinux and console is Idle, submit
    a task which runs top, if top sees Netscape,
    submit a task to kill Netscape

109
Todays Summary
  • Condor works on many levels
  • Small pools can make a big difference
  • Big pools are for the really big problems
  • Condor works in the Grid
  • Condor is assisted by a host of technologies
  • ClassAds, Checkpointing, Remote I/O DAGMan,
    Master-Worker, NeST, DaPScheduler, Hawkeye

110
Questions?Comments?
  • Web www.cs.wisc.edu/condor
  • Email condor-admin_at_cs.wisc.edu
Write a Comment
User Comments (0)
About PowerShow.com