The Condor Story ( - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

The Condor Story (

Description:

The Condor Story – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 62
Provided by: miro74
Category:
Tags: condor | story | vsp

less

Transcript and Presenter's Notes

Title: The Condor Story (


1
The Condor Story ( why it is worth developing
the plot further)
2
Regardless of how we call IT (distributed
computing, eScience, grid, cyberinfrastructure,
)IT is not easy!
3
Therefore, if we want IT to happen, we MUST
join forces and work together
4
Working Together
  • Each of us must be consider as both a consumer
    and a provider and view others in the same way
  • We have to know each other
  • We have to trust each other
  • We have to understand each other

5
The Condor Project (Established 85)
  • Distributed Computing research performed by a
    team of 40 faculty, full time staff and students
    who
  • face software/middleware engineering challenges,
  • involved in national and international
    collaborations,
  • interact with users in academia and industry,
  • maintain and support a distributed production
    environment (more than 2300 CPUs at UW),
  • and educate and train students.
  • Funding ( 4.5M annual budget)
  • DoE, NASA, NIH, NSF, EU, INTEL, Micron,
  • Microsoft and the UW Graduate School

6
(No Transcript)
7
Excellence
S u p p o r t
Functionality
Research
8
  • Since the early days of mankind the primary
    motivation for the establishment of communities
    has been the idea that by being part of an
    organized group the capabilities of an individual
    are improved. The great progress in the area of
    inter-computer communication led to the
    development of means by which stand-alone
    processing sub-systems can be integrated into
    multi-computer communities.

Miron Livny, Study of Load Balancing Algorithms
for Decentralized Distributed Processing
Systems., Ph.D thesis, July 1983.
9
Claims for benefits provided by Distributed
Processing Systems
  • High Availability and Reliability
  • High System Performance
  • Ease of Modular and Incremental Growth
  • Automatic Load and Resource Sharing
  • Good Response to Temporary Overloads
  • Easy Expansion in Capacity and/or Function

What is a Distributed Data Processing System? ,
P.H. Enslow, Computer, January 1978
10
Benefits to Science
  • Democratization of Computing you do not have
    to be a SUPER person to do SUPER computing.
    (accessibility)
  • Speculative Science Since the resources are
    there, lets run it and see what we get.
    (unbounded computing power)
  • Function shipping Find the image that has a
    red car in this 3 TB collection. (computational
    mobility)

11
High Throughput Computing
  • For many experimental scientists, scientific
    progress and quality of research are strongly
    linked to computing throughput. In other words,
    they are less concerned about instantaneous
    computing power. Instead, what matters to them is
    the amount of computing they can harness over a
    month or a year --- they measure computing power
    in units of scenarios per day, wind patterns per
    week, instructions sets per month, or crystal
    configurations per year.

12
High Throughput Computingis a24-7-365activity
FLOPY ? (606024752)FLOPS
13
Every communityneeds a Matchmaker!
or a Classified section in the newspaper or an
eBay.
14
We use Matchmakersto build Computing
Communities out of Commodity Components
15
CERN 92
16
The 94 Worldwide Condor Flock
Amsterdam
Delft
3
30
10
200
3
3
3
Madison
Warsaw
10
10
Geneva
Dubna/Berlin
17
The Grid Blueprint for a New Computing
Infrastructure Edited by Ian Foster and Carl
Kesselman July 1998, 701 pages.
The grid promises to fundamentally change the way
we think about and use computing. This
infrastructure will connect multiple regional and
national computational grids, creating a
universal source of pervasive and dependable
computing power that supports dramatically new
classes of applications. The Grid provides a
clear vision of what computational grids are, why
we need them, who will use them, and how they
will be programmed.
18
  • We claim that these mechanisms, although
    originally developed in the context of a cluster
    of workstations, are also applicable to
    computational grids. In addition to the required
    flexibility of services in these grids, a very
    important concern is that the system be robust
    enough to run in production mode continuously
    even in the face of component failures.

Miron Livny Rajesh Raman, "High Throughput
Resource Management", in The Grid Blueprint for
a New Computing Infrastructure.
19
  • Grid computing is a partnership between
    clients and servers. Grid clients have more
    responsibilities than traditional clients, , and
    must be equipped with powerful mechanisms for
    dealing with and recovering from failures,
    whether they occur in the context of remote
    execution, work management, or data output. When
    clients are powerful, servers must accommodate
    them by using careful protocols.

Douglas Thain Miron Livny, "Building Reliable
Clients and Servers", in The Grid Blueprint for
a New Computing Infrastructure,2nd edition
20
(No Transcript)
21
Grid
WWW
22
Being a Master
  • Customer deposits task(s) with the master that
    is responsible for
  • Obtaining resources and/or workers
  • Deploying and managing workers on obtained
    resources
  • Assigning and delivering work unites to
    obtained/deployed workers
  • Receiving and processing results
  • Notify customer.

23
our answer to High Throughput MW Computing on
commodity resources
24
(No Transcript)
25
The Layers of Condor
Matchmaker
26
PSE or User
Condor
Local
Condor G (schedD)
Flocking
Condor
Remote
27
Cycle Delivery at theMadison campus
28
Yearly Condor usage at UW-CS
10,000,000 8,000,000 6,000,000 4,000,000 2,000
,000
29
Yearly Condor CPUs at UW
30
(inter) national science
31
U.S. Trillium Grid Partnership
  • Trillium PPDG GriPhyN iVDGL
  • Particle Physics Data Grid 12M (DOE) (1999
    2004)
  • GriPhyN 12M (NSF) (2000 2005)
  • iVDGL 14M (NSF) (2001 2006)
  • Basic composition (150 people)
  • PPDG 4 universities, 6 labs
  • GriPhyN 12 universities, SDSC, 3 labs
  • iVDGL 18 universities, SDSC, 4 labs, foreign
    partners
  • Expts BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO,
    SDSS/NVO
  • Complementarity of projects
  • GriPhyN CS research, Virtual Data Toolkit (VDT)
    development
  • PPDG End to end Grid services, monitoring,
    analysis
  • iVDGL Grid laboratory deployment using VDT
  • Experiments provide frontier challenges
  • Unified entity when collaborating internationally

32
  • Grid2003 An Operational National Grid
  • 28 sites Universities national labs
  • 2800 CPUs, 4001300 jobs
  • Running since October 2003
  • Applications in HEP, LIGO, SDSS, Genomics

Korea
http//www.ivdgl.org/grid2003
33
Contributions to Grid3
  • Condor-G your window to Grid3 resources
  • GRAM 1.5 GASS Cache
  • Directed Acyclic Graph Manager (DAGMan)
  • Packaging, Distribution and Support of the
    Virtual Data Toolkit (VDT)
  • Trouble Shooting
  • Technical road-map/blueprint

34
Contributions to EDG/EGEE
  • Condor-G
  • DAGMan
  • VDT
  • Design of gLite
  • Testbed

35
VDT Growth
VDT 1.1.8 First real use by LCG
VDT 1.1.11 Grid2003
VDT 1.0 Globus 2.0b Condor 6.3.1
VDT 1.1.7 Switch to Globus 2.2
VDT 1.1.3, 1.1.4 1.1.5 pre-SC 2002
36
The Build Process
NMI
VDT
Test
Sources (CVS)
Build
Binaries
Build Test Condor pool (40 computers)
Pacman cache
Package
Patching

RPMs
Build
Binaries
GPT src bundles
Build
Binaries
Test
Contributors (VDS, etc.)
37
Tools in the VDT 1.2.0
Components built by NMI
  • Condor Group
  • Condor/Condor-G
  • Fault Tolerant Shell
  • ClassAds
  • Globus Alliance
  • Job submission (GRAM)
  • Information service (MDS)
  • Data transfer (GridFTP)
  • Replica Location (RLS)
  • EDG LCG
  • Make Gridmap
  • Certificate Revocation List Updater
  • Glue Schema/Info prov.
  • ISI UC
  • Chimera Pegasus
  • NCSA
  • MyProxy
  • GSI OpenSSH
  • UberFTP
  • LBL
  • PyGlobus
  • Netlogger
  • Caltech
  • MonaLisa
  • VDT
  • VDT System Profiler
  • Configuration software
  • Others
  • KX509 (U. Mich.)
  • DRM 1.2
  • Java
  • FBSng job manager

38
Tools in the VDT 1.2.0
Components built by contributors
  • Condor Group
  • Condor/Condor-G
  • Fault Tolerant Shell
  • ClassAds
  • Globus Alliance
  • Job submission (GRAM)
  • Information service (MDS)
  • Data transfer (GridFTP)
  • Replica Location (RLS)
  • EDG LCG
  • Make Gridmap
  • Certificate Revocation List Updater
  • Glue Schema/Info prov.
  • ISI UC
  • Chimera Pegasus
  • NCSA
  • MyProxy
  • GSI OpenSSH
  • UberFTP
  • LBL
  • PyGlobus
  • Netlogger
  • Caltech
  • MonaLisa
  • VDT
  • VDT System Profiler
  • Configuration software
  • Others
  • KX509 (U. Mich.)
  • DRM 1.2
  • Java
  • FBSng job manager

39
Tools in the VDT 1.2.0
Components built by VDT
  • Condor Group
  • Condor/Condor-G
  • Fault Tolerant Shell
  • ClassAds
  • Globus Alliance
  • Job submission (GRAM)
  • Information service (MDS)
  • Data transfer (GridFTP)
  • Replica Location (RLS)
  • EDG LCG
  • Make Gridmap
  • Certificate Revocation List Updater
  • Glue Schema/Info prov.
  • ISI UC
  • Chimera Pegasus
  • NCSA
  • MyProxy
  • GSI OpenSSH
  • UberFTP
  • LBL
  • PyGlobus
  • Netlogger
  • Caltech
  • MonaLisa
  • VDT
  • VDT System Profiler
  • Configuration software
  • Others
  • KX509 (U. Mich.)
  • DRM 1.2
  • Java
  • FBSng job manager

40
Health
41
Condor at Noregon
gtAt 1014 AM 7/15/2004 -0400, xxx wrotegtDr.
Livny gtI wanted to update you on our progress
with our grid computing gtproject. We have about
300 nodes deployed presently with the ability to
gtdeploy up to 6,000 total nodes whenever we are
ready.  The project has gtbeen getting attention
in the local press and has gained the full
support gtof the public school system and
generated a lot of excitement in the gtbusiness
community.
  • Noregon has entered into a partnership with
    Targacept Inc. to develop a system to efficiently
    perform molecular dynamics simulations. Targacept
    is a privately held pharmaceutical company
    located in Winston-Salem's Triad Research Park
    whose efforts are focused on creating drug
    therapies for neurological, psychiatric, and
    gastrointestinal diseases.
  • Using the Condor grid middleware, Noregon is
    designing and implementing an ensemble
    Car-Parrinello simulation tool for Targacept that
    will allow a simulation to be distributed across
    a large grid of inexpensive Windows PCs.
    Simulations can be completed in a fraction of the
    time without the use of high performance
    (expensive) hardware.

42
Electronics
43
Condor at Micron
44
Condor at Micron
  • The Chief Officer value proposition
  • Info Week 2004 IT Survey includes Grid questions!
  • Makes our CIO look good by letting him answer yes
  • Microns 2003 rank 23rd
  • Without Condor we only get about 25 of PC value
    today
  • Didt tell our CFO a 1000 PC really costs 4000!
  • Doubling utilization to 50 doubles CFOs return
    on capital
  • Microns goal 66 monthly average utilization
  • Providing a personal supercomputer to every
    engineer
  • CTO appreciates the cool factor
  • CTO really gets it when his engineers say
  • I dont know how I would have done that without
    the Grid

45
Condor at MicronExample Value
  • 73606 job hours / 24 / 30 103 Solaris boxes
  • 103 10,000/box 1,022,306
  • And thats just for one application not
    considering decreased development time, increased
    uptime, etc.
  • Chances are if you have Micron memory in your PC,
    it was processed by Condor!

46
Software Engineering
47
Condor at Oracle
  • Condor is used within Oracle's Automated
    Integration Management Environment (AIME) to
    perform automated build and regression testing of
    multiple components for Oracle's flagship
    Database Server product.Each day, nearly 1,000
    developers make contributions to the code base of
    Oracle Database Server.  Just the compilation
    alone of these software modules would take over
    11 hours on a capable workstation.  But in
    addition to building, AIME must control
    repository labelling/tagging, configuration
    publishing, and last but certainly not least,
    regression testing.  Oracle is very serious about
    the stability and correctness about their
    products. Therefore, the AIME daily regression
    test suite currently covers 90,000 testable items
    divided into over 700 test packages. The entire
    process must complete within 12 hours to keep
    development moving forward.About five years
    ago, Oracle selected Condor as the resource
    manager underneath AIME because they liked the
    maturity of Condor's core components. In total,
    3000 CPUs at Oracle are managed by Condor today.

48
GRIDS Center- Enabling Collaborative
Science-Grid Research Integration Development
Support
49
Procedures,Tools and Facilities
  • Build Generate executable versions of a
    component
  • Package Integrate executables into a
    distribution
  • Test Verify the functionality of a
  • Component
  • A set of a components
  • A distribution

50
Build
  • Reproducibility build the version we released
    2 years ago!
  • Well managed source repository
  • Know your externals and keep them around
  • Portability build the component on
    build17.nmi.wisc.edu!
  • No dependencies on local capabilities
  • Understand your hardware requirements
  • Manageability run the build daily and email me
    outcome

51
Fetch component
Build Component
Build Component
Move source files to build site
Move source files to build site
Retrieve executables from build site
Retrieve executables from build site
Report outcome and clean up
52
Goals of the Build Facility
  • Design, develop and deploy a build system (HW and
    software) capable of performing daily builds of a
    suite of middleware packages on a heterogeneous
    (HW, OS, libraries, ) collection of platforms
  • Dependable
  • Traceable
  • Manageable
  • Portable
  • Extensible
  • Schedulable

53
Using our own technologies
  • Using GRIDS technologies to automate the build,
    deploy, and test cycle
  • Condor schedule build and testing tasks
  • DAGMan Manage build and testing workflow
  • GridFTP copy/move files
  • GSI-OpenSSH remote login, start/stop services
    etc
  • Constructed and manage a dedicated heterogeneous
    and distributed facility

54
NMI Build facility
  • Build resources
  • mi-aix.cs.wisc.edu
  • nmi-hpux.cs.wisc.edu
  • nmi-irix.cs.wisc.edu
  • nmi-rh72-alpha.cs.wisc.edu
  • nmi-redhat72-ia64.cs.wisc.edu
  • nmi-sles8-ia64.cs.wisc.edu
  • nmi-redhat72-build.cs.wisc.edu
  • nmi-redhat72-dev.cs.wisc.edu
  • nmi-redhat80-ia32.cs.wisc.edu
  • nmi-redhat9-ia32.cs.wisc.edu
  • (rh9 x86)nmi-test-1.cs.wisc.edu
  • (production system rh73 x86)vger.cs.wisc.edu
  • nmi-dux40f.cs.wisc.edu
  • nmi-tru64.cs.wisc.edu
  • nmi-macosx.local.
  • nmi-solaris6.cs.wisc.edu
  • nmi-solaris7.cs.wisc.edu

Web interface
Database
Build Manager
Build Generator
Email
55
The VDT operation
NMI
VDT
Test
Sources (CVS)
Build
Binaries
Build Test Condor pool (37 computers)
Pacman cache
Package
Patching

RPMs
Build
Binaries
GPT src bundles
Build
Binaries
Test
Contributors (VDS, etc.)
56
Test
  • Reproducibility Run last year test harness on
    last week build!
  • Separation between build and test processes
  • Well managed repository of test harnesses
  • Know your externals and keep them around
  • Portability run the test harness of component
    A on test17.nmi.wisc.edu!
  • Automatic install and de-install of component
  • No dependencies on local capabilities
  • Understand your hardware requirements
  • Manageability run the test suite daily and
    email me the outcome

57
Testing Tools
  • Current focus on component testing
  • Developed scripts and procedures to verify
    deployment, very basic operations
  • Multi-component, multi-version, multi-platform
    test harness and procedures
  • Testing as bottom feeder activity
  • Short and long term testing cycles

58
Movies
59
C.O.R.E Digital Pictures
X-Men, X-Men II
  • There has been a lot of buzz in the industry
    about something big going on here at C.O.R.E.
    We're really really really pleased to make the
    following announcementYes, it's true. C.O.R.E.
    digital pictures has spawned a new
    divisionC.O.R.E. Feature Animation We're in
    production on a CG animated feature film being
    directed by Steve "Spaz" Williams. The script is
    penned by the same writers who brought you
    There's Something About Mary, Ed Decter and John
    Strauss.

The Time Machine
Blade, Blade II
The Nutty Professor II
60
How can we accommodatean unbounded need for
computing with an unbounded amount of
resources?
61
GCB
DAGMan
HawkEye
Parrot
Condor-G
Stork
BirdBath
NeST
Chirp
Condor-C
Write a Comment
User Comments (0)
About PowerShow.com