Title: Form a grid of One to a grid of Many
1Form a grid of Oneto a grid of Many
2(No Transcript)
3The Condor Project (Established 85)
- Distributed Computing research performed by a
team of 40 faculty, full time staff and students
who - face software/middleware engineering challenges
in a UNIX/Linux/Windows/OS X environment, - involved in national and international
collaborations, - interact with users in academia and industry,
- maintain and support a distributed production
environment (more than 3200 CPUs at UW), - and educate and train students.
- Funding DoE, NIH, NSF, EU, INTEL,
- Micron, Microsoft and the UW Graduate School
4NeST
HawkEye
DAGMan
BirdBath
Condor-G
Stork
QUill
M W
Chirp
Condor-C
GCB
5Claims for benefits provided by Distributed
Processing Systems
- High Availability and Reliability
- High System Performance
- Ease of Modular and Incremental Growth
- Automatic Load and Resource Sharing
- Good Response to Temporary Overloads
- Easy Expansion in Capacity and/or Function
What is a Distributed Data Processing System? ,
P.H. Enslow, Computer, January 1978
6(No Transcript)
7The Grid Blueprint for a New Computing
Infrastructure Edited by Ian Foster and Carl
Kesselman July 1998, 701 pages.
The grid promises to fundamentally change the way
we think about and use computing. This
infrastructure will connect multiple regional and
national computational grids, creating a
universal source of pervasive and dependable
computing power that supports dramatically new
classes of applications.
8- We claim that these mechanisms, although
originally developed in the context of a cluster
of workstations, are also applicable to
computational grids. In addition to the required
flexibility of services in these grids, a very
important concern is that the system be robust
enough to run in production mode continuously
even in the face of component failures.
Miron Livny Rajesh Raman, "High Throughput
Resource Management", in The Grid Blueprint for
a New Computing Infrastructure.
9- Grid computing is a partnership between
clients and servers. Grid clients have more
responsibilities than traditional clients, and
must be equipped with powerful mechanisms for
dealing with and recovering from failures,
whether they occur in the context of remote
execution, work management, or data output. When
clients are powerful, servers must accommodate
them by using careful protocols.
Douglas Thain Miron Livny, "Building Reliable
Clients and Servers", in The Grid Blueprint for
a New Computing Infrastructure,2nd edition
10Grid
WWW
11Being a Master
- Customer delegates task(s) to the master who
is responsible for - Obtaining allocation of resources
- Deploying and managing workers on allocated
resources - Delegating work unites to deployed workers
- Receiving and processing results
- Delivering results to customer
12Master must be
- Persistent work and results must be safely
recorded on non-volatile media - Resourceful delegates DAGs of work to other
masters - Speculative takes chances and knows how to
recover from failure - Self aware knows its own capabilities and
limitations - Obedience manages work according to plan
- Reliable can mange large numbers of work
items and resource providers - Portable can be deployed on the fly to act as
a sub master
13Master should not do
- Predictions
- Optimal scheduling
- Data mining
- Bidding
- Forecasting
14our answer to High Throughput MW Computing on
commodity resources
15The Layers of Condor
Matchmaker
16Resource Allocationvs.Work Delegation
17(No Transcript)
18Resource Allocation
- A limited assignment of the ownership of a
resource - Owner is charged for allocation regardless of
actual consumption - Owner can allocate the resource to others
- Owner has the right and means to revoke an
allocation - Allocation is governed by an agreement between
the consumer and the owner - Allocation is always a lease
- Trees of allocations can be formed
19- We present some principles that we believe
should apply in any compute resource management
system. The first, P1, speaks to the need to
avoid resource leaks of all kinds, as might
result, for example, from a monitoring system
that consumes a nontrivial number of resources. - P1 - It must be possible to monitor and control
all resources consumed by a CEwhether for
computation or management. - Our second principle is a corollary of P1
- P2 - A system should incorporate circuit breakers
to protect both the compute resource and clients.
For example, negotiating with a CE consumes
resources. How do we prevent an eager client from
turning into a denial of service attack?
Ian Foster Miron Livny, "Virtualization and
Management of Compute Resources Principles and
Architecture ", A working document (February
2005)
20Work Delegation
- A limited assignment of the responsibility to
perform the work - Delegation involves a definition of these
responsibilities - Responsibilities my be further delegated
- Delegation always consumes resources
- Delegation is always a lease
- Tree of delegations can be formed
21From CondortoCondor-Gto Condor-C
22startD
DAGMan
3
starter
schedD
1
3
Globus
4
1
2
5
3
4
6
shadow
Unicore
5
1
3
grid manager
4
5
6
GAHP- Globus
4
6
6
5
6
23 PSE or User
Condor
MM
C-app
Local
SchedD (Condor G)
MM
MM
Condor
Remote
C-app
24Downloads per month
X86/Windows
25Condor Adoption
26Condor adoptionCampus grids
27- RCAC opens up opportunistic access to 11 TFlops
for TeraGrid users (July 21, 2005)Dr. Sebastien
Goasguen, Purdue University - WEST LAFAYETTE, Ind. -- Rosen Center for Advanced
Computing (RCAC) at Purdue University has opened
up access to 11 teraflops of computing power to
the TeraGrid community. Based on a new model
known as "community clusters," developed by
researchers at RCAC, this new computing resource
will be accessible to TeraGrid researchers and
educators using Condor. The community clusters
currently supported by RCAC include a 1024 Xeon
64-bit (Irwindale) processor cluster, a 194
Opteron 64-bit processor cluster with InfiniBand
interconnects, and a 618 Xeon 32-bit processor
cluster a combined capacity of 11 TFlops.
28UW Enterprise Level Grid
- Condor pools at various departments integrated
into a campus wide grid - Grid Laboratory of Wisconsin (GLOW)
- Older private Condor pools at other departments
- 1000 1GHz Intel CPUs at CS
- 100 2GHz Intel CPUs at Physics
-
- Condor jobs flock from various departments to
GLOW - Excellent utilization
- Especially when the Condor Standard Universe is
used - Premption, Checkpointing, Job Migration
29Grid Laboratory of Wisconsin
2003 Initiative funded by NSF/UW (1.5M budget)
Six GLOW Sites
- Computational Genomics, Chemistry
- Amanda, Ice-cube, Physics/Space Science
- High Energy Physics/CMS, Physics
- Materials by Design, Chemical Engineering
- Radiation Therapy, Medical Physics
- Computer Science
GLOW phases-1,2 non-GLOW funded nodes already
have 1000 Xeons 100 TB disk
30GLOW Deployment
- GLOW Phase-I and II are Commissioned
- CPU
- 66 nodes each _at_ ChemE, CS, LMCG, MedPhys, Physics
- 30 nodes _at_ IceCube
- 100 extra nodes _at_ CS (50 ATLAS 50 CS)
- 26 extra nodes _at_ Physics
- Total CPU 1000
- Storage
- Head nodes _at_ at all sites
- 45 TB each _at_ CS and Physics
- Total storage 100 TB
- GLOW Resources are used at 100 level
- Key is to have multiple user groups
- GLOW continues to grow
31GLOW Usage Since February 2004
Leftover cycles available for Others
Takes advantage of shadow jobs
Take advantage of check-pointing jobs
Over 7.6 million CPU-Hours (865 CPU-Years) served!
32Example Uses
- ATLAS
- Over 15 Million proton collision events simulated
at 10 minutes each - CMS
- Over 10 Million events simulated in a month -
many more events reconstructed and analyzed - Computational Genomics
- Prof. Shwartz asserts that GLOW has opened up new
paradigm of work patterns in his group - They no longer think about how long a particular
computational job will take - they just do it - Chemical Engineering
- Students do not know where the computing cycles
are coming from - they just do it
33Condor adoptionIndustry
34- Seeking the massive computing power needed to
hedge a portion of its book of annuity business,
Hartford Life, a subsidiary of The Hartford
Financial Services Group (Hartford 18.7 billion
in 2003 revenues), has implemented a grid
computing solution based on the University of
Wisconsin's (Madison, Wis.) Condor open source
software. Hartford Life's SVP and CIO Vittorio
Severino notes that the move was a matter of
necessity. "It was the necessity to hedge the
book," owing in turn to a tight reinsurance
market that is driving the need for an
alternative risk management strategy, he says.
The challenge was to support the risk generated
by clients opting for income protection benefit
riders on popular annuity products.
35- Resource How did you complete this projecton
your own or with a vendors help?Severino We
completed this project very much on our own. As a
matter of fact it is such a new technology in the
insurance industry, that others were calling us
for assistance on how to do it. So it was
interesting because we were breaking new ground
and vendors really couldnt help us. We
eventually chose grid computing software from the
University of Wisconsin called Condor it is open
source software. We chose the Condor software
because it is one of the oldest grid computing
software tools around so it is mature. We have a
tremendous amount of confidence in the Condor
software
36Condor at Micron
37(No Transcript)
38Condor adoption National and International
Grids
39U.S. Trillium Grid Partnership
- Trillium PPDG GriPhyN iVDGL
- Particle Physics Data Grid 12M (DOE) (1999
2004) - GriPhyN 12M (NSF) (2000 2005)
- iVDGL 14M (NSF) (2001 2006)
- Basic composition (150 people)
- PPDG 4 universities, 6 labs
- GriPhyN 12 universities, SDSC, 3 labs
- iVDGL 18 universities, SDSC, 4 labs, foreign
partners - Expts BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO,
SDSS/NVO - Complementarity of projects
- GriPhyN CS research, Virtual Data Toolkit (VDT)
development - PPDG End to end Grid services, monitoring,
analysis - iVDGL Grid laboratory deployment using VDT
- Experiments provide frontier challenges
- Unified entity when collaborating internationally
40- Grid2003 An Operational National Grid
- 28 sites Universities national labs
- 2800 CPUs, 4001300 jobs
- Running since October 2003
- Applications in HEP, LIGO, SDSS, Genomics
Korea
http//www.ivdgl.org/grid2003
41The current gLite CE
- Collaboration of INFN, Univ. of Chicago, Univ. of
Wisconsin-Madison, and the EGEE security activity
(JRA3)
Submitjob
CEMon
Notifications
Condor-C
Blahpd
CE
Localbatchsystem
LSF
PBS/Torque
Condor
42What about other types of work and Resources?
- Make data placement jobs first class citizens
- Manage storage space
- Manage FTP connections
- Bridge protocols
- Manage network connections
- Across private networks
- Through firewalls
- Through shared gateways
43Customer requestsPlace y F(x) at L!Master
delivers.
44Data Placement
- Management of storage space and bulk data
transfers play a key role in the end-to-end
performance of an application. - Data Placement (DaP) operations must be treated
as first class jobs and explicitly expressed in
the job flow - Fabric must provide services to manage storage
space - Data Placement schedulers are needed.
- Data Placement and computing must be coordinated
- Smooth transition of CPU-I/O interleaving across
software layers - Error handling and garbage collection
45A simple DAG for yF(x)?L
- Allocate (size(x)size(y)size(F)) at SE(i)
- Move x from SE(j) to SE(i)
- Place F on CE(k)
- Compute F(x) at CE(k)
- Move y to L
- Release allocated space
Storage Element (SE) Compute Element (CE)
46The Concept
Condor Job Queue
DaP A A.submit DaP B B.submit Job C
C.submit .. Parent A child B Parent B child
C Parent C child D, E ..
DAG specification
C
DAGMan
Stork Job Queue
C
E
47Current Status
- Implemented a first version of a framework that
unifies the management of compute and data
placement activities. - DaP aware Job Flow (DAGMan).
- Stork A DaP scheduler
- Parrot A tool that speaks a variety of
distributed I/O services - NeST A portable Grid enabled storage appliance
48Planner
MM
SchedD
Stork
StartD
SchedD
RFT
GridFTP
49Dont ask what can the Grid do for me?ask
what can I do with a Grid?