Form a grid of One to a grid of Many - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Form a grid of One to a grid of Many

Description:

The Grid: Blueprint for a New Computing Infrastructure. Edited by Ian Foster and Carl Kesselman ... Throughput Resource Management', in 'The Grid: Blueprint for ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 50

Provided by: miro51

Category:

more less

Transcript and Presenter's Notes

Title: Form a grid of One to a grid of Many

1
Form a grid of Oneto a grid of Many
2
(No Transcript)
3
The Condor Project (Established 85)

Distributed Computing research performed by a
team of 40 faculty, full time staff and students
who
face software/middleware engineering challenges
in a UNIX/Linux/Windows/OS X environment,
involved in national and international
collaborations,
interact with users in academia and industry,
maintain and support a distributed production
environment (more than 3200 CPUs at UW),
and educate and train students.
Funding DoE, NIH, NSF, EU, INTEL,
Micron, Microsoft and the UW Graduate School

4
NeST
HawkEye
DAGMan
BirdBath
Condor-G
Stork
QUill
M W
Chirp
Condor-C
GCB
5
Claims for benefits provided by Distributed
Processing Systems

High Availability and Reliability
High System Performance
Ease of Modular and Incremental Growth
Automatic Load and Resource Sharing
Good Response to Temporary Overloads
Easy Expansion in Capacity and/or Function

What is a Distributed Data Processing System? ,
P.H. Enslow, Computer, January 1978
6
(No Transcript)
7
The Grid Blueprint for a New Computing
Infrastructure Edited by Ian Foster and Carl
Kesselman July 1998, 701 pages.
The grid promises to fundamentally change the way
we think about and use computing. This
infrastructure will connect multiple regional and
national computational grids, creating a
universal source of pervasive and dependable
computing power that supports dramatically new
classes of applications.
8

We claim that these mechanisms, although
originally developed in the context of a cluster
of workstations, are also applicable to
computational grids. In addition to the required
flexibility of services in these grids, a very
important concern is that the system be robust
enough to run in production mode continuously
even in the face of component failures.

Miron Livny Rajesh Raman, "High Throughput
Resource Management", in The Grid Blueprint for
a New Computing Infrastructure.
9

Grid computing is a partnership between
clients and servers. Grid clients have more
responsibilities than traditional clients, and
must be equipped with powerful mechanisms for
dealing with and recovering from failures,
whether they occur in the context of remote
execution, work management, or data output. When
clients are powerful, servers must accommodate
them by using careful protocols.

Douglas Thain Miron Livny, "Building Reliable
Clients and Servers", in The Grid Blueprint for
a New Computing Infrastructure,2nd edition
10
Grid
WWW
11
Being a Master

Customer delegates task(s) to the master who
is responsible for
Obtaining allocation of resources
Deploying and managing workers on allocated
resources
Delegating work unites to deployed workers
Receiving and processing results
Delivering results to customer

12
Master must be

Persistent work and results must be safely
recorded on non-volatile media
Resourceful delegates DAGs of work to other
masters
Speculative takes chances and knows how to
recover from failure
Self aware knows its own capabilities and
limitations
Obedience manages work according to plan
Reliable can mange large numbers of work
items and resource providers
Portable can be deployed on the fly to act as
a sub master

13
Master should not do

Predictions
Optimal scheduling
Data mining
Bidding
Forecasting

14
our answer to High Throughput MW Computing on
commodity resources
15
The Layers of Condor
Matchmaker
16
Resource Allocationvs.Work Delegation
17
(No Transcript)
18
Resource Allocation

A limited assignment of the ownership of a
resource
Owner is charged for allocation regardless of
actual consumption
Owner can allocate the resource to others
Owner has the right and means to revoke an
allocation
Allocation is governed by an agreement between
the consumer and the owner
Allocation is always a lease
Trees of allocations can be formed

We present some principles that we believe
should apply in any compute resource management
system. The first, P1, speaks to the need to
avoid resource leaks of all kinds, as might
result, for example, from a monitoring system
that consumes a nontrivial number of resources.
P1 - It must be possible to monitor and control
all resources consumed by a CEwhether for
computation or management.
Our second principle is a corollary of P1
P2 - A system should incorporate circuit breakers
to protect both the compute resource and clients.
For example, negotiating with a CE consumes
resources. How do we prevent an eager client from
turning into a denial of service attack?

Ian Foster Miron Livny, "Virtualization and
Management of Compute Resources Principles and
Architecture ", A working document (February
2005)
20
Work Delegation

A limited assignment of the responsibility to
perform the work
Delegation involves a definition of these
responsibilities
Responsibilities my be further delegated
Delegation always consumes resources
Delegation is always a lease
Tree of delegations can be formed

21
From CondortoCondor-Gto Condor-C
22
startD
DAGMan
3
starter
schedD
1
3
Globus
4
1
2
5
3
4
6
shadow
Unicore
5
1
3
grid manager
4
5
6
GAHP- Globus
4
6
6
5
6
23
PSE or User
Condor
MM
C-app
Local
SchedD (Condor G)
MM
MM
Condor
Remote
C-app
24
Downloads per month
X86/Windows
25
Condor Adoption
26
Condor adoptionCampus grids
27

RCAC opens up opportunistic access to 11 TFlops
for TeraGrid users (July 21, 2005)Dr. Sebastien
Goasguen, Purdue University
WEST LAFAYETTE, Ind. -- Rosen Center for Advanced
Computing (RCAC) at Purdue University has opened
up access to 11 teraflops of computing power to
the TeraGrid community. Based on a new model
known as "community clusters," developed by
researchers at RCAC, this new computing resource
will be accessible to TeraGrid researchers and
educators using Condor. The community clusters
currently supported by RCAC include a 1024 Xeon
64-bit (Irwindale) processor cluster, a 194
Opteron 64-bit processor cluster with InfiniBand
interconnects, and a 618 Xeon 32-bit processor
cluster a combined capacity of 11 TFlops.

28
UW Enterprise Level Grid

Condor pools at various departments integrated
into a campus wide grid
Grid Laboratory of Wisconsin (GLOW)
Older private Condor pools at other departments
1000 1GHz Intel CPUs at CS
100 2GHz Intel CPUs at Physics
Condor jobs flock from various departments to
GLOW
Excellent utilization
Especially when the Condor Standard Universe is
used
Premption, Checkpointing, Job Migration

29
Grid Laboratory of Wisconsin
2003 Initiative funded by NSF/UW (1.5M budget)
Six GLOW Sites

Computational Genomics, Chemistry
Amanda, Ice-cube, Physics/Space Science
High Energy Physics/CMS, Physics
Materials by Design, Chemical Engineering
Radiation Therapy, Medical Physics
Computer Science

GLOW phases-1,2 non-GLOW funded nodes already
have 1000 Xeons 100 TB disk
30
GLOW Deployment

GLOW Phase-I and II are Commissioned
CPU
66 nodes each _at_ ChemE, CS, LMCG, MedPhys, Physics
30 nodes _at_ IceCube
100 extra nodes _at_ CS (50 ATLAS 50 CS)
26 extra nodes _at_ Physics
Total CPU 1000
Storage
Head nodes _at_ at all sites
45 TB each _at_ CS and Physics
Total storage 100 TB
GLOW Resources are used at 100 level
Key is to have multiple user groups
GLOW continues to grow

31
GLOW Usage Since February 2004
Leftover cycles available for Others
Takes advantage of shadow jobs
Take advantage of check-pointing jobs
Over 7.6 million CPU-Hours (865 CPU-Years) served!
32
Example Uses

ATLAS
Over 15 Million proton collision events simulated
at 10 minutes each
CMS
Over 10 Million events simulated in a month -
many more events reconstructed and analyzed
Computational Genomics
Prof. Shwartz asserts that GLOW has opened up new
paradigm of work patterns in his group
They no longer think about how long a particular
computational job will take - they just do it
Chemical Engineering
Students do not know where the computing cycles
are coming from - they just do it

33
Condor adoptionIndustry
34

Seeking the massive computing power needed to
hedge a portion of its book of annuity business,
Hartford Life, a subsidiary of The Hartford
Financial Services Group (Hartford 18.7 billion
in 2003 revenues), has implemented a grid
computing solution based on the University of
Wisconsin's (Madison, Wis.) Condor open source
software. Hartford Life's SVP and CIO Vittorio
Severino notes that the move was a matter of
necessity. "It was the necessity to hedge the
book," owing in turn to a tight reinsurance
market that is driving the need for an
alternative risk management strategy, he says.
The challenge was to support the risk generated
by clients opting for income protection benefit
riders on popular annuity products.

Resource How did you complete this projecton
your own or with a vendors help?Severino We
completed this project very much on our own. As a
matter of fact it is such a new technology in the
insurance industry, that others were calling us
for assistance on how to do it. So it was
interesting because we were breaking new ground
and vendors really couldnt help us. We
eventually chose grid computing software from the
University of Wisconsin called Condor it is open
source software. We chose the Condor software
because it is one of the oldest grid computing
software tools around so it is mature. We have a
tremendous amount of confidence in the Condor
software

36
Condor at Micron
37
(No Transcript)
38
Condor adoption National and International
Grids
39
U.S. Trillium Grid Partnership

Trillium PPDG GriPhyN iVDGL
Particle Physics Data Grid 12M (DOE) (1999
2004)
GriPhyN 12M (NSF) (2000 2005)
iVDGL 14M (NSF) (2001 2006)
Basic composition (150 people)
PPDG 4 universities, 6 labs
GriPhyN 12 universities, SDSC, 3 labs
iVDGL 18 universities, SDSC, 4 labs, foreign
partners
Expts BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO,
SDSS/NVO
Complementarity of projects
GriPhyN CS research, Virtual Data Toolkit (VDT)
development
PPDG End to end Grid services, monitoring,
analysis
iVDGL Grid laboratory deployment using VDT
Experiments provide frontier challenges
Unified entity when collaborating internationally

Grid2003 An Operational National Grid
28 sites Universities national labs
2800 CPUs, 4001300 jobs
Running since October 2003
Applications in HEP, LIGO, SDSS, Genomics

Korea
http//www.ivdgl.org/grid2003
41
The current gLite CE

Collaboration of INFN, Univ. of Chicago, Univ. of
Wisconsin-Madison, and the EGEE security activity
(JRA3)

Submitjob
CEMon
Notifications
Condor-C
Blahpd
CE
Localbatchsystem
LSF
PBS/Torque
Condor
42
What about other types of work and Resources?

Make data placement jobs first class citizens
Manage storage space
Manage FTP connections
Bridge protocols
Manage network connections
Across private networks
Through firewalls
Through shared gateways

43
Customer requestsPlace y F(x) at L!Master
delivers.
44
Data Placement

Management of storage space and bulk data
transfers play a key role in the end-to-end
performance of an application.
Data Placement (DaP) operations must be treated
as first class jobs and explicitly expressed in
the job flow
Fabric must provide services to manage storage
space
Data Placement schedulers are needed.
Data Placement and computing must be coordinated
Smooth transition of CPU-I/O interleaving across
software layers
Error handling and garbage collection

45
A simple DAG for yF(x)?L

Allocate (size(x)size(y)size(F)) at SE(i)
Move x from SE(j) to SE(i)
Place F on CE(k)
Compute F(x) at CE(k)
Move y to L
Release allocated space

Storage Element (SE) Compute Element (CE)
46
The Concept
Condor Job Queue
DaP A A.submit DaP B B.submit Job C
C.submit .. Parent A child B Parent B child
C Parent C child D, E ..
DAG specification
C
DAGMan
Stork Job Queue
C
E
47
Current Status

Implemented a first version of a framework that
unifies the management of compute and data
placement activities.
DaP aware Job Flow (DAGMan).
Stork A DaP scheduler
Parrot A tool that speaks a variety of
distributed I/O services
NeST A portable Grid enabled storage appliance