Form a grid of One to a grid of Many - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Form a grid of One to a grid of Many

Description:

The Grid: Blueprint for a New Computing Infrastructure. Edited by Ian Foster and Carl Kesselman ... Throughput Resource Management', in 'The Grid: Blueprint for ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 50
Provided by: miro51
Category:
Tags: blueprint | form | grid | many | one

less

Transcript and Presenter's Notes

Title: Form a grid of One to a grid of Many


1
Form a grid of Oneto a grid of Many
2
(No Transcript)
3
The Condor Project (Established 85)
  • Distributed Computing research performed by a
    team of 40 faculty, full time staff and students
    who
  • face software/middleware engineering challenges
    in a UNIX/Linux/Windows/OS X environment,
  • involved in national and international
    collaborations,
  • interact with users in academia and industry,
  • maintain and support a distributed production
    environment (more than 3200 CPUs at UW),
  • and educate and train students.
  • Funding DoE, NIH, NSF, EU, INTEL,
  • Micron, Microsoft and the UW Graduate School

4
NeST
HawkEye
DAGMan
BirdBath
Condor-G
Stork
QUill
M W
Chirp
Condor-C
GCB
5
Claims for benefits provided by Distributed
Processing Systems
  • High Availability and Reliability
  • High System Performance
  • Ease of Modular and Incremental Growth
  • Automatic Load and Resource Sharing
  • Good Response to Temporary Overloads
  • Easy Expansion in Capacity and/or Function

What is a Distributed Data Processing System? ,
P.H. Enslow, Computer, January 1978
6
(No Transcript)
7
The Grid Blueprint for a New Computing
Infrastructure Edited by Ian Foster and Carl
Kesselman July 1998, 701 pages.
The grid promises to fundamentally change the way
we think about and use computing. This
infrastructure will connect multiple regional and
national computational grids, creating a
universal source of pervasive and dependable
computing power that supports dramatically new
classes of applications.
8
  • We claim that these mechanisms, although
    originally developed in the context of a cluster
    of workstations, are also applicable to
    computational grids. In addition to the required
    flexibility of services in these grids, a very
    important concern is that the system be robust
    enough to run in production mode continuously
    even in the face of component failures.

Miron Livny Rajesh Raman, "High Throughput
Resource Management", in The Grid Blueprint for
a New Computing Infrastructure.
9
  • Grid computing is a partnership between
    clients and servers. Grid clients have more
    responsibilities than traditional clients, and
    must be equipped with powerful mechanisms for
    dealing with and recovering from failures,
    whether they occur in the context of remote
    execution, work management, or data output. When
    clients are powerful, servers must accommodate
    them by using careful protocols.

Douglas Thain Miron Livny, "Building Reliable
Clients and Servers", in The Grid Blueprint for
a New Computing Infrastructure,2nd edition
10
Grid
WWW
11
Being a Master
  • Customer delegates task(s) to the master who
    is responsible for
  • Obtaining allocation of resources
  • Deploying and managing workers on allocated
    resources
  • Delegating work unites to deployed workers
  • Receiving and processing results
  • Delivering results to customer

12
Master must be
  • Persistent work and results must be safely
    recorded on non-volatile media
  • Resourceful delegates DAGs of work to other
    masters
  • Speculative takes chances and knows how to
    recover from failure
  • Self aware knows its own capabilities and
    limitations
  • Obedience manages work according to plan
  • Reliable can mange large numbers of work
    items and resource providers
  • Portable can be deployed on the fly to act as
    a sub master

13
Master should not do
  • Predictions
  • Optimal scheduling
  • Data mining
  • Bidding
  • Forecasting

14
our answer to High Throughput MW Computing on
commodity resources
15
The Layers of Condor
Matchmaker
16
Resource Allocationvs.Work Delegation
17
(No Transcript)
18
Resource Allocation
  • A limited assignment of the ownership of a
    resource
  • Owner is charged for allocation regardless of
    actual consumption
  • Owner can allocate the resource to others
  • Owner has the right and means to revoke an
    allocation
  • Allocation is governed by an agreement between
    the consumer and the owner
  • Allocation is always a lease
  • Trees of allocations can be formed

19
  • We present some principles that we believe
    should apply in any compute resource management
    system. The first, P1, speaks to the need to
    avoid resource leaks of all kinds, as might
    result, for example, from a monitoring system
    that consumes a nontrivial number of resources.
  • P1 - It must be possible to monitor and control
    all resources consumed by a CEwhether for
    computation or management.
  • Our second principle is a corollary of P1
  • P2 - A system should incorporate circuit breakers
    to protect both the compute resource and clients.
    For example, negotiating with a CE consumes
    resources. How do we prevent an eager client from
    turning into a denial of service attack?

Ian Foster Miron Livny, "Virtualization and
Management of Compute Resources Principles and
Architecture ", A working document (February
2005)
20
Work Delegation
  • A limited assignment of the responsibility to
    perform the work
  • Delegation involves a definition of these
    responsibilities
  • Responsibilities my be further delegated
  • Delegation always consumes resources
  • Delegation is always a lease
  • Tree of delegations can be formed

21
From CondortoCondor-Gto Condor-C
22
startD
DAGMan
3
starter
schedD
1
3
Globus
4
1
2
5
3
4
6
shadow
Unicore
5
1
3
grid manager
4
5
6
GAHP- Globus
4
6
6
5
6
23
PSE or User
Condor
MM
C-app
Local
SchedD (Condor G)
MM
MM
Condor
Remote
C-app
24
Downloads per month
X86/Windows
25
Condor Adoption
26
Condor adoptionCampus grids
27
  • RCAC opens up opportunistic access to 11 TFlops
    for TeraGrid users (July 21, 2005)Dr. Sebastien
    Goasguen, Purdue University
  • WEST LAFAYETTE, Ind. -- Rosen Center for Advanced
    Computing (RCAC) at Purdue University has opened
    up access to 11 teraflops of computing power to
    the TeraGrid community. Based on a new model
    known as "community clusters," developed by
    researchers at RCAC, this new computing resource
    will be accessible to TeraGrid researchers and
    educators using Condor. The community clusters
    currently supported by RCAC include a 1024 Xeon
    64-bit (Irwindale) processor cluster, a 194
    Opteron 64-bit processor cluster with InfiniBand
    interconnects, and a 618 Xeon 32-bit processor
    cluster a combined capacity of 11 TFlops.

28
UW Enterprise Level Grid
  • Condor pools at various departments integrated
    into a campus wide grid
  • Grid Laboratory of Wisconsin (GLOW)
  • Older private Condor pools at other departments
  • 1000 1GHz Intel CPUs at CS
  • 100 2GHz Intel CPUs at Physics
  • Condor jobs flock from various departments to
    GLOW
  • Excellent utilization
  • Especially when the Condor Standard Universe is
    used
  • Premption, Checkpointing, Job Migration

29
Grid Laboratory of Wisconsin
2003 Initiative funded by NSF/UW (1.5M budget)
Six GLOW Sites
  • Computational Genomics, Chemistry
  • Amanda, Ice-cube, Physics/Space Science
  • High Energy Physics/CMS, Physics
  • Materials by Design, Chemical Engineering
  • Radiation Therapy, Medical Physics
  • Computer Science

GLOW phases-1,2 non-GLOW funded nodes already
have 1000 Xeons 100 TB disk
30
GLOW Deployment
  • GLOW Phase-I and II are Commissioned
  • CPU
  • 66 nodes each _at_ ChemE, CS, LMCG, MedPhys, Physics
  • 30 nodes _at_ IceCube
  • 100 extra nodes _at_ CS (50 ATLAS 50 CS)
  • 26 extra nodes _at_ Physics
  • Total CPU 1000
  • Storage
  • Head nodes _at_ at all sites
  • 45 TB each _at_ CS and Physics
  • Total storage 100 TB
  • GLOW Resources are used at 100 level
  • Key is to have multiple user groups
  • GLOW continues to grow

31
GLOW Usage Since February 2004
Leftover cycles available for Others
Takes advantage of shadow jobs
Take advantage of check-pointing jobs
Over 7.6 million CPU-Hours (865 CPU-Years) served!
32
Example Uses
  • ATLAS
  • Over 15 Million proton collision events simulated
    at 10 minutes each
  • CMS
  • Over 10 Million events simulated in a month -
    many more events reconstructed and analyzed
  • Computational Genomics
  • Prof. Shwartz asserts that GLOW has opened up new
    paradigm of work patterns in his group
  • They no longer think about how long a particular
    computational job will take - they just do it
  • Chemical Engineering
  • Students do not know where the computing cycles
    are coming from - they just do it

33
Condor adoptionIndustry
34
  • Seeking the massive computing power needed to
    hedge a portion of its book of annuity business,
    Hartford Life, a subsidiary of The Hartford
    Financial Services Group (Hartford 18.7 billion
    in 2003 revenues), has implemented a grid
    computing solution based on the University of
    Wisconsin's (Madison, Wis.) Condor open source
    software. Hartford Life's SVP and CIO Vittorio
    Severino notes that the move was a matter of
    necessity. "It was the necessity to hedge the
    book," owing in turn to a tight reinsurance
    market that is driving the need for an
    alternative risk management strategy, he says.
    The challenge was to support the risk generated
    by clients opting for income protection benefit
    riders on popular annuity products.

35
  • Resource How did you complete this projecton
    your own or with a vendors help?Severino We
    completed this project very much on our own. As a
    matter of fact it is such a new technology in the
    insurance industry, that others were calling us
    for assistance on how to do it. So it was
    interesting because we were breaking new ground
    and vendors really couldnt help us. We
    eventually chose grid computing software from the
    University of Wisconsin called Condor it is open
    source software. We chose the Condor software
    because it is one of the oldest grid computing
    software tools around so it is mature. We have a
    tremendous amount of confidence in the Condor
    software

36
Condor at Micron
37
(No Transcript)
38
Condor adoption National and International
Grids
39
U.S. Trillium Grid Partnership
  • Trillium PPDG GriPhyN iVDGL
  • Particle Physics Data Grid 12M (DOE) (1999
    2004)
  • GriPhyN 12M (NSF) (2000 2005)
  • iVDGL 14M (NSF) (2001 2006)
  • Basic composition (150 people)
  • PPDG 4 universities, 6 labs
  • GriPhyN 12 universities, SDSC, 3 labs
  • iVDGL 18 universities, SDSC, 4 labs, foreign
    partners
  • Expts BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO,
    SDSS/NVO
  • Complementarity of projects
  • GriPhyN CS research, Virtual Data Toolkit (VDT)
    development
  • PPDG End to end Grid services, monitoring,
    analysis
  • iVDGL Grid laboratory deployment using VDT
  • Experiments provide frontier challenges
  • Unified entity when collaborating internationally

40
  • Grid2003 An Operational National Grid
  • 28 sites Universities national labs
  • 2800 CPUs, 4001300 jobs
  • Running since October 2003
  • Applications in HEP, LIGO, SDSS, Genomics

Korea
http//www.ivdgl.org/grid2003
41
The current gLite CE
  • Collaboration of INFN, Univ. of Chicago, Univ. of
    Wisconsin-Madison, and the EGEE security activity
    (JRA3)

Submitjob
CEMon
Notifications
Condor-C
Blahpd
CE
Localbatchsystem
LSF
PBS/Torque
Condor
42
What about other types of work and Resources?
  • Make data placement jobs first class citizens
  • Manage storage space
  • Manage FTP connections
  • Bridge protocols
  • Manage network connections
  • Across private networks
  • Through firewalls
  • Through shared gateways

43
Customer requestsPlace y F(x) at L!Master
delivers.
44
Data Placement
  • Management of storage space and bulk data
    transfers play a key role in the end-to-end
    performance of an application.
  • Data Placement (DaP) operations must be treated
    as first class jobs and explicitly expressed in
    the job flow
  • Fabric must provide services to manage storage
    space
  • Data Placement schedulers are needed.
  • Data Placement and computing must be coordinated
  • Smooth transition of CPU-I/O interleaving across
    software layers
  • Error handling and garbage collection

45
A simple DAG for yF(x)?L
  • Allocate (size(x)size(y)size(F)) at SE(i)
  • Move x from SE(j) to SE(i)
  • Place F on CE(k)
  • Compute F(x) at CE(k)
  • Move y to L
  • Release allocated space

Storage Element (SE) Compute Element (CE)
46
The Concept
Condor Job Queue
DaP A A.submit DaP B B.submit Job C
C.submit .. Parent A child B Parent B child
C Parent C child D, E ..
DAG specification
C
DAGMan
Stork Job Queue
C
E
47
Current Status
  • Implemented a first version of a framework that
    unifies the management of compute and data
    placement activities.
  • DaP aware Job Flow (DAGMan).
  • Stork A DaP scheduler
  • Parrot A tool that speaks a variety of
    distributed I/O services
  • NeST A portable Grid enabled storage appliance

48
Planner
MM
SchedD
Stork
StartD
SchedD
RFT
GridFTP
49
Dont ask what can the Grid do for me?ask
what can I do with a Grid?
Write a Comment
User Comments (0)
About PowerShow.com