Title: The Computational Grid: Aggregating Performance and Enhanced Capability from Federated Resources
1- The Computational Grid Aggregating Performance
and Enhanced Capability from Federated Resources - Rich Wolski
- University of California, Santa Barbara
2The Goal
- To provide a seamless, ubiquitous, and
high-performance computing environment using a
heterogeneous collection of networked computers. - But there wont be one, big, uniform system
- Resources must be able to come and go dynamically
- The base system software supported by each
resource must remain inviolate - Multiple languages and programming paradigms must
be supported - The environment must be secure
- Programs must run fast
- For distributed computingThe Holy Grail
3For Example Richs Computational World
umich.edu
wisc.edu
ameslab.gov
osc.edu
harvard.edu
wellesley.edu
anl.gov
ncsa.edu
ksu.edu
uiuc.edu
lbl.gov
indiana.edu
virginia.edu
ncni.net
utk.edu
ucsb.edu
titech.jp
isi.edu
vu.nl
csun.edu
caltech.edu
utexas.edu
ucsd.edu
npaci.edu
rice.edu
4Zoom In
CT94
SDSC
IBM SP
HPSS
Desktops
Sun
T-3E
C
The Internet
UCSB
5The Landscape
- Heterogeneous
- Processors X86, SPARC, RS6000, Alpha, MIPS,
PowerPC, Cray - Networks GigE, Myrinet, 100baseT, ATM
- OS Linux, Solaris, AIX, Unicos, OSX, NT, Windows
- Dynamically changing
- Completely dedicated access is impossible gt
contention - Failures, upgrades, reconfigurations, etc.
- Federated
- Local administrative policies take precedence
- Performance?
6The Computational Grid
- Vision Application programs plug into the
system to draw computational power from a
dynamically changing pool of resources. - Electrical Power Grid analogy
- Power generation facilities computers,
networks, storage devices, palm tops, databases,
libraries, etc. - Household appliances application programs
- Scale to national and international levels
- Grid users (both power producers and application
consumers) can join and leave the Grid at will.
7The Shape of Things to Come?
- Grid Research Adventures
- Infrastructure
- Grid Programming
- State of the Grid Art
- What do Grids look like today?
- Interesting developments, trends, and
prognostications of the Grid future
8Fundamental Questions
- How do we build it?
- software infrastructures
- policies
- maintenance, support, accounting, etc.
- How do we program it?
- concurrency, synchronization
- heterogeneity
- dynamism
- How do we use it for performance?
- metrics
- models
9General Approach
- Combine results from distributed operating
systems, parallel computing, and internet
computing research domains - Remote procedure call/ remote invocation
- Public/private key encryption
- Domain decomposition
- Location independent naming
- Engineering strategy Implement Grid software
infrastructure as middleware - Allows resource owners maintain ultimate control
locally over the resources they commit to the
Grid - Permits new resources to be incorporated easily
- Aids in developing a user community
10Middleware Research Efforts
- Globus (I. Foster and K. Kesselman)
- Collection of independent remote execution and
naming services - Legion (A. Grimshaw)
- Distributed object-oriented programming
- NetSolve (J. Dongarra)
- Multi-language brokered RPC
- Condor (M. Livny)
- Idle cycle harvesting
- NINF (S. Matsuoka)
- Java-based brokered RPC
11Commonalities
- Runtime systems
- All current infrastructures are implemented as a
set of run time services - Resource is an abstract notion
- Anything with an API is resource operating
systems, libraries, databases, hardware devices - Support for multiple programming languages
- legacy codes
- performance
12Infrastructure Concerns
- Leverage emerging distributed technologies
- Buy it rather than build it
- Network infrastructure
- Web services
- Complexity
- Performance
- Installation, configuration, fault-diagnosis
- Mean time to reconfiguration is probably measured
in minutes - Bringing the Grid down is not an option
- Who operates it?
13NPACI
- National Partnership for Advanced Computational
Infrastructure - high-performance computing for the scientific
research community - Goal Build a production-quality Grid
- Leverage emerging standards
- Harden and deploy mature Grid technologies
- Packaging, configuration, deployment,
diagnostics, accounting - Deliver the Grid to scientists
14PACI-sized Questions
- If the national infrastructure is managed as a
Grid... - What resources are attached to it?
- X86 is certainly plentiful
- Earth Simulator is certainly expensive
- Mutithreading is certainly attractive
- What is the right blend?
- How are they managed?
- How long will you wait for your job to get
through the queue? - Accounting
- What are the units of Grid allocation?
15Grid Programming
- Two models
- Manual Application is explicitly coded to be a
Grid application - Automatic Grid software Gridifies a parallel
or sequential program - Start with the simpler approach build programs
that can adapt to changing Grid conditions - What are the current Grid conditions?
- Need a way to assess the available performance
- For example
- What is the speed of your ethernet?
16Ethernet Doesnt Have a Speed -- it Has Many
TCP/IP throughput mb/s
17More Importantly
- It is not what the speed was, but what the speed
will be that matters - Performance prediction
- Analytical models remain elusive
- Statistical models are difficult
- Whatever models are used, the prediction itself
needs to be fast
18The Network Weather Service
- On-line Grid system that
- monitors the performance that is available from
distributed resources - forecasts future performance levels using fast
statistical techniques - delivers forecasts on-the-fly dynamically
- Uses adaptive, non-parametric time series
analysis models to make short-term predictions - Records and reports forecasting error with each
prediction stream - Runs as any user (no privileged access required)
- Scalable and end-to-end
19NWS Predictions and Errors
Red NWS Prediction, Black Data
MSE 73.3, FED 8.5 mb/s, MAE 5.8 mb/s
20Clusters Too
MSE 4089, FED 63 mb/s, MAE 56 mb/s
21Many Challenges, No Waiting
- On-line predictions
- Need it better, faster, cheaper, and more
accurate - Adaptive programming
- Even if predictions are there they will have
errors - Performance fluctuates at machines speeds, not
human speeds - Which resource to use? When?
- Can programmers really manage a fluctuating
abstract machine?
22GrADS
- Grid Application Development Software (GrADS)
Project (K. Kennedy, PI) - Investigates Grid programmability
- Soup-to-nuts integrated approach
- Compilers, Debuggers, libraries, etc.
- Automatic Resource Control strategies
- Selection and Scheduling
- Resource economies (stability)
- Performance Prediction and Monitoring
- Applications and resources
- Effective Grid simulation
- Builds upon middleware successes
- Tested with real applications
23Four Observations
- The performance of the Grid middleware and
services matters - Grid fabric must scale even if the individual
applications do not - Adaptivity is critical
- So far, only short-term performance predictions
are possible - Both application and system must adapt on same
time scale - Extracting performance is really really hard
- Things happen at machine speeds
- Complexity is a killer
- We need more compilation technology
24Grid Compilers
- Adaptive compilation
- Compiler and program preparation environment
needs to manage complexity - The machine for which the compiler is
optimizing is changing dynamically - Challenges
- Performance of the compiler is important
- Legacy codes
- Security?
- GrADS has broken ground, but there is much more
to do
25Grid Research Challenges
- Four foci characterize Grid problems
- Heterogeneity
- Dynamism
- Federalism
- Performance
- Just building the infrastructure makes research
questions out of previously solved problems - Installation
- Configuration
- Accounting
- Grid programming is extremely complex
- New programming technologies
26Okay, so where are we now?
27Rational Exuberance
28For Example -- TeraGrid
- Joint effort between
- San Diego Supercomputer Center (SDSC)
- National Center for Scientific Applications
(NCSA) - Argonne National Laboratory (ANL)
- Center for Advanced Computational Research (CACR)
- Stats
- 13.6 Teraflops (peak)
- 600 Terabytes on-line storage
- 40 gb/s full connectivity, cross country, between
sites - Software Infrastructure is primarily Globus based
- Funded by NSF last year
29Non-trivial Endeavor
30Its Big, but there is Room to Grow
- Baseline infrastructure
- IA64 processors running Linux
- Gigabit ethernet
- Myrinet
- The Phone Company
- Designed to be heterogeneous and extensible
- Sites have plugged their resources in
- IBM Blue Horizon
- SGI Origin
- Sun Enterprise
- Convex X and V Class
- Caves, imersadesks, etc.
31Middleware Status
- Several research and commercial infrastructures
have reached maturity - Research Globus, Legion, NetSolve, Condor, NINF,
PUNCH - Commercial Globus, Avaki, Grid Engine
- By far, the most prevalent Grid infrastructure
deployed today is Globus
32Globus on One Slide
- Grid protocols for resource access, sharing, and
discovery - Grid Security Infrastructure (GSI)
- Grid Resource Allocation Manager (GRAM)
- MetaDirectory Service (MDS)
- Reference implementation of protocols in toolkit
form
33Increasing Research Leverage
- Grid research software artifacts turn out to be
valuable - Much of the extant work is empirical and
engineering focused - Robustness concerns mean that the prototype
systems need to work - Heterogeneity implies the need for portability
- Open source impetus
- Need to go from research prototypes to nationally
available software infrastructure - Download, install, run
34Packaging Efforts
- NSF Middleware Initiative (NMI)
- USC/ISI, SDSC, U. Wisc., ANL, NCSA, I2
- Identifies maturing Grid services and tools
- Provides support for configuration tools,
testing, packaging - Implements a release schedule and coordination
- R1 out 8/02
- Globus, Condor-G, NWS, KX509/KCA
- Release every 3 months
- Many more packages slated
- The NPACkage
- Use NMI technology for PACI infrastructure
35State of the Art
- Dozens of Grid deployments underway
- Linux cluster technology is the primary COTS
computing platform - Heterogeneity is built in from the start
- Networks
- Extant systems
- Special-purpose devices
- Globus is the leading Middleware
- Grid services and software tools reaching
maturity and mechanisms are in place to maximize
leverage
36Whats next?
37Grid Standards
- Interoperability is an issue
- Technology drift is starting to become a problem
- Protocol zoo is open for business
- The Global Grid Forum (GGF)
- Modeled after IETF (e.g working groups)
- Organized at a much earlier stage of development
(relatively speaking) - Meetings every 4 months
- Truly an international organization
38Webification
- Open Grid Service Architecture (OGSA)
- The Physiology of the Grid, I. Foster, K.
Kesselman, J. Nick, S. Tuecke - Based on W3C standards (XML, WSDL, WSIL, UDDI,
etc.) - Incorporates web service support for interface
publication, multiple protocol bindings, and
local/remote transparency - Directly interoperable with Internet-targeted
hosting environments - J2EE, .NET
- The Vendors are excited
39Grid_at_Home
- Entropia (www.entropia.com)
- Commercial enterprise
- Peer-2-Peer approach
- Napster for compute cycles (without the law
suits) - Microsoft PC-based instead of Linux/Unix based
- More compute leverage -- a lot more
- Way more configuration support, deployment
support, fault-management built into the system - Proprietary technology
- Deployed at NPACI on 250 hosts
40Thanks and Credit
- organizations
- NPACI, SDSC, NCSA, The Globus Project (ISI/USC),
The Legion Project (UVa), UTK, LBL - support
- NSF, NASA, DARPA, USPTO, DOE
41More Information
http//www.cs.ucsb.edu/rich
- Entropia
- http//www.entropia.com
- Globus
- http//www.globus.org
- GrADS
- http//hipersoft.cs.rice.edu/grads
- NMI
- http//www.nsf-middleware.org
- NPACI
- http//www.npaci.edu
- NWS
- http//nws.cs.ucsb.edu
- TeraGrid
- http//www.teragrid.org