Global Data Grids for 21st Century Science - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Global Data Grids for 21st Century Science

Description:

Global Data Grids for 21st Century Science – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 48
Provided by: paula92
Category:
Tags: 21st | century | data | global | grids | jayz | science

less

Transcript and Presenter's Notes

Title: Global Data Grids for 21st Century Science


1
  • Global Data Grids for21st Century Science

Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
Florida International UniversityMarch 27, 2002
2
What is a Grid?
  • Grid Geographically distributed computing
    resources configured for coordinated use
  • Physical resources networks provide raw
    capability
  • Middleware software ties it together

3
What Are Grids Good For?
  • Climate modeling
  • Climate scientists visualize, annotate, analyze
    Terabytes of simulation data
  • Biology
  • A biochemist exploits 10,000 computers to screen
    100,000 compounds in an hour
  • High energy physics
  • 3,000 physicists worldwide pool Petaflops of CPU
    resources to analyze Petabytes of data
  • Engineering
  • Civil engineers collaborate to design, execute,
    analyze shake table experiments
  • A multidisciplinary analysis in aerospace couples
    code and data in four companies to design a new
    airframe

From Ian Foster
4
What Are Grids Good For?
  • Application Service Providers
  • A home user invokes architectural design
    functions at an application service provider
  • which purchases computing cycles from cycle
    providers
  • Commercial
  • Scientists at a multinational toy company design
    a new product
  • Cities, communities
  • An emergency response team couples real time
    data, weather model, population data
  • A community group pools members PCs to analyze
    alternative designs for a local road
  • Health
  • Hospitals and international agencies collaborate
    on stemming a major disease outbreak

From Ian Foster
5
Proto-Grid SETI_at_home
  • Community SETI researchers enthusiasts
  • Arecibo radio data sent to users (250KB data
    chunks)
  • Over 2M PCs used

6
More Advanced Proto-GridEvaluation of AIDS Drugs
  • Community
  • Research group (Scripps)
  • 1000s of PC owners
  • Vendor (Entropia)
  • Common goal
  • Drug design
  • Advance AIDS research

7
Why Grids?
  • Resources for complex problems are distributed
  • Advanced scientific instruments (accelerators,
    telescopes, )
  • Storage and computing
  • Groups of people
  • Communities require access to common services
  • Scientific collaborations (physics, astronomy,
    biology, eng. )
  • Government agencies
  • Health care organizations, large corporations,
  • Goal is to build Virtual Organizations
  • Make all community resources available to any VO
    member
  • Leverage strengths at different institutions
  • Add people resources dynamically

8
Grids Why Now?
  • Moores law improvements in computing
  • Highly functional endsystems
  • Burgeoning wired and wireless Internet
    connections
  • Universal connectivity
  • Changing modes of working and problem solving
  • Teamwork, computation
  • Network exponentials
  • (Next slide)

9
Network Exponentials Collaboration
  • Network vs. computer performance
  • Computer speed doubles every 18 months
  • Network speed doubles every 9 months
  • Difference order of magnitude per 5 years
  • 1986 to 2000
  • Computers x 500
  • Networks x 340,000
  • 2001 to 2010?
  • Computers x 60
  • Networks x 4000

Scientific American (Jan-2001)
10
Grid Challenges
  • Overall goal Coordinated sharing of resources
  • Technical problems to overcome
  • Authentication, authorization, policy, auditing
  • Resource discovery, access, allocation, control
  • Failure detection recovery
  • Resource brokering
  • Additional issue lack of central control
    knowledge
  • Preservation of local site autonomy
  • Policy discovery and negotiation important

11
Layered Grid Architecture(Analogy to Internet
Architecture)
Specialized servicesApp. specific distributed
services
User
Managing multiple resourcesubiquitous
infrastructure services
Collective
Sharing single resourcesnegotiating access,
controlling use
Resource
Talking to thingscommunications, security
Connectivity
Controlling things locallyAccessing,
controlling resources
Fabric
From Ian Foster
12
Globus Project and Toolkit
  • Globus Project (Argonne USC/ISI)
  • O(40) researchers developers
  • Identify and define core protocols and services
  • Globus Toolkit 2.0
  • A major product of the Globus Project
  • Reference implementation of core protocols
    services
  • Growing open source developer community
  • Globus Toolkit used by all Data Grid projects
    today
  • US GriPhyN, PPDG, TeraGrid, iVDGL
  • EU EU-DataGrid and national projects
  • Recent announcement of applying web services to
    Grids
  • Keeps Grids in the commercial mainstream
  • GT 3.0

13
Globus General Approach
Applications
  • Define Grid protocols APIs
  • Protocol-mediated access to remote resources
  • Integrate and extend existing standards
  • Develop reference implementation
  • Open source Globus Toolkit
  • Client server SDKs, services, tools, etc.
  • Grid-enable wide variety of tools
  • Globus Toolkit
  • FTP, SSH, Condor, SRB, MPI,
  • Learn about real world problems
  • Deployment
  • Testing
  • Applications

Diverse global services
Core services
Diverse resources
14
Data Grids
15
Data Intensive Science 2000-2015
  • Scientific discovery increasingly driven by IT
  • Computationally intensive analyses
  • Massive data collections
  • Data distributed across networks of varying
    capability
  • Geographically distributed collaboration
  • Dominant factor data growth (1 Petabyte 1000
    TB)
  • 2000 0.5 Petabyte
  • 2005 10 Petabytes
  • 2010 100 Petabytes
  • 2015 1000 Petabytes?

How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
16
Data Intensive Physical Sciences
  • High energy nuclear physics
  • Including new experiments at CERNs Large Hadron
    Collider
  • Gravity wave searches
  • LIGO, GEO, VIRGO
  • Astronomy Digital sky surveys
  • Sloan Digital sky Survey, VISTA, other Gigapixel
    arrays
  • Virtual Observatories (multi-wavelength
    astronomy)
  • Time-dependent 3-D systems (simulation data)
  • Earth Observation, climate modeling
  • Geophysics, earthquake modeling
  • Fluids, aerodynamic design
  • Pollutant dispersal scenarios

17
Data Intensive Biology and Medicine
  • Medical data
  • X-Ray, mammography data, etc. (many petabytes)
  • Digitizing patient records (ditto)
  • X-ray crystallography
  • Bright X-Ray sources, e.g. Argonne Advanced
    Photon Source
  • Molecular genomics and related disciplines
  • Human Genome, other genome databases
  • Proteomics (protein structure, activities, )
  • Protein interactions, drug delivery
  • Brain scans (3-D, time dependent)
  • Virtual Population Laboratory (proposed)
  • Database of populations, geography,
    transportation corridors
  • Simulate likely spread of disease outbreaks

Craig Venter keynote _at_SC2001
18
Example High Energy Physics
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
19
LHC Computing Challenges
  • Complexity of LHC interaction environment
    resulting data
  • Scale Petabytes of data per year (100 PB by
    2010-12)
  • Global distribution of people and resources

1800 Physicists 150 Institutes 32 Countries
20
Global LHC Data Grid
Tier0 CERNTier1 National LabTier2 Regional
Center (University, etc.)Tier3 University
workgroupTier4 Workstation
  • Key ideas
  • Hierarchical structure
  • Tier2 centers

21
Global LHC Data Grid
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
Experiment
PBytes/sec
Online System
100 MBytes/sec
Bunch crossing per 25 nsecs.100 triggers per
secondEvent is 1 MByte in size
Tier 0
CERN Computer Center gt 20 TIPS
HPSS
2.5 Gbits/sec
France Center
Italy Center
UK Center
USA Center
Tier 1
2.5 Gbits/sec
Tier 2
Tier 3
622 Mbits/sec
Institute 0.25TIPS
Institute
Institute
Institute
Physics data cache
Physicists work on analysis channels. Each
institute has 10 physicists working on one or
more channels
100 - 1000 Mbits/sec
Workstations,other portals
Tier 4
22
Sloan Digital Sky Survey Data Grid
23
LIGO (Gravity Wave) Data Grid
MIT
LivingstonObservatory
HanfordObservatory
OC48
OC3
OC3
OC12
Caltech
Tier1
OC48
24
Data Grid Projects
25
Data Grid Projects
  • Particle Physics Data Grid (US, DOE)
  • Data Grid applications for HENP expts.
  • GriPhyN (US, NSF)
  • Petascale Virtual-Data Grids
  • iVDGL (US, NSF)
  • Global Grid lab
  • TeraGrid (US, NSF)
  • Dist. supercomp. resources (13 TFlops)
  • European Data Grid (EU, EC)
  • Data Grid technologies, EU deployment
  • CrossGrid (EU, EC)
  • Data Grid technologies, EU emphasis
  • DataTAG (EU, EC)
  • Transatlantic network, Grid applications
  • Japanese Grid Projects (APGrid?) (Japan)
  • Grid deployment throughout Japan
  • Collaborations of application scientists
    computer scientists
  • Infrastructure devel. deployment
  • Globus based

26
GriPhyN App. Science CS Grids
  • GriPhyN Grid Physics Network
  • US-CMS High Energy Physics
  • US-ATLAS High Energy Physics
  • LIGO/LSC Gravity wave research
  • SDSS Sloan Digital Sky Survey
  • Strong partnership with computer scientists
  • Design and implement production-scale grids
  • Develop common infrastructure, tools and services
  • Integration into the 4 experiments
  • Broad application to other sciences via Virtual
    Data Toolkit
  • Strong outreach program
  • Multi-year project
  • RD for grid architecture (funded at 11.9M
    1.6M)
  • Integrate Grid infrastructure into experiments
    through VDT

27
GriPhyN Institutions
  • UC San Diego
  • San Diego Supercomputer Center
  • Lawrence Berkeley Lab
  • Argonne
  • Fermilab
  • Brookhaven
  • U Florida
  • U Chicago
  • Boston U
  • Caltech
  • U Wisconsin, Madison
  • USC/ISI
  • Harvard
  • Indiana
  • Johns Hopkins
  • Northwestern
  • Stanford
  • U Illinois at Chicago
  • U Penn
  • U Texas, Brownsville
  • U Wisconsin, Milwaukee
  • UC Berkeley

28
GriPhyN PetaScale Virtual-Data Grids
Production Team
Individual Investigator
Workgroups
1 Petaflop 100 Petabytes
Interactive User Tools
Request Planning
Request Execution
Virtual Data Tools
Management Tools
Scheduling Tools
Resource
Other Grid
  • Resource
  • Security and
  • Other Grid

Security and
Management
  • Management
  • Policy
  • Services

Policy
Services
Services
  • Services
  • Services

Services
Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw data
source
29
GriPhyN Research Agenda
  • Virtual Data technologies (fig.)
  • Derived data, calculable via algorithm
  • Instantiated 0, 1, or many times (e.g., caches)
  • Fetch value vs execute algorithm
  • Potentially complex (versions, consistency, cost
    calculation, etc)
  • LIGO example
  • Get gravitational strain for 2 minutes around
    each of 200 gamma-ray bursts over the last year
  • For each requested data value, need to
  • Locate item location and algorithm
  • Determine costs of fetching vs calculating
  • Plan data movements computations required to
    obtain results
  • Execute the plan

30
Virtual Data in Action
  • Data request may
  • Compute locally
  • Compute remotely
  • Access local data
  • Access remote data
  • Scheduling based on
  • Local policies
  • Global policies
  • Cost

Major facilities, archives
Regional facilities, caches
Local facilities, caches
31
GriPhyN Research Agenda (cont.)
  • Execution management
  • Co-allocation of resources (CPU, storage, network
    transfers)
  • Fault tolerance, error reporting
  • Interaction, feedback to planning
  • Performance analysis (with PPDG)
  • Instrumentation and measurement of all grid
    components
  • Understand and optimize grid performance
  • Virtual Data Toolkit (VDT)
  • VDT virtual data services virtual data tools
  • One of the primary deliverables of RD effort
  • Technology transfer mechanism to other scientific
    domains

32
GriPhyN/PPDG Data Grid Architecture
Application
initial solution is operational
DAG
Catalog Services
Monitoring
Planner
Info Services
DAG
Repl. Mgmt.
Executor
Policy/Security
Reliable Transfer Service
Compute Resource
Storage Resource
33
Catalog Architecture
Transparency wrt location
Metadata Catalog
Metadata Catalog
Name
LObjN

Name
LObjN
X logO1
Y logO2
F.X
logO3
F.X
logO3
G(1).Y logO4
Object Name
Object Name
GCMS
GCMS
Logical Container
Name
Replica Catalog
Replica Catalog
LCN
PFNs

LCN
PFNs

logC1 URL1
logC1 URL1
logC2 URL2 URL3
logC2 URL2 URL3
logC3 URL4
logC3 URL4
logC4 URL5 URL6
logC4 URL5 URL6
URLs for physical file location
Physical file storage
34
iVDGL A Global Grid Laboratory
We propose to create, operate and evaluate, over
asustained period of time, an international
researchlaboratory for data-intensive
science. From NSF proposal, 2001
  • International Virtual-Data Grid Laboratory
  • A global Grid laboratory (US, EU, South America,
    Asia, )
  • A place to conduct Data Grid tests at scale
  • A mechanism to create common Grid infrastructure
  • A facility to perform production exercises for
    LHC experiments
  • A laboratory for other disciplines to perform
    Data Grid tests
  • A focus of outreach efforts to small institutions
  • Funded for 13.65M by NSF

35
iVDGL Components
  • Computing resources
  • Tier1, Tier2, Tier3 sites
  • Networks
  • USA (TeraGrid, Internet2, ESNET), Europe (Géant,
    )
  • Transatlantic (DataTAG), Transpacific, AMPATH,
  • Grid Operations Center (GOC)
  • Indiana (2 people)
  • Joint work with TeraGrid on GOC development
  • Computer Science support teams
  • Support, test, upgrade GriPhyN Virtual Data
    Toolkit
  • Outreach effort
  • Integrated with GriPhyN
  • Coordination, interoperability

36
Current iVDGL Participants
  • Initial experiments (funded by NSF proposal)
  • CMS, ATLAS, LIGO, SDSS, NVO
  • U.S. Universities and laboratories
  • (Next slide)
  • Partners
  • TeraGrid
  • EU DataGrid EU national projects
  • Japan (AIST, TITECH)
  • Australia
  • Complementary EU project DataTAG
  • 2.5 Gb/s transatlantic network

37
Initial U.S. iVDGL Participants
  • U Florida CMS
  • Caltech CMS, LIGO
  • UC San Diego CMS, CS
  • Indiana U ATLAS, GOC
  • Boston U ATLAS
  • U Wisconsin, Milwaukee LIGO
  • Penn State LIGO
  • Johns Hopkins SDSS, NVO
  • U Chicago/Argonne CS
  • U Southern California CS
  • U Wisconsin, Madison CS
  • Salish Kootenai Outreach, LIGO
  • Hampton U Outreach, ATLAS
  • U Texas, Brownsville Outreach, LIGO
  • Fermilab CMS, SDSS, NVO
  • Brookhaven ATLAS
  • Argonne Lab ATLAS, CS

T2 / Software
CS support
T3 / Outreach
T1 / Labs(funded elsewhere)
38
TeraGrid 13 TeraFlops, 40 Gb/s
Site Resources
Site Resources
26
HPSS
HPSS
4
24
External Networks
External Networks
8
5
Caltech
Argonne
40 Gb/s
External Networks
External Networks
NCSA/PACI 8 TF 240 TB
SDSC 4.1 TF 225 TB
Site Resources
Site Resources
HPSS
UniTree
39
Initial US-iVDGL Data Grid
40
iVDGL Map (2002-2003)
41
Need for Common Grid Infrastructure
  • Grid computing sometimes compared to electric
    grid
  • You plug in to get a resource (CPU, storage, )
  • You dont care where the resource is located
  • This analogy is more appropriate than originally
    intended
  • It expresses a USA viewpoint ? uniform power grid
  • What happens when you travel around the world?

Different frequencies 60 Hz, 50 Hz Different
voltages 120 V, 220 V Different sockets! USA, 2
pin, France, UK, etc.
Want to avoid this situation in Grid computing
42
Role of Grid Infrastructure
  • Provide essential common Grid services
  • Cannot afford to develop separate
    infrastructures(Manpower, timing, immediate
    needs, etc.)
  • Meet needs of high-end scientific enging
    collaborations
  • HENP, astrophysics, GVO, earthquake, climate,
    space, biology,
  • Already international and even global in scope
  • Drive future requirements
  • Be broadly applicable outside science
  • Government agencies National, regional (EU), UN
  • Non-governmental organizations (NGOs)
  • Corporations, business networks (e.g., suppliers,
    RD)
  • Other virtual organizations (see Anatomy of the
    Grid)
  • Be scalable to the Global level

43
Coordination of U.S. Grid Projects
  • Three closely coordinated U.S. projects
  • PPDG HENP experiments, short term tools,
    deployment
  • GriPhyN Data Grid research, Virtual Data, VDT
    deliverable
  • iVDGL Global Grid laboratory
  • Coordination of PPDG, GriPhyN, iVDGL
  • Common experiments personnel, management
    integration
  • iVDGL as joint PPDG GriPhyN laboratory
  • Joint meetings (Jan. 2002, April 2002, Sept.
    2002)
  • Joint architecture creation (GriPhyN, PPDG)
  • Adoption of VDT as common core Grid
    infrastructure
  • Common Outreach effort (GriPhyN iVDGL)
  • New TeraGrid project (Aug. 2001)
  • 13MFlops across 4 sites, 40 Gb/s networking
  • Aim to integrate into iVDGL, adopt VDT, common
    Outreach

44
Grid Coordination Efforts
  • Global Grid Forum (GGF)
  • www.gridforum.org
  • International forum for general Grid efforts
  • Many working groups, standards definitions
  • Next one in Toronto, Feb. 17-20
  • HICB (High energy physics)
  • Represents HEP collaborations, primarily LHC
    experiments
  • Joint development deployment of Data Grid
    middleware
  • GriPhyN, PPDG, TeraGrid, iVDGL, EU-DataGrid, LCG,
    DataTAG, CrossGrid
  • Common testbed, open source software model
  • Several meeting so far
  • New infrastructure Data Grid projects?
  • Fold into existing Grid landscape (primarily US
    EU)

45
Worldwide Grid Coordination
  • Two major clusters of projects
  • US based GriPhyN Virtual Data Toolkit (VDT)
  • EU based Different packaging of similar
    components

46
Summary
  • Data Grids will qualitatively and quantitatively
    change the nature of collaborations and
    approaches to computing
  • The iVDGL will provide vast experience for new
    collaborations
  • Many challenges during the coming transition
  • New grid projects will provide rich experience
    and lessons
  • Difficult to predict situation even 3-5 years
    ahead

47
Grid References
  • Grid Book
  • www.mkp.com/grids
  • Globus
  • www.globus.org
  • Global Grid Forum
  • www.gridforum.org
  • TeraGrid
  • www.teragrid.org
  • EU DataGrid
  • www.eu-datagrid.org
  • PPDG
  • www.ppdg.net
  • GriPhyN
  • www.griphyn.org
  • iVDGL
  • www.ivdgl.org
Write a Comment
User Comments (0)
About PowerShow.com