Grids for 21st Century Data Intensive Science - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Grids for 21st Century Data Intensive Science

Description:

LIGO, GEO, VIRGO, TAMA. Time-dependent 3-D systems (simulation & data) ... standard man. University of Michigan (May 8, 2003) Paul Avery. 14 ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 54
Provided by: paula247
Category:

less

Transcript and Presenter's Notes

Title: Grids for 21st Century Data Intensive Science


1
  • Grids for 21st CenturyData Intensive Science

Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
University of MichiganMay 8, 2003
2
Grids and Science
3
The Grid Concept
  • Grid Geographically distributed computing
    resources configured for coordinated use
  • Fabric Physical resources networks provide
    raw capability
  • Middleware Software ties it all together (tools,
    services, etc.)
  • Goal Transparent resource sharing

4
Fundamental Idea Resource Sharing
  • Resources for complex problems are distributed
  • Advanced scientific instruments (accelerators,
    telescopes, )
  • Storage, computing, people, institutions
  • Communities require access to common services
  • Research collaborations (physics, astronomy,
    engineering, )
  • Government agencies, health care organizations,
    corporations,
  • Virtual Organizations
  • Create a VO from geographically separated
    components
  • Make all community resources available to any VO
    member
  • Leverage strengths at different institutions
  • Grids require a foundation of strong networking
  • Communication tools, visualization
  • High-speed data transmission, instrument operation

5
Some (Realistic) Grid Examples
  • High energy physics
  • 3,000 physicists worldwide pool Petaflops of CPU
    resources to analyze Petabytes of data
  • Fusion power (ITER, etc.)
  • Physicists quickly generate 100 CPU-years of
    simulations of a new magnet configuration to
    compare with data
  • Astronomy
  • An international team remotely operates a
    telescope in real time
  • Climate modeling
  • Climate scientists visualize, annotate, analyze
    Terabytes of simulation data
  • Biology
  • A biochemist exploits 10,000 computers to screen
    100,000 compounds in an hour

6
Grids Enhancing Research Learning
  • Fundamentally alters conduct of scientific
    research
  • Central model People, resources flow inward to
    labs
  • Distributed model Knowledge flows between
    distributed teams
  • Strengthens universities
  • Couples universities to data intensive science
  • Couples universities to national international
    labs
  • Brings front-line research and resources to
    students
  • Exploits intellectual resources of formerly
    isolated schools
  • Opens new opportunities for minority and women
    researchers
  • Builds partnerships to drive advances in
    IT/science/eng
  • Application sciences ? Computer Science
  • Physics ? Astronomy, biology, etc.
  • Universities ? Laboratories
  • Scientists ? Students
  • Research Community ? IT industry

7
Grid Challenges
  • Operate a fundamentally complex entity
  • Geographically distributed resources
  • Each resource under different administrative
    control
  • Many failure modes
  • Manage workflow across Grid
  • Balance policy vs. instantaneous capability to
    complete tasks
  • Balance effective resource use vs. fast
    turnaround for priority jobs
  • Match resource usage to policy over the long term
  • Goal-oriented algorithms steering requests
    according to metrics
  • Maintain a global view of resources and system
    state
  • Coherent end-to-end system monitoring
  • Adaptive learning for execution optimization
  • Build high level services integrated user
    environment

8
Data Grids
9
Data Intensive Science 2000-2015
  • Scientific discovery increasingly driven by data
    collection
  • Computationally intensive analyses
  • Massive data collections
  • Data distributed across networks of varying
    capability
  • Internationally distributed collaborations
  • Dominant factor data growth (1 Petabyte 1000
    TB)
  • 2000 0.5 Petabyte
  • 2005 10 Petabytes
  • 2010 100 Petabytes
  • 2015 1000 Petabytes?

How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
10
Data Intensive Physical Sciences
  • High energy nuclear physics
  • Including new experiments at CERNs Large Hadron
    Collider
  • Astronomy
  • Digital sky surveys SDSS, VISTA, other Gigapixel
    arrays
  • VLBI arrays multiple- Gbps data streams
  • Virtual Observatories (multi-wavelength
    astronomy)
  • Gravity wave searches
  • LIGO, GEO, VIRGO, TAMA
  • Time-dependent 3-D systems (simulation data)
  • Earth Observation, climate modeling
  • Geophysics, earthquake modeling
  • Fluids, aerodynamic design
  • Dispersal of pollutants in atmosphere

11
Data Intensive Biology and Medicine
  • Medical data
  • X-Ray, mammography data, etc. (many petabytes)
  • Radiation Oncology (real-time display of 3-D
    images)
  • X-ray crystallography
  • Bright X-Ray sources, e.g. Argonne Advanced
    Photon Source
  • Molecular genomics and related disciplines
  • Human Genome, other genome databases
  • Proteomics (protein structure, activities, )
  • Protein interactions, drug delivery
  • Brain scans (1-10?m, time dependent)

12
Driven by LHC Computing Challenges
  • Complexity Millions of individual detector
    channels
  • Scale PetaOps (CPU), Petabytes (Data)
  • Distribution Global distribution of people
    resources

1800 Physicists 150 Institutes 32 Countries
13
CMS Experiment at LHC
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
14
LHC Data Rates Detector to Storage
40 MHz
1000 TB/sec
Physics filtering
Level 1 Trigger Special Hardware
75 GB/sec
75 KHz
Level 2 Trigger Commodity CPUs
5 GB/sec
5 KHz
Level 3 Trigger Commodity CPUs
100 1500 MB/sec
100 Hz
Raw Data to storage
15
LHC Higgs Decay into 4 muons
16
Hierarchy of LHC Data Grid Resources
CMS Experiment
Tier0/(? Tier1)/(? Tier2) 111
Online System
100-1500 MBytes/s
CERN Computer Center 20 TIPS
Tier 0
10-40 Gbps
Tier 1
2.5-10 Gbps
Tier 2
1-2.5 Gbps
Tier 3
Physics cache
1-10 Gbps
10s of Petabytes by 2007-81000 Petabytes in
5-7 years
PCs
Tier 4
17
Digital Astronomy
  • Future dominated by detector improvements
  • Moores Law growth in CCDs
  • Gigapixel arrays on horizon
  • Growth in CPU/storage tracking data volumes

Glass
MPixels
  • Total area of 3m telescopes in the world in m2
  • Total number of CCD pixels in Mpixels
  • 25 year growth 30x in glass, 3000x in pixels

18
The Age of Astronomical Mega-Surveys
  • Next generation mega-surveys will change
    astronomy
  • Large sky coverage
  • Sound statistical plans, uniform systematics
  • The technology to store and access the data is
    here
  • Following Moores law
  • Integrating these archives for the whole
    community
  • Astronomical data mining will lead to stunning
    new discoveries
  • Virtual Observatory (next slides)

19
Virtual Observatories
Multi-wavelength astronomy,Multiple surveys
20
Virtual Observatory Data Challenge
  • Digital representation of the sky
  • All-sky deep fields
  • Integrated catalog and image databases
  • Spectra of selected samples
  • Size of the archived data
  • 40,000 square degrees
  • Resolution 50 trillion pixels
  • One band (2 bytes/pixel) 100 Terabytes
  • Multi-wavelength 500-1000 Terabytes
  • Time dimension Many Petabytes
  • Large, globally distributed database engines
  • Multi-Petabyte data size, distributed widely
  • Thousands of queries per day, Gbyte/s I/O speed
    per site
  • Data Grid computing infrastructure

21
Sloan Sky Survey Data Grid
22
International Grid/Networking ProjectsUS, EU, E.
Europe, Asia, S. America,
23
Global Context Data Grid Projects
  • U.S. Projects
  • Particle Physics Data Grid (PPDG) DOE
  • GriPhyN NSF
  • International Virtual Data Grid Laboratory
    (iVDGL) NSF
  • TeraGrid NSF
  • DOE Science Grid DOE
  • NSF Middleware Initiative (NMI) NSF
  • EU, Asia major projects
  • European Data Grid (EU, EC)
  • LHC Computing Grid (LCG) (CERN)
  • EU national Projects (UK, Italy, France, )
  • CrossGrid (EU, EC)
  • DataTAG (EU, EC)
  • Japanese Project
  • Korea project

24
Particle Physics Data Grid
  • Funded 2001 2004 _at_ US9.5M (DOE)
  • Driven by HENP experiments D0, BaBar, STAR, CMS,
    ATLAS

25
PPDG Goals
  • Serve high energy nuclear physics (HENP)
    experiments
  • Unique challenges, diverse test environments
  • Develop advanced Grid technologies
  • Focus on end to end integration
  • Maintain practical orientation
  • Networks, instrumentation, monitoring
  • DB file/object replication, caching, catalogs,
    end-to-end movement
  • Make tools general enough for wide community
  • Collaboration with GriPhyN, iVDGL, EDG, LCG
  • ESNet Certificate Authority work, security

26
GriPhyN and iVDGL
  • Both funded through NSF ITR program
  • GriPhyN 11.9M (NSF) 1.6M (matching) (2000
    2005)
  • iVDGL 13.7M (NSF) 2M (matching) (2001
    2006)
  • Basic composition
  • GriPhyN 12 funded universities, SDSC, 3
    labs (80 people)
  • iVDGL 16 funded institutions, SDSC, 3 labs (80
    people)
  • Expts US-CMS, US-ATLAS, LIGO, SDSS/NVO
  • Large overlap of people, institutions, management
  • Grid research vs Grid deployment
  • GriPhyN CS research, Virtual Data Toolkit (VDT)
    development
  • iVDGL Grid laboratory deployment
  • 4 physics experiments provide frontier challenges
  • VDT in common

27
GriPhyN Computer Science Challenges
  • Virtual data (more later)
  • Data programs (content) programs (executions)
  • Representation, discovery, manipulation of
    workflows and associated data programs
  • Planning
  • Mapping workflows in an efficient, policy-aware
    manner to distributed resources
  • Execution
  • Executing workflows, inc. data movements,
    reliably and efficiently
  • Performance
  • Monitoring system performance for scheduling
    troubleshooting

28
Goal PetaScale Virtual-Data Grids
Production Team
Single Researcher
Workgroups
Interactive User Tools
Request Execution Management Tools
Request Planning Scheduling Tools
Virtual Data Tools
ResourceManagementServices
Security andPolicyServices
Other GridServices
  • PetaOps
  • Petabytes
  • Performance

Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw datasource
29
GriPhyN/iVDGL Science Drivers
  • US-CMS US-ATLAS
  • HEP experiments at LHC/CERN
  • 100s of Petabytes
  • LIGO
  • Gravity wave experiment
  • 100s of Terabytes
  • Sloan Digital Sky Survey
  • Digital astronomy (1/4 sky)
  • 10s of Terabytes
  • Massive CPU
  • Large, distributed datasets
  • Large, distributed communities

30
Virtual Data Derivation and Provenance
  • Most scientific data are not simple
    measurements
  • They are computationally corrected/reconstructed
  • They can be produced by numerical simulation
  • Science eng. projects are more CPU and data
    intensive
  • Programs are significant community resources
    (transformations)
  • So are the executions of those programs
    (derivations)
  • Management of dataset transformations important!
  • Derivation Instantiation of a potential data
    product
  • Provenance Exact history of any existing data
    product

We already do this, but manually!
31
Virtual Data Motivations (1)
Ive detected a muon calibration error and want
to know which derived data products need to be
recomputed.
Ive found some interesting data, but I need to
know exactly what corrections were applied before
I can trust it.
Data
consumed-by/ generated-by
product-of
Derivation
Transformation
execution-of
I want to search a database for 3 muon SUSY
events. If a program that does this analysis
exists, I wont have to write one from scratch.
I want to apply a forward jet analysis to 100M
events. If the results already exist, Ill save
weeks of computation.
32
Virtual Data Motivations (2)
  • Data track-ability and result audit-ability
  • Universally sought by scientific applications
  • Facilitates tool and data sharing and
    collaboration
  • Data can be sent along with its recipe
  • Repair and correction of data
  • Rebuild data productscf., make
  • Workflow management
  • Organizing, locating, specifying, and requesting
    data products
  • Performance optimizations
  • Ability to re-create data rather than move it

Manual /error prone ? Automated /robust
33
Chimera Virtual Data System
  • Virtual Data API
  • A Java class hierarchy to represent
    transformations derivations
  • Virtual Data Language
  • Textual for people illustrative examples
  • XML for machine-to-machine interfaces
  • Virtual Data Database
  • Makes the objects of a virtual data definition
    persistent
  • Virtual Data Service (future)
  • Provides a service interface (e.g., OGSA) to
    persistent objects
  • Version 1.0 available
  • To be put into VDT 1.1.7

34
Chimera Application SDSS Analysis
Galaxy cluster data
Size distribution
Chimera Virtual Data System GriPhyN Virtual
Data Toolkit iVDGL Data Grid (many CPUs)
35
Virtual Data and LHC Computing
  • US-CMS
  • Chimera prototype tested with CMS MC (200K
    events)
  • Currently integrating Chimera into standard CMS
    production tools
  • Integrating virtual data into Grid-enabled
    analysis tools
  • US-ATLAS
  • Integrating Chimera into ATLAS software
  • HEPCAL document includes first virtual data use
    cases
  • Very basic cases, need elaboration
  • Discuss with LHC expts requirements, scope,
    technologies
  • New ITR proposal to NSF ITR program (15M)
  • Dynamic Workspaces for Scientific Analysis
    Communities
  • Continued progress requires collaboration with CS
    groups
  • Distributed scheduling, workflow optimization,
  • Need collaboration with CS to develop robust tools

36
iVDGL Goals and Context
  • International Virtual-Data Grid Laboratory
  • A global Grid laboratory (US, EU, E. Europe,
    Asia, S. America, )
  • A place to conduct Data Grid tests at scale
  • A mechanism to create common Grid infrastructure
  • A laboratory for other disciplines to perform
    Data Grid tests
  • A focus of outreach efforts to small institutions
  • Context of iVDGL in US-LHC computing program
  • Develop and operate proto-Tier2 centers
  • Learn how to do Grid operations (GOC)
  • International participation
  • DataTag
  • UK e-Science programme support 6 CS Fellows per
    year in U.S.

37
US-iVDGL Sites (Spring 2003)
  • Partners?
  • EU
  • CERN
  • Brazil
  • Australia
  • Korea
  • Japan

38
US-CMS Grid Testbed
39
US-CMS Testbed Success Story
  • Production Run for Monte Carlo data production
  • Assigned 1.5 million events for eGamma Bigjets
  • 500 sec per event on 750 MHz processor all
    production stages from simulation to ntuple
  • 2 months continuous running across 5 testbed
    sites
  • Demonstrated at Supercomputing 2002

40
Creation of WorldGrid
  • Joint iVDGL/DataTag/EDG effort
  • Resources from both sides (15 sites)
  • Monitoring tools (Ganglia, MDS, NetSaint, )
  • Visualization tools (Nagios, MapCenter, Ganglia)
  • Applications ScienceGrid
  • CMS CMKIN, CMSIM
  • ATLAS ATLSIM
  • Submit jobs from US or EU
  • Jobs can run on any cluster
  • Demonstrated at IST2002 (Copenhagen)
  • Demonstrated at SC2002 (Baltimore)

41
WorldGrid Sites
42
Grid Coordination
43
U.S. Project Coordination Trillium
  • Trillium GriPhyN iVDGL PPDG
  • Large overlap in leadership, people, experiments
  • Driven primarily by HENP, particularly LHC
    experiments
  • Benefit of coordination
  • Common software base packaging VDT PACMAN
  • Collaborative / joint projects monitoring,
    demos, security,
  • Wide deployment of new technologies, e.g. Virtual
    Data
  • Stronger, broader outreach effort
  • Forum for US Grid projects
  • Joint view, strategies, meetings and work
  • Unified entity to deal with EU other Grid
    projects

44
International Grid Coordination
  • Global Grid Forum (GGF)
  • International forum for general Grid efforts
  • Many working groups, standards definitions
  • Close collaboration with EU DataGrid (EDG)
  • Many connections with EDG activities
  • HICB HEP Inter-Grid Coordination Board
  • Non-competitive forum, strategic issues,
    consensus
  • Cross-project policies, procedures and
    technology, joint projects
  • HICB-JTB Joint Technical Board
  • Definition, oversight and tracking of joint
    projects
  • GLUE interoperability group
  • Participation in LHC Computing Grid (LCG)
  • Software Computing Committee (SC2)
  • Project Execution Board (PEB)
  • Grid Deployment Board (GDB)

45
HEP and International Grid Projects
  • HEP continues to be the strongest science driver
  • (In collaboration with computer scientists)
  • Many national and international initiatives
  • LHC a particularly strong driving function
  • US-HEP committed to working with international
    partners
  • Many networking initiatives with EU colleagues
  • Collaboration on LHC Grid Project
  • Grid projects driving linked to network
    developments
  • DataTag, SCIC, US-CERN link, Internet2
  • New partners being actively sought
  • Korea, Russia, China, Japan, Brazil, Romania,
  • Participate in US-CMS and US-ATLAS Grid testbeds
  • Link to WorldGrid, once some software is fixed

46
New Grid Efforts
47
An Inter-Regional Center for High Energy Physics
Research and Educational Outreach (CHEPREO) at
Florida International University
  • Status
  • Proposal submitted Dec. 2002
  • Presented to NSF review panel
  • Project Execution Plan submitted
  • Funding in June?
  • E/O Center in Miami area
  • iVDGL Grid Activities
  • CMS Research
  • AMPATH network (S. America)
  • Intl Activities (Brazil, etc.)

48
A Global Grid Enabled Collaboratory for
Scientific Research (GECSR)
  • 4M ITR proposal from
  • Caltech (HN PI,JBCoPI)
  • Michigan (CoPI,CoPI)
  • Maryland (CoPI)
  • Plus senior personnel from
  • Lawrence Berkeley Lab
  • Oklahoma
  • Fermilab
  • Arlington (U. Texas)
  • Iowa
  • Florida State
  • First Grid-enabled Collaboratory
  • Tight integration between
  • Science of Collaboratories
  • Globally scalable work environment
  • Sophisticated collaborative tools (VRVS, VNC
    Next-Gen)
  • Agent based monitoring decision support system
    (MonALISA)
  • Initial targets are the global HENP
    collaborations, but GESCR is expected to be
    widely applicable to other large scale
    collaborative scientific endeavors
  • Giving scientists from all world regions the
    means to function as full partners in the process
    of search and discovery

49
Large ITR Proposal 15M
Dynamic WorkspacesEnabling Global Analysis
Communities
50
UltraLight Proposal to NSF
  • 10 Gb/s network
  • Caltech, UF, FIU, UM, MIT
  • SLAC, FNAL
  • Intl partners
  • Cisco
  • Applications
  • HEP
  • VLBI
  • Radiation Oncology
  • Grid Projects

51
GLORIAD
  • New 10 Gb/s network linking US-Russia-China
  • Plus Grid component linking science projects
  • H. Newman, P. Avery participating
  • Meeting at NSF April 14 with US-Russia-China
    reps.
  • HEP people (Hesheng, et al.)
  • Broad agreement that HEP can drive Grid portion
  • More meetings planned

52
Summary
  • Progress on many fronts in PPDG/GriPhyN/iVDGL
  • Packaging Pacman VDT
  • Testbeds (development and production)
  • Major demonstration projects
  • Productions based on Grid tools using iVDGL
    resources
  • WorldGrid providing excellent experience
  • Excellent collaboration with EU partners
  • Building links to our Asian and other partners
  • Excellent opportunity to build lasting
    infrastructure
  • Looking to collaborate with more international
    partners
  • Testbeds, monitoring, deploying VDT more widely
  • New directions
  • Virtual data a powerful paradigm for LHC
    computing
  • Emphasis on Grid-enabled analysis

53
Grid References
  • Grid Book
  • www.mkp.com/grids
  • Globus
  • www.globus.org
  • Global Grid Forum
  • www.gridforum.org
  • PPDG
  • www.ppdg.net
  • GriPhyN
  • www.griphyn.org
  • iVDGL
  • www.ivdgl.org
  • TeraGrid
  • www.teragrid.org
  • EU DataGrid
  • www.eu-datagrid.org
Write a Comment
User Comments (0)
About PowerShow.com