The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration - PowerPoint PPT Presentation

Loading...

PPT – The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration PowerPoint presentation | free to download - id: 4f9fb-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration

Description:

Enhance scientific productivity through discovery and processing of datasets, ... Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev, Argonne MCS ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 51
Provided by: annche
Learn more at: http://www.mcs.anl.gov
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration


1
The Virtual Data GridA New Model and
Architecture forData-Intensive Collaboration
  • Summer Grid 2004
  • UT Brownsville South Padre Island Center
  • 24 June 2004
  • Mike Wilde
  • Argonne National Laboratory
  • Mathematics and Computer Science Division

2
GriPhyNGrid Physics Network Mission
  • Enhance scientific productivity through
    discovery and processing of datasets, using the
    grid as a scientific workstation
  • Virtual Data enables this approach by creating
    datasets from workflow recipes and recording
    their provenance.
  • GriPhyN works to cross the chasm -
  • application and computer scientists create and
    field-test paradigms and toolkits together

3
AcknowledgementsVirtual Data is a Large Team
Effort
  • The Chimera Virtual Data Systemis the work of
    Ian Foster, Jens Voeckler, Mike Wilde and Yong
    Zhao
  • The Pegasus Planner is the work of Ewa Deelman,
    Gaurang Mehta, and Karan Vahi
  • Applications described are the work of many
    people, including James Annis, Rick Cavanaugh,
    Dan Engh, Rob Gardner, Albert Lazzarini, Natalia
    Maltsev, Marge Bardeen, and their wonderful teams

4
Virtual Data Scenario
Manage workflow
On-demand data generation
Update workflow following changes
Explain provenance, e.g. for file8
psearch t 10 i file3 file4 file5 o
file8summarize t 10 i file6 o file7reformat
f fz i file2 o file3 file4 file5 conv l esd
o aod i file 2 o file6simulate t 10 o file1
file2
5
Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requesteddataset
file7
conv I esd o aod
summarize t 10
file6
  • The recorded virtual data recipe here is
  • Files 8 lt (1,3,4,5,7), 7 lt 6, (3,4,5,6) lt 2
  • Programs 8 lt psearch, 7 lt summarize,(3,4,5) lt
    reformat, 6 lt conv, (1,2) lt simulate

6
Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
  • To recreate file 8 Step 1
  • simulate gt file1, file2

7
Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
  • To re-create file8 Step 2
  • files 3, 4, 5, 6 derived from file 2
  • reformat gt file3, file4, file5
  • conv gt file 6

8
Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
  • To re-create file 8 step 3
  • File 7 depends on file 6
  • Summarize gt file 7

9
Virtual DataDescribes analysis workflow
psearch t 10
file8
simulate t 10
Requestedfile
file7
summarize t 10
  • To re-create file 8 final step
  • File 8 depends on files 1, 3, 4, 5, 7
  • psearch lt file1, file3, file4, file5, file 7 gt
    file 8

10
Grid3 The Laboratory
Supported by the National Science Foundation and
the Department of Energy.
11
VDL Virtual Data LanguageDescribes Data
Transformations
  • Transformation
  • Abstract template of program invocation
  • Similar to "function definition"
  • Derivation
  • Function call to a Transformation
  • Store past and future
  • A record of how data products were generated
  • A recipe of how data products can be generated
  • Invocation
  • Record of a Derivation execution
  • These XML documents reside in a virtual data
    catalog VDC - a relational database

12
VDL Describes Workflowvia Data Dependencies
file1
  • TR tr1(in a1, out a2)
  • argument stdin a1 
  • argument stdout a2
  • TR tr2(in a1, out a2)
  • argument stdin a1
  • argument stdout a2
  • DV x1-gttr1(a1_at_infile1, a2_at_outfile2)
  • DV x2-gttr2(a1_at_infile2, a2_at_outfile3)

x1
file2
x2
file3
13
Workflow example
  • Graph structure
  • Fan-in
  • Fan-out
  • "left" and "right" can run in parallel
  • Needs external input file
  • Located via replica catalog
  • Data file dependencies
  • Form graph structure

preprocess
findrange
findrange
analyze
14
Complete VDL workflow
  • Generate appropriate derivations
  • DV top-gtpreprocess( b _at_out"f.b1", _at_
    out"f.b2" , a_at_in"f.a" )
  • DV left-gtfindrange( b_at_out"f.c1",
    a2_at_in"f.b2", a1_at_in"f.b1", name"left",
    p"0.5" )
  • DV right-gtfindrange( b_at_out"f.c2",
    a2_at_in"f.b2", a1_at_in"f.b1", name"right" )
  • DV bottom-gtanalyze( b_at_out"f.d", a
    _at_in"f.c1", _at_in"f.c2" )

15
Compound TransformationsEnable Functional
Abstractions
  • Compound TR encapsulates an entire sub-graph
  • TR rangeAnalysis (in fa, p1, p2,
  • out fd, io fc1,
  • io fc2, io fb1, io
    fb2, )
  • call preprocess( afa, b outfb1,
    outfb2 )
  • call findrange( a1infb1, a2infb2,
    name"LEFT", pp1, boutfc1 )
  • call findrange( a1infb1, a2infb2,
    name"RIGHT", pp2, boutfc2 )
  • call analyze( a infc1, infc2 ,
    bfd )

16
Derivation scripts
  • Representation of virtual data provenance
  • DV d1-gtdiamond( fd_at_out"f.00005",
    fc1_at_io"f.00004", fc2_at_io"f.00003",
    fb1_at_io"f.00002", fb2_at_io"f.00001",
    fa_at_io"f.00000", p2"100", p1"0" )
  • DV d2-gtdiamond( fd_at_out"f.0000B",
    fc1_at_io"f.0000A", fc2_at_io"f.00009",
    fb1_at_io"f.00008", fb2_at_io"f.00007",
    fa_at_io"f.00006", p2"141.42135623731", p1"0"
    )
  • ...
  • DV d70-gtdiamond( fd_at_out"f.001A3",
    fc1_at_io"f.001A2", fc2_at_io"f.001A1",
    fb1_at_io"f.001A0", fb2_at_io"f.0019F",
    fa_at_io"f.0019E", p2"800", p1"18" )

17
Invocation Provenance
Completion status and resource usage
Attributes of executable transformation
Attributes of input and output files
18
Executing VDL Workflows
Grid Info
Global planner Pegasus
Concrete DAG
Abstract workflow
jit planner (research)
DAGman / Condor-G
local planner
19
GriPhyN-iVDGLApplications to date
  • ATLAS, BTeV, CMS HEP event simulation
  • Argonne Computational Biology sequence
    comparison and result capture
  • LIGO Pulsar search
  • Sloan Digital Sky Survey cluster finding
    near-earth object search planned
  • Quarknet science education cosmic rays, HEP
    analysis

20
Genome Analysis Database Update
Application work by Alex Rodriguez, Dina Sulakhe,
Natalia Matlsev,Argonne MCS Described in
GGF10workshop paper.
21
Virtual Data ExampleGalaxy Cluster Search
DAG
Sloan Data
Galaxy cluster size distribution
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab,
Michael Milligan, Yong Zhao,
University of Chicago. Described in SC2002 paper
22
Cluster SearchWorkflow Graphand Execution Trace
Workflow jobs vs time
23
Virtual Data Application
High Energy Physics Data
Analysis
mass 200 decay WW stability 1 LowPt
20 HighPt 10000
Work and slide by Rick Cavanaugh and Dimitri
Bourilkov, University of Florida Ref CHEP 2002
paper
24
Using Virtual Data forScience Education
  • The QuarkNet-Trillium collaboration is using Grid
    virtual data tools and methods to enrich science
    education
  • Its an experiment to give students the means to
  • discover and apply datasets, algorithms, and data
    analysis methods
  • collaborate by developing new ones and sharing
    results and observations
  • learn data analysis methods that will ready and
    excite them for a scientific career
  • And in later steps, we may actually use the Grid!

25
Quarknet Virtual Data Project
Quarknet Virtual Data Portal
Central High SchoolReston, Virginia
Cosmic Ray Detector
Locally Collected Data
Student Data,Algorithms,Results, Notes, and
communications
Foothills High SchoolGreat Falls, Montana
VirtualData Toolkit
CosmicRayDetector
Standard Web access
LocallyCollected Data
Virtual Data Catalog
Yale / Middletown High CollaborationHartford,
Connecticut
CosmicRayDetector
LocallyCollected Data
Student teacher teams sharing data, methods,
programs, and knowledge Enabling
collaboration-intensive science discovery with
virtual data tools and methods
26
Detector Performance Study
27
Example BTeV Event Simulation
28
Support for Search and Discovery
  • Goal make it as easy to use as Google
  • More advanced capabilities lie below the surface
    (as with Google)
  • Understand the structure and meaning of the
    datasets and their fields.
  • Advanced search, using SQL-like queries
  • Find both DATA and TRANSFORMATIONS
  • Create datasets from queries
  • Perform calculations on datasets, filtering
    results to look for patterns

29
Search byMetadata
30
Derving a new datasetto find mass of z
particle
31
Workflow formissing energy calculations
32
Virtual Provenancelist of derivations and files
ltjob id"ID000001" namespace"Quarknet.HEPSRCH"
name"ECalEnergySum" level"5
dv-namespace"Quarknet.HEPSRCH"
dv-name"run1aesum"gt ltargumentgtltfilename
file"run1a.event"/gt ltfilename file"run1a.esm"/gtlt
/argumentgt ltuses file"run1a.esm"
link"output" dontRegister"false"
dontTransfer"false"/gt ltuses
file"run1a.event" link"input"
dontRegister"false" dontTransfer"false"/gt
lt/jobgt ltjob id"ID000002" namespace"Quarknet.HEPS
RCH" name"ECalEnergySum" level"7
dv-namespace"Quarknet.HEPSRCH"
ltargumentgtltfilename file"electron10GeV.event"/gt
ltfilenamefile"electron10GeV.sum"/gtlt/argumentgt
lt/jobgt ltjob id"ID000014" namespace"Quarknet.HE
PSRCH" name"ReconTotalEnergy" level"3"
ltargumentgtltfilename file"run1a.mis"/gt ltfilename
file"run1a.ecal"/gt ltuses file"run1a.muon"
link"input" dontRegister"false"
dontTransfer"false"/gt ltuses
file"run1a.total" link"output"
dontRegister"false" dontTransfer"false"/gt
ltuses file"run1a.ecal" link"input"
dontRegister"false" dontTransfer"false"/gt
ltuses file"run1a.hcal" link"input"
dontRegister"false" dontTransfer"false"/gt
ltuses file"run1a.mis" link"input"
dontRegister"false" dontTransfer"false"/gt
lt/jobgt lt!--list of all files used --gt
ltfilename file"ecal.pct" link"inout"/gt
ltfilename file"electron10GeV.avg"
link"inout"/gt ltfilename file"electron10GeV.sum
" link"inout"/gt ltfilename file"hcal.pct"
link"inout"/gt. (excerpted for display)
33
Virtual Provenance in XMLcontrol flow graph
ltchild ref"ID000003"gt ltparent
ref"ID000002"/gt lt/childgt ltchild
ref"ID000004"gt ltparent ref"ID000003"/gt
lt/childgt ltchild ref"ID000005"gt ltparent
ref"ID000004"/gt ltparent ref"ID000001"/gt
ltchild ref"ID000009"gt ltparent
ref"ID000008"/gt lt/childgt ltchild
ref"ID000010"gt ltparent ref"ID000009"/gt
ltparent ref"ID000006"/gt ltchild
ref"ID000012"gt ltparent ref"ID000011"/gt
lt/childgt ltchild ref"ID000013"gt ltparent
ref"ID000011"/gt lt/childgt ltchild
ref"ID000014"gt ltparent ref"ID000010"/gt
ltparent ref"ID000012"/gt
ltparent ref"ID000013"/gt lt/childgt (excerpte
d for display)
34
And writing the results up in a poster
35
Poster describing analysis
36
Using active data from Web Services
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Levels of Interaction
  • Skins use it like a calculator, experiment
    with scenarios and settings, use virtual data
    like a log book to document, assess, and share
    parameter values.
  • Blocks re-assemble workflow pipelines using
    existing ones as patterns and pre-developed
    transforms as building blocks
  • Code write new transforms in a variety of
    languages and data models

41
Observations
  • A provenance approach based on interface
    definition and data flow declaration fits well
    with Grid requirements for code and data
    transportability and heterogeneity
  • Working in a provenance-managed system has many
    fringe benefits uniformity, precision,
    structure, communication, documentation
  • The real world is messy finding the right
    abstractions is hard, and handling legacy
    applications is even harder

42
Vision for Provenance in the Large
  • Universal knowledge management and production
    systems
  • Vendors integrate the provenance tracking
    protocol into data processing products
  • Ability to run anywhere in the Grid

43
Virtual Data Grid Vision
44
Planned Dataset Model
ltFORM ltTitlegt /FORMgt
File
Set of files
Object closure
XML Element
Relational query or spreadsheet range
Set of files with relational index
New user-defined dataset type
Speculative model described in CIDR 2003 paper by
Foster, Voeckler, Wilde and Zhao
45
Planned Dataset Type Model
FileDataset
Representational
File
FileSet
Logical
MultiFileSet
TarFileSet
EventCollection
(Nonleaf Typesare Superclasses)
RawEventSet
SimulatedEventSet
MonteCarloSimulation
DiscreteEventSimulation
46
Provenance Server Plans
  • OGSA-based Grid services
  • Discovery, security, resource management
  • Supports code and data discoveryand workflow
    management
  • Object names (TR, DS, TY, DV, IV) can be used as
    global cross-server links
  • Derivations can reference remote transformations
    and datasets
  • Structured object namespaces object-level
    access control enable large VO collaboration
  • Generalize transforms to describe service calls,
    database queries and language interpreters

47
Provenance Hyperlinks
48
Indexing Serversto Support Discovery
49
For Information and Software
  • Virtual Data System
  • www.griphyn.org/chimera - Chimera Virtual Data
    System Overview, papers, software
  • Grids and Grid Software
  • www.ivdgl.org/grid2003 - Using Grid3
  • www.griphyn.org/vdt - Virtual Data Toolkit
  • www.globus.org The Globus Toolkit
  • www.cs.wisc.edu/condor - The Condor Project
  • www.ppdg.net Particle Physics Data Grid

50
Acknowledgements
GriPhyN, iVDGL, and QuarkNet(in part) are
supported by the National Science Foundation
The Globus Alliance, PPDG, and QuarkNet are
supported in part by the US Department of Energy,
Office of Science by the NASA Information Power
Grid program and by IBM
About PowerShow.com