The Chimera Virtual Data System - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

The Chimera Virtual Data System

Description:

The Chimera Virtual Data System. www.griphyn.org/chimera. Presented by Mike Wilde ... is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 29
Provided by: annc173
Category:

less

Transcript and Presenter's Notes

Title: The Chimera Virtual Data System


1
The Chimera Virtual Data System
  • www.griphyn.org/chimera
  • Presented by Mike Wilde
  • Workflow Workshop
  • 3 December 2003
  • e-Science Institute, Edinburgh

2
Acknowledgements
  • GriPhyN the Grid Physics Network is supported
    by The National Science Foundation, Information
    Technology Research Program
  • The Chimera Virtual Data Systemis the work of
    Ian Foster, Jens Voeckler, Mike Wilde and Yong
    Zhao
  • The Pegasus Planner is the work of Ewa Deelman,
    Gaurang Mehta, and Karan Vahi
  • This talk was also delivered at the Data
    Provenance and Annotation Workshop, 1 Dec 2003

3
The Virtual Data Concept
  • Enhance scientific productivity through
  • Discovery and application of datasets and
    programs
  • Enabling use of a worldwide data grid as a
    scientific workstation
  • Virtual Data enables this approach by creating
    datasets from workflow recipes and recording
    their provenance.
  • Provenance Virtual Data

4
Provenance System Goals
  • Producing data from transformations with
    uniform, precise data interface descriptions
    enables
  • Discovery finding and understanding datasets and
    transformations
  • Workflow structured paradigm for organizing,
    locating, specifying, producing scientific
    datasets
  • Forming new workflow
  • Building new workflow from existing patterns
  • Managing change
  • Planning automated to make the Grid transparent
  • Audit explanation and validation via provenance

5
Virtual Data Grid Vision
6
Usage Models and Cases
  • Domains where its valuable (and where its not)?
    Cost benefit ratios?
  • Batch models
  • Cluster finding laboratory code and data
    changes, track results.
  • Interactive models
  • Using provenance within interactive dialogs in
    graphical and textual tools
  • Moving back and forth between interactive and
    batch modes
  • Discovery
  • Understand / review / audit
  • Compose
  • Passive Provenance recording
  • Active Provenance declaration

7
Virtual Data ExampleGalaxy Cluster Search
DAG
Sloan Data
Galaxy cluster size distribution
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab,
Michael Milligan, Yong Zhao,
University of Chicago
8
Virtual Data Application
High Energy Physics Data
Analysis
mass 200 decay WW stability 1 LowPt
20 HighPt 10000
Work and slide by Rick Cavanaugh and Dimitri
Bourilkov, University of Florida
9
Provenance Scenario
Manage workflow
On-demand data generation
Update workflow following changes
Explain provenance, e.g. for file8
psearch t 10 i file1 file3 file4 file5 file7o
file8simulate t 10 o file1 file2reformat f
fz i file2 o file3 file4 file5 summarize t 10
i file6 o file7conv l esd o aod i file 2 o
file6
10
Fundamental Units
  • Transformations
  • Interface Declarations
  • Action Declarations
  • Call declaration
  • Invocation
  • Datasets
  • Contents
  • Representation
  • Location

11
VDL Virtual Data LanguageDescribes Data
Transformations
  • Transformation
  • Abstract template of program invocation
  • Similar to "function definition"
  • Derivation
  • Function call to a transformation
  • Stores past and future
  • A record of how data products were generated
  • A recipe of how data products can be generated
  • Invocation
  • Record of a Derivation execution

12
Example Transformation
  • TR t1( out a2, in a1, none pa "500", none
    env "100000" )
  • argument "-p "pa
  • argument "-f "a1
  • argument "-x y"
  • argument stdout a2
  • profile env.MAXMEM env

a1
t1
a2
13
Example Transformation Calls (Derivations)
  • DV d1-gtt1 (env"20000", pa"600",a2_at_outrun1.e
    xp15.T1932.summary,a1_at_inrun1.exp15.T1932.raw
    ,
  • )
  • DV d2-gtt1 (a1_at_inrun1.exp16.T1918.raw,a2_at_ou
    t.run1.exp16.T1918.summary
  • )

14
Workflow from File Dependencies
file1
  • TR tr1(in a1, out a2)
  • argument stdin a1 
  • argument stdout a2
  • TR tr2(in a1, out a2)
  • argument stdin a1
  • argument stdout a2
  • DV x1-gttr1(a1_at_infile1, a2_at_outfile2)
  • DV x2-gttr2(a1_at_infile2, a2_at_outfile3)

x1
file2
x2
file3
15
Example Invocation
Completion status and resource usage
Attributes of executable transformation
Attributes of input and output files
16
Example Workflow
  • Complex structure
  • Fan-in
  • Fan-out
  • "left" and "right" can run in parallel
  • Uses input file
  • Register with RC
  • Complex file dependencies
  • Glues workflow

preprocess
findrange
findrange
analyze
17
Workflow step "preprocess"
  • TR preprocess turns f.a into f.b1 and f.b2
  • TR preprocess( output b, input a ) argument
    "-a top"argument " i "inputaargument
    " o " outputb
  • Makes use of the "list" feature of VDL
  • Generates 0..N output files.
  • Number file files depend on the caller.

18
Workflow step "findrange"
  • Turns two inputs into one output
  • TR findrange( output b, input a1, input a2,none
    name"findrange", none p"0.0" ) argument "-a
    "nameargument " i " a1 " "
    a2argument " o " bargument " p "
    p
  • Uses the default argument feature

19
Can also use list parameters
  • TR findrange( output b, input a,none
    name"findrange", none p"0.0" ) argument "-a
    "nameargument " i " " "aargument
    " o " bargument " p " p

20
Workflow step "analyze"
  • Combines intermediary results
  • TR analyze( output b, input a ) argument
    "-a bottom"argument " i " aargument "
    o " b

21
Complete VDL workflow
  • Generate appropriate derivations
  • DV top-gtpreprocess( b _at_out"f.b1", _at_
    out"f.b2" , a_at_in"f.a" )
  • DV left-gtfindrange( b_at_out"f.c1",
    a2_at_in"f.b2", a1_at_in"f.b1", name"left",
    p"0.5" )
  • DV right-gtfindrange( b_at_out"f.c2",
    a2_at_in"f.b2", a1_at_in"f.b1", name"right" )
  • DV bottom-gtanalyze( b_at_out"f.d", a
    _at_in"f.c1", _at_in"f.c2" )

22
Compound Transformations
  • Using compound TR
  • Permits composition of complex TRs from basic
    ones
  • Calls are independent
  • unless linked through LFN
  • A Call is effectively an anonymous derivation
  • Late instantiation at workflow generation time
  • Permits bundling of repetitive workflows
  • Model Function calls nested within a function
    definition

23
Compound Transformations (cont)
  • TR diamond encapsulates diamond workflows
  • TR diamond( out fd, io fc1, io fc2, io fb1, io
    fb2, in fa, p1, p2 )
  • call preprocess( afa, b outfb1,
    outfb2 )
  • call findrange( a1infb1, a2infb2,
    name"LEFT", pp1, boutfc1 )
  • call findrange( a1infb1, a2infb2,
    name"RIGHT", pp2, boutfc2 )
  • call analyze( a infc1, infc2 ,
    bfd )

24
Compound Transformations (cont)
  • Multiple DVs allow easy generator scripts
  • DV d1-gtdiamond( fd_at_out"f.00005",
    fc1_at_io"f.00004", fc2_at_io"f.00003",
    fb1_at_io"f.00002", fb2_at_io"f.00001",
    fa_at_io"f.00000", p2"100", p1"0" )
  • DV d2-gtdiamond( fd_at_out"f.0000B",
    fc1_at_io"f.0000A", fc2_at_io"f.00009",
    fb1_at_io"f.00008", fb2_at_io"f.00007",
    fa_at_io"f.00006", p2"141.42135623731", p1"0"
    )
  • ...
  • DV d70-gtdiamond( fd_at_out"f.001A3",
    fc1_at_io"f.001A2", fc2_at_io"f.001A1",
    fb1_at_io"f.001A0", fb2_at_io"f.0019F",
    fa_at_io"f.0019E", p2"800", p1"18" )

25
Dataset Requirements
ltFORM ltTitlegt /FORMgt
File
Set of files
Object closure
XML Element
Relational query or spreadsheet range
Set of files with relational index
New user-defined dataset type
26
Possible Dataset Type Model
  • Types used for
  • Managing dataset representation
  • Determining argument conformance in invocations
  • Discovery of datasets and transformations
  • Two parallel type hierarchies separate
    representation and semantics
  • Representational organizes and specifies
    families of dataset representation
  • Logical organizes and specifies
    application-specific semantics of datasets

27
Example Dataset Types(Nonleaf Types are
Superclasses)
FileDataset
Representational
File
FileSet
Logical
MultiFileSet
TarFileSet
EventCollection
RawEventSet
SimulatedEventSet
MonteCarloSimulation
DiscreteEventSimulation
28
Dataset Representation Descriptor
  • Defines a datasets physical layout
  • Permits transformations to access datasets
  • Structure is defined by dataset type (examples)
  • File ltlfngt ltevt.02gt
  • MultiFileSet ltlfngt ltevt.03, evt.04, evt05gt
  • TarFileSet ltlfn,taroptsgt ltevts.1998, "-b50 -z"gt
  • Relation ltltodbcgtltselect .gtgt ltserver
    name"db.mcs.anl.gov" db"hepdb"
    id"uchep"/gtltquery request"select from evt
    where eidgt2897 and eidlt3945" /gt
  • Stored in dataset catalog
  • Format constrained by DS type def

29
Provenance Schema
describes
describes
Metadata
30
Observations
  • A provenance approach based on interface
    definition and data flow declaration fits well
    with Grid requirements for code and data
    transportability and heterogeneity
  • Working in a provenance-managed system has many
    fringe benefits uniformity, precision,
    structure, communication, documentation

31
Vision for Provenance in the Large
  • Universal knowledge management and production
    systems
  • Vendors integrate the provenance tracking
    protocol into data processing products
  • Ability to run anywhere in the Grid

32
Virtual Data Grid Vision
33
Systems requirementsServices and Interfaces
  • Provenance databases, servers, virtual machines,
    workflow composers
  • Provenance navigation portals and webs
  • Embedded tracing systems esp. within interactive
    tools SPSS, ROOT, Excel, etc
  • Catalog integration replica catalogs, metadata
    catalogs, transformation catalogs, integrity,
    coherence, interoperability.
  • Interaction between provenance systems and
    workflow systems

34
Provenance Servers
  • OGSA-based Grid services
  • Discovery, security, resource management
  • Supports code and data discoveryand workflow
    management
  • Object names (TR, DS, TY, DV, IV) can be used as
    global cross-server links
  • Derivations can reference remote transformations
    and datasets
  • Structured object namespaces object-level
    access control enable large VO collaboration

35
Provenance Hyperlinks
36
Indexing Provenance Servers to Support Discovery
37
Challenges
  • Whats the unit of change? Dataset? File?
    Object?
  • Relations to the worlds of HDF, CDF, FITS, many
    others
  • Does a dataset type have multiple dimensions?
  • Dataset names/handles
  • Unification of processing models App, SQL,
    Service
  • Closure and reflection Are transformations and
    workflows datasets? Can we track provenance of
    annotations?
  • Version management mutability, timestamps
  • Garbage collection, retention, pruning
  • Distribution what standards and naming protocols
    are needed? Catalogs, schemas?
  • Theoretical models? Unification of fine-grain and
    coarse-grained models?
Write a Comment
User Comments (0)
About PowerShow.com