Provenance in myGrid and beyond - PowerPoint PPT Presentation

About This Presentation
Title:

Provenance in myGrid and beyond

Description:

Provenance in myGrid and beyond – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 73
Provided by: Carole153
Category:

less

Transcript and Presenter's Notes

Title: Provenance in myGrid and beyond


1
  • Provenance in myGrid and beyond
  • www.mygrid.org.uk
  • Luc Moreau,
  • University of Southampton, UK

2
  • or the Provenance of
  • my interest for Provenance
  • Luc Moreau,
  • University of Southampton, UK

3
Overview
  • Bioinformatics background
  • myGrid facts
  • Services and Workflows
  • Provenance in myGrid
  • Beyond myGrid Provenance
  • Architectural vision
  • Conclusions

4
Overview
  • Bioinformatics background
  • myGrid facts
  • Services and Workflows
  • Provenance in myGrid
  • Beyond myGrid Provenance
  • Architectural vision
  • Conclusions

5
Large amounts of data
http//www3.ebi.ac.uk/Services/DBStats/
  • EMBL July 2001
  • 150 Gbytes
  • Microarray
  • 1 Petabyte per annum
  • Sanger Centre
  • 20 terabytes of data
  • Genome sequences increase 4x per annum

6
Heterogeneity
  • Complexity
  • Diversity

Genomic, proteomic, transcriptomic, metabalomic,
protein-protein interactions, regulatory
bio-networks, alignments, disease, patterns
motifs, protein structure, protein
classifications, specialist proteins (enzymes,
receptors),
Proteome
7
Heterogeneity
  • Data types forms
  • Community
  • Autonomy
  • Over 500 different databases
  • Different formats, structure, schemas, coverage
  • Web interfaces, flat file distribution,

8
Heterogeneous Data
  • Multimedia
  • Images Video
  • Text annotations literature
  • Descriptive as well as numeric
  • Knowledge-based

Text Extraction
9
Bioinformatics Analysis
  • Different algorithms
  • BLAST, FASTA, pSW
  • Different implementations
  • WU-BLAST, NCBI-BLAST
  • Different service providers
  • NCBI, EBI, DDBJ

10
Drug Discovery
11
In silico experimentation
  • Discovery of resources and tools, staging of
    operations, sharing of results
  • Process is as important as outcome
  • Science is dynamic change happens
  • Scientific discovery is personal global
  • Provenance and history

12
Overview
  • Bioinformatics background
  • myGrid facts
  • Services and Workflows
  • Provenance in myGrid
  • Beyond myGrid Provenance
  • Architectural vision
  • Conclusions

13
myGrid
  • EPSRC funded pilot project
  • Generic middleware within application setting
  • 36 month in 42 month performance period
  • Start 1st October 2001
  • 16 full-time post docs altogether
  • 6 DTA studentships
  • 1 technical project manager

14
myGrid consortium
  • Scientific Team
  • Biologists and Bioinformaticians
  • GSK, AZ, Merck KGaA, Manchester, EBI
  • Technical Team
  • Manchester, Southampton, Newcastle, Sheffield,
    EBI, Nottingham
  • IBM, SUN
  • GeneticXchange
  • Network Inference, Epistemics Ltd

15
myGrid outcomes
  • e-Scientists
  • Bioinformatics demonstrator (Graves disease and
    Williams syndrome)
  • Developers
  • myGrid-in-a-Box developers kit
  • (currently myGrid 0.4)
  • Integrating some existing bioinformatics tools
    with myGrid (EBI services)

16
Overview
  • Bioinformatics background
  • myGrid facts
  • Services and Workflows
  • Provenance in myGrid
  • Beyond myGrid Provenance
  • Architectural vision
  • Conclusions

17
Graves disease
  • Autoimmune disease of the thyroid in which the
    immune system of an individual attacks cells in
    the thyroid gland resulting in hyperthyroidism
  • Weight loss, trembling, muscle weakness,
    increased pulse rate, increased sweating and heat
    intolerance, goitre, exophthalmos

18
The Biology
  • GD caused by the stimulation of the thyrotrophin
    receptor by thyroid-stimulating autoantibodies
    secreted by lymphocytes of the immune system.
  • Why is the lymphocyte causing the antibodies that
    attack the thyroid cell?

19
Graves Disease Experimental Process
20
Experiment life cycle
Personalised registries Personalised
workflows Info repository views Personalised
annotations Personalised metadata Security
Resource service discovery Repository
creation Workflow creation Database query
formation
Forming experiments
Personalisation
Discovering and reusing experiments and resources
Executing experiments
Workflow discovery refinement Resource
service discovery Repository creation Provenance
Workflow enactment Distributed Query
processing Job execution Provenance
generation Single sign-on authorisation Event
notification
Providing services experiments
Managing experiments
Service registration Workflow deposition Metadata
Annotation Third party registration
Information repository Metadata
management Provenance management Workflow
evolution Event notification
21
A work bench for demonstrating services
myView on the mIR
Workflow
Metadata about workflow
note about workflow
22
Worflows
  • A workflow represents an experiment that can be
    run on the Grid.
  • A workflow takes data as input.
  • It performs activities, which are steps
    involved in analysing the data, including using
    tools and services, querying databases and
    running other workflows.
  • A workflow can be run on the users local
    machine, or remotely, taking advantage of
    resources that are distributed.
  • Data intensive grid having to deal with
    heterogeneity of the data and processes.

23
myGrid schematic
Graves disease scenario
Exemplars
Workbench
Workflow editor
Talisman
Generic Applications
Gateway
Event Notification
Workflow Enactment
Core components
Information repository
Service Registry
Knowledge management
SoapLab
Services
Bio services
Distributed query processing
Text services
24
Service Oriented Architecture
Knowledge Services
Knowledge Service
Semantic registration
Registry
Registry
Ontology Server
Reasoner
Structural registration
UDDI
Matcher
Service
Registry View
Notification Service
Notification Service
UDDI-M
Service Discovery
JMS
Provenance service
Workflow enactment engine
Build/Edit Workflow
mIR
Test Data
WSFL
Component Discovery
Information Extraction
Distributed Query Processor
Job Execution
mInfo Repository
Workflow templates
Workflow instances
PASTA
Service
Service
Service
Metadata
Concepts
Data
Provenance
SoapLab
DB2
DB2
25
myGrid Deployment
26
myGrid 0.4 (Nov 2003)
  • Describer (MAN) A tool for attaching semantic
    descriptions to WS and workflows
  • Find Service (MAN) A component for classifying
    and discovering services and workflows via their
    semantic descriptions
  • Ontology Server (MAN) The DAMLOIL reasoner
  • Workbench (NOT) a NetBeans module for examining
    and updating the MIR and submitting workflows for
    enactment
  • e-Science Gateway (NOT) An API giving access to
    myGrid core services
  • MIR (myGrid information repository) (MAN/NEW) A
    Web Service accessing a repository that can hold
    data for an individual scientist or a team of
    scientists.
  • Notification Service (IAM) A general-purpose Web
    Service that supports a publish/subscribe model
    of event notification, based on JMS
  • Registry View service (IAM) A Web Service
    supporting a registry of published Web Services
    and workflows annotated with metadata, including
    semantic descriptions
  • Freefluo (ITI) workflow enactment engine
  • Taverna (EBI) workflow editing environment

27
Overview
  • Bioinformatics background
  • myGrid facts
  • Services and Workflows
  • Provenance in myGrid
  • Beyond myGrid Provenance
  • Architectural vision
  • Conclusions

28
Provenance definition
  • Main Entry provenance Pronunciation
    'präv-nn(t)s, 'prä-v-"nän(t)sFunction
    nounEtymology French, from provenir to come
    forth, originate, from Latin provenire, from pro-
    forth venire to come -- more at PRO-,
    COMEDate 17851 ORIGIN, SOURCE2 the
    history of ownership of a valued object or work
    of art or literature

29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
Provenance
Provenance is related to
  • Experiment is repeatable, if not reproducible,
    and explained by provenance records
  • Who, what, where, why, when, (w)how?
  • The traceability of knowledge as it is evolves
    and as it is derived.
  • Immutable metadata
  • Migration travels with its data but may not be
    stored with it.
  • Private vs Shared provenance records.
  • Credit.

34
Early Provenance Capture
A full provenance record is linked with the
results. Its a log of execution.
35
Kinds of Provenance
  • Backward Derivation
  • An explanation of when, by who, how something was
    produced.
  • Linking items, usually in a directed graph.
  • Execution Process-centric
  • To be contrasted with forward derivation, which
    is a path like a workflow, script or query.

36
Kinds of Provenance
  • Annotations
  • Attached to items or collections of items, in a
    structured, semi-structured or free text form.
  • Annotations on one item or linking items.
  • An explanation of why, when, where, who, what,
    how.
  • Data-centric

37
Kinds of Provenance in myGrid
  • Derivations
  • Workflow Enactment Engine provides a detailed
    provenance record stored in the myGrid
    Information Repository (mIR) describing what was
    done, with what services and when
  • XML document, soon to be an RDF model
  • Annotations
  • Every mIR object has Dublin Core provenance
    properties described in an attribute value model

38
Provenance of data
  • Operational execution trail

GeneAC005412.6
SNP000010197
input
output
processstart timeend time
run_for
by_service
urn Claire Jennings
lsidHGVBase_retrieve
39
From Provenance to Knowledge
  • Declarative semantic execution trail

contains_single_nucleotide_polymorphism
GeneAC005412.6
SNP000010197
input
output
as stated by
processstart timeend time
run_for
by_service
urn Claire Jennings
lsidHGVBase_retrieve
40
From Provenance to Knowledge
urn Carole Goble
  • Trust and attribution

disputed by
contains_single_nucleotide_polymorphism
GeneAC005412.6
SNP000010197
input
output
as stated by
processstart timeend time
run_for
by_service
urn Claire Jennings
lsidHGVBase_retrieve
41
Provenance vs
  • Provenance vs Annotation
  • Provenance of an annotation
  • Annotation of Provenance
  • Provenance vs Workflow
  • Provenance describes past execution
  • A workflow is a script for future execution

42
What is Provenance?
  • Annotations may be subject of interpretation
    (e.g. Alice believes annotation X, whereas Bob
    does not).
  • Provenance should aim at recording an undisputed
    view of an execution.

43
What is Provenance?
  • Provenance traces execution
  • Provenance must be generated automatically
  • Annotations can be either generated automatically
    or created by the user
  • Annotations can contain semantic augmentation,
    which can be derived automatically or supplied
    manually.

44
Generating provenance
45
Overview
  • Bioinformatics background
  • myGrid facts
  • Services and Workflows
  • Provenance in myGrid
  • Beyond myGrid Provenance
  • Architectural vision
  • Conclusions

46
Provenance in a Bioinformatics Grid
  • myGrid builds a personalised problem-solving
    environment that helps bioinformaticians find,
    adapt, construct and execute in silico
    experiments
  • Provenance in Drugs Discovery process
  • FDA requirement on drug companies to keep a
  • record of provenance of drug discovery as
    long
  • as the drug is in use (up to 50 years
  • sometimes).

47
Provenance in Aerospace Engineering
  • Provenance requirement to maintain a historical
    record of outputs from each sub-system involved
    in simulations.
  • Aircrafts provenance data need to be kept for up
    to 99 years when sold to some countries.
  • Currently, little direct support is available for
    this.

48
Provenance in Organ Transplant Management
  • Decision support systems for organ and tissue
    transplant, rely on a wide range of data sources,
    patient data, and doctors and surgeons
    knowledge
  • Heavily regulated domain European, national,
    regional and site specific rules govern how
    decisions are made.
  • Application of these rules must be ensured, be
    auditable and may change over time
  • Provenance allows tracking previous decisions
    crucial to maximise the efficiency in matching
    and recovery rate of patients

49
The Grid and Virtual Organisations
  • The Grid problem is defined as coordinated
    resource sharing and problem solving in dynamic,
    multi-institutional virtual organisations
    FKT01.
  • Effort is required to allow users to place their
    trust in the data produced by such virtual
    organisations
  • Understanding how a given service is likely to
    modify data flowing into it, and how this data
    has been generated is crucial.

50
Provenance and Virtual Organisations
  • Given a set of services in an open grid
    environment that decide to form a virtual
    organisation with the aim to produce a given
    result
  • How can we determine the process that
    generated the result, especially after the
    virtual organisation has been disbanded?
  • The lack of information about the origin of
    results does not help users to trust such open
    environments.

51
Provenance and Workflows
  • Workflow enactment has become popular in the Grid
    and Web Services communities
  • Workflow enactment can be seen as a scripted form
    of virtual organisation.
  • The problem is similar how can we determine the
    origin of enactment results.

52
Provenance Definition
  • Provenance is some data able to explain how a
    particular result has been derived.
  • In a service-oriented architecture, provenance
    identifies what data is passed between services,
    what services are available, and what results are
    generated for particular sets of input values,
    etc.
  • Using provenance, a user can trace the process
    that led to the aggregation of services producing
    a particular output.

53
Overview
  • Bioinformatics background
  • myGrid facts
  • Services and Workflows
  • Provenance in myGrid
  • Beyond myGrid Provenance
  • Architectural vision
  • Conclusions

54
What is the problem?
  • Provenance recording should be part of the
    infrastructure, so that users can elect to enable
    it when they execute their complex tasks over the
    Grid or in Web Services environments.
  • Currently, the Web Services protocol stack and
    the Open Grid Services Architecture do not
    provide any support for recording provenance.

55
Architectural Vision
56
Architectural Vision
  • Provenance gathering is a collaborative process
    that involves multiple entities, including the
    workflow enactment engine, the enactment engine's
    client, the service directory, and the invoked
    services.
  • Provenance data will be submitted to one or more
    provenance repositories acting as storage for
    provenance data.
  • Upon user's requests, some analysis, navigation
    and reasoning over provenance data can be
    undertaken.

57
Architectural Vision
  • Storage could be achieved by a provenance
    service.
  • Provenance service would provide support for
    analysis, navigation or reasoning over provenance
  • Client side support for submitting provenance
    data to the provenance service.

58
A First Prototype (Szomszor,Moreau 03)
  • A service-oriented architecture for provenance
    support in Grid and Web Services environments,
    based on the idea of a provenance service
  • A client-side API for recording provenance data
    for Web Service invocation
  • A data model for storing provenance data
  • A server-side interface for querying provenance
    data
  • Two components making use of provenance
    provenance browsing and provenance validation.

59
Prototype Overview
60
Prototype Sequence Diagram
61
Prototype Sequence Diagram
  • To identify the interactions between provenance
    service, client side library and enactment engine
  • Creation of a session
  • Need to be able to support the most complex
    workflows including conditional branching,
    iteration, recursion and parallel execution.
  • Support asynchronous submission of provenance
    data so that provenance submission does not delay
    workflow execution.

62
Prototype Provenance Data Model
63
Prototype Provenance Data Model
  • Must support recording of all information
    necessary to replay execution
  • Must support all complex forms of workflows
    (recursion, iterations, parallel execution).

64
Prototype Provenance Browser
65
Discussion
  • In order for provenance data to be useful, we
    expect such a protocol to support some
    classical properties of distributed algorithms.
  • Using mutual authentication, an invoked service
    can ensure that it submits data to a specific
    provenance server, and vice-versa, a provenance
    server can ensure that it receives data from a
    given service.
  • With non-repudiation, we can retain evidence of
    the fact that a service has committed to
    executing a particular invocation and has
    produced a given result.
  • We anticipate that cryptographic techniques will
    be useful to ensure such properties

66
Towards Trust
67
Towards Trust
  • Using the provenance of data, trust metrics of
    the data can be derived from
  • Trust the user places in invoked services
  • Trust the user places in the input data
  • Trust the user places in the enacted workflow
  • Trust the user places in the enactor
  • Trust the user places in the provenance service.

68
  • The purpose of project PASOA to investigate
    provenance in Grid architectures
  • Funded by EPSRC under the fundamental computer
    science for e-Science call
  • In collaboration with Cardiff
  • www.pasoa.org

69
Conclusion
  • Provenance is a rather unexplored domain
  • Strategic to bring trust in open environment
  • Necessity to design a configurable architecture
    capable of support multiple requirements from
    very different application domains.
  • Need to further investigate the algorithmic
    foundations of provenance, which will lead to
    scalable and secure industrial solutions.

70
Publications
  • SM03 Martin Szomszor and Luc Moreau. Recording
    and reasoning over data provenance in web and
    grid services. In International Conference on
    Ontologies, Databases and Applications of
    SEmantics (ODBASE'03), volume 2888 of Lecture
    Notes in Computer Science, pages 603-620,
    Catania, Sicily, Italy, November 2003.
  • MCS03 Luc Moreau, Syd Chapman, Andreas
    Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga,
    Ulises Cortes, and Steven Willmott.
    Provenance-based trust for grid computing -
    position paper. 2003.
  • GGS03 Mark Greenwood, Carole Goble, Robert
    Stevens, Jun Zhao, Matthew Addis, Darren Marvin,
    Luc Moreau, and Tom Oinn. Provenance of e-science
    experiments - experience from bioinformatics. In
    Proceedings of the UK OST e-Science second All
    Hands Meeting 2003 (AHM'03), pages 223-226,
    Nottingham, UK, September 2003.

71
Acknowledgements
  • The myGrid Southampton Team Simon Miles, Juri
    Papay, Ananth Krishna, Michael Luck, David De
    Roure, Terry Payne
  • Mark Greenwood, Carole Goble, Manchester
  • Martin Szomszor, Southampton
  • Syd Chapman, IBM
  • Omer Rana, Cardiff
  • Andreas Schreiber and Rolf Hempel, DLR
  • Lazslo Varga, SZTAKI
  • Ulises Cortes and Steven Willmott, UPC

72
www.mygrid.org.uk
m
Write a Comment
User Comments (0)
About PowerShow.com