Quality views: capturing and exploiting the user perspective on information quality - PowerPoint PPT Presentation

About This Presentation
Title:

Quality views: capturing and exploiting the user perspective on information quality

Description:

GO = Gene Ontology. Reference controlled ... The ontology is implemented in OWL DL ... Defined in the DataAnalysisTool taxonomy as part of the ontology ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 27
Provided by: euindiaD
Category:

less

Transcript and Presenter's Notes

Title: Quality views: capturing and exploiting the user perspective on information quality


1
Quality views capturing and exploiting the user
perspective on information quality
www.qurator.org Describing the Quality of Curated
e-Science Information Resources
  • Paolo Missier, Suzanne Embury, Mark Greenwood
  • School of Computer Science, University of
    Manchester
  • Alun Preece, Binling Jin
  • Department of Computing Science, University of
    Aberdeen

2
Outline
  • Information and information quality (IQ) in
    e-science
  • Quality views a quality lens on data
  • Semantic model for IQ
  • Architectural framework for quality views
  • State of the project and current research

3
Information and quality in e-science
Can I trust this data? What evidence do I have
that it is suitable for my experiment?
In silico experiments (eg Workflow-based)
  • Scientists are increasingly required to place
    more of their data in the public domain
  • Scientists use other scientists' experimental
    results as part of their own work
  • Variations in the quality of the data being
    shared
  • Scientists have no control over the quality of
    public data
  • Lack of awareness on quality difficult to
    measure and assess
  • No standards!

4
A concrete scenario
  • Qualitative proteomics identification of
    proteins in a cell sample

Candidate Data for matching (peptides peak lists)
Match algorithm
Reference DBs - MSDB - NCBI - SwissProt/Uniprot
False negatives incompleteness of reference DBs,
pessimistic matching False positives optimistic
matching
5
The complete in silico workflow
  • 1 identify proteins 2 analyze their functions

What is the quality of this processors
output? Is the processor introducing noise in
the flow?
GO Gene Ontology Reference controlled
vocabulary for describing protein function (and
more)
How can a user rapidly test this and other
hypotheses on quality?
6
The users perception of quality
Scientists often have only a blurry notion of
their quality requirements for the data
  • One size fits-all approach to quality does not
    work
  • Scientists tend to apply personal acceptability
    criteria to data
  • Driven mostly by prior personal and peers
    experience
  • Based on the expected use of the data
  • What levels of false positives / negatives are
    acceptable?

It is difficult for users to implement quality
criteria and test them on the data
7
Quality views making quality explicit
  • Our goals
  • To support groups of users within a (scientific)
    community in understanding information quality on
    specific data domains
  • To foster reuse of quality definitions within the
    community
  • Approach
  • Provide a conceptual model and architectural
    framework to capture user preferences on data
    quality
  • Let users populate the framework with custom
    definitions for indicators and personal decision
    criteria
  • The framework allows uses to rapidly test quality
    preferences and observe their effect on the data
  • Semi-automated integration in the data processing
    environment

Quality views A specification of quality
preferences and how they apply to the data
8
Basic elements of information quality
  • 1 - Quality dimensions
  • A basic set of generic definitions for well-known
    non-functional properties of the data
  • Ex. Accuracy describes how close the observed
    value is to the actual value
  • 2- Quality evidence
  • Any measurable quantities that can be used to
    express formal quality criteria
  • Evidence is not by itself a measure of quality
  • Ex. Hit ratio in protein identification

3- Quality assertions Decision procedures for
data acceptability, based on available evidence
9
The nature of quality evidence
  • Direct evidence indicators that represent some
    quality property
  • Algorithms may exist to determine the biological
    plausibility of an experiments outcome
  • may be costly, not always available, and possibly
    inconclusive
  • Indirect evidence inexpensive indicators that
    correlate with other more expensive indicators
  • Eg some function of hit ratio and sequence
    coverage
  • Need experimental evidence of the correlation

Goals design suitable functions to collect /
compute evidence associate evidence to data (data
quality annotation)
10
Generic (e-science) evidence
  • recency how recently the experiment was
    performed, or its results published
  • Evidence submission, publication dates
  • submitter reputation is the lab well-known for
    its accuracy in carrying out this type of
    experiments
  • Metric lab ranking (subjective)
  • publications prestige are the experiment results
    presented in high-profile journal publications
  • Metric Impact Factor and more (official)

Collecting data provenance is the key to
providing most of these types of evidence
11
Semantic model for Information Quality
  • The key IQ concepts are captured using an
    ontology
  • Provides shareable, formal definitions for
  • QualityProperties (dimensions)
  • Quality Evidence
  • Quality Assertions
  • DataAnalysisTools Describe how indicators are
    computed
  • The ontology is implemented in OWL DL
  • Expressive operators for defining concepts and
    their relationships
  • Support for subsumption reasoning

12
Top-level taxonomy of quality dimensions
Generic dimensions
Wang and Strong, Beyond Accuracy What Data
Quality Means to Data Consumers, Journal of
Management Information Systems, 1996
13
Main taxonomies and properties
assertion-based-on-evidence QualityAssertion ?
QualityEvidence
is-evidence-for QualityEvidence ? DataEntity
14
Associating evidence to data
  • Annotation functions compute quality evidence
    values for datasets and associate them to the
    data
  • Defined in the DataAnalysisTool taxonomy as part
    of the ontology

15
Quality assertions
Assertions formalize the users bias on evidence
as computable decision models on that evidence
  • Defined as ranking or classification functions
    f(D,I)
  • Input
  • dataset D
  • vector I I1,I2,In of indicator values
  • Possible outputs
  • A classification (d,ci) for each d ? D
  • A ranking (d,ri) for each d ? D
  • The classification scheme C c1,..ck and the
    ranking interval r,R are themselves defined in
    the ontology

Example PIScoreClassifier partitions the input
dataset into three classes low, avg, high based
on a function of HitScore, MassCoverage
16
Quality views in practice
  • Quality views are declarative specifications for
  • desired data classification models and evidence
  • I I1,I2,In
  • classi(d), ranki(d) for all d ? D
  • condition-action pairs, eg
  • If ltcondition on class(d), rank(d), Igt then
    ltactiongt
  • Where ltactiongt depends on the data processing
    environment
  • Filter out d
  • Highlight d in a viewer
  • Send d to a designated process or repository
  • Quality views are based on a small set of formal
    operators
  • They are expressed using an XML syntax

17
Execution model for Quality views
  • QVs can be embedded within specific data
    management host environments for runtime
    execution
  • For static data a query processor
  • For dynamic data a workflow engine

Declarative (XML) QV
Host environment
18
User model
  • Implementing rapid testing of quality hypotheses

Compose quality view
(XML)
IQ ontology
Compile and deploy
bindings
Execute on test data
Quality assertion services
Assess View results
Re-deploy
(Update assertion models)
19
The Qurator quality framework
20
Compiled quality workflow
21
Embedded quality workflow
22
Example effect of QV noise reduction
23
Summary
Main paradigm let scientists experiment with
quality concepts in an easy and intuitive way by
observing the effect of their personal bias
  • A conceptual model and architecture for capturing
    the users perception on information quality
  • Formal, semantic model makes concepts
  • Shareable
  • Reusable
  • Machine-processable
  • Quality views are user-defined and compiled to
    data processing environments (possibly multiple)
  • The Qurator framework supports a runtime model
    for QVs
  • Current work
  • Formal semantics for QVs
  • Exploiting semantic technology to support the QV
    specification task
  • Addressing more real use cases

24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com