Title: Quality views: capturing and exploiting the user perspective on information quality
1Quality views capturing and exploiting the user
perspective on information quality
www.qurator.org Describing the Quality of Curated
e-Science Information Resources
- Paolo Missier, Suzanne Embury, Mark Greenwood
- School of Computer Science, University of
Manchester - Alun Preece, Binling Jin
- Department of Computing Science, University of
Aberdeen
2Outline
- Information and information quality (IQ) in
e-science - Quality views a quality lens on data
- Semantic model for IQ
- Architectural framework for quality views
- State of the project and current research
3Information and quality in e-science
Can I trust this data? What evidence do I have
that it is suitable for my experiment?
In silico experiments (eg Workflow-based)
- Scientists are increasingly required to place
more of their data in the public domain - Scientists use other scientists' experimental
results as part of their own work
- Variations in the quality of the data being
shared - Scientists have no control over the quality of
public data - Lack of awareness on quality difficult to
measure and assess - No standards!
4A concrete scenario
- Qualitative proteomics identification of
proteins in a cell sample
Candidate Data for matching (peptides peak lists)
Match algorithm
Reference DBs - MSDB - NCBI - SwissProt/Uniprot
False negatives incompleteness of reference DBs,
pessimistic matching False positives optimistic
matching
5The complete in silico workflow
- 1 identify proteins 2 analyze their functions
What is the quality of this processors
output? Is the processor introducing noise in
the flow?
GO Gene Ontology Reference controlled
vocabulary for describing protein function (and
more)
How can a user rapidly test this and other
hypotheses on quality?
6The users perception of quality
Scientists often have only a blurry notion of
their quality requirements for the data
- One size fits-all approach to quality does not
work - Scientists tend to apply personal acceptability
criteria to data - Driven mostly by prior personal and peers
experience - Based on the expected use of the data
- What levels of false positives / negatives are
acceptable?
It is difficult for users to implement quality
criteria and test them on the data
7Quality views making quality explicit
- Our goals
- To support groups of users within a (scientific)
community in understanding information quality on
specific data domains - To foster reuse of quality definitions within the
community - Approach
- Provide a conceptual model and architectural
framework to capture user preferences on data
quality - Let users populate the framework with custom
definitions for indicators and personal decision
criteria - The framework allows uses to rapidly test quality
preferences and observe their effect on the data - Semi-automated integration in the data processing
environment
Quality views A specification of quality
preferences and how they apply to the data
8Basic elements of information quality
- 1 - Quality dimensions
- A basic set of generic definitions for well-known
non-functional properties of the data - Ex. Accuracy describes how close the observed
value is to the actual value
- 2- Quality evidence
- Any measurable quantities that can be used to
express formal quality criteria - Evidence is not by itself a measure of quality
- Ex. Hit ratio in protein identification
3- Quality assertions Decision procedures for
data acceptability, based on available evidence
9The nature of quality evidence
- Direct evidence indicators that represent some
quality property - Algorithms may exist to determine the biological
plausibility of an experiments outcome - may be costly, not always available, and possibly
inconclusive - Indirect evidence inexpensive indicators that
correlate with other more expensive indicators - Eg some function of hit ratio and sequence
coverage - Need experimental evidence of the correlation
Goals design suitable functions to collect /
compute evidence associate evidence to data (data
quality annotation)
10Generic (e-science) evidence
- recency how recently the experiment was
performed, or its results published - Evidence submission, publication dates
- submitter reputation is the lab well-known for
its accuracy in carrying out this type of
experiments - Metric lab ranking (subjective)
- publications prestige are the experiment results
presented in high-profile journal publications - Metric Impact Factor and more (official)
Collecting data provenance is the key to
providing most of these types of evidence
11Semantic model for Information Quality
- The key IQ concepts are captured using an
ontology - Provides shareable, formal definitions for
- QualityProperties (dimensions)
- Quality Evidence
- Quality Assertions
- DataAnalysisTools Describe how indicators are
computed - The ontology is implemented in OWL DL
- Expressive operators for defining concepts and
their relationships - Support for subsumption reasoning
12Top-level taxonomy of quality dimensions
Generic dimensions
Wang and Strong, Beyond Accuracy What Data
Quality Means to Data Consumers, Journal of
Management Information Systems, 1996
13Main taxonomies and properties
assertion-based-on-evidence QualityAssertion ?
QualityEvidence
is-evidence-for QualityEvidence ? DataEntity
14Associating evidence to data
- Annotation functions compute quality evidence
values for datasets and associate them to the
data - Defined in the DataAnalysisTool taxonomy as part
of the ontology
15Quality assertions
Assertions formalize the users bias on evidence
as computable decision models on that evidence
- Defined as ranking or classification functions
f(D,I) - Input
- dataset D
- vector I I1,I2,In of indicator values
- Possible outputs
- A classification (d,ci) for each d ? D
- A ranking (d,ri) for each d ? D
- The classification scheme C c1,..ck and the
ranking interval r,R are themselves defined in
the ontology
Example PIScoreClassifier partitions the input
dataset into three classes low, avg, high based
on a function of HitScore, MassCoverage
16Quality views in practice
- Quality views are declarative specifications for
- desired data classification models and evidence
- I I1,I2,In
- classi(d), ranki(d) for all d ? D
- condition-action pairs, eg
- If ltcondition on class(d), rank(d), Igt then
ltactiongt - Where ltactiongt depends on the data processing
environment - Filter out d
- Highlight d in a viewer
- Send d to a designated process or repository
- Quality views are based on a small set of formal
operators - They are expressed using an XML syntax
17Execution model for Quality views
- QVs can be embedded within specific data
management host environments for runtime
execution - For static data a query processor
- For dynamic data a workflow engine
Declarative (XML) QV
Host environment
18User model
- Implementing rapid testing of quality hypotheses
Compose quality view
(XML)
IQ ontology
Compile and deploy
bindings
Execute on test data
Quality assertion services
Assess View results
Re-deploy
(Update assertion models)
19The Qurator quality framework
20Compiled quality workflow
21Embedded quality workflow
22Example effect of QV noise reduction
23Summary
Main paradigm let scientists experiment with
quality concepts in an easy and intuitive way by
observing the effect of their personal bias
- A conceptual model and architecture for capturing
the users perception on information quality - Formal, semantic model makes concepts
- Shareable
- Reusable
- Machine-processable
- Quality views are user-defined and compiled to
data processing environments (possibly multiple) - The Qurator framework supports a runtime model
for QVs - Current work
- Formal semantics for QVs
- Exploiting semantic technology to support the QV
specification task - Addressing more real use cases
24(No Transcript)
25(No Transcript)
26(No Transcript)