Metadata Quality for Federated Collections presentation

About This Presentation

Transcript and Presenter's Notes

Title: Metadata Quality for Federated Collections

1
Metadata Quality for Federated Collections
Besiki Stvilia, Les Gasser, Mike Twidale, Sarah
Shreeves, Tim Cole

GSLIS, UIUC
November, 2004

2
1. Abstract

Centralized metadata repositories attempt to
provide integrated access across multiple digital
collections from libraries, archives and museums.
Metadata quality in these repositories heavily
influences the collections' usability---high
quality can raise satisfaction and use, while low
quality can render collections unusable.
Individual metadata type, origin and quality
variances are compounded into complex quality
challenges when collections are aggregated.
Current metadata quality assurance is generally
piecemeal, reactive, ad-hoc, and a-theoretical
formal compatibility and interoperability
standards often prove unenforceable given
metadata providers' dynamic and conflicting
organizational priorities. We are empirically
examining large bodies of harvested metadata to
develop systematic techniques for metadata
quality assessment/assurance. We study metadata
quality, value, and cost models algorithms for
connecting metadata component variations to
(aggregate) metadata record quality and
prototype metadata quality assurance tools that
help providers, aggregators and users reason
about metadata quality, doing more intelligent
selection, aggregation and maintenance of
metadata.

3
2. Approach
The model has been developed using a number of
techniques such as literature analysis, case
studies, statistical analysis, strategic
experimentation, and multi-agent modeling. The
model along with the concepts and metrics used
can serve as a foundation for developing
effective specific methodologies of quality
assurance in various types of organizations. Our
model of metadata quality ties together findings
from existing and new research in information
quality, along with well-developed work in
information seeking/use behavior, and the
techniques of strategic experimentation from
manufacturing. It presents a holistic approach to
determining the quality of a metadata object,
identifying quality requirements based on
typified contexts of metadata use (such as
specific information seeking/use activities) and
expressing interactions between metadata quality
and metadata value.
4
3. Measuring Metadata Quality3.1 Metadata
Quality Problem

Actual qualitynot matching Required/needed
level of quality
May arise at different levels
Element Level
Schema Level
Quality Dimensions

5
3.2 Information Quality Dimensions

Relational / Contextual
Accuracy
Completeness
Complexity
Latency
Naturalness
Informativeness
Relevance (aboutness)
Precision
Security
Verifiability
Volatility

Intrinsic
Accuracy
Cohesiveness
Complexity
Semantic-consistency
Structural-consistency
Currency
Informativeness
Naturalness
Precision

Reputational
Authority

6
3.3 MQ Dimensions may trade off

completeness vs. simplicity robustness vs.
simplicity volatility vs. simplicity robustness
vs. redundancy accessibility vs. certainty
Taguchi curves help to model and reason about
tradeoffs.

7
3.4 Genre Captures Context
8
4. Measuring Value4.1 Whats the Value of
Quality?
9
4.2 Value as Amount of Use
Dublin Core element of total records containing element
Identifier 99.6
Title 80.3
Type 76.5
Subject 72.9
Format 69.4
Publisher 61.2
Language 55.0
Creator 50.7
Description 47.4
Date 43.0
Rights 41.0
Relation 31.2
Source 14.9
Contributor 6.6
Coverage 5.9

The value of metadata can be a function of the
probability distribution of the
operations/transactions using the metadata.
Human factors experiments can be used for
assessing the effectiveness of creating and using
the metadata.
Metadata often is an organizational asset,
especially in organizations like libraries and
one can calculate its dollar cost based on the
average time a cataloger spends on creating a
record or an element of the record..

10
5. IMLS Digital Collections and Content Project

Promote centralized search, interoperability and
reusability of metadata collections
Harvested metadata from gt20 data providers,
gt150,000 Dublin Core Records (and growing)
Data providers small public libraries and
historical societies large academic libraries
museums research centers
Records provided from dozens to tens of
thousands
Interoperability and reusability require
negotiation of Global quality
http//imlsdcc.grainger.uiuc.edu

11
5.1 Examples of Quality Problems

Ptolemaios son of Diodoros
Dioskoros Ptolemaios
Dioscorus. Ptolemaios
(variant transliteration)
ltdategt2000lt/dategt
ltdategt1998-03-26lt/dategt
(ambiguous and structurally inconsistent)
ltpublishergtNew York Robert Carter,
1846lt/publishergt(schema limitation led to
workaround)
. . .

Activity
Find Collocate
Actions
Find
Identify
Select
Obtain
Across Federated Collections

12
5.2 Findings

MQ dimensions with major quality problems
completeness
redundancy
clarity
semantic inconsistency (incorrect element use)
structural inconsistency
inaccurate representation

Problem type Incomplete Redundant Unclear Incorrect Use of Elements Inconsistent Inaccurate
100 94 78 73 47 24
13
5.3 Findings

Correlation between consistency of element use
and type of metadata objects and type of data
providers (sample size 2,000).
Grouping by type of objects made standard
deviation of total number of elements used drop
significantly (from 5.73 to 3.6)
Clustering by use of distinct DC elements
(K-means, with 2 clusters) suggested that
different types of institutions may use different
number of distinct DC elements
Academic libraries 13
Public libraries 8
Museums - divided

DC Elements A P
Title 1 1
Creator 1 1
Subject 1 1
Description 1 1
publisher 1 1
contributor 0 0
Date 1 1
Type 1 0
Format 1 0
Identifier 1 1
Source 1 0
Language 1 0
Relation 1 0
Coverage 0 0
Rights 1 1
14
5.3 Findings

High complexity of metadata content related to
quality problems
Strong correlation found between Content
Simplicity/Complexity Rate and Quality Problem
Rate (-.434, plt.01)
However, no significant correlation found between
Quality Problem Rate and Length of Metadata
Object (.043)
Differences in how well standard schemas handle
different types of original objects - lowest
quality problem rate found for print materials

GENRE/TYPE MEAN Error Rate MEDIAN Error Rate
species .00099 .00064
manuscript .00063 .00058
photograph .00025 .00018
art .00016 .00015
print .00010 .00010
15
6. Conclusions and Lessons Learned

Communities of practice may use their own
implicit or explicit schema when sharing metadata
even through a standardized schema such as DC
Some schema elements can be more ambiguous than
others and require qualification Date vs.
Creator
Ambiguity of schema elements can be major source
of quality problems leading to context loss and
element misuses
Inferring native schema and comparing it to
destination schema can point to possible sources
of quality problems
Analysis of activities can help in evaluating
Robustness and Clarity of schema
Mining regularities between metadata
characteristics and quality problems can help in
constructing robust and inexpensive metrics
Some metrics used in Information Retrieval
(Infonoise, Kulback-Liebler, Average IDF) can be
effective and scalable in assessing quality at
the content level
A general purpose dictionary-based metric found
robust for assessing cognitive complexity of
metadata content
Structure profiles can be effective source for
measuring quality and predicting quality problems
at the schema level

16
Acknowledgements and Contact Information
The research was made possible by the generous
support from the Institute of Museums and Library
Services (IMLS) and the UIUC Campus Research
Board.
How to contact
Email Besiki Stvilia at stvilia_at_uiuc.edu

Write a Comment

User Comments (0)

About PowerShow.com

Metadata Quality for Federated Collections PowerPoint PPT Presentation