Title: Metadata Challenges National Forum for Geosciences Information Technology FGIT 2005 Annual Meeting O
1Metadata ChallengesNational Forum for
Geosciences Information TechnologyFGIT 2005
Annual MeetingOctober 6-7, 2005Washington, D.C.
- Sara Graves, PhD
- Director, Information Technology and Systems
Center - University Professor, Computer Science Department
- University of Alabama in Huntsville
- Director, Information Technology Research Center
- National Space Science and Technology Center
- 256-824-6064
- sgraves_at_itsc.uah.edu
- http//www.itsc.uah.edu
2Got Metadata?
- Metadata is data about data 1
- meta pref. Beyond, transcending, more complete
Greek, from meta, beside, after. 1 - Is all data inherently meta?
- It is always "about" something, not the thing
itself 2 - In the absence of metadata, information becomes
noise 3 - All data can be considered metadata whats data
and whats metadata depends on whos doing the
asking 4 - Theres metadata everywhere Challenge is to
harness and organize this information to enable
data usability - Metadata is information about a resource where
the resource can be data, information, workflow
or compute resource. - Metadata is the key to ensuring that resources
will survive and continue to be accessible in the
future 5
1 dictionary.com 2 Metadata and the Joy of Vague
Boundaries http//www.kn.com.au/networks/2003/11/
metadata_and_th.html 3 discussion on
http//groups.yahoo.com/group/ebook-community/mess
age/18470 4 The End of Data? Journal of the
Hyperlinked Organization, Oct 15, 2004
http//www.hyperorg.com/backissues/joho-oct15-04.h
tmldata 5 Understanding Metadata, National
Information Standards Organization, 2004
3Formal structured metadata
- Formal Metadata is metadata that follows some
standard specification that provides a common set
of terminology, definitions and information about
values to be provided - Formal metadata are formally-structured
documentation of resources - They describe the "who, what, where, when, why,
and how" of every aspect of the resource. - Formal metadata
- help organize and maintain an organization's
internal investment in a resource - provide information to data catalogs,
clearinghouses, search engines etc. - provide information to aid data transfers.
- Metadata should be recorded when the information
needed for metadata is known, not after the fact,
when important information may be lost or
forgotten.
4Do we need formal metadata?
- Google indexes every word in every document, so
why bother with creating metadata? - What about science data and resources, as
distinct from documents? - Even for documents, metadata such as Dublin Core
can make searches better - Formal processes such as standards help in
obtaining and maintaining complete metadata, and
automating these processes - Formal metadata structures can yield
machine-readable and operable information
5Roles of metadata in cyberinfrastructure
SEARCH ACCESS
- Facilitates discovery and access
- Content information for resource discovery
- Location information for resource access
- Facilitates use
- Syntactic information for resource
interoperability and integration - Semantic information for automated reasoning and
analysis - Facilitates preservation
- All these types of metadata and more (quality,
provenance, etc.) support digital identification
and preservation
USE
6Basic tension - how much metadata?
- How much metadata do typical science data
producers want to provide? - The minimum they can get away with
- How much metadata do typical science data users
want available? - As much as they can have
7What are some metadata design principles?
- Who is the metadata for?
- End-users, scientists, students needing the data
for their research or decision analysis - What metadata schema?
- What specification should one select such that
it - Covers the needs of the target consumers of the
metadata - Allows interoperability with other systems
- FGDC, ISO, Dublin Core Initiative
- Why use specific metadata elements?
- May not want to use specification as is
- Need to create an application profile or
extension suited for the target user community - Deleting, modifying or adding of metadata
elements is based on determining its importance
to the user community
8Good metadata requires collaboration of domain
science and information technology
- Information Technology Scientists
- Information Science Research
- Knowledge Management
- Data Exploitation
- Domain Scientists
- Research and Analysis
- Data Set Development
- Collaborations
- Accelerate research process
- Maximize knowledge discovery
- Minimize data handling
- Contribute to both fields
Domain Scientists
Information Scientists
9Six principles of good metadata
- Appropriate to the data collection, its users,
and intended uses - Supports interoperability
- Uses standard, controlled vocabularies
- Includes clear statement on terms of use for
digital object - Metadata is also a data object, with qualities of
archivability, persistence, unique
identification, etc. - Supports long-term management of objects in the
collection
Understanding Metadata, National Information
Standards Organization, 2004
10Can metadata creation tools ease the burden?
- Templates enter metadata values into pre-set
fields - Mark-up tools structure metadata attributes and
values to specified schema - Extraction tools analyze digital (text)
resource to automatically create metadata - Note that metadata quality varies greatly
depending on content and structure of source
text. - Conversion tools translate metadata from one
format to another
11Can automated metadata harvesting work for
science data and resources?
- Challenges to automated metadata generation
include - How to decode metadata embedded in directory
structures and file names? - How to locate relevant external metadata?
- How to use structured metadata embedded within
data files (e.g., NetCDF COARDS/CF conventions)? - May need specialized code for each data type
12How do we discover data and resources?
- Ideally, search results should be accurate and
complete - Find only what you really want
- Find everything you really want
- Approaches to meet this challenge
- Registries Emphasis on accuracy structured
metadata, possibly controlled vocabulary, rely on
resource providers to register resources, may
mandate participation of specified user community - Examples GCMD, FGDC Clearinghouse
- Web crawlers Emphasis on completeness harvest
and index all available information, better for
documentation than science data - Examples general web search engines, Mercury
13Google for science data discovery
- Search metadata is document text, indexed
- Results here are primarily science data sources
14Data discovery by browsing metadata
- Search metadata consists of a list of datasets
grouped by collection - Additional metadata includes structured
documentation and browse images
15Data discovery by browsing a flight calendar
- Search metadata includes field campaign, flight
platform and date - Additional metadata includes flight track and
instrument histograms
16Metadata Interoperability ChallengesMediation
among different schemas
- Metadata crosswalks mapping of elements, syntax
and semantics from one metadata scheme to another - Examples GEON, MMI
- Metadata registries integrating resources,
documenting each metadata element - Example EPA Environmental Data Registry
17Metadata Interoperability ChallengesMediation
among different catalogs
- Cross-system search
- Data providers support a common search API - map
own search capabilities to common search
attributes - Examples Z39.50, EOSDIS IMS
- Metadata aggregation
- Data providers support a central metadata
repository - translate native metadata to common
set of core elements for aggregation - Examples Open Archives Initiative, ECHO
18Metadata Interoperability ChallengesMediation
among different domain vocabularies
- Standard metadata specification provides the
attributes and their definitions BUT what about
the values - What vocabulary should be used for these values?
- Two approaches
- Use controlled vocabulary
- Examples Global Change Master Directory (GCMD),
Climate and Forecasting (CF) - Use an ontology where it not only acts as an
extended controlled vocabulary but also provides
context and relationship for the values
19Beyond data discovery how can metadata improve
data usability?
- Syntactic and semantic metadata are required for
effective exchange and use of digital objects
described by metadata - Syntactic metadata describes data structures
within the data object may be stored within the
data object or separately - Examples README files, self-describing data
formats (HDF, NetCDF), ESML - Semantic metadata attaches meaning to the data
structures within the data object can enable a
common framework that allows data to be shared
across application, enterprise and/or community
boundaries - Example publications, ontologies, Semantic Web
20Challenge of Data Heterogeneity to Usability
- Earth Science Data Characteristics
- Many different formats, types and structures (50
and counting at NCDC alone!) - Some formats lack metadata where as others are
metadata rich - Enormous volumes
- Heterogeneity leads to usabilty problems
21Interoperability Accessing Heterogeneous Data
- One approach Enforce a standard data format,
but - Difficult to implement and enforce
- Cant anticipate all needs
- Some data cant be modeled or is lost in
translation - Converting legacy data is costly
- A better approach Interchange Technologies
- Earth Science Markup Language
22What is ESML?
- It is a specialized markup language for Earth
Science metadata based on XML - NOT another data
format. - It is a machine-readable and -interpretable
representation of the structure, semantics and
content of any data file, regardless of data
format - ESML description files contain external metadata
that can be generated by either data producer or
data consumer (at collection, data set, and/or
granule level) - ESML provides the benefits of a standard,
self-describing data format (like HDF, HDF-EOS,
netCDF, geoTIFF, ) without the cost of data
conversion - ESML is the basis for core Interchange Technology
that allows data/application interoperability - ESML complements and extends data catalogs such
as FGDC and GCMD by providing the use/access
information those directories lack. - http//esml.itsc.uah.edu
23How might we add semantics?Example Extending
ESML with Ontologies
- ESML Schema provides structural metadata
- Extend ESML schema by embedding semantic terms in
the ESML Description File to provide a complete
description of the data - Allow various science communities to create their
own ontologies (for example, SWEET) and use them
with ESML Description Files for their data
RULES DESCRIBING THE STRUCTURE OF THE DATA
DATA
ESMLSCHEMA
ESMLFILE
ONTOL- OGIES
TERMS DEFINING THE MEANING OF THE DATA
SEMANTIC PARSER (INFERENCE ENGINE)
CORE ESML LIBRARY
SMART APPLICATION/ SERVICES
24Example Noesis semantic search using ontologies
1
2
4
- Enter search term
- Review ontology search results (Inheriting and
synonymous concepts for the search term with
definitions and links to AMS glossaries) - Select terms of interest
- Review results from different search resources
3
25How to create large metadata databases?
- Organized Manual
- Original Yahoo! is the classic case of metadata
created by organizing an army of people to put in
data manually - Organized Mechanical
- Original AltaVista used a program to follow links
and domain names and spidered the web, saving the
information as it went - Another Interesting Option Volunteer Manual
26Volunteer Metadata Creation Will it work for
geosciences?
- CDDB Example
- The CDDB database has information that allows
your computer to identify a particular music CD
in the CD drive and list its album title and
track titles. - What CDDB does is let the software on your PC
take that track information, send a CD signature
to CDDB through Internet protocols (if you're
connected) and get back the titles. - CDDB was created by getting track timing
information and the titles typed in by a
volunteer. - Only one person with each (even obscure) album
needed to do this to build the database. - Bottom line increasing the value of the database
by adding more metadata is a natural by-product
of using the metadata database for ones own
benefit.
27Role of metadata in Open Space Discussions?
CONVERSATION
OPEN SPACE AREAS
OPEN SPACE
AREA
HOURS