Title: Dublin Core Metadata Tutorial July 9, 2007 Stuart Weibel Senior Research Scientist OCLC Programs and Research
1Dublin Core Metadata TutorialJuly 9,
2007Stuart WeibelSenior Research
ScientistOCLC Programs and Research
2Tutorial Roadmap
- Principles of Metadata
- Dublin Core Metadata Basics
- The Dublin Core Abstract Model
- Syntax Alternatives for DC Metadata
- Mixing and Matching Metadata
- History and workings of the Dublin Core Metadata
Initiative - Acknowledgements I have borrowed liberally from
tutorial slides sets from Tom Baker, Diane
Hillman, Andy Powell, and Marty Kurth, available
at Dublincore.org
3Basic Principles of Metadata
- The Web as an information system
- The Internet Commons
- Interoperability is key
- MARC lives
- The varieties of metadata
- Modularity
- Some Challenges
4State of the Web as an Information System
- Search systems are motivated by business models,
not functionality - Index coverage is broad, but unpredictable
- Too much recall, too little precision
- Index spam abounds
- Resources (and their names) are volatile
- What about versions, editions, back issues?
- Archiving is presently unsolved
- Authority and quality of service are spotty
- Managing Intellectual Property Rights is
difficult
5Metadata Part of a Solution
- Structured data about other data
- helps to impose order on chaos
- enables automated discovery/manipulation
- Full Text Web indexing is the dominant idiom for
search - Metadata is more useful in structured
collections, used in combination with
applications designed to take advantage of
structured descriptions
6Internet Commons includes Multiple Communities
7Interoperabilityrequires conventions about
- Semantics
- The meaning of the elements
- Structure
- human-readable
- machine-parseable
- Syntax
- grammars to convey semantics and structure
8Havent we done metadata already?
The MARC family of standards is the single most
successful resource description standard in the
world
9MARC Cataloging
- Is really MARC-AACR2 cataloging
- MARC is the communications format
- AACR2 (Anglo-American Cataloging Rules) defines
the cataloging rules (semantics - MARC and AACR2 are evolving
- Closer alignment with XML as a syntax option
- RDA is an effort to modernize AACR2, and
alignment it with networked environments - RDA and Dublin Core are cooperating on alignment
of a common underlying data model.
10Whats wrong with this model on the Web?
- Expensive
- Complex
- Professional Catalogers required
- Bias towards bibliographic artifacts
- Fixed resources
- Incomplete handling of resource evolution and
other resource relationships - Anglo-centric
- MARC 21 accounts for ¾ of MARC records, but there
are many other varieties
11Metadata Takes Many Forms
12Warwick Framework Modular Metadata
- Conceptual Architecture for metadata from the
Warwick Metadata Workshop (DC-2) - Conceptual architecture to support the
specification, collection, encoding, and exchange
of modular metadata - Provide context for metadata efforts (including
Dublin Core) - avoids the black-hole of comprehensive element
sets - focuses interoperability issues at package level
- A conceptual framework, NOT an application
13Modularity and Extensibility the Lego metaphor
- DC is a beginning, not an end
- An architecture for modular, extensible metadata
- The simplest common denominator
- Add stuff you need for
- Local requirements
- Domain specific functionality
- Other dimensions of description
- Eg cloud cover management structural metadata.
14Descriptive Metadata Standards
- IEEE LOM (Learning Object Metadata)
- Descriptive and structural metadata to support
instructional systems - ONIX (Online Information Exchange) bookseller
metadata - FGDC Federal Geographic Data Committee rich
descriptive and structural metadata for GIS
applications - Encoded Archival Description description of
archival collections - MPEG Multimedia Metadata large, complicated,
still in progress descriptive, structural,
rights management - Dublin Core core descriptive metadata
15Metadata Creation
- Metadata is expensive and error prone
- A MARC Record costs about 100 USD to create one
record at the Library of Congress - Competes with indexing at 00.001 ???
- Capture it as close to point of creation as
possible - Capture as much automatically as possible
- Should be designed with close attention to the
functional requirements it serves - Re-use existing standards whenever possible
- Always tension between completeness of
description, intended purpose, and cost
16Metadata Challenges
- Accommodate multiple varieties of metadata
- Tension functionality and simplicity
- Tension extensibility and interoperability
- Human and machine creation and use
- Community-specific functionality, creation,
administration, access work at cross purposes to
global interoperability
17Interoperability barriers cost time and moneyA
Common data model helps avoid this
18Dublin Core Basics
- Design Philosophy useful metaphors
- Language and pidgins
- Characteristics of DC metadata
- The simple bucket (properties)
- Resource Types
- Metadata grammar
- Dublin Core Principles
- One-to-one
- Dumb-down rule
- Context appropriate values
- Translations
-
19Dublin Core Starting Assumptions and Essential
Features
- Simple
- true to a point the elements are simple, the
underlying model is not - Consensus-based
- Crucial to early success, both in attracting
expertise and deployment. Bottom up - Based on the experience of practitioners, but
hard to capture and capitalize on lessons learned - Cross-disciplinary and International
- Central success factor
20Essential Features (continued)
- The Web is the strategic application
- On the mark
- International
- Also central success factor, but hard (20
languages in the Registry) - Lego-like modularity extensibility
- Partially realized promise
- Application Profiles are the means
- Syntax independence
- An ongoing nightmare (HTMLXMLRDF/XML)
- Authors will describe their own works
- Laughably naïve
21A Pidgin for Digital Tourists
- Metadata is language
- Dublin Core is a small and simple language -- a
pidgin -- for finding resources across domains - Speakers of different languages naturally
"pidginize" to communicate - E.g., tourists using simple phrases to order beer
("zwei Bier bitte" "dva pivo" "biru o san
bai"...) - We are all "tourists" on the Internet.
22A Grammar of Dublin Core
- By design not as rich as mother tongues, but easy
to learn and useful in practice - Pidgins small vocabularies (Dublin Core fifteen
special nouns and lots of optional adjectives) - Simple grammars sentences (statements) follow a
simple fixed pattern... - http//www.dlib.org/dlib/october00/baker/10baker.h
tml
23Basic Structures in Dublin Core Metadata
- The basic unit of metadata is a statement
- Statements consist of a property (a metadata
element) and a value - Metadata statements describe resources
- More about the Dublin Core Abstract model later
24What are the properties and values in the
following metadata statements?
- 245 00 a Amores perros h videorecording
- lttitlegt Nueve reinas lt/titlegt
- lttypegt MovingImage lt/typegt
- Different models for conveying related
information - Dublin Core syntax fits in more naturally with
the structure of the Web
25implied verb
one of 15 properties
property value (an appropriate literal)
DCCreator DCTitle DCSubject DCDate...
implied subject
Resource
has
property
X
qualifiers (adjectives)
optional qualifier
optional qualifier
26The fifteen elements (properties)
27Varieties of qualifiersElement Refinements
- Make the meaning of an element narrower or more
specific. - a Date Created versus a Date Modified
- an IsReplacedBy Relation versus a Replaces
Relation - If your software does not understand the
qualifier, you can safely ignore it.
28Varieties of QualifiersValue Encoding Schemes
- Says that the value is
- a term from a controlled vocabulary (e.g.,
Library of Congress Subject Headings) - a string formatted in a standard way (e.g.,
"2001-05-02" means May 3, not February 5) - Even if a scheme is not known by software, the
value should be "appropriate" and usable for
resource discovery.
29Resource
has
Subject
"Languages -- Grammar"
LCSH
Resource
has
Date
"2000-06-13"
Revised
ISO8601
30Dumb-Down Principle for Qualifiers
- Simple DC does not use element refinements or
encoding schemes statements contain only value
strings - Qualified DC uses features of the DCMI Abstract
Model, including element refinements and encoding
schemes - Dumbing-down is translating Qualified DC to
simple DC - Qualifiers refine meaning (but may be harder to
understand)
31The One to One Principle
- Each resource should have one metadata
description - For example, do not describe a digital image of
the Mona Lisa as if it were the original painting - Group Related descriptions into description sets
- Describe an artist and his or her work
separately, not in a single description
32Appropriate Values
- There are generally tradeoffs between local
requirements and global requirements - Use elements and qualifiers to meet the needs of
your local context, but - Keep in mind that machines and people use and
interpret metadata, so - Consider whether the values used will help
discovery outside your local context
33Dublin Core as a multilingual metadata language
- Dublin Core has been translated into 20
languages - machine-readable tokens are shared by all
- human-readable labels are defined in different
languages - translations are distributed, maintained in many
countries - eventually linked in DCMI registry
34(No Transcript)
35One token labels in many languages
dccreator
Server in Germany
DCMI Server
Server in Jakarta
36Metadata languages are "multilingual"
- Metadata is not a spoken language
- The words of metadata -- "elements" -- are
symbols that stand for concepts expressible in
multiple natural languages - Standards may have dozens of translations
- Are concepts like "title", "author", or "subject"
used the same way in English, Finnish, and Korean?
37DCMI Open Metadata Registry
- Managing vocabularies defined by the DCMI
- Languages
- Versioning
- Controlled vocabularies
- Foundation for modular, incremental integration
and evolution - The Registry working group is a Dublin Core
Community with participants around the world
38The Dublin Core Abstract Model
- Terminology
- Simple versus Qualified DC
- Resources
- Descriptions
- Description sets
- Value Strings
- Element refinements
- Encoding Schemes
- Graphical representation of the Abstract Model
- Summary of general ideas
39Important DCMI Document concerningthe Abstract
Model and Syntax alternatives
- DCMI Abstract Model
- http//dublincore.org/documents/abstract-model
/ - Expressing Dublin Core in HTML/XHTML meta and
link elements - http//dublincore.org/documents/dcq-html/
- Expressing Dublin Core metadata using the
Resource Description Framework (RDF) - http//dublincore.org/documents/dc-rdf/
- Expressing Dublin Core metadata using XML
- http//dublincore.org/documents/dc-xml/
40Simple versus Qualified DC
- Simple DC supports single descriptions using the
15 base elements and value strings - Qualified DC supports the richer features of the
Abstract Model, and allows the use of all DCMI
terms as well as other, non-DCMI terms. - An application profile is used to specify a
metadata application that includes DCMI terms in
combination with non-DCMI terms (mix match
metadata).
41The DCMI Abstract Model
- A data model for Dublin Core
- Agreed upon underlying structure for metadata
statements - Many years in the making -- long term contention
- Describes the structure of statements about
resources that we make in our metadata language
42What is a resource?
- W3C definition
- anything that has identity electronic document,
an image, a service - not all resources are network retrievable e.g.
human beings, corporations, and bound books can
also be considered resources - In other words, a resource is anything we can
identify - Physical things (books, people, airplanes.)
- Digital things (Images, web pages, services.)
- Concepts (colors, subjects, eras, places)
- In the DC context, the DCMI Type list describes
the stuff we describe with DC metadata
43Resource types for which DC is often used
DCMI TYPE Vocabulary
Collection Dataset Event
Image Interactive Resource Moving Image
Physical Object Service Software
Sound Still Image Text
44Abstract Model Descriptions
- A description is composed of
- One or more statements about a single resource
- Optionally, the URI of the resource being
described - Each statement is made up of
- A property URI (that identifies a property)
- A value URI (that identifies a value) and/or one
or more representations of the value (a value
string)
45Terminology Value Strings
- A value string is a human-readable string that
represents the value of the property - Each value string may have an associated value
string language that is an ISO language tag
(e.g., pt-BR)
46Terminology Element Refinements
- Elements are the same as properties
- Element refinements are the same as
sub-properties - An element refinement is a special case of an
element that shares the meaning of its parent,
but has narrower semantics - Paulo is illustrator of a book, therefore he is
also a contributor to the book - Illustrator is an element refinement of
contributor
47Terminology Encoding Schemes
- Values and value strings can be qualified by
encoding schemes in order to clarify their
meaning - A Vocabulary Encoding Scheme is used to indicate
a terminology set from which a value is taken - Stem cellsResearch is a value from LCSH
- 616.02774 is a value from DDC-22
- A syntax encoding scheme is used to indicate the
structure of a value string - 2004-10-12 is structured according to the
- W3CDTF rules for date encoding
48Terminology Description Sets
- The 11 principle dictates that each description
describes one, and only one, resource - We often need to describe grouped sets of
descriptions, which are known in the abstract
model as description sets - An article and its authors
- A painting and its artist
- When description sets are exchanged between
software applications, they are generally encoded
according to a particular syntax in a metadata
record
49Abstract Model summary (after Andy Powell)
Record (encoded as html, XML, or
RDF/XML
Description set
Resource Description (URI)
Resource Description (URI)
Resource Description (URI)
Statement
Statement
Vocabulary encoding scheme
Statement
value URI
property (URI)
language (pt-BR)
syntax encoding scheme
50General Ideas
- DC is not just the 15 elements, though they
comprise the foundation for simple DC - 50 properties (elements) have been approved by
DCMI - The model supports local declarations of
additional properties - The model supports application profiles (mixing
DC elements with those of other sets) - The model allows the grouping of descriptions to
create more complex description entities
51Syntax Alternatives
- Choosing among alternatives
- HTML
- XML
- RDF/XML
52Syntax AlternativesHTML XML RDF/XML
- Three Web-based models for deploying metadata
- Each has advantages and disadvantages
- What is best depends on local constraints
- What is the objective of the system? How do
these syntax alternatives support local
functional requirements? - Are there services and software to consume the
metadata created? - Are trained practitioners available to create and
support the systems?
53Syntax Alternatives HTML
- Advantages
- Simple META tags embedded in content
- Widely deployed tools and knowledge
- Resource carries its metadata around with it
- Metadata is openly harvestable
54Syntax Alternatives HTML (continued)
- Disadvantages
- Limited structural richness (does not support
hierarchical, tree-structured data - Management of metadata is less reliable (the
metadata is out in the wild) - Describe one thing (the HTML document) and no
more!
55Dublin Core in HTML (example)
ltheadgt ltlink rel"schema.DC" href"http//purl.org
/dc"gt ltmeta name"DC.title" contentDC
Metadata Tutorial ltmeta name"DC.creator"
contentStuart L. Weibel"gt ltmeta
name"DC.subject" xmllang en-US contentMeta
data"gt ltmeta name"DC.date" schemeDCTERMS.W3CDT
F" content2007-07-08"gt ltmeta
nameDCTERMS.audience content technical
librarianslt/headgt ltbodygt rest of html
document
56The namespaces for HTML encoding
- All DCMI terms (elements, element refinements,
and encoding schemes) are found in - DCMI Metadata Terms
- http//dublincore.org/documents/dcmi-terms/
- The namespaces are a result of historical
developments - DC original elements
- DCTERMS later elements
57Syntax Alternatives XML
- XML eXtensible Markup Language
- The standard for networked text and data
- Wide-spread tool support
- Parsers are widely available
- Extensibility (XML namespaces)
- Type definitions (XML Schema)
- Transformation and Rendering (XSLT)
- Rich linking semantics (XLINK)
58XML Schema
- Rich XML-based language for expressing data-type
semantics - Replaces arcane and limited DTD (origin in SGML)
- Facilities
- Data typing (both complex and primitive)
- Constraints (ranges, cardinality)
- Defaults (specify defaults for certain properties)
59Dublin Core fragment in XML
ltmetadata xmlnsdc"http//www.openarchives.org /
OAI/dc.xsd"gt ltdccreatorgtCarl
Lagozelt/dccreatorgt ltdctitlegtAccommodating
Simplicity and Complexity in
Metadatalt/dctitlegt ltdcdategt2000-07-01lt/dcdat
egt ltdcpublishergtCornell University,
Computer Sciencelt/dcpublishergt lt/metadatagt
Where is the rest of the stuff? In the
schema!
60Case Study OAI-PMHOAI Protocol for Metadata
Harvesting
- Open Archives Initiative
- http//www.openarchives.org
- Simple Protocol for sharing metadata records
- Based on HTTP, XML, XML Schema, and XML
namespaces - Allows a harvester to query a remote repository
for some or all of its metadata records - DC is the default native metadata format in the
OAI protocol
61 Syntax Alternatives RDF
- RDF (Resource Description Format)
- Syntax expressed in XML
- W3C recommendation for encoding metadata (a
semantic Web technology) - Enabling technology for richly-structured
metadata - Rich data model (the DC Abstract Model is a
constrained version of RDF) - Metadata can be shared easily among independent
applications that understand RDF - W3C Resource Description Framework (RDF)
- http//www.w3.org/RDF/
62Summary Syntax alternatives
- Choices should be driven by local requirements
and objectives - Available expertise
- Costs of Deployment
- Objectives and functional requirements
63Association ModelsWhere do we keep the metadata?
- Embedded
- HTML META tags or XML or RDF-XML can be embedded
in the resource, and hence travels with the
resource - Simple, but limited in structural richness
- Loosely coupled
- Shadow Files (like Adobes XMP Sidecar files)
- Requires a system to manage and insure that they
stay in synch - RDF or XML descriptions
- Third Party Metadata
- Stored in repositories such as library catalogs
- Easier to manage and maintain, and provide
service - Library catalogs, for example
64Questions about syntax alternatives?
65Application ProfilesMixing and Matching Metadata
- What is an Application Profile?
- Why bother?
- Creating new properties
- Documenting and declaring new
- properties
- Some examples
66Application Profiles Mixing and Matching Metadata
- The mixing and matching of elements (properties)
from separate metadata sets - An expression of metadata modularity
- Implementers can benefit from peer applications
- Communities can harmonize their metadata, picking
complementary properties - Promotes convergence over time
- For application profiles to work, there must be
public declarations of properties that conform to
a common data model (or nearly so)
67Application Profile Definition
- Declaration of metadata properties used in a
given organization or application or community - Documentation of encodings, constraints, and
creation guidelines - Implies formal schemas (xml schemas or RDF
schemas) - Should promote both human understanding and
machine interoperability - The concept of application profiles applies to
any metadata community of practice, not just DC - DC has promoted their use and leads by example
68Why bother?
- One-size-fits-all metadata results in bloated,
unmanageable specifications and applications - APs allow tailoring a given metadata application
to match the element set to specific functional
requirements based on local or community needs,
while retaining interoperability with a larger
metadata community
69Creating an Application Profile
- Find out what others have done dont re-invent
wheels! - Develop community consensus
- Define Name, Label, definition relationships (see
the DCMI Usage Board guidelines) - Determine an appropriate URI (a home on the Web)
- Dublin Core Application Profile Guidelines
- http//dublincore.org/usage/documents/profile-
guidelines/
70Document New Properties
- At very least a Web page with relevant
information - Better a web page with a public schema using new
terms in an application profile - Better still all properties available as part of
a metadata registry
71Example Application Profiles
- DC-Library AP
- DC-Collection Description AP
- DC-Government AP
- DC-Education AP
72Some History of the Dublin CoreandHow the
Initiative Works
- The Beginnings
- Landmarks
- Workshops and Conference series
- What the initiative does
- Standardization
- Some example applications
-
73Dublin Core The Beginning
- A casual discussion at WWW-2 in Chicago, October
of 1994 - How to make things on the Web easier to find?
- OCLC NCSA co-sponsored an invitational workshop
in March of 1995 - The workshop became a workshop series, and
eventually a conference series - DCMI Dublin Core Metadata Initiative
- Governance and process evolved over time
- De facto standards maintenance body
74Dublin Core Landmarks
- 1994 Simple tags to describe Web pages
- 1995 The Dublin Core is one of many vocabularies
needed ("Warwick Framework") - 1996 The Dublin Core 13 elements expanded to 15
- appropriate for Text and Images - 1997 WF needs formal expression in a Resource
Description Framework (RDF)
75Dublin Core Landmarks (continued)
- 2000 Dublin Core Metadata Initiative recommends
qualifiers, broadens its organizational scope
beyond the Core - 2001 Workshop Series becomes a conference series
- DCMI Affiliates and a board of trustees
- 2005 Abstract Model (Finally)
76The Dublin Core Workshop Series
- Workshop Venues
- US DC 1, 3, 6
- UK DC 2
- Australia DC 4
- Finland DC 5
- Germany DC 7
- Canada DC 8
- Conferences
- Tokyo (2001) China (2004)
- Florence (2002) Spain (2005)
- Seattle (2003) Mexico (2006)
77(No Transcript)
78DCMI Activities
- Standards development and maintenance
- Metadata registry and infrastructure
- Technical working groups and periodic workshops
- Tutorial materials and user guides
- Education and training
- Open source software
- Liaisons with other standards or user communities
79Governance of DCMI
- DCMI has a Board of Trustees that oversees the
operation and goals of the initiative - Managing Director
- Makx Dekkers
- Director of Specifications and Documentation
- Tom Baker
- An Advisory Board of metadata experts provides
guidance on metadata issues
80The DCMI Usage Board
- The Usage Board is an editorial committee that
evaluates proposals for new elements or revisions - International selection of metadata experts
- Meet twice yearly
- Documents decisions and updates DCTERMS document
81Affiliate Program
- DCMI has National Affiliates which support the
Initiative and are represented on the Board of
Trustees - Finland
- UK
- Singapore
- New Zealand
- Korea
- OCLC has been the Host from the start
82The Three Is
- Independent DCMI is not controlled by specific
commercial or other interests and is not biased
towards specific domains nor does it mandate
specific technical solutions - International DCMI encourages participation from
organizations anywhere in the world, respecting
linguistic and cultural differences - Influenceable DCMI is an open organization
aiming at building consensus among the
participating organizations there are no
prerequisites for participation
83The Work gets done by Communities and task
groups
- Accessibility Community
- Collection Description Community
- Education Community
- Environment Community
- Global Corporate Circle
- Government Community
- Kernel Community
- Libraries Community
- Localization and Internationalization Community
- Preservation Community
- Registry Community
- Social Tagging Community
- Standards Community
- Tools Community
84Standardization of the Dublin Core
- IETF RFC 2413
- http//www.ietf.org/rfc/rfc2413.txt
- CEN Workshop Agreement (Europe)
- endorse Dublin Core elements as CWA13874
- NISO Z39.85
- National Information Standards Organization, an
ANSI affiliate - ISO 15836
85Metadata Applications - examples
- Governments
- 7 governments have adopted DC metadata
- Adobe products
- XMP Adobes variant of RDF
- Dublin Core is a base schema
- IPTC International Press and Telecommunications
Council - Dublin Core based standard for journalism
- Knowledge Management systems commonly use DC
metadata - Visual materials require metadata for findability
- Library Systems (mostly MARC cataloging, but
increasingly other metadata as well)
86Metadata applications (continued)
- Search Systems
- Full text indexing is enormously useful
- Structured metadata improves search
- The Amazoogles are all aggressively courting
metadata aggregators - Cameras
- Automatically create metadata for each image
- Some even include GPS data
- Commerce systems require metadata
- Social Software applications are largely about
enriching resource information with tags,
reviews, and automated linking
87To Sum Up
- Many purpose-built metadata standards
- Few have explicit data models
- Few interoperate
- Some will survive, others will not
- The Web demands convergence
- Break down silos between domains and communities
of practice - RDF should help promote convergence, but we are
not there yet - Expect more metadata standards, but hope for
fewer
88How to Participate
- Join the
- DC-General
- mailing list
- Join a working
- group
- Information
- on lists and
- working groups
- is available at http//dublincore.org
89Stuart L. Weibel
Visit me at http//weibel-lines.typepad.com Con
tact me at Weibel_at_oclc.org
Thank you for your attention