Markup and Metadata - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Markup and Metadata

Description:

... catalog entry (or 5 minutes, depending on number of fields! ... Local link anchors. Navigation within a single document. Forms. Collect data from user ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 36
Provided by: csdl9
Category:

less

Transcript and Presenter's Notes

Title: Markup and Metadata


1
Markup and Metadata
  • How to Build a Digital Library
  • Ian H. Witten and David Bainbridge

2
Digital Library Elements
  • Basic Elements of Organization
  • Markup
  • Controls structure and appearance
  • Metadata
  • Expedites access

3
Structural Markup
  • Identify and maintain the document structure
  • Section divisions
  • Headings
  • Subsection structure
  • Lists
  • Quotations
  • Tabular information
  • Structural markup items become metadata

4
Presentation Markup
  • Specify how the document will appear
    typographically by formatting the document
  • Page size
  • Headers and footers
  • Font
  • Line spacing
  • Section headers
  • Figures

5
Kinds of Metadata
  • Assist navigation
  • Structural markup
  • Resource discovery
  • Metadata to assist in finding documents through
    searching and browsing
  • Value of digital libraries depends on how easily
    information can be located
  • Policy
  • Define rights, restrictions, and rules that
    govern who can do what with digital resources
  • Administration and Preservation
  • Information necessary to preserve the integrity
    and functionality of a digital resource long term

6
Explicit versus Extracted Metadata
  • Explicit Metadata
  • Requires careful analysis of a document
  • Takes 1-2 hours to create a traditional library
    catalog entry (or 5 minutes, depending on number
    of fields!)
  • Extracted Metadata
  • Text Mining
  • Automatically obtained from the contents of a
    document
  • Cheaper, but less reliable

7
HTML
  • Hypertext Markup Language
  • Document format of the World Wide Web
  • Original vision separate document structure
    from presentation
  • Inconsistent ways of formatting and metadata in
    HTML may discourage automatic processing of
    document collections

8
Basic HTML
  • Angle brackets enclose words
  • lttitlegtMy Storylt/titlegt
  • Tag names are not case sensitive

9
HTML Tags
  • ltpgt Paragraph
  • lttrgt Table Row
  • lttdgt Table Cell
  • ltligt Special characters, list item
  • ltimggt Images
  • ltigt Italics
  • ltulgt Unordered List, Bulleted List
  • ltagt .. lt/agt Link Anchor

10
HTML
  • Opening Tags
  • Attributes
  • Special Markers
  • Header
  • Gives global information
  • Title, encoding scheme, metadata
  • Body
  • ASCII /UTF-8 Unicode
  • Local link anchors
  • Navigation within a single document
  • Forms
  • Collect data from user
  • Frames
  • HTML document can be tiled into smaller,
    independent segments (each an HTML page)
  • Frameset a set of frames can be displayed
    simultaneously (useful for navigation bars)

11
HTML in Digital Libraries
  • Many source documents are presented in HTML form
  • Explicit specification of metadata using ltmetagt
    tags
  • Extract text
  • Plain text browser lynx extracts text from HTML
    documents

12
XML
  • Extensible Markup Language
  • Flexible way to characterize document structure
    and metadata
  • Well suited to digital libraries
  • Widespread use

13
XML Document Type Description
  • DTD Document Type Description
  • Tag Syntax lt!...gt
  • Keywords in Block Capitals
  • Square Bracket indicates DTD will appear
    in-line
  • Otherwise, DTD can be in external file
  • Referred to by a URL
  • Desirable
  • New elements
  • Keyword ELEMENT
  • Tag name
  • Description of what element may contain
  • A Leaf
  • An element that is plain text, with no markup
  • Declared as PCDATA (parsed character data)
  • Special Characters
  • Encoded as in HTML (lt amp, etc.)

14
XML Regular Expressions
  • Regular expression
  • Comma indicates an ordered sequence
  • Vertical bar indicates a choice of one element
    from sequence
  • Asterisk indicates zero or more
  • Plus indicates one or more
  • Question mark indicates zero or one

15
XML Attributes
  • Attributes
  • Give set of possible values
  • No nesting
  • Keyword ATTLIST
  • Element to which it applies
  • Attribute name
  • Attribute type
  • Appearance restrictions (optional)

16
XML Entities
  • Entities
  • lt, amp, gt, apos, quote
  • New entities can be added in the DTD
  • Use syntax
  • ENTITY
  • Name
  • value
  • Example lt!ENTITY howto How to Build a Digital
    Librarygt

17
XML Parameter Entity
  • Several elements share the same attributes
  • Parameter Entity
  • Special type of entity
  • Percent symbol

18
Well Formed and Valid XML
  • Well Formed
  • A document that conforms to XML syntax but does
    not supply a DTD (Document Type Description)
  • Valid
  • A document that conforms to XML syntax and does
    supply a DTD
  • The content follows the syntactic constraints
    defined in the DTD

19
Parsing XML
  • Parsing indicates whether the document conforms
    to the general rules of XML (or the specific DTD,
    when applicable)
  • Parsing produces a parse tree
  • Begins with a root node
  • Root node has descendents
  • Descendents reflect text content and nested tags
  • Programming Interface
  • Lets user traverse the tree and retrieve the data
  • API Application Program Interface

20
XML DOM
  • Document Object Model
  • Application Program Interface (API)
  • Cross-platform
  • Cross-language
  • Allows programs to be written that access and
    modify the documents
  • Content
  • Structure
  • Style

21
XML and Digital Libraries
  • XML is powerful
  • XML allows file formats within a digital library
    to be shared
  • Structure explanations are put in a DTD (Document
    Type Description)
  • XML provides syntax for expressing structural
    information metadata
  • XML goes further by combining with other
    standards
  • Support document restructuring, querying,
    information extraction and formatting
  • Can have display capabilities similar to HTML

22
Style Sheets
  • Control the presentation of marked-up documents
  • Two Kinds of Style Sheets
  • Cascading Style Sheets
  • Work with HTML and XML
  • Extensible Stylesheet Language XSL
  • Works with XML
  • Powerful
  • Allows document structure to be altered
    dynamically

23
Bibliographic Metadata
  • Two Standards for Representing Document Metadata
  • Machine-Readable Cataloging (MARC)
  • Used by professional catalogers for use in
    libraries
  • The Dublin Core
  • Minimal standard used by people who are not
    trained in library cataloging
  • Two metadata formats used by document authors in
    scientific and technical fields
  • BibTeX
  • Refer

24
MARC
  • Machine-Readable Cataloging
  • Internally stored as collection of tagged fields
  • Format covers
  • Bibliographic records
  • Authority records standardized forms that are
    part of the librarians controlled vocabulary
  • Governed by AACR2R
  • Anglo-American Cataloging Rules
  • Detailed set of rules and guidelines
  • Two Parts
  • Part 1 Description of Documents
  • Part 2 Description of Works

25
Dublin Core
  • Set of metadata elements
  • Simple - designed for non-specialists
  • Intended for electronic materials that will not
    receive a full MARC catalog entry
  • Named after Dublin, Ohio
  • The first meeting was held there in 1995
  • Approved by ANSI (American National Standards
    Organization) in 2001

26
Dublin Core
  • Fifteen metadata elements form the core element
    set
  • May be refined through qualifiers
  • May be augmented by additional elements for local
    purposes
  • Resource
  • Anything that has identity
  • Similar to entity (objectives of bibliographic
    system)
  • Does not impose any kind of vocabulary control or
    authority files
  • Two people might generate very different
    descriptions of the same resource

27
Dublin Core Metadata Standard
  • Title
  • Creator
  • Subject
  • Description
  • Publisher
  • Contributor
  • Date
  • Type
  • Format
  • Identifier
  • Source
  • Language
  • Relation
  • Coverage
  • Rights

28
BibTeX
  • Manages bibliographic data and references within
    documents
  • TeX
  • Generalized document-processing system
  • Scientific, Mathematical and Technical Purposes
  • LaTeX
  • Customized Version of TeX
  • Freely available
  • BibTeX
  • Subsystem of LaTeX

29
Refer
  • Similar to BibTeX
  • Designed by computer scientists for use by
    scientific and technical researchers
  • Basis of EndNote
  • Bibliographic tool which augments Microsoft Word

30
Metadata for Images and Multimedia
  • Metadata is not confined to text
  • Most image files include data about resolution
  • PNG can store text strings
  • Image metadata is usually kept separate from the
    image file

31
Metadata for Images and Multimedia
  • Two Metadata Formats
  • TIFF
  • Tagged Image File Format
  • Associates metadata with image files
  • Widespread use for over a decade
  • How images are stored in digital libraries
  • Normal images
  • Document images
  • MPEG-7
  • Multimedia Content Description Interface
  • Scheme to define and store metadata associated
    with any multimedia information
  • General, extensible, and still being standardized

32
Extracting Metadata
  • Text Mining
  • Automatic extraction of information from text
  • Plain text documents
  • Require text comprehension skills
  • Computer techniques for text analysis
  • Good results in constrained domains
  • XML and other Structured Markup Languages
  • Make key aspects of documents available to
    computers and people
  • Encoded information can easily be extracted by
    parsing the document structure
  • Few documents contain explicitly encoded metadata

33
General Techniques
  • Extracting Document Metadata
  • Title, Author, Publisher, Date, etc.
  • Generic Entities
  • Email, URLs, Dates, Time, Money
  • Bibliography Entries
  • Citation analysis

34
Key Phrase Metadata
  • Key-phrase metadata can successfully be obtained
    automatically from documents
  • Two Different Approaches
  • Key-Phrase Assignment
  • Key-Phrase Extraction

35
Generating Phrase Hierarchies
  • Key phrases consist of a few well-chosen words
    that characterize the document
  • It is useful to extract a structure that contains
    ALL the phrases in the documents
  • Hierarchical structure of phrases can support
    browsing around a digital library collection
Write a Comment
User Comments (0)
About PowerShow.com