Content Types: Text and Metadata - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Content Types: Text and Metadata

Description:

External to meaning of document ... HTML DTD not explicitly referenced by documents. HTML documents can have documents embedded within them ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 24
Provided by: CSDL
Learn more at: http://www.csdl.tamu.edu
Category:
Tags: content | metadata | text | types

less

Transcript and Presenter's Notes

Title: Content Types: Text and Metadata


1
Content TypesText and Metadata
2
Introduction
  • Text documents come in many forms
  • Article (news, conference, journal, etc.)
  • Email, memo,
  • Book, manual, manuscript, transcript,
  • Any part of one of the above
  • Syntax can express
  • Structure
  • Presentation style
  • Semantics (e.g. software code)

3
Metadata
  • Metadata data about data
  • Descriptive metadata
  • External to meaning of document
  • Author, publication date, document source,
    document length, document genre, file type, bits
    per second, frame rate, etc.
  • Semantic metadata
  • Characterizes semantic content of document
  • LoC subject heading, keywords, subject headings
    from ontologies (e.g. MESH), etc.

4
Metadata Formats
  • Machine Readable Cataloging Record (MARC)
  • Used by most libraries
  • Fields include title, author, etc.
  • Resource Description Framework (RDF)
  • Used for Web resources
  • Node and attribute / value pairs
  • Node ID is any Uniform Resource Identifier (URI),
    which could be a URL

5
Metadata Sets
  • Dublin Core Metadata Elements
  • Contributor entities contributing to the
    content
  • Coverage extent or scope of content (spatial
    area, temporal period, )
  • Creator entity primarily responsible for making
    the content
  • Date date associated with event (e.g.
    publication) for resource
  • Description abstract, table of contents,
  • Format media (file) type, dimensions (size,
    duration), hardware needed
  • Identifier unique identifier
  • Language language of content
  • Publisher entity responsible for making
    resource available
  • Relation reference to related resource(s)
  • Rights information about rights held in/over
    resource
  • Source resource from which content is derived
  • Subject keywords, key phrases, classification
    code, etc.
  • Title name of the resource
  • Type nature or genre of content

6
Text Formats
  • Coding schemes
  • EBCDIC (7 bit, one of first coding schemes)
  • ASCII (initially 7 bit, extended to 8 bit)
  • Unicode (16 bit for large alphabets)
  • Additional Formats
  • RTF (format-oriented document exchange)
  • PDF and PostScript (display-oriented
    representation)
  • Multipurpose Internet Mail Exchange (MIME)
    (multiple character sets, languages, media)

7
Information Theory
  • How can we predict information value of
    components of a document?
  • Entropy attempts to model information content
    (information uncertainty)
  • E - Sum all symbols in alphabet (pi log2 pi)
  • pi is the probability of symbol I (symbol
    frequency over number of symbols)
  • Need a text model for real language
  • Also important for compression as E acts as a
    limit of how much a text can be compressed.

8
Modeling Character Strings
  • Symbols in NL are not evenly distributed
  • Some symbols are not part of words (often used
    for syntax)
  • Symbols in words are not evenly distributed
  • Models
  • Binomial model uses distribution of symbols in
    language
  • But previous symbols influence probabilities of
    later symbols
  • (what letter will appear after a q?)
  • Finite context or Markovian models used for this
    dependency
  • k-order where k is the number of previous
    characters taken into account by the model
  • Thus, the binomial model is a 0-order model

9
Word Distribution in Documents
  • How frequent are words within documents?
  • Zipfs Law
  • Frequency of the ith most frequent word is
    1/itheta frequency of most frequent word
  • The value of theta depends on the text (value of
    1 is logarithmic distribution)
  • Theta values of 1.5 to 2.0 best model real texts
  • In practice, a few hundred words make up 50 of
    most texts
  • Frequent words provide less information
  • Thus, many search strategies involve ignoring
    stopwords (a, an, the, is, of, by, )

10
Word Distribution in Collections
  • Simplest to assume uniform distribution of words
    in documents
  • But not true
  • Better models built on negative binomial
    distributions or Poisson distributions

11
Vocabulary Size for Documents and Collections
  • Heaps Law
  • Vocabulary size (V) grows with number of words
    (n)
  • V Knb
  • Experimentally,
  • K is between 10 and 100
  • B is between 0.4 and 0.6
  • So vocabulary grows proportionally with the
    square root of the size of the document or
    collection in words
  • Works best for large documents collections

12
String Similarity Models
  • Similarity is measured by a distance function
  • Hamming distance number of characters different
    in strings
  • Levenshtein distance minimum number of
    insertions, deletions, and substitutions needed
    to make strings equal
  • color to colour is 1
  • survey to surgery is 2
  • Can be extended to documents
  • UNIX diff treats each line as a character

13
Content TypesMarkup and Multimedia
14
Introduction
  • Markup languages use extra textual syntax to
    encode
  • Formatting / display information
  • Structure information
  • Descriptive metadata
  • Semantic metadata
  • Marks are often called tags
  • The act of adding markup is called tagging
  • Most markup languages use initial and ending tags
    surrounding the marked text

15
Standard Generalized Markup Language (SGML)
  • Metalanguage for markup.
  • Includes rules for defining markup language
  • Use of SGML includes
  • Description of structure of markup
  • Text marked with tags
  • Document Type Declaration (DTD)
  • Describes and names tags and how they are related
  • Comments used to express interpretation of tags
    (meaning, presentation, )

16
SGML DTD Example
  • lt! SGML DTD for electronic messages - - gt
  • lt! ELEMENT e-mail - - (prolog, contents) gt
  • lt! ELEMENT prolog - - (sender, address ,
    subject?, Cc) gt
  • lt! ELEMENT (sender address subject Cc) - 0
    (PCDATA) gt
  • lt! ELEMENT contents - - (par image audio)
    gt
  • lt! ELEMENT par - 0 (ref PCDATA)gt
  • lt! ELEMENT ref - 0 EMPTY gt
  • lt! ELEMENT (image audio) - - (NDATA) gt
  • lt! ATTLIST e-mail
  • id ID REQUIRED
  • date_sent DATE REQUIRED
  • status (secret public ) public gt
  • lt! ATTLIST ref
  • id IDREF REQUIRED gt
  • lt! ATTLIST (image audio)
  • id IDREF REQUIRED gt

17
SGML Example
lt! DOCTYPE e-mail SYSTEM e-mail.dtdgt lte-mail
id94108rby date_sent02101998gt ltprologgt ltsendergt
Pablo Nerudalt/sendergt ltaddressgt Federico Garcia
Lorcalt/addressgt ltaddressgt Ernest
Hemingwaylt/addressgt ltsubjectgt Picture of my house
in Isla ltCcgt Gabriel Garcia Marquezlt/Ccgt lt/prologgt
ltcontentsgt ltpargt Here are two photos. One is of
the view (photo ltref idrefF2gt). lt/pargt ltimage
idF1gt photo1.gif lt/imagegt ltimage idF2gt
photo2.jpg lt/imagegt lt/contentsgt lt/e-mailgt
18
SGML Characteristics
  • DTD provides ability to determine if a given
    document is well-formed.
  • SGML generally does not specify
    presentation/appearance.
  • Output specification standards
  • DSSSL (Document Style Semantic Specification
    Language)
  • FOSI (Formatted Output Specification Instance)

19
HyperText Markup Language (HTML)
  • Based on SGML
  • HTML DTD not explicitly referenced by documents
  • HTML documents can have documents embedded within
    them
  • Images or audio
  • Frames with other HTML documents
  • When programs are included, it is referred to as
    Dynamic HTML
  • Strict HTML includes only non-presentational
    markup.
  • Cascade Style Sheets (CSS) used to define
    presentation
  • In reality, presentational and structural markup
    are blended by HTML authoring applications.

20
(Original) HTML Limitations
  • In contrast to SGML
  • Users cannot specify their own tags or
    attributes.
  • No support for nested structures that can
    represent database schemas or object-oriented
    hierarchies.
  • No support for validation of document by
    consuming applications.

21
eXtensible Markup Language (XML)
  • XML is a simplified subset of SGML
  • XML is a meta-language
  • XML designed for semantic markup that is both
    human and machine readable
  • No DTD is required
  • All tags must be closed
  • Extensible Style sheet Language (XSL)
  • XML equivalent of CSS
  • Can be used to convert XML into HTML and CSS

22
Multimedia
  • Lots of data file formats for non-textual data
  • Images
  • BMP, GIF, JPEG (JPG), TIFF
  • Audio
  • AU, MIDI, WAVE, MP3
  • Video
  • MPEG, AVI, QuickTime
  • Graphics / Virtual Environments
  • CGM, VRML, OpenGL

23
Audio and Video
  • Data files often have
  • Header
  • Indicates time granularity, number of channels,
    bits per channel
  • Somewhat like a DTD
  • Data
  • The signal
  • Data may be compressed
  • Data may be in frequency domain rather than time
    domain
  • Data may be encoded as sequence of differences
    between consecutive time segments.
Write a Comment
User Comments (0)
About PowerShow.com