Content Types: Text and Metadata - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Content Types: Text and Metadata

Description:

External to meaning of document ... HTML DTD not explicitly referenced by documents. HTML documents can have documents embedded within them ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 24

Provided by: CSDL

Learn more at: http://www.csdl.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Content Types: Text and Metadata

1
Content TypesText and Metadata
2
Introduction

Text documents come in many forms
Article (news, conference, journal, etc.)
Email, memo,
Book, manual, manuscript, transcript,
Any part of one of the above
Syntax can express
Structure
Presentation style
Semantics (e.g. software code)

3
Metadata

Metadata data about data
Descriptive metadata
External to meaning of document
Author, publication date, document source,
document length, document genre, file type, bits
per second, frame rate, etc.
Semantic metadata
Characterizes semantic content of document
LoC subject heading, keywords, subject headings
from ontologies (e.g. MESH), etc.

4
Metadata Formats

Machine Readable Cataloging Record (MARC)
Used by most libraries
Fields include title, author, etc.
Resource Description Framework (RDF)
Used for Web resources
Node and attribute / value pairs
Node ID is any Uniform Resource Identifier (URI),
which could be a URL

5
Metadata Sets

Dublin Core Metadata Elements
Contributor entities contributing to the
content
Coverage extent or scope of content (spatial
area, temporal period, )
Creator entity primarily responsible for making
the content
Date date associated with event (e.g.
publication) for resource
Description abstract, table of contents,
Format media (file) type, dimensions (size,
duration), hardware needed
Identifier unique identifier
Language language of content
Publisher entity responsible for making
resource available
Relation reference to related resource(s)
Rights information about rights held in/over
resource
Source resource from which content is derived
Subject keywords, key phrases, classification
code, etc.
Title name of the resource
Type nature or genre of content

6
Text Formats

Coding schemes
EBCDIC (7 bit, one of first coding schemes)
ASCII (initially 7 bit, extended to 8 bit)
Unicode (16 bit for large alphabets)
Additional Formats
RTF (format-oriented document exchange)
PDF and PostScript (display-oriented
representation)
Multipurpose Internet Mail Exchange (MIME)
(multiple character sets, languages, media)

7
Information Theory

How can we predict information value of
components of a document?
Entropy attempts to model information content
(information uncertainty)
E - Sum all symbols in alphabet (pi log2 pi)
pi is the probability of symbol I (symbol
frequency over number of symbols)
Need a text model for real language
Also important for compression as E acts as a
limit of how much a text can be compressed.

8
Modeling Character Strings

Symbols in NL are not evenly distributed
Some symbols are not part of words (often used
for syntax)
Symbols in words are not evenly distributed
Models
Binomial model uses distribution of symbols in
language
But previous symbols influence probabilities of
later symbols
(what letter will appear after a q?)
Finite context or Markovian models used for this
dependency
k-order where k is the number of previous
characters taken into account by the model
Thus, the binomial model is a 0-order model

9
Word Distribution in Documents

How frequent are words within documents?
Zipfs Law
Frequency of the ith most frequent word is
1/itheta frequency of most frequent word
The value of theta depends on the text (value of
1 is logarithmic distribution)
Theta values of 1.5 to 2.0 best model real texts
In practice, a few hundred words make up 50 of
most texts
Frequent words provide less information
Thus, many search strategies involve ignoring
stopwords (a, an, the, is, of, by, )

10
Word Distribution in Collections

Simplest to assume uniform distribution of words
in documents
But not true
Better models built on negative binomial
distributions or Poisson distributions

11
Vocabulary Size for Documents and Collections

Heaps Law
Vocabulary size (V) grows with number of words
(n)
V Knb
Experimentally,
K is between 10 and 100
B is between 0.4 and 0.6
So vocabulary grows proportionally with the
square root of the size of the document or
collection in words
Works best for large documents collections

12
String Similarity Models

Similarity is measured by a distance function
Hamming distance number of characters different
in strings
Levenshtein distance minimum number of
insertions, deletions, and substitutions needed
to make strings equal
color to colour is 1
survey to surgery is 2
Can be extended to documents
UNIX diff treats each line as a character

13
Content TypesMarkup and Multimedia
14
Introduction

Markup languages use extra textual syntax to
encode
Formatting / display information
Structure information
Descriptive metadata
Semantic metadata
Marks are often called tags
The act of adding markup is called tagging
Most markup languages use initial and ending tags
surrounding the marked text

15
Standard Generalized Markup Language (SGML)

Metalanguage for markup.
Includes rules for defining markup language
Use of SGML includes
Description of structure of markup
Text marked with tags
Document Type Declaration (DTD)
Describes and names tags and how they are related
Comments used to express interpretation of tags
(meaning, presentation, )

16
SGML DTD Example

lt! SGML DTD for electronic messages - - gt
lt! ELEMENT e-mail - - (prolog, contents) gt
lt! ELEMENT prolog - - (sender, address ,
subject?, Cc) gt
lt! ELEMENT (sender address subject Cc) - 0
(PCDATA) gt
lt! ELEMENT contents - - (par image audio)
gt
lt! ELEMENT par - 0 (ref PCDATA)gt
lt! ELEMENT ref - 0 EMPTY gt
lt! ELEMENT (image audio) - - (NDATA) gt
lt! ATTLIST e-mail
id ID REQUIRED
date_sent DATE REQUIRED
status (secret public ) public gt
lt! ATTLIST ref
id IDREF REQUIRED gt
lt! ATTLIST (image audio)
id IDREF REQUIRED gt

17
SGML Example
lt! DOCTYPE e-mail SYSTEM e-mail.dtdgt lte-mail
id94108rby date_sent02101998gt ltprologgt ltsendergt
Pablo Nerudalt/sendergt ltaddressgt Federico Garcia
Lorcalt/addressgt ltaddressgt Ernest
Hemingwaylt/addressgt ltsubjectgt Picture of my house
in Isla ltCcgt Gabriel Garcia Marquezlt/Ccgt lt/prologgt
ltcontentsgt ltpargt Here are two photos. One is of
the view (photo ltref idrefF2gt). lt/pargt ltimage
idF1gt photo1.gif lt/imagegt ltimage idF2gt
photo2.jpg lt/imagegt lt/contentsgt lt/e-mailgt
18
SGML Characteristics

DTD provides ability to determine if a given
document is well-formed.
SGML generally does not specify
presentation/appearance.
Output specification standards
DSSSL (Document Style Semantic Specification
Language)
FOSI (Formatted Output Specification Instance)

19
HyperText Markup Language (HTML)

Based on SGML
HTML DTD not explicitly referenced by documents
HTML documents can have documents embedded within
them
Images or audio
Frames with other HTML documents
When programs are included, it is referred to as
Dynamic HTML
Strict HTML includes only non-presentational
markup.
Cascade Style Sheets (CSS) used to define
presentation
In reality, presentational and structural markup
are blended by HTML authoring applications.

20
(Original) HTML Limitations

In contrast to SGML
Users cannot specify their own tags or
attributes.
No support for nested structures that can
represent database schemas or object-oriented
hierarchies.
No support for validation of document by
consuming applications.

21
eXtensible Markup Language (XML)

XML is a simplified subset of SGML
XML is a meta-language
XML designed for semantic markup that is both
human and machine readable
No DTD is required
All tags must be closed
Extensible Style sheet Language (XSL)
XML equivalent of CSS
Can be used to convert XML into HTML and CSS

22
Multimedia

Lots of data file formats for non-textual data
Images
BMP, GIF, JPEG (JPG), TIFF
Audio
AU, MIDI, WAVE, MP3
Video
MPEG, AVI, QuickTime
Graphics / Virtual Environments
CGM, VRML, OpenGL

23
Audio and Video

Data files often have
Header
Indicates time granularity, number of channels,
bits per channel
Somewhat like a DTD
Data
The signal
Data may be compressed
Data may be in frequency domain rather than time
domain
Data may be encoded as sequence of differences
between consecutive time segments.