Title: Content Types: Text and Metadata
1Content TypesText and Metadata
2Introduction
- Text documents come in many forms
- Article (news, conference, journal, etc.)
- Email, memo,
- Book, manual, manuscript, transcript,
- Any part of one of the above
- Syntax can express
- Structure
- Presentation style
- Semantics (e.g. software code)
3Metadata
- Metadata data about data
- Descriptive metadata
- External to meaning of document
- Author, publication date, document source,
document length, document genre, file type, bits
per second, frame rate, etc. - Semantic metadata
- Characterizes semantic content of document
- LoC subject heading, keywords, subject headings
from ontologies (e.g. MESH), etc.
4Metadata Formats
- Machine Readable Cataloging Record (MARC)
- Used by most libraries
- Fields include title, author, etc.
- Resource Description Framework (RDF)
- Used for Web resources
- Node and attribute / value pairs
- Node ID is any Uniform Resource Identifier (URI),
which could be a URL
5Metadata Sets
- Dublin Core Metadata Elements
- Contributor entities contributing to the
content - Coverage extent or scope of content (spatial
area, temporal period, ) - Creator entity primarily responsible for making
the content - Date date associated with event (e.g.
publication) for resource - Description abstract, table of contents,
- Format media (file) type, dimensions (size,
duration), hardware needed - Identifier unique identifier
- Language language of content
- Publisher entity responsible for making
resource available - Relation reference to related resource(s)
- Rights information about rights held in/over
resource - Source resource from which content is derived
- Subject keywords, key phrases, classification
code, etc. - Title name of the resource
- Type nature or genre of content
6Text Formats
- Coding schemes
- EBCDIC (7 bit, one of first coding schemes)
- ASCII (initially 7 bit, extended to 8 bit)
- Unicode (16 bit for large alphabets)
- Additional Formats
- RTF (format-oriented document exchange)
- PDF and PostScript (display-oriented
representation) - Multipurpose Internet Mail Exchange (MIME)
(multiple character sets, languages, media)
7Information Theory
- How can we predict information value of
components of a document? - Entropy attempts to model information content
(information uncertainty) - E - Sum all symbols in alphabet (pi log2 pi)
- pi is the probability of symbol I (symbol
frequency over number of symbols) - Need a text model for real language
- Also important for compression as E acts as a
limit of how much a text can be compressed.
8Modeling Character Strings
- Symbols in NL are not evenly distributed
- Some symbols are not part of words (often used
for syntax) - Symbols in words are not evenly distributed
- Models
- Binomial model uses distribution of symbols in
language - But previous symbols influence probabilities of
later symbols - (what letter will appear after a q?)
- Finite context or Markovian models used for this
dependency - k-order where k is the number of previous
characters taken into account by the model - Thus, the binomial model is a 0-order model
9Word Distribution in Documents
- How frequent are words within documents?
- Zipfs Law
- Frequency of the ith most frequent word is
1/itheta frequency of most frequent word - The value of theta depends on the text (value of
1 is logarithmic distribution) - Theta values of 1.5 to 2.0 best model real texts
- In practice, a few hundred words make up 50 of
most texts - Frequent words provide less information
- Thus, many search strategies involve ignoring
stopwords (a, an, the, is, of, by, )
10Word Distribution in Collections
- Simplest to assume uniform distribution of words
in documents - But not true
- Better models built on negative binomial
distributions or Poisson distributions
11Vocabulary Size for Documents and Collections
- Heaps Law
- Vocabulary size (V) grows with number of words
(n) - V Knb
- Experimentally,
- K is between 10 and 100
- B is between 0.4 and 0.6
- So vocabulary grows proportionally with the
square root of the size of the document or
collection in words - Works best for large documents collections
12String Similarity Models
- Similarity is measured by a distance function
- Hamming distance number of characters different
in strings - Levenshtein distance minimum number of
insertions, deletions, and substitutions needed
to make strings equal - color to colour is 1
- survey to surgery is 2
- Can be extended to documents
- UNIX diff treats each line as a character
13Content TypesMarkup and Multimedia
14Introduction
- Markup languages use extra textual syntax to
encode - Formatting / display information
- Structure information
- Descriptive metadata
- Semantic metadata
- Marks are often called tags
- The act of adding markup is called tagging
- Most markup languages use initial and ending tags
surrounding the marked text
15Standard Generalized Markup Language (SGML)
- Metalanguage for markup.
- Includes rules for defining markup language
- Use of SGML includes
- Description of structure of markup
- Text marked with tags
- Document Type Declaration (DTD)
- Describes and names tags and how they are related
- Comments used to express interpretation of tags
(meaning, presentation, )
16SGML DTD Example
- lt! SGML DTD for electronic messages - - gt
- lt! ELEMENT e-mail - - (prolog, contents) gt
- lt! ELEMENT prolog - - (sender, address ,
subject?, Cc) gt - lt! ELEMENT (sender address subject Cc) - 0
(PCDATA) gt - lt! ELEMENT contents - - (par image audio)
gt - lt! ELEMENT par - 0 (ref PCDATA)gt
- lt! ELEMENT ref - 0 EMPTY gt
- lt! ELEMENT (image audio) - - (NDATA) gt
- lt! ATTLIST e-mail
- id ID REQUIRED
- date_sent DATE REQUIRED
- status (secret public ) public gt
- lt! ATTLIST ref
- id IDREF REQUIRED gt
- lt! ATTLIST (image audio)
- id IDREF REQUIRED gt
17SGML Example
lt! DOCTYPE e-mail SYSTEM e-mail.dtdgt lte-mail
id94108rby date_sent02101998gt ltprologgt ltsendergt
Pablo Nerudalt/sendergt ltaddressgt Federico Garcia
Lorcalt/addressgt ltaddressgt Ernest
Hemingwaylt/addressgt ltsubjectgt Picture of my house
in Isla ltCcgt Gabriel Garcia Marquezlt/Ccgt lt/prologgt
ltcontentsgt ltpargt Here are two photos. One is of
the view (photo ltref idrefF2gt). lt/pargt ltimage
idF1gt photo1.gif lt/imagegt ltimage idF2gt
photo2.jpg lt/imagegt lt/contentsgt lt/e-mailgt
18SGML Characteristics
- DTD provides ability to determine if a given
document is well-formed. - SGML generally does not specify
presentation/appearance. - Output specification standards
- DSSSL (Document Style Semantic Specification
Language) - FOSI (Formatted Output Specification Instance)
19HyperText Markup Language (HTML)
- Based on SGML
- HTML DTD not explicitly referenced by documents
- HTML documents can have documents embedded within
them - Images or audio
- Frames with other HTML documents
- When programs are included, it is referred to as
Dynamic HTML - Strict HTML includes only non-presentational
markup. - Cascade Style Sheets (CSS) used to define
presentation - In reality, presentational and structural markup
are blended by HTML authoring applications.
20(Original) HTML Limitations
- In contrast to SGML
- Users cannot specify their own tags or
attributes. - No support for nested structures that can
represent database schemas or object-oriented
hierarchies. - No support for validation of document by
consuming applications.
21eXtensible Markup Language (XML)
- XML is a simplified subset of SGML
- XML is a meta-language
- XML designed for semantic markup that is both
human and machine readable - No DTD is required
- All tags must be closed
- Extensible Style sheet Language (XSL)
- XML equivalent of CSS
- Can be used to convert XML into HTML and CSS
22Multimedia
- Lots of data file formats for non-textual data
- Images
- BMP, GIF, JPEG (JPG), TIFF
- Audio
- AU, MIDI, WAVE, MP3
- Video
- MPEG, AVI, QuickTime
- Graphics / Virtual Environments
- CGM, VRML, OpenGL
23Audio and Video
- Data files often have
- Header
- Indicates time granularity, number of channels,
bits per channel - Somewhat like a DTD
- Data
- The signal
- Data may be compressed
- Data may be in frequency domain rather than time
domain - Data may be encoded as sequence of differences
between consecutive time segments.