Content%20Analysis%20and%20Statistical%20Properties%20of%20Text - PowerPoint PPT Presentation

About This Presentation
Title:

Content%20Analysis%20and%20Statistical%20Properties%20of%20Text

Description:

Ray Larson & Warren Sack. University of California, Berkeley ... Very dumb rules work well (for English) Porter Stemmer: Iteratively remove suffixes ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 58
Provided by: rayrl
Category:

less

Transcript and Presenter's Notes

Title: Content%20Analysis%20and%20Statistical%20Properties%20of%20Text


1
Content Analysis and Statistical Properties of
Text
  • Ray Larson Warren Sack
  • University of California, Berkeley
  • School of Information Management and Systems
  • SIMS 202 Information Organization and Retrieval
  • Lecture authors Marti Hearst Ray Larson
    Warren Sack

2
Last Time
  • Boolean Model of Information Retrieval

3
Boolean Queries
  • Cat
  • Cat OR Dog
  • Cat AND Dog
  • (Cat AND Dog)
  • (Cat AND Dog) OR Collar
  • (Cat AND Dog) OR (Collar AND Leash)
  • (Cat OR Dog) AND (Collar OR Leash)

4
Boolean
  • Advantages
  • simple queries are easy to understand
  • relatively easy to implement
  • Disadvantages
  • difficult to specify what is wanted
  • too much returned, or too little
  • ordering not well determined
  • Dominant language in commercial systems until the
    WWW

5
How are the texts handled?
  • What happens if you take the words exactly as
    they appear in the original text?
  • What about punctuation, capitalization, etc.?
  • What about spelling errors?
  • What about plural vs. singular forms of words
  • What about cases and declension in non-english
    languages?
  • What about non-roman alphabets?

6
Today
  • Overview of Content Analysis
  • Text Representation
  • Statistical Characteristics of Text Collections
  • Zipf distribution
  • Statistical dependence

7
Content Analysis
  • Automated Transformation of raw text into a form
    that represent some aspect(s) of its meaning
  • Including, but not limited to
  • Automated Thesaurus Generation
  • Phrase Detection
  • Categorization
  • Clustering
  • Summarization

8
Techniques for Content Analysis
  • Statistical
  • Single Document
  • Full Collection
  • Linguistic
  • Syntactic
  • Semantic
  • Pragmatic
  • Knowledge-Based (Artificial Intelligence)
  • Hybrid (Combinations)

9
Text Processing
  • Standard Steps
  • Recognize document structure
  • titles, sections, paragraphs, etc.
  • Break into tokens
  • usually space and punctuation delineated
  • Stemming/morphological analysis
  • Special issues with morphologically complex
    languages (e.g., Finnish)
  • Store in inverted index (to be discussed later)

10
Information need
Collections
How is the query constructed?
How is the text processed?
text input
11
Document Processing Steps
12
Stemming and Morphological Analysis
  • Goal normalize similar words
  • Morphology (form of words)
  • Inflectional Morphology
  • E.g,. inflect verb endings and noun number
  • Never change grammatical class
  • dog, dogs
  • tengo, tienes, tiene, tenemos, tienen
  • Derivational Morphology
  • Derive one word from another,
  • Often change grammatical class
  • build, building health, healthy

13
Automated Methods
  • Powerful multilingual tools exist for
    morphological analysis
  • PCKimmo, Xerox Lexical technology
  • Require rules (a grammar) and dictionary
  • E.g., spys ? spies spies
  • Use finite-state automata
  • Stemmers
  • Very dumb rules work well (for English)
  • Porter Stemmer Iteratively remove suffixes
  • Improvement pass results through a lexicon

14
Porter Stemmer Rules
  • sses ? ss
  • ies ? I
  • ed ? NULL
  • ing ? NULL
  • ational ? ate
  • ization ? ize

15
Errors Generated by Porter Stemmer (Krovetz 93)
Errors of Commission Errors of Ommission
organization ? organ urgency ? urgent
generalization ? generic triangle ? triangular
arm ? army explain ? explanation
16
Table-Lookup Stemming
  • E.g., Karp, Schabes, Zaidel and Egedi, A Freely
    Available Wide Coverage Morphological Analyzer
    for English, COLING-92
  • Example table entries
  • matrices ? matrix N 3pl
  • fish ? fish N 3sg
  • fish V INF
  • fish N 3pl

17
Statistical Properties of Text
  • Token occurrences in text are not uniformly
    distributed
  • They are also not normally distributed
  • They do exhibit a Zipf distribution

18
Common words in Tom Sawyer
  • the 3332
  • and 2972
  • a 1775
  • I 783
  • you 686
  • Tom 679

19
Plotting Word Frequency by Rank
  • Frequency How many times do tokens (words) occur
    in the text (or collection).
  • Rank Now order these according to how often they
    occur.

20
Empirical evaluation of Zipfs Law for Tom Sawyer
word frequency rank f r
the 3332 1 3332
and 2972 2 5944
a 1775 3 5235
he 877 10 8770
be 294 30 8820
there 222 40 8880
one 172 50 8600
friends 10 800 8000

21
Rank Freq1 37 system2 32
knowledg3 24 base4 20
problem5 18 abstract6 15
model7 15 languag8 15
implem9 13 reason10 13
inform11 11 expert12 11
analysi13 10 rule14 10
program15 10 oper16 10
evalu17 10 comput18 10
case19 9 gener20 9 form
The Corresponding Zipf Curve
22
43 6 approach44 5
work45 5 variabl46 5
theori47 5 specif48 5
softwar49 5 requir50 5
potenti51 5 method52 5
mean53 5 inher54 5
data55 5 commit56 5
applic57 4 tool58 4
technolog59 4 techniqu
Zoom in on the Knee of the Curve
23
Zipf Distribution
  • The Important Points
  • a few elements occur very frequently
  • a medium number of elements have medium frequency
  • many elements occur very infrequently

24
Zipf Distribution
  • The product of the frequency of words (f) and
    their rank (r) is approximately constant
  • Rank order of words frequency of occurrence
  • Another way to state this is with an
    approximately correct rule of thumb
  • Say the most common term occurs C times
  • The second most common occurs C/2 times
  • The third most common occurs C/3 times

25
Empirical evaluation of Zipfs Law for Tom Sawyer
word frequency rank f r
the 3332 1 3332
and 2972 2 5944
a 1775 3 5235
he 877 10 8770
be 294 30 8820
there 222 40 8880
one 172 50 8600
friends 10 800 8000

26
Zipf Distribution(linear and log scale)
27
Consequences of Zipf
  • There are always a few very frequent tokens that
    are not good discriminators.
  • Called stop words in IR
  • Usually correspond to linguistic notion of
    closed-class words
  • English examples to, from, on, and, the, ...
  • Grammatical classes that dont take on new
    members.
  • There are always a large number of tokens that
    occur once and can mess up algorithms.
  • Medium frequency words most descriptive

28
Frequency v. Resolving Power
Luhn's contribution to automatic text analysis
frequency data can be used to extract words and
sentences to represent a document LUHN, H.P.,
'The automatic creation of literature abstracts',
IBM Journal of Research and Development, 2,
159-165 (1958)
29
What Kinds of Data Exhibit a Zipf Distribution?
  • Words in a text collection
  • Virtually any language usage
  • Library book checkout patterns
  • Incoming Web Page Requests (Nielsen)
  • Outgoing Web Page Requests (Cunha Crovella)
  • Document Size on Web (Cunha Crovella)

30
Related Distributions/Laws
  • Bradfords Law of Literary Yield (1934)
  • If periodicals are ranked into three groups, each
    yielding the same number of articles on a
    specified topic, the numbers of periodicals in
    each group increased geometrically.
  • Thus, Leith (1969) shows that by reading only the
    core periodicals of your speciality you will
    miss about 40 of the articles relevant to it.

31
Related Distributions/Laws
  • Lotkas distribution of literary productivity
    (1926)
  • The number of author-scientists who had published
    N papers in a given field was roughly 1/N2 the
    number of authors who had published one paper
    only.

32
Related Distributions/Laws
  • For more examples, see Robert A. Fairthorne,
    Empirical Distributions (Bradford-Zipf-Mandelbrot
    ) for Bibliometric Description and Prediction,
    Journal of Documentation, 25(4), 319-341
  • Pareto distribution of wealth Willis taxonomic
    distribition in biology Mandelbrot on
    self-similarity and market prices and
    communication errors etc.

33
Statistical Independence vs. Statistical
Dependence
  • How likely is a red car to drive by given weve
    seen a black one?
  • How likely is the word ambulence to appear,
    given that weve seen car accident?
  • Color of cars driving by are independent
    (although more frequent colors are more likely)
  • Words in text are not independent (although again
    more frequent words are more likely)

34
Statistical Independence
  • Two events x and y are statistically
    independent if the product of their probability
    of their happening individually equals their
    probability of happening together.

35
Statistical Independence and Dependence
  • What are examples of things that are
    statistically independent?
  • What are examples of things that are
    statistically dependent?

36
Lexical Associations
  • Subjects write first word that comes to mind
  • doctor/nurse black/white (Palermo Jenkins 64)
  • Text Corpora yield similar associations
  • One measure Mutual Information (Church and Hanks
    89)
  • If word occurrences were independent, the
    numerator and denominator would be equal (if
    measured across a large collection)

37
Statistical Independence
  • Compute for a window of words

a b c d e f g h i j k l m n o p
w1
w11
w21
38
Interesting Associations with Doctor AP
Corpus, N15 million, Church Hanks 89)
39
Uninteresting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
These associations were likely to happen because
the non-doctor words shown here are very
common and therefore likely to co-occur with any
noun.
40
Document Vectors
  • Documents are represented as bags of words
  • Represented as vectors when used computationally
  • A vector is an array of numbers floating point
  • Has direction and magnitude
  • Each vector holds a place for every term in the
    collection
  • Therefore, most vectors are sparse

41
Document VectorsOne location for each word.
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
42
Document VectorsOne location for each word.
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
43
Document Vectors
Document ids
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
44
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
45
Documents in 3D Space
46
Content Analysis Summary
  • Content Analysis transforming raw text into more
    computationally useful forms
  • Words in text collections exhibit interesting
    statistical properties
  • Word frequencies have a Zipf distribution
  • Word co-occurrences exhibit dependencies
  • Text documents are transformed to vectors
  • Pre-processing includes tokenization, stemming,
    collocations/phrases
  • Documents occupy multi-dimensional space.

47
Information need
Collections
text input
How is the index constructed?
48
Inverted Index
  • This is the primary data structure for text
    indexes
  • Main Idea
  • Invert documents into a big index
  • Basic steps
  • Make a dictionary of all the tokens in the
    collection
  • For each token, list all the docs it occurs in.
  • Do a few things to reduce redundancy in the data
    structure

49
Inverted Indexes
  • We have seen Vector files conceptually. An
    Inverted File is a vector file inverted so
    that rows become columns and columns become rows

50
How Are Inverted Files Created
  • Documents are parsed to extract tokens. These are
    saved with the Document ID.

Doc 1
Doc 2
Now is the time for all good men to come to the
aid of their country
It was a dark and stormy night in the country
manor. The time was past midnight
51
How Inverted Files are Created
  • After all documents have been parsed the inverted
    file is sorted alphabetically.

52
How InvertedFiles are Created
  • Multiple term entries for a single document are
    merged.
  • Within-document term frequency information is
    compiled.

53
How Inverted Files are Created
  • Then the file can be split into
  • A Dictionary file
  • and
  • A Postings file

54
How Inverted Files are Created
  • Dictionary Postings

55
Inverted indexes
  • Permit fast search for individual terms
  • For each term, you get a list consisting of
  • document ID
  • frequency of term in doc (optional)
  • position of term in doc (optional)
  • These lists can be used to solve Boolean queries
  • country -gt d1, d2
  • manor -gt d2
  • country AND manor -gt d2
  • Also used for statistical ranking algorithms

56
How Inverted Files are Used
Query on time AND dark 2 docs with time
in dictionary -gt IDs 1 and 2 from posting file 1
doc with dark in dictionary -gt ID 2 from
posting file Therefore, only doc 2 satisfied the
query.
  • Dictionary Postings

57
Next Time
  • Term weighting
  • Statistical ranking
Write a Comment
User Comments (0)
About PowerShow.com