Text Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Text Mining

Description:

Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn_at_cs.ucr.edu Text Mining ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 111
Provided by: csUcrEdu6
Learn more at: http://www.cs.ucr.edu
Category:
Tags: mining | text

less

Transcript and Presenter's Notes

Title: Text Mining


1
Text Mining Dr Eamonn Keogh Computer Science
Engineering DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
2
Text Mining/Information Retrieval
  • Task Statement
  • Build a system that retrieves documents that
    users are likely to find relevant to their
    queries.
  • This assumption underlies the field of
    Information Retrieval.

3
Information need
Collections
How is the query constructed?
text input
How is the text processed?
4
Terminology
Token A natural language word Swim,
Simpson, 92513 etc Document Usually a web
page, but more generally any file.
5
Some IR History
  • Roots in the scientific Information Explosion
    following WWII
  • Interest in computer-based IR from mid 1950s
  • H.P. Luhn at IBM (1958)
  • Probabilistic models at Rand (Maron Kuhns)
    (1960)
  • Boolean system development at Lockheed (60s)
  • Vector Space Model (Salton at Cornell 1965)
  • Statistical Weighting methods and theoretical
    advances (70s)
  • Refinements and Advances in application (80s)
  • User Interfaces, Large-scale testing and
    application (90s)

6
Relevance
  • In what ways can a document be relevant to a
    query?
  • Answer precise question precisely.
  • Who is Homers Boss? Montgomery Burns.
  • Partially answer question.
  • Where does Homer work? Power Plant.
  • Suggest a source for more information.
  • What is Barts middle name? Look in Issue 234 of
    Fanzine
  • Give background information.
  • Remind the user of other knowledge.
  • Others ...

7
(No Transcript)
8
Information need
Collections
How is the query constructed?
text input
How is the text processed?
The section that follows is about Content
Analysis (transforming raw text into a
computationally more manageable form)
9
Document Processing Steps
10
Stemming and Morphological Analysis
  • Goal normalize similar words
  • Morphology (form of words)
  • Inflectional Morphology
  • E.g,. inflect verb endings and noun number
  • Never change grammatical class
  • dog, dogs
  • Bike, Biking
  • Swim, Swimmer, Swimming

What about build, building
11
Examples of Stemming (using Porters algorithm)
Original Words         consignconsignedconsign
ingconsignmentconsistconsistedconsistencycons
istentconsistentlyconsistingconsists
Stemmed Words consignconsignconsignconsignco
nsistconsistconsistconsistconsistconsistcons
ist
Porters algorithms is available in Java, C, Lisp,
Perl, Python etc from http//www.tartarus.org/ma
rtin/PorterStemmer/
12
Errors Generated by Porter Stemmer (Krovetz 93)
13
Statistical Properties of Text
  • Token occurrences in text are not uniformly
    distributed
  • They are also not normally distributed
  • They do exhibit a Zipf distribution

14
Government documents, 157734 tokens, 32259 unique
969 on 915 FT 883 Mr 860 was 855 be 849
Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE
798 HEADLINE 798 DOCNO
1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI
1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT
1 ADVISERS 1 AE
8164 the 4771 of 4005 to 2834 a 2827 and 2802
in 1592 The 1370 for 1326 is 1324 s 1194 that
973 by
15
Plotting Word Frequency by Rank
  • Main idea count
  • How many times tokens occur in the text
  • Over all texts in the collection
  • Now rank these according to how often they occur.
    This is called the rank.

16
Rank Freq1 37 system2 32
knowledg3 24 base4 20
problem5 18 abstract6 15
model7 15 languag8 15
implem9 13 reason10 13
inform11 11 expert12 11
analysi13 10 rule14 10
program15 10 oper16 10
evalu17 10 comput18 10
case19 9 gener20 9 form
The Corresponding Zipf Curve
17
Zipf Distribution
  • The Important Points
  • a few elements occur very frequently
  • a medium number of elements have medium frequency
  • many elements occur very infrequently

18
Zipf Distribution
  • The product of the frequency of words (f) and
    their rank (r) is approximately constant
  • Rank order of words frequency of occurrence
  • Another way to state this is with an
    approximately correct rule of thumb
  • Say the most common term occurs C times
  • The second most common occurs C/2 times
  • The third most common occurs C/3 times

19
Zipf Distribution(linear and log scale)
20
What Kinds of Data Exhibit a Zipf Distribution?
  • Words in a text collection
  • Virtually any language usage
  • Library book checkout patterns
  • Incoming Web Page Requests
  • Outgoing Web Page Requests
  • Document Size on Web
  • City Sizes

21
Consequences of Zipf
  • There are always a few very frequent tokens that
    are not good discriminators.
  • Called stop words in IR
  • English examples to, from, on, and, the, ...
  • There are always a large number of tokens that
    occur once and can mess up algorithms.
  • Medium frequency words most descriptive

22
Word Frequency vs. Resolving Power (from van
Rijsbergen 79)
The most frequent words are not the most
descriptive.
23
Statistical Independence
  • Two events x and y are statistically
    independent if the product of their probability
    of their happening individually equals their
    probability of happening together.

24
Statistical Independence and Dependence
  • What are examples of things that are
    statistically independent?
  • What are examples of things that are
    statistically dependent?

25
Lexical Associations
  • Subjects write first word that comes to mind
  • doctor/nurse black/white (Palermo Jenkins 64)
  • Text Corpora yield similar associations
  • One measure Mutual Information (Church and Hanks
    89)
  • If word occurrences were independent, the
    numerator and denominator would be equal (if
    measured across a large collection)

26
Statistical Independence
  • Compute for a window of words

a b c d e f g h i j k l m n o p
w1
w11
w21
27
Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
28
Un-Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
These associations were likely to happen because
the non-doctor words shown here are very
common and therefore likely to co-occur with any
noun.
29
Associations Are Important Because
  • We may be able to discover that phrases that
    should be treated as a word. I.e. data mining.
  • We may be able to automatically discover
    synonyms. I.e. Bike and Bicycle

30
Content Analysis Summary
  • Content Analysis transforming raw text into more
    computationally useful forms
  • Words in text collections exhibit interesting
    statistical properties
  • Word frequencies have a Zipf distribution
  • Word co-occurrences exhibit dependencies
  • Text documents are transformed to vectors
  • Pre-processing includes tokenization, stemming,
    collocations/phrases

31
(No Transcript)
32
Information need
Collections
text input
How is the index constructed?
The section that follows is about Index
Construction
33
Inverted Index
  • This is the primary data structure for text
    indexes
  • Main Idea
  • Invert documents into a big index
  • Basic steps
  • Make a dictionary of all the tokens in the
    collection
  • For each token, list all the docs it occurs in.
  • Do a few things to reduce redundancy in the data
    structure

34
How Are Inverted Files Created
  • Documents are parsed to extract tokens. These are
    saved with the Document ID.

Doc 1
Doc 2
Now is the time for all good men to come to the
aid of their country
It was a dark and stormy night in the country
manor. The time was past midnight
35
How Inverted Files are Created
  • After all documents have been parsed the inverted
    file is sorted alphabetically.

36
How InvertedFiles are Created
  • Multiple term entries for a single document are
    merged.
  • Within-document term frequency information is
    compiled.

37
How Inverted Files are Created
  • Then the file can be split into
  • A Dictionary file
  • and
  • A Postings file

38
How Inverted Files are Created
  • Dictionary Postings

39
Inverted Indexes
  • Permit fast search for individual terms
  • For each term, you get a list consisting of
  • document ID
  • frequency of term in doc (optional)
  • position of term in doc (optional)
  • These lists can be used to solve Boolean queries
  • country -gt d1, d2
  • manor -gt d2
  • country AND manor -gt d2
  • Also used for statistical ranking algorithms

40
How Inverted Files are Used
Query on time AND dark 2 docs with time
in dictionary -gt IDs 1 and 2 from posting file 1
doc with dark in dictionary -gt ID 2 from
posting file Therefore, only doc 2 satisfied the
query.
  • Dictionary Postings

41
(No Transcript)
42
Information need
Collections
text input
How is the index constructed?
The section that follows is about Querying (and
ranking)
43
Simple query language Boolean
  • Terms Connectors (or operators)
  • terms
  • words
  • normalized (stemmed) words
  • phrases
  • connectors
  • AND
  • OR
  • NOT
  • NEAR (Pseudo Boolean)
  • Word Doc
  • Cat x
  • Dog
  • Collar x
  • Leash

44
Boolean Queries
  • Cat
  • Cat OR Dog
  • Cat AND Dog
  • (Cat AND Dog)
  • (Cat AND Dog) OR Collar
  • (Cat AND Dog) OR (Collar AND Leash)
  • (Cat OR Dog) AND (Collar OR Leash)

45
Boolean Queries
  • (Cat OR Dog) AND (Collar OR Leash)
  • Each of the following combinations works
  • Cat x x x x
  • Dog x x x x x
  • Collar x x x x
  • Leash x x x x

46
Boolean Queries
  • (Cat OR Dog) AND (Collar OR Leash)
  • None of the following combinations work
  • Cat x x
  • Dog x x
  • Collar x x
  • Leash x x

47
Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
48
Ordering of Retrieved Documents
  • Pure Boolean has no ordering
  • In practice
  • order chronologically
  • order by total number of hits on query terms
  • What if one term has more hits than others?
  • Is it better to one of each term or many of one
    term?

49
Boolean Model
  • Advantages
  • simple queries are easy to understand
  • relatively easy to implement
  • Disadvantages
  • difficult to specify what is wanted
  • too much returned, or too little
  • ordering not well determined
  • Dominant language in commercial Information
    Retrieval systems until the WWW

Since the Boolean model is limited, lets consider
a generalization
50
Vector Model
  • Documents are represented as bags of words
  • Represented as vectors when used computationally
  • A vector is like an array of floating point
  • Has direction and magnitude
  • Each vector holds a place for every term in the
    collection
  • Therefore, most vectors are sparse
  • Smithers secretly loves Monty Burns
  • Monty Burns secretly loves Smithers
  • Both map to
  • Burns, loves, Monty, secretly, Smithers

51
Document VectorsOne location for each word
Document ids
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
52
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
53
Documents in 3D Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
54
Vector Space Model
Note that the query is projected into the same
vector space as the documents. The query here is
for Marge. We can use a vector similarity
model to determine the best match to our query
(details in a few slides). But what weights
should we use for the terms?
55
Assigning Weights to Terms
  • Binary Weights
  • Raw term frequency
  • tf x idf
  • Recall the Zipf distribution
  • Want to weight terms highly if they are
  • frequent in relevant documents BUT
  • infrequent in the collection as a whole

56
Binary Weights
  • Only the presence (1) or absence (0) of a term is
    included in the vector

We have already seen and discussed this model.
57
Raw Term Weights
  • The frequency of occurrence for the term in each
    document is included in the vector

This model is open to exploitation by websites
sex sex sex sex sex sex sex sex sex sex sex sex
sex sex sex sex sex sex sex sex sex sex sex sex
sex sex sex sex sex sex
Counts can be normalized by document lengths.
58
tf idf Weights
  • tf idf measure
  • term frequency (tf)
  • inverse document frequency (idf) -- a way to deal
    with the problems of the Zipf distribution
  • Goal assign a tf idf weight to each term in
    each document

59
tf idf
60
Inverse Document Frequency
  • IDF provides high values for rare words and low
    values for common words

For a collection of 10000 documents
61
Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
62
Cosine
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
63
Problems with Vector Space
  • There is no real theoretical basis for the
    assumption of a term space
  • it is more for visualization that having any real
    basis
  • most similarity measures work about the same
    regardless of model
  • Terms are not really orthogonal dimensions
  • Terms are not independent of all other terms

64
Probabilistic Models
  • Rigorous formal model attempts to predict the
    probability that a given document will be
    relevant to a given query
  • Ranks retrieved documents according to this
    probability of relevance (Probability Ranking
    Principle)
  • Rely on accurate estimates of probabilities

65
(No Transcript)
66
Relevance Feedback
  • Main Idea
  • Modify existing query based on relevance
    judgements
  • Query Expansion Extract terms from relevant
    documents and add them to the query
  • Term Re-weighing and/or re-weight the terms
    already in the query
  • Two main approaches
  • Automatic (psuedo-relevance feedback)
  • Users select relevant documents
  • Users/system select terms from an
    automatically-generated list

67
Definition Relevance Feedback is the
reformulation of a search query in response to
feedback provided by the user for the results of
previous versions of the query.
Suppose you are interested in bovine agriculture
on the banks of the river Jordan
Search
Display Results
Gather Feedback
Update Weights
68
Rocchio Method
69
Rocchio Illustration
Although we usually work in vector space for
text, it is easier to visualize Euclidian space
Original Query
Term Re-weighting Note that both the location of
the center, and the shape of the query have
changed
Query Expansion
70
Rocchio Method
  • Rocchio automatically
  • re-weights terms
  • adds in new terms (from relevant docs)
  • have to be careful when using negative terms
  • Rocchio is not a machine learning algorithm
  • Most methods perform similarly
  • results heavily dependent on test collection
  • Machine learning methods are proving to work
    better than standard IR approaches like Rocchio

71
Using Relevance Feedback
  • Known to improve results
  • People dont seem to like giving feedback!

72
Relevance Feedback for Time Series
The original query
The weigh vector. Initially, all weighs are the
same.
Note In this example we are using a piecewise
linear approximation of the data. We will learn
more about this representation later.
73
The initial query is executed, and the five best
matches are shown (in the dendrogram)
One by one the 5 best matching sequences will
appear, and the user will rank them from between
very bad (-3) to very good (3)
74
Based on the user feedback, both the shape and
the weigh vector of the query are changed.
The new query can be executed. The hope is that
the query shape and weights will converge to the
optimal query.
Two papers consider relevance feedback for time
series. Query Expansion L Wu, C Faloutsos, K
Sycara, T. Payne FALCON Feedback Adaptive Loop
for Content-Based Retrieval. VLDB 2000
297-306 Term Re-weighting Keogh, E. Pazzani,
M. Relevance feedback retrieval of time series
data. In Proceedings of SIGIR 99
75
(No Transcript)
76
Document Space has High Dimensionality
  • What happens beyond 2 or 3 dimensions?
  • Similarity still has to do with how many tokens
    are shared in common.
  • More terms -gt harder to understand which subsets
    of words are shared among similar documents.
  • One approach to handling high dimensionalityClust
    ering

77
Text Clustering
  • Finds overall similarities among groups of
    documents.
  • Finds overall similarities among groups of
    tokens.
  • Picks out some themes, ignores others.

78
Scatter/Gather
  • Hearst Pedersen 95
  • Cluster sets of documents into general themes,
    like a table of contents (using K-means)
  • Display the contents of the clusters by showing
    topical terms and typical titles
  • User chooses subsets of the clusters and
    re-clusters the documents within
  • Resulting new groups have different themes

79
S/G Example query on star
  • Encyclopedia text
  • 14 sports
  • 8 symbols 47 film, tv
  • 68 film, tv (p) 7 music
  • 97 astrophysics
  • 67 astronomy(p) 12 stellar phenomena
  • 10 flora/fauna 49 galaxies, stars
  • 29 constellations
  • 7 miscellaneous
  • Clustering and re-clustering is entirely
    automated

80
(No Transcript)
81
Ego Surfing!
http//vivisimo.com/
82
(No Transcript)
83
Information need
Collections
text input
How is the index constructed?
The section that follows is about Evaluation
84
Evaluation
  • Why Evaluate?
  • What to Evaluate?
  • How to Evaluate?

85
Why Evaluate?
  • Determine if the system is desirable
  • Make comparative assessments
  • Others?

86
What to Evaluate?
  • How much of the information need is satisfied.
  • How much was learned about a topic.
  • Incidental learning
  • How much was learned about the collection.
  • How much was learned about other topics.
  • How inviting the system is.

87
What to Evaluate?
  • What can be measured that reflects users
    ability to use system? (Cleverdon 66)
  • Coverage of Information
  • Form of Presentation
  • Effort required/Ease of Use
  • Time and Space Efficiency
  • Recall
  • proportion of relevant material actually
    retrieved
  • Precision
  • proportion of retrieved material actually relevant

effectiveness
88
Relevant vs. Retrieved
All docs
Retrieved
Relevant
89
Precision vs. Recall
All docs
Retrieved
Relevant
90
Why Precision and Recall?
  • Intuition
  • Get as much good stuff while at the same time
    getting as little junk as possible.

91
Retrieved vs. Relevant Documents
Very high precision, very low recall
Relevant
92
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)
Relevant
93
Retrieved vs. Relevant Documents
High recall, but low precision
Relevant
94
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
Relevant
95
Precision/Recall Curves
  • There is a tradeoff between Precision and Recall
  • So measure Precision at different levels of
    Recall
  • Note this is an AVERAGE over MANY queries

precision
x
x
x
x
recall
96
Precision/Recall Curves
  • Difficult to determine which of these two
    hypothetical results is better

x
precision
x
x
x
recall
97
Precision/Recall Curves
98
Recall under various retrieval assumptions
99
Precision under various assumptions
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Perfect
P R E C I S I O N
Tangent Parabolic Recall
1000 Documents 100 Relevant
Parabolic Recall
random
Perverse
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Proportion of documents retrieved
100
Document Cutoff Levels
  • Another way to evaluate
  • Fix the number of documents retrieved at several
    levels
  • top 5
  • top 10
  • top 20
  • top 50
  • top 100
  • top 500
  • Measure precision at each of these levels
  • Take (weighted) average over results
  • This is a way to focus on how well the system
    ranks the first k documents.

101
Problems with Precision/Recall
  • Cant know true recall value
  • except in small collections
  • Precision/Recall are related
  • A combined measure sometimes more appropriate
  • Assumes batch mode
  • Interactive IR is important and has different
    criteria for successful searches
  • Assumes a strict rank ordering matters.

102
Relation to Contingency Table
Doc is Relevant Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved c d
Doc is Relevant Doc is NOT relevant
Doc is retrieved
Doc is NOT retrieved
  • Accuracy (ad) / (abcd)
  • Precision a/(ab)
  • Recall a/(ac)
  • Why dont we use Accuracy for IR?
  • (Assuming a large collection)
  • Most docs arent relevant
  • Most docs arent retrieved
  • Inflates the accuracy value

103
The E-Measure
  • Combine Precision and Recall into one number (van
    Rijsbergen 79)

P precision R recall b measure of relative
importance of P or R For example, b 0.5 means
user is twice as interested in precision as
recall
104
How to Evaluate?Test Collections
105
Test Collections
  • Cranfield 2
  • 1400 Documents, 221 Queries
  • 200 Documents, 42 Queries
  • INSPEC 542 Documents, 97 Queries
  • UKCIS -- gt 10000 Documents, multiple sets, 193
    Queries
  • ADI 82 Document, 35 Queries
  • CACM 3204 Documents, 50 Queries
  • CISI 1460 Documents, 35 Queries
  • MEDLARS (Salton) 273 Documents, 18 Queries

106
TREC
  • Text REtrieval Conference/Competition
  • Run by NIST (National Institute of Standards
    Technology)
  • 2002 (November) will be 11th year
  • Collection gt6 Gigabytes (5 CRDOMs), gt1.5
    Million Docs
  • Newswire full text news (AP, WSJ, Ziff, FT)
  • Government documents (federal register,
    Congressional Record)
  • Radio Transcripts (FBIS)
  • Web subsets

107
TREC (cont.)
  • Queries Relevance Judgments
  • Queries devised and judged by Information
    Specialists
  • Relevance judgments done only for those documents
    retrieved -- not entire collection!
  • Competition
  • Various research and commercial groups compete
    (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66)
  • Results judged on precision and recall, going up
    to a recall level of 1000 documents

108
TREC
  • Benefits
  • made research systems scale to large collections
    (pre-WWW)
  • allows for somewhat controlled comparisons
  • Drawbacks
  • emphasis on high recall, which may be unrealistic
    for what most users want
  • very long queries, also unrealistic
  • comparisons still difficult to make, because
    systems are quite different on many dimensions
  • focus on batch ranking rather than interaction
  • no focus on the WWW

109
TREC is changing
  • Emphasis on specialized tracks
  • Interactive track
  • Natural Language Processing (NLP) track
  • Multilingual tracks (Chinese, Spanish)
  • Filtering track
  • High-Precision
  • High-Performance
  • http//trec.nist.gov/

110
What to Evaluate?
  • Effectiveness
  • Difficult to measure
  • Recall and Precision are one way
  • What might be others?
Write a Comment
User Comments (0)
About PowerShow.com