Title: Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005
1Identifying Sets of RelatedWords from the World
Wide Web Thesis Defense 06/09/2005
- Pratheepan (Prath) Raveendranathan
- Advisor Ted Pedersen
2Outline
- Introduction Objective
- Methodology
- Experimental Results
- Conclusion
- Future Work
- Demo
3Introduction
- The goal of my thesis research is to use the
World Wide Web as a source of information to
identify sets of words that are related in
meaning. - Example, given two words - gun,pistol
- a possible set of related words would be
- handgun, holster, shotgun, machine-gun,
weapon,ammunition,bullet, magazine - Example, given two words toyota, nissan, ford
- A possible set of related words would
- honda, gmc, chevy, mitsubishi
4Examples Cont
- Example, given two words - red,yellow
-
- a possible set of related words would be
- white,black,blue, colors, green
- Example, given two words - George Bush,Bill
Clinton -
- a possible set of related words would be
- Ronald Reagan, Jimmy Carter, White House,
Presidents, USA, etc
5Application
- Use sets of related words to classify Semantic
Orientation of reviews. - (Peter Turney)
- Use sets of related words to find the sentiment
associated with particular product. - (Rajiv Vaidyanathan and Praveen Agarwal).
6Pros and Cons of using the Web
- Pros
- Huge amounts of text
- Diverse text
- Encyclopedias, Publications, Commercial Web
Pages - Dynamic (ever-changing state)
- Cons,
- The Web creates a unique set of challenges,
- Dynamic (ever-changing state)
- News websites, Blogs
- Presence of repetitive, noisy, or low-quality
data. - HTML tags, web lingo (home page, information etc)
7Contributions
- Developed an Algorithm that predicts sets of
related words by using pattern matching
techniques and frequency counts. - Developed an Algorithm that predicts sets of
related words by using a relatedness measure. - Developed an Algorithm that predicts sets of
related words by using a relatedness measure and
an extension of the Log Likelihood score. - Applied sets of related words to problem of
Sentiment Classification.
8Outline
- Introduction Objective
- Methodology
- Experimental Results
- Conclusion
- Future Work
- Demo
9Interface to Web - Google
- Reasons for using Google
- Research is very much dependant on both the
quantity and quality of the Web content. - Google has a very effective ranking algorithm
called PageRank which attempts to give more
important or higher quality web pages a higher
ranking. - Google API An interface which allows
programmers to query more than 8 billion web
pages using the Google search engine.
(http//www.google.com/apis/).
10Problems with Google API
- Restricted to 1000 queries a day
- 10 Results for each query
- No near operator (Proximity based search)
- Maximum 1000 results.
- Alternative
- Yahoo API 5000 Queries a day (Released very
recently) - No near operator as well.
- Cannot retrieve number of hits.
- Note Google was used only as means of retrieving
from the - Information.
11Key Idea behind Algorithms
- Words that are related in meaning often tend to
occur together. - Example,
- A Springfield, MA , Chevrolet, Ford, Honda,
Lexus, Mazda, Nissan, Saturn, Toyota automotive
dealer with new and pre-owned vehicle sales and
leasing
12Algorithm 1
- Features
- Based on frequency
- Takes only single words as input
- Initial set 2 words
- Frequency cutoff
- Ranked by frequency
- Smart stop list -
- The, if, me, why, you etc (non-content words)
- Web stop list
- Web page, WWW, home,page, personal, url,
information, link, text , decoration, verdana,
script, javascript
13Algorithm 1 High level Description
- Create queries to Google based on the input
terms. - Retrieve the top N number of web pages for each
query. - Parse the retrieved web page content for each
query. - 3. Tokenize web page content into list of words
and frequency. - Discard words that occur less than C number of
times. - 4. Find the common words between at least two of
the sets of words. This set of intersecting words
are the set of related words to the input term. - 5. Repeat the process for I iterations by using
the set of related words - from the previous iteration as input.
14Algorithm 1 Trace 1
- Search Terms S1pistol, gun
- Frequency Cutoff 15
- Num Results (Web Pages) 10
- Iterations - 2
15Algorithm 1 Step 1
- Create queries to Google based permutations of
the Input Terms, - gun
- gun AND pistol
- pistol
- pistol AND gun
16Algorithm 1 Step 2
- Issue query to Google,
- Retrieve the top 10 URLs for the query,
- For each URL, retrieve the web page content, and
parse the web page for more links. - Traverse these links and retrieve the content of
those web pages as well. - Repeat this process for each query.
17Trace 1 Cont
- Web pages for the query gun
18Trace 1 Cont
19Trace 1 Cont
- Web pages for gun AND pistol
20Trace 1 Cont
- Web pages for pistol AND gun
21Algorithm 1 Step 3
- 3. Next, for the total web page content retrieved
for each query, - Remove HTML Tags etc and retrieve text.
- Remove stop words.
- Tokenize the web page content into lists of words
and frequency. - Note This would result in the following 4 sets
of words, - each set representing the words retrieved for
each - query.
22Words from Web pages after removing stop words
23Algorithm 1 Step 4
- 4. Find the words that are common at least 2
sets. - Let,
- gun AND pistol
- pistol AND gun
- gun
- pistol
- Related Set
24Related Set 1 Iteration 1
25Trace 1 Cont Iteration 2
- 11 input terms
- Search terms created
- Rifle
- Shooting
- Guns
- Cases
- Airsoft
- Shooting AND Guns
- Guns AND Shooting
- Guns AND Cases
- etc etc.
- Results in 112 121 queries to Google!
- Note As you can see, the number of queries to
Google increases - drastically.
26Result Set 2 gun, pistol
27Algorithm 1 red, yellow
Number of Results 10 Frequency Cutoff -
15 Iterations - 1
Related Words
28Problems with Algorithm 1
- Frequency based ranking,
- Number of input terms restricted to 2,
- Input and output restricted to single words
29Algorithm 2
- Features
- Based on frequency relatedness score
- Can takes input as single words or 2 word
collocations - Relatedness measure based on Jiang and Conrath
- Frequency cutoff and relatedness score cutoff
- Ranked by score
- Initial set can be more than 2 words
- Bi-grams as output
- Smart stop list
- The, if, me, why, you etc
- Web stop words phrases
- Web page, WWW, home page, personal, url,
information, link, text , decoration, verdana,
script, javascript
30Algorithm 2 High level Description
- Repeat same steps as in Algorithm 1 to retrieve
initial set of related words (Add bigrams to
results as well). - For each word returned by Algorithm 1 as a
related word, - Calculate Relatedness of word to input terms.
- Discard any word or bigram with a relatedness
score greater than the score cutoff. - Sort remaining terms from most relevant to
irrelevant. - Repeat Steps 1 2 for each iteration, using the
set of words from iteration previous iteration as
input.
31Relatedness Measure (Distance Measure)
- Relatedness (Word1, Word2)
- log (hits(Word1)) log (hits(Word2)) 2 log
(hits(Word1 Word2)) - (Based on measure by Jiang and Conrath)
- Example 1,
- hits(toyota) 12,500,000
- hits(ford) 22,900,000
- hits(toyota AND ford) 50,000
- 32.41
- Example 2,
- hits(toyota) 12,500,000
- hits(ford) 22,900,000
- hits(toyota AND ford) 150,000
- 30.82
32Relatedness Measure Cont
- Example 3,
- hits(toyota) 1000
- hits(ford) 1000
- hits(toyota AND ford) 1000
- Relatedness (toyota,ford) 0
- As the measure tends to approach zero, the
relatedness - between the two terms increase.
33Input Set gun, pistol
34Algorithm 2 red, yellow
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
35Problems with Algorithm 2
- Certain bigrams are not good collocations,
- For example,
- sunny, cloudy
- Number of Results - 10
- Frequency Cutoff - 15
- Bigram Cutoff - 4
- Score Cutoff - 30
36Algorithm 3 High Level Description
- Repeat same steps as in Algorithm 1 to retrieve
initial set of related words (Add bigrams to
results as well). - For each term returned by Algorithm 1 as a
related word, - If the term is a bigram,
- Validate if bigram is a valid collocation
- If bigram is a valid collocation continue with
step 2.2 - else
- 2. Remove term from set of related words.
- Calculate Relatedness of word to input terms.
- Discard any word or collocation with a
relatedness score greater than the score cutoff. - Sort remaining terms from most relevant to
irrelevant.
37Verifying Bigrams
- Adapt Log Likelihood (G2) Score to web hit counts
- Example, New York
- 4 Queries to Google
New
New York
York
of the
38Expected Values
(621 3560) / 5670
(5049 3560) / 5670
(621 2110) / 5670
(5049 2110) / 5670
39Identifying a bad collocation
- Bigram is discarded if,
- Observed value for bigram is 0 (eg, New York)
- Observed value for bigram is less than the
expected value.
40Example Bigrams
41Methodology
- Introduction Objective
- Methodology
- Experimental Results Evaluation
- Conclusion
- Future Work
- Demo
42Evaluating Results
- Compare with Google Sets
- http//labs.google.com/sets
- Human Subject Experiments
- Around 20 people expanded 2-word sets to what
they feel as a set of related words
43F-measure, Precision and Recall
44Comparison of Algorithm 1 2
45Algorithm 1
jordan,chicago
Number of Results 10 Frequency Cutoff -
15 Iterations - 1
Precision 0, Recall 0 F-measure 0
46Algorithm 2
toyota,ford, nissan
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
Precision 6/11 0.54, Recall 6/11
0.54 F-measure 0.54
47Algorithm 2
january, february, may
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
Precision 9/9 1, Recall 9/9
1 F-measure 1
48Algorithm 2
armani, versace
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 30 Iterations
- 1
Precision 11/20 0.55, Recall 11/43
.25 F-measure 0.35
Not Entire Set
49Algorithm 2
artificial intelligence, machine learning
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 32 Iterations
- 1
Precision 9/23 0.39, Recall 9/48
0.1875 F-measure 0.25
50Comparison of Algorithm 2 3
sunny, cloudy
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 30 Iterations
- 1
51Algorithm 3 - Bigrams
artificial intelligence, machine learning
52Performance of Algorithms
- F-measure increases, from Algorithm 1 to 3
53Sentiment Classification
- Point wise Mutual Information Information
Retrieval Algorithm (PMI-IR) Peter Turney - Used to classify reviews as being positive or
negative in orientation - Part-of-speech tag the review
- Extract 2-word phrases from text
- Adjective followed by a Noun
- Noun followed by a Noun etc.
- Use a positive connotation such as excellent
and negative connotation such as poor, and
calculate the Semantic Orientation (SO) for each
2-word phrase,
54Example,
- Let, the phrase be incredible cast
- SO(incredible cast)
- log2 (hits(incredible cast NEAR excellent))
hits(poor) - (hits(incredible cast NEAR poor))
hits(excellent)
55Problem with Current Algorithm
- Words such as poor have at least two senses
- poor as in poverty
- poor as in not good
56Extended PMI-IR
- Used Google instead of AltaVista
- Used AND instead of NEAR
- Extended SO formula
- Use multiple pairs of positive and negative
connotations - excellent, poor, good, bad, great, mediocre
57A Negative Review for the movie Planet of the
Apes
Classified by our Algorithm as being Negative
58Positive Review for an Audi
Classified by our Algorithm as being Positive
59Negative Movie Review
Classified by our Algorithm as being Negative
60Performance of Extended PMI-IR
- Algorithm run on 20 reviews (movies and
automobiles) - Overall Accuracy 75
61End Result
- All of this is available freely on CPAN and
Sourceforge - Google-Hack
62Conclusions Contribution
- Developed 3 Algorithms that try to predict sets
of related words - Algorithm 1 was based on frequency
- Algorithm 2 was based on a relatedness measure
- Algorithm 3 was based on a relatedness measure
and the Log Likelihood score - Applied sets of related words to Sentiment
Classification
63Conclusions Contribution
- Released free PERL package Google-Hack on CPAN
and Sourceforge. - Developed a web interface.
64Future Work
- Addition of proximity operator
- Restrict of web pages traversed
- Find intersection of words through different
search engines - Yahoo API - Use anchor text
65Related URLs
- Research Page
- http//www.d.umn.edu/rave0029/research
- Google-Hack
- http//google-hack.sf.net
- CPAN Release
- http//search.cpan.org/prath/WebService-GoogleHac
k0.15/GoogleHack/GoogleHack.pm - Web Interface
- http//marimba.d.umn.edu/cgi-bin/googlehack/index.
cgi