Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 - PowerPoint PPT Presentation

About This Presentation

Title:

Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005

Description:

The goal of my thesis research is to use the World Wide Web as a source of ... Airsoft. Shooting AND Guns. Guns AND Shooting. Guns AND Cases. etc etc. ... – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 66

Provided by: Prat4

Learn more at: https://www.d.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005

1
Identifying Sets of RelatedWords from the World
Wide Web Thesis Defense 06/09/2005

Pratheepan (Prath) Raveendranathan
Advisor Ted Pedersen

2
Outline

Introduction Objective
Methodology
Experimental Results
Conclusion
Future Work
Demo

3
Introduction

The goal of my thesis research is to use the
World Wide Web as a source of information to
identify sets of words that are related in
meaning.
Example, given two words - gun,pistol
a possible set of related words would be
handgun, holster, shotgun, machine-gun,
weapon,ammunition,bullet, magazine
Example, given two words toyota, nissan, ford
A possible set of related words would
honda, gmc, chevy, mitsubishi

4
Examples Cont

Example, given two words - red,yellow
a possible set of related words would be
white,black,blue, colors, green
Example, given two words - George Bush,Bill
Clinton
a possible set of related words would be
Ronald Reagan, Jimmy Carter, White House,
Presidents, USA, etc

5
Application

Use sets of related words to classify Semantic
Orientation of reviews.
(Peter Turney)
Use sets of related words to find the sentiment
associated with particular product.
(Rajiv Vaidyanathan and Praveen Agarwal).

6
Pros and Cons of using the Web

Pros
Huge amounts of text
Diverse text
Encyclopedias, Publications, Commercial Web
Pages
Dynamic (ever-changing state)
Cons,
The Web creates a unique set of challenges,
Dynamic (ever-changing state)
News websites, Blogs
Presence of repetitive, noisy, or low-quality
data.
HTML tags, web lingo (home page, information etc)

7
Contributions

Developed an Algorithm that predicts sets of
related words by using pattern matching
techniques and frequency counts.
Developed an Algorithm that predicts sets of
related words by using a relatedness measure.
Developed an Algorithm that predicts sets of
related words by using a relatedness measure and
an extension of the Log Likelihood score.
Applied sets of related words to problem of
Sentiment Classification.

8
Outline

Introduction Objective
Methodology
Experimental Results
Conclusion
Future Work
Demo

9
Interface to Web - Google

Reasons for using Google
Research is very much dependant on both the
quantity and quality of the Web content.
Google has a very effective ranking algorithm
called PageRank which attempts to give more
important or higher quality web pages a higher
ranking.
Google API An interface which allows
programmers to query more than 8 billion web
pages using the Google search engine.
(http//www.google.com/apis/).

10
Problems with Google API

Restricted to 1000 queries a day
10 Results for each query
No near operator (Proximity based search)
Maximum 1000 results.
Alternative
Yahoo API 5000 Queries a day (Released very
recently)
No near operator as well.
Cannot retrieve number of hits.
Note Google was used only as means of retrieving
from the
Information.

11
Key Idea behind Algorithms

Words that are related in meaning often tend to
occur together.
Example,
A Springfield, MA , Chevrolet, Ford, Honda,
Lexus, Mazda, Nissan, Saturn, Toyota automotive
dealer with new and pre-owned vehicle sales and
leasing

12
Algorithm 1

Features
Based on frequency
Takes only single words as input
Initial set 2 words
Frequency cutoff
Ranked by frequency
Smart stop list -
The, if, me, why, you etc (non-content words)
Web stop list
Web page, WWW, home,page, personal, url,
information, link, text , decoration, verdana,
script, javascript

13
Algorithm 1 High level Description

Create queries to Google based on the input
terms.
Retrieve the top N number of web pages for each
query.
Parse the retrieved web page content for each
query.
3. Tokenize web page content into list of words
and frequency.
Discard words that occur less than C number of
times.
4. Find the common words between at least two of
the sets of words. This set of intersecting words
are the set of related words to the input term.
5. Repeat the process for I iterations by using
the set of related words
from the previous iteration as input.

14
Algorithm 1 Trace 1

Search Terms S1pistol, gun
Frequency Cutoff 15
Num Results (Web Pages) 10
Iterations - 2

15
Algorithm 1 Step 1

Create queries to Google based permutations of
the Input Terms,
gun
gun AND pistol
pistol
pistol AND gun

16
Algorithm 1 Step 2

Issue query to Google,
Retrieve the top 10 URLs for the query,
For each URL, retrieve the web page content, and
parse the web page for more links.
Traverse these links and retrieve the content of
those web pages as well.
Repeat this process for each query.

17
Trace 1 Cont

Web pages for the query gun

18
Trace 1 Cont

Web pages for pistol

19
Trace 1 Cont

Web pages for gun AND pistol

20
Trace 1 Cont

Web pages for pistol AND gun

21
Algorithm 1 Step 3

3. Next, for the total web page content retrieved
for each query,
Remove HTML Tags etc and retrieve text.
Remove stop words.
Tokenize the web page content into lists of words
and frequency.
Note This would result in the following 4 sets
of words,
each set representing the words retrieved for
each
query.

22
Words from Web pages after removing stop words
23
Algorithm 1 Step 4

4. Find the words that are common at least 2
sets.
Let,
gun AND pistol
pistol AND gun
gun
pistol
Related Set

24
Related Set 1 Iteration 1
25
Trace 1 Cont Iteration 2

11 input terms
Search terms created
Rifle
Shooting
Guns
Cases
Airsoft
Shooting AND Guns
Guns AND Shooting
Guns AND Cases
etc etc.
Results in 112 121 queries to Google!
Note As you can see, the number of queries to
Google increases
drastically.

26
Result Set 2 gun, pistol
27
Algorithm 1 red, yellow
Number of Results 10 Frequency Cutoff -
15 Iterations - 1
Related Words
28
Problems with Algorithm 1

Frequency based ranking,
Number of input terms restricted to 2,
Input and output restricted to single words

29
Algorithm 2

Features
Based on frequency relatedness score
Can takes input as single words or 2 word
collocations
Relatedness measure based on Jiang and Conrath
Frequency cutoff and relatedness score cutoff
Ranked by score
Initial set can be more than 2 words
Bi-grams as output
Smart stop list
The, if, me, why, you etc
Web stop words phrases
Web page, WWW, home page, personal, url,
information, link, text , decoration, verdana,
script, javascript

30
Algorithm 2 High level Description

Repeat same steps as in Algorithm 1 to retrieve
initial set of related words (Add bigrams to
results as well).
For each word returned by Algorithm 1 as a
related word,
Calculate Relatedness of word to input terms.
Discard any word or bigram with a relatedness
score greater than the score cutoff.
Sort remaining terms from most relevant to
irrelevant.
Repeat Steps 1 2 for each iteration, using the
set of words from iteration previous iteration as
input.

31
Relatedness Measure (Distance Measure)

Relatedness (Word1, Word2)
log (hits(Word1)) log (hits(Word2)) 2 log
(hits(Word1 Word2))
(Based on measure by Jiang and Conrath)
Example 1,
hits(toyota) 12,500,000
hits(ford) 22,900,000
hits(toyota AND ford) 50,000
32.41
Example 2,
hits(toyota) 12,500,000
hits(ford) 22,900,000
hits(toyota AND ford) 150,000
30.82

32
Relatedness Measure Cont

Example 3,
hits(toyota) 1000
hits(ford) 1000
hits(toyota AND ford) 1000
Relatedness (toyota,ford) 0
As the measure tends to approach zero, the
relatedness
between the two terms increase.

33
Input Set gun, pistol
34
Algorithm 2 red, yellow
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
35
Problems with Algorithm 2

Certain bigrams are not good collocations,
For example,
sunny, cloudy
Number of Results - 10
Frequency Cutoff - 15
Bigram Cutoff - 4
Score Cutoff - 30

36
Algorithm 3 High Level Description

Repeat same steps as in Algorithm 1 to retrieve
initial set of related words (Add bigrams to
results as well).
For each term returned by Algorithm 1 as a
related word,
If the term is a bigram,
Validate if bigram is a valid collocation
If bigram is a valid collocation continue with
step 2.2
else
2. Remove term from set of related words.
Calculate Relatedness of word to input terms.
Discard any word or collocation with a
relatedness score greater than the score cutoff.
Sort remaining terms from most relevant to
irrelevant.

37
Verifying Bigrams

Adapt Log Likelihood (G2) Score to web hit counts
Example, New York
4 Queries to Google

New
New York
York
of the
38
Expected Values
(621 3560) / 5670
(5049 3560) / 5670
(621 2110) / 5670
(5049 2110) / 5670
39
Identifying a bad collocation

Bigram is discarded if,
Observed value for bigram is 0 (eg, New York)
Observed value for bigram is less than the
expected value.

40
Example Bigrams
41
Methodology

Introduction Objective
Methodology
Experimental Results Evaluation
Conclusion
Future Work
Demo

42
Evaluating Results

Compare with Google Sets
http//labs.google.com/sets
Human Subject Experiments
Around 20 people expanded 2-word sets to what
they feel as a set of related words

43
F-measure, Precision and Recall

44
Comparison of Algorithm 1 2
45
Algorithm 1
jordan,chicago
Number of Results 10 Frequency Cutoff -
15 Iterations - 1
Precision 0, Recall 0 F-measure 0
46
Algorithm 2
toyota,ford, nissan
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
Precision 6/11 0.54, Recall 6/11
0.54 F-measure 0.54
47
Algorithm 2
january, february, may
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
Precision 9/9 1, Recall 9/9
1 F-measure 1
48
Algorithm 2
armani, versace
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 30 Iterations
- 1
Precision 11/20 0.55, Recall 11/43
.25 F-measure 0.35
Not Entire Set
49
Algorithm 2
artificial intelligence, machine learning
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 32 Iterations
- 1
Precision 9/23 0.39, Recall 9/48
0.1875 F-measure 0.25
50
Comparison of Algorithm 2 3
sunny, cloudy
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 30 Iterations
- 1
51
Algorithm 3 - Bigrams
artificial intelligence, machine learning
52
Performance of Algorithms

F-measure increases, from Algorithm 1 to 3

53
Sentiment Classification

Point wise Mutual Information Information
Retrieval Algorithm (PMI-IR) Peter Turney
Used to classify reviews as being positive or
negative in orientation
Part-of-speech tag the review
Extract 2-word phrases from text
Adjective followed by a Noun
Noun followed by a Noun etc.
Use a positive connotation such as excellent
and negative connotation such as poor, and
calculate the Semantic Orientation (SO) for each
2-word phrase,

54
Example,

Let, the phrase be incredible cast
SO(incredible cast)
log2 (hits(incredible cast NEAR excellent))
hits(poor)
(hits(incredible cast NEAR poor))
hits(excellent)

55
Problem with Current Algorithm

Words such as poor have at least two senses
poor as in poverty
poor as in not good

56
Extended PMI-IR

Used Google instead of AltaVista
Used AND instead of NEAR
Extended SO formula
Use multiple pairs of positive and negative
connotations
excellent, poor, good, bad, great, mediocre

57
A Negative Review for the movie Planet of the
Apes
Classified by our Algorithm as being Negative
58
Positive Review for an Audi
Classified by our Algorithm as being Positive
59
Negative Movie Review
Classified by our Algorithm as being Negative
60
Performance of Extended PMI-IR