Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 - PowerPoint PPT Presentation

About This Presentation
Title:

Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005

Description:

The goal of my thesis research is to use the World Wide Web as a source of ... Airsoft. Shooting AND Guns. Guns AND Shooting. Guns AND Cases. etc etc. ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 66
Provided by: Prat4
Learn more at: https://www.d.umn.edu
Category:

less

Transcript and Presenter's Notes

Title: Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005


1
Identifying Sets of RelatedWords from the World
Wide Web Thesis Defense 06/09/2005
  • Pratheepan (Prath) Raveendranathan
  • Advisor Ted Pedersen

2
Outline
  • Introduction Objective
  • Methodology
  • Experimental Results
  • Conclusion
  • Future Work
  • Demo

3
Introduction
  • The goal of my thesis research is to use the
    World Wide Web as a source of information to
    identify sets of words that are related in
    meaning.
  • Example, given two words - gun,pistol
  • a possible set of related words would be
  • handgun, holster, shotgun, machine-gun,
    weapon,ammunition,bullet, magazine
  • Example, given two words toyota, nissan, ford
  • A possible set of related words would
  • honda, gmc, chevy, mitsubishi

4
Examples Cont
  • Example, given two words - red,yellow
  • a possible set of related words would be
  • white,black,blue, colors, green
  • Example, given two words - George Bush,Bill
    Clinton
  • a possible set of related words would be
  • Ronald Reagan, Jimmy Carter, White House,
    Presidents, USA, etc

5
Application
  • Use sets of related words to classify Semantic
    Orientation of reviews.
  • (Peter Turney)
  • Use sets of related words to find the sentiment
    associated with particular product.
  • (Rajiv Vaidyanathan and Praveen Agarwal).

6
Pros and Cons of using the Web
  • Pros
  • Huge amounts of text
  • Diverse text
  • Encyclopedias, Publications, Commercial Web
    Pages
  • Dynamic (ever-changing state)
  • Cons,
  • The Web creates a unique set of challenges,
  • Dynamic (ever-changing state)
  • News websites, Blogs
  • Presence of repetitive, noisy, or low-quality
    data.
  • HTML tags, web lingo (home page, information etc)

7
Contributions
  • Developed an Algorithm that predicts sets of
    related words by using pattern matching
    techniques and frequency counts.
  • Developed an Algorithm that predicts sets of
    related words by using a relatedness measure.
  • Developed an Algorithm that predicts sets of
    related words by using a relatedness measure and
    an extension of the Log Likelihood score.
  • Applied sets of related words to problem of
    Sentiment Classification.

8
Outline
  • Introduction Objective
  • Methodology
  • Experimental Results
  • Conclusion
  • Future Work
  • Demo

9
Interface to Web - Google
  • Reasons for using Google
  • Research is very much dependant on both the
    quantity and quality of the Web content.
  • Google has a very effective ranking algorithm
    called PageRank which attempts to give more
    important or higher quality web pages a higher
    ranking.
  • Google API An interface which allows
    programmers to query more than 8 billion web
    pages using the Google search engine.
    (http//www.google.com/apis/).

10
Problems with Google API
  • Restricted to 1000 queries a day
  • 10 Results for each query
  • No near operator (Proximity based search)
  • Maximum 1000 results.
  • Alternative
  • Yahoo API 5000 Queries a day (Released very
    recently)
  • No near operator as well.
  • Cannot retrieve number of hits.
  • Note Google was used only as means of retrieving
    from the
  • Information.

11
Key Idea behind Algorithms
  • Words that are related in meaning often tend to
    occur together.
  • Example,
  • A Springfield, MA , Chevrolet, Ford, Honda,
    Lexus, Mazda, Nissan, Saturn, Toyota automotive
    dealer with new and pre-owned vehicle sales and
    leasing

12
Algorithm 1
  • Features
  • Based on frequency
  • Takes only single words as input
  • Initial set 2 words
  • Frequency cutoff
  • Ranked by frequency
  • Smart stop list -
  • The, if, me, why, you etc (non-content words)
  • Web stop list
  • Web page, WWW, home,page, personal, url,
    information, link, text , decoration, verdana,
    script, javascript

13
Algorithm 1 High level Description
  • Create queries to Google based on the input
    terms.
  • Retrieve the top N number of web pages for each
    query.
  • Parse the retrieved web page content for each
    query.
  • 3. Tokenize web page content into list of words
    and frequency.
  • Discard words that occur less than C number of
    times.
  • 4. Find the common words between at least two of
    the sets of words. This set of intersecting words
    are the set of related words to the input term.
  • 5. Repeat the process for I iterations by using
    the set of related words
  • from the previous iteration as input.

14
Algorithm 1 Trace 1
  • Search Terms S1pistol, gun
  • Frequency Cutoff 15
  • Num Results (Web Pages) 10
  • Iterations - 2

15
Algorithm 1 Step 1
  • Create queries to Google based permutations of
    the Input Terms,
  • gun
  • gun AND pistol
  • pistol
  • pistol AND gun

16
Algorithm 1 Step 2
  • Issue query to Google,
  • Retrieve the top 10 URLs for the query,
  • For each URL, retrieve the web page content, and
    parse the web page for more links.
  • Traverse these links and retrieve the content of
    those web pages as well.
  • Repeat this process for each query.

17
Trace 1 Cont
  • Web pages for the query gun

18
Trace 1 Cont
  • Web pages for pistol

19
Trace 1 Cont
  • Web pages for gun AND pistol

20
Trace 1 Cont
  • Web pages for pistol AND gun

21
Algorithm 1 Step 3
  • 3. Next, for the total web page content retrieved
    for each query,
  • Remove HTML Tags etc and retrieve text.
  • Remove stop words.
  • Tokenize the web page content into lists of words
    and frequency.
  • Note This would result in the following 4 sets
    of words,
  • each set representing the words retrieved for
    each
  • query.

22
Words from Web pages after removing stop words
23
Algorithm 1 Step 4
  • 4. Find the words that are common at least 2
    sets.
  • Let,
  • gun AND pistol
  • pistol AND gun
  • gun
  • pistol
  • Related Set

24
Related Set 1 Iteration 1
25
Trace 1 Cont Iteration 2
  • 11 input terms
  • Search terms created
  • Rifle
  • Shooting
  • Guns
  • Cases
  • Airsoft
  • Shooting AND Guns
  • Guns AND Shooting
  • Guns AND Cases
  • etc etc.
  • Results in 112 121 queries to Google!
  • Note As you can see, the number of queries to
    Google increases
  • drastically.

26
Result Set 2 gun, pistol
27
Algorithm 1 red, yellow
Number of Results 10 Frequency Cutoff -
15 Iterations - 1
Related Words
28
Problems with Algorithm 1
  • Frequency based ranking,
  • Number of input terms restricted to 2,
  • Input and output restricted to single words

29
Algorithm 2
  • Features
  • Based on frequency relatedness score
  • Can takes input as single words or 2 word
    collocations
  • Relatedness measure based on Jiang and Conrath
  • Frequency cutoff and relatedness score cutoff
  • Ranked by score
  • Initial set can be more than 2 words
  • Bi-grams as output
  • Smart stop list
  • The, if, me, why, you etc
  • Web stop words phrases
  • Web page, WWW, home page, personal, url,
    information, link, text , decoration, verdana,
    script, javascript

30
Algorithm 2 High level Description
  • Repeat same steps as in Algorithm 1 to retrieve
    initial set of related words (Add bigrams to
    results as well).
  • For each word returned by Algorithm 1 as a
    related word,
  • Calculate Relatedness of word to input terms.
  • Discard any word or bigram with a relatedness
    score greater than the score cutoff.
  • Sort remaining terms from most relevant to
    irrelevant.
  • Repeat Steps 1 2 for each iteration, using the
    set of words from iteration previous iteration as
    input.

31
Relatedness Measure (Distance Measure)
  • Relatedness (Word1, Word2)
  • log (hits(Word1)) log (hits(Word2)) 2 log
    (hits(Word1 Word2))
  • (Based on measure by Jiang and Conrath)
  • Example 1,
  • hits(toyota) 12,500,000
  • hits(ford) 22,900,000
  • hits(toyota AND ford) 50,000
  • 32.41
  • Example 2,
  • hits(toyota) 12,500,000
  • hits(ford) 22,900,000
  • hits(toyota AND ford) 150,000
  • 30.82

32
Relatedness Measure Cont
  • Example 3,
  • hits(toyota) 1000
  • hits(ford) 1000
  • hits(toyota AND ford) 1000
  • Relatedness (toyota,ford) 0
  • As the measure tends to approach zero, the
    relatedness
  • between the two terms increase.

33
Input Set gun, pistol
34
Algorithm 2 red, yellow
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
35
Problems with Algorithm 2
  • Certain bigrams are not good collocations,
  • For example,
  • sunny, cloudy
  • Number of Results - 10
  • Frequency Cutoff - 15
  • Bigram Cutoff - 4
  • Score Cutoff - 30

36
Algorithm 3 High Level Description
  • Repeat same steps as in Algorithm 1 to retrieve
    initial set of related words (Add bigrams to
    results as well).
  • For each term returned by Algorithm 1 as a
    related word,
  • If the term is a bigram,
  • Validate if bigram is a valid collocation
  • If bigram is a valid collocation continue with
    step 2.2
  • else
  • 2. Remove term from set of related words.
  • Calculate Relatedness of word to input terms.
  • Discard any word or collocation with a
    relatedness score greater than the score cutoff.
  • Sort remaining terms from most relevant to
    irrelevant.

37
Verifying Bigrams
  • Adapt Log Likelihood (G2) Score to web hit counts
  • Example, New York
  • 4 Queries to Google

New
New York
York
of the
38
Expected Values
(621 3560) / 5670
(5049 3560) / 5670
(621 2110) / 5670
(5049 2110) / 5670
39
Identifying a bad collocation
  • Bigram is discarded if,
  • Observed value for bigram is 0 (eg, New York)
  • Observed value for bigram is less than the
    expected value.

40
Example Bigrams
41
Methodology
  • Introduction Objective
  • Methodology
  • Experimental Results Evaluation
  • Conclusion
  • Future Work
  • Demo

42
Evaluating Results
  • Compare with Google Sets
  • http//labs.google.com/sets
  • Human Subject Experiments
  • Around 20 people expanded 2-word sets to what
    they feel as a set of related words

43
F-measure, Precision and Recall

44
Comparison of Algorithm 1 2
45
Algorithm 1
jordan,chicago
Number of Results 10 Frequency Cutoff -
15 Iterations - 1
Precision 0, Recall 0 F-measure 0
46
Algorithm 2
toyota,ford, nissan
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
Precision 6/11 0.54, Recall 6/11
0.54 F-measure 0.54
47
Algorithm 2
january, february, may
Number of Results 10 Frequency Cutoff -
10 Score Cutoff - 30 Iterations - 1
Precision 9/9 1, Recall 9/9
1 F-measure 1
48
Algorithm 2
armani, versace
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 30 Iterations
- 1
Precision 11/20 0.55, Recall 11/43
.25 F-measure 0.35
Not Entire Set
49
Algorithm 2
artificial intelligence, machine learning
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 32 Iterations
- 1
Precision 9/23 0.39, Recall 9/48
0.1875 F-measure 0.25
50
Comparison of Algorithm 2 3
sunny, cloudy
Number of Results 10 Frequency Cutoff -
10 Bigram Cutoff - 4 Score Cutoff - 30 Iterations
- 1
51
Algorithm 3 - Bigrams
artificial intelligence, machine learning
52
Performance of Algorithms
  • F-measure increases, from Algorithm 1 to 3

53
Sentiment Classification
  • Point wise Mutual Information Information
    Retrieval Algorithm (PMI-IR) Peter Turney
  • Used to classify reviews as being positive or
    negative in orientation
  • Part-of-speech tag the review
  • Extract 2-word phrases from text
  • Adjective followed by a Noun
  • Noun followed by a Noun etc.
  • Use a positive connotation such as excellent
    and negative connotation such as poor, and
    calculate the Semantic Orientation (SO) for each
    2-word phrase,

54
Example,
  • Let, the phrase be incredible cast
  • SO(incredible cast)
  • log2 (hits(incredible cast NEAR excellent))
    hits(poor)
  • (hits(incredible cast NEAR poor))
    hits(excellent)

55
Problem with Current Algorithm
  • Words such as poor have at least two senses
  • poor as in poverty
  • poor as in not good

56
Extended PMI-IR
  • Used Google instead of AltaVista
  • Used AND instead of NEAR
  • Extended SO formula
  • Use multiple pairs of positive and negative
    connotations
  • excellent, poor, good, bad, great, mediocre

57
A Negative Review for the movie Planet of the
Apes
Classified by our Algorithm as being Negative
58
Positive Review for an Audi
Classified by our Algorithm as being Positive
59
Negative Movie Review
Classified by our Algorithm as being Negative
60
Performance of Extended PMI-IR
  • Algorithm run on 20 reviews (movies and
    automobiles)
  • Overall Accuracy 75

61
End Result
  • All of this is available freely on CPAN and
    Sourceforge
  • Google-Hack

62
Conclusions Contribution
  • Developed 3 Algorithms that try to predict sets
    of related words
  • Algorithm 1 was based on frequency
  • Algorithm 2 was based on a relatedness measure
  • Algorithm 3 was based on a relatedness measure
    and the Log Likelihood score
  • Applied sets of related words to Sentiment
    Classification

63
Conclusions Contribution
  • Released free PERL package Google-Hack on CPAN
    and Sourceforge.
  • Developed a web interface.

64
Future Work
  • Addition of proximity operator
  • Restrict of web pages traversed
  • Find intersection of words through different
    search engines - Yahoo API
  • Use anchor text

65
Related URLs
  • Research Page
  • http//www.d.umn.edu/rave0029/research
  • Google-Hack
  • http//google-hack.sf.net
  • CPAN Release
  • http//search.cpan.org/prath/WebService-GoogleHac
    k0.15/GoogleHack/GoogleHack.pm
  • Web Interface
  • http//marimba.d.umn.edu/cgi-bin/googlehack/index.
    cgi
Write a Comment
User Comments (0)
About PowerShow.com