Recognition and Classification of Noun Phrases in Queries for Effective Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Recognition and Classification of Noun Phrases in Queries for Effective Retrieval

Description:

Wikipedia is better than WordNet and Minipar. Need for a complete dictionary ... Using Wikipedia for PN/DP and just collins parser for SNP/CNP ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 28
Provided by: q3
Learn more at: https://www.cs.uic.edu
Category:

less

Transcript and Presenter's Notes

Title: Recognition and Classification of Noun Phrases in Queries for Effective Retrieval


1
Recognition and Classification ofNoun Phrases in
Queries for Effective Retrieval
Wei Zhang1 Shuang Liu2
Clement Yu1
wzhang_at_cs.uic.edu shuang.liu_at_ask.com
yu_at_cs.uic.edu Chaojing Sun3
Fang Liu4 Weiyi Meng5
chaojing_at_gmail.com fangliu_at_microsoft.com
meng_at_cs.binghamton.edu 1 Department of
Computer Science, University of Illinois at
Chicago 2 Ask.com 3 Broadcom
Corporation 4 Microsoft 5 Department
of Computer Science, Binghamton University
CIKM 2007
1
2
Outline
  • Motivation
  • Our definitions of the phrases
  • Proper noun and dictionary phrase recognition
  • Simple and complex phrase recognition
  • Experimental results

CIKM 2007
2
3
Motivation
  • Terms in a query are related semantically
  • John Smith
  • Recognize this relationship
  • Partition the query terms to groups (phrases)
  • Document retrieval using phrases
  • Adding phrases into searching and ranking

4
Types of Noun Phrases
  • Phrases that have fixed writing formats
  • Names of Locations, people, companies,
  • Well defined concepts. E.g. computer science
  • Freely written phrases
  • Not formally defined but used in the real language

5
Four Types of Noun Phrases
  • Proper Noun (PN)
  • A noun phrase that names a specific person, place
    or thing.
  • First letters of the content words are
    capitalized
  • E.g. John Smith, Atlantic Ocean
  • Dictionary Phrase (DP)
  • A phrase that has a definition in a dictionary,
    excluding PN
  • These two types may overlap
  • Atlantic Ocean
  • They can not replace each other
  • E.g. Linas Pizza, public transportation

6
Four Types of Noun Phrases
  • Simple Noun Phrase (SNP)
  • A grammatically valid noun phrase other than PN
    and DP
  • 2 words
  • E.g. white car, good hotel
  • Complex Noun Phrase (CNP)
  • A grammatically valid noun phrase other than PN
    and DP
  • 3 or more words
  • May contain PN/DP/SNP
  • E.g. small white car, city public
    transportation

7
Noun Phrase Recognition
  • General procedure
  • Recognize PN and dictionary phrases first
  • Then simple and complex noun phrases
  • A n-word query
  • Check the original query
  • Check the 2 (n-1)-term arrays
  • Check the (n-1) 2-term arrays
  • Totally n(n-1)/2 candidates
  • E.g. World Trade Organization
  • World Trade and Trade Organization

8
Noun Phrase Recognition
  • Tools for phrase recognition
  • Dictionaries (Wikipedia, WordNet)
  • Large text corpus (Google for experiments)
  • Parsers (Minipar, Collins parser) and POS tagger

9
PN and DP Recognition
  • Wekipedia
  • For proper nouns and dictionary phrases
  • DP existence of the entry page
  • PN content words in the first instance of the
    phrase in the main text should be capitalized

10
PN and DP Recognition
  • WordNet
  • For PN and DP recognition
  • DP defined in a dictionary
  • PN has a hypernym of city, province, country,
    organization, geographic area, person, syndrome,
    region, building, or nation.

11
PN and DP Recognition
  • Minipar
  • For PN recognition only
  • PN label in the parse tree
  • Semantic label of person, country, corpname,
    location, corpdesig, fname, gname, or date

12
PN and DP Recognition
  • List of first names, last names and rules
  • First_initial last_name
  • First_initial mid_initial last_name
  • First_name middle_initial last_name
  • First_name last_name

13
PN and DP Recognition
  • Text corpus
  • For less well-known PNs
  • Three instances, first letters of the content
    words capitalized
  • Not a sub-phrase of a longer PN
  • if you choose windows by Vista Window Company,
  • if you choose windows by Super Vista Window
    Company,

14
PN and DP Recognition
  • Overlapped phrases
  • Search all words together
  • Count the instances of each phrase in the
    returned documents
  • e.g. Native American Casino
  • Native American and American Casino
  • Compare ( Count(Native American),
    Count(American Casino) )

15
SNP and CNP Recognition
  • Only check the phrase candidates that
  • are not sub-phrases of a recognized PN/DP
  • do not overlap with a recognized PN/DP

16
SNP and CNP Recognition
  • Implicit phrases
  • and / or
  • main and contributing factor ?
  • main factor
  • contributing factor

17
SNP and CNP Recognition
  • Head word replacement
  • Replace the whole phrase by its head word
  • Collins parser
  • Label the noun phrases

NP/sedan(head word)
NP/sedan(head word)
Compact/JJ
Best/JJS
Sedan/NN
18
SNP and CNP Recognition
  • Phrase verification
  • To verify that a phrase is used in the world
  • For CNP it also means to find all the words in a
    text window
  • Colin Farrell wallpaper and wallpaper of Colin
    Farrell

19
SNP and CNP Recognition
  • Overlapped phrases
  • Two potential SNP/CNP Search all words, compare
    the numbers of the instances.
  • sony dvd handyam ? sony dvd and dvd
    handycam

20
Document Retrieval Using Phrases
  • Search a phrase in a document
  • Exact match PN/DP
  • Search all words in a text window SNP/CNP

21
Document Retrieval Using Phrases
  • Sim(Query, Doc) ltSim_P, Sim_Tgt
  • Phrase similarity
  • Sim_P(P_i) idf(P_i)
  • Sim_P sum ( sim_P(P_i) )
  • Term similarity
  • Okapi/BM-25 similarity
  • Document ranking
  • D1 is ranked higher than D2, if
  • (Sim_P1gtSim_P2) OR (P1P2 AND T1gtT2)

22
Experimental Results
  • Phrase recognition experiments
  • Tuned by using TREC queries

23
Experimental Results
  • Phrase recognition experiments
  • Tested by using Web queries

24
Experimental Results
  • Performance of individual tools
  • Wikipedia is better than WordNet and Minipar
  • Need for a complete dictionary
  • Collins parser alone is not enough for SNP/CNP
    recognition
  • Lack of real world usage information

25
Experimental Results
  • Document retrieval experiments
  • Ad-hoc TREC 6, 7 and 8, robust TREC 12, 13 and 14
  • Retrieval without using phrases
  • Using Wikipedia for PN/DP and just collins parser
    for SNP/CNP
  • Using phrases from the full recognition algorithm
  • 33 MAP increase and 44.27 GMAP increase from 1
    to 2
  • 5.8 MAP increase and 12.58 GMAP increase from 2
    to 3

26
Conclusions
  • Our algorithm can effectively recognize the four
    types of phrases in the short Web queries
  • The recognized phrases help improve the retrieval
    effectiveness

27
Questions?
  • wzhang_at_cs.uic.edu
  • http//www.cs.uic.edu/wzhang/
Write a Comment
User Comments (0)
About PowerShow.com