Title: Recognition and Classification of Noun Phrases in Queries for Effective Retrieval
1Recognition and Classification ofNoun Phrases in
Queries for Effective Retrieval
Wei Zhang1 Shuang Liu2
Clement Yu1
wzhang_at_cs.uic.edu shuang.liu_at_ask.com
yu_at_cs.uic.edu Chaojing Sun3
Fang Liu4 Weiyi Meng5
chaojing_at_gmail.com fangliu_at_microsoft.com
meng_at_cs.binghamton.edu 1 Department of
Computer Science, University of Illinois at
Chicago 2 Ask.com 3 Broadcom
Corporation 4 Microsoft 5 Department
of Computer Science, Binghamton University
CIKM 2007
1
2Outline
- Motivation
- Our definitions of the phrases
- Proper noun and dictionary phrase recognition
- Simple and complex phrase recognition
- Experimental results
CIKM 2007
2
3Motivation
- Terms in a query are related semantically
- John Smith
- Recognize this relationship
- Partition the query terms to groups (phrases)
- Document retrieval using phrases
- Adding phrases into searching and ranking
4Types of Noun Phrases
- Phrases that have fixed writing formats
- Names of Locations, people, companies,
- Well defined concepts. E.g. computer science
- Freely written phrases
- Not formally defined but used in the real language
5Four Types of Noun Phrases
- Proper Noun (PN)
- A noun phrase that names a specific person, place
or thing. - First letters of the content words are
capitalized - E.g. John Smith, Atlantic Ocean
- Dictionary Phrase (DP)
- A phrase that has a definition in a dictionary,
excluding PN - These two types may overlap
- Atlantic Ocean
- They can not replace each other
- E.g. Linas Pizza, public transportation
6Four Types of Noun Phrases
- Simple Noun Phrase (SNP)
- A grammatically valid noun phrase other than PN
and DP - 2 words
- E.g. white car, good hotel
- Complex Noun Phrase (CNP)
- A grammatically valid noun phrase other than PN
and DP - 3 or more words
- May contain PN/DP/SNP
- E.g. small white car, city public
transportation
7Noun Phrase Recognition
- General procedure
- Recognize PN and dictionary phrases first
- Then simple and complex noun phrases
- A n-word query
- Check the original query
- Check the 2 (n-1)-term arrays
-
- Check the (n-1) 2-term arrays
- Totally n(n-1)/2 candidates
- E.g. World Trade Organization
- World Trade and Trade Organization
8Noun Phrase Recognition
- Tools for phrase recognition
- Dictionaries (Wikipedia, WordNet)
- Large text corpus (Google for experiments)
- Parsers (Minipar, Collins parser) and POS tagger
9PN and DP Recognition
- Wekipedia
- For proper nouns and dictionary phrases
- DP existence of the entry page
- PN content words in the first instance of the
phrase in the main text should be capitalized
10PN and DP Recognition
- WordNet
- For PN and DP recognition
- DP defined in a dictionary
- PN has a hypernym of city, province, country,
organization, geographic area, person, syndrome,
region, building, or nation.
11PN and DP Recognition
- Minipar
- For PN recognition only
- PN label in the parse tree
- Semantic label of person, country, corpname,
location, corpdesig, fname, gname, or date
12PN and DP Recognition
- List of first names, last names and rules
- First_initial last_name
- First_initial mid_initial last_name
- First_name middle_initial last_name
- First_name last_name
13PN and DP Recognition
- Text corpus
- For less well-known PNs
- Three instances, first letters of the content
words capitalized - Not a sub-phrase of a longer PN
- if you choose windows by Vista Window Company,
- if you choose windows by Super Vista Window
Company,
14PN and DP Recognition
- Overlapped phrases
- Search all words together
- Count the instances of each phrase in the
returned documents - e.g. Native American Casino
- Native American and American Casino
- Compare ( Count(Native American),
Count(American Casino) )
15SNP and CNP Recognition
- Only check the phrase candidates that
- are not sub-phrases of a recognized PN/DP
- do not overlap with a recognized PN/DP
16SNP and CNP Recognition
- Implicit phrases
- and / or
- main and contributing factor ?
- main factor
- contributing factor
17SNP and CNP Recognition
- Head word replacement
- Replace the whole phrase by its head word
- Collins parser
- Label the noun phrases
NP/sedan(head word)
NP/sedan(head word)
Compact/JJ
Best/JJS
Sedan/NN
18SNP and CNP Recognition
- Phrase verification
- To verify that a phrase is used in the world
- For CNP it also means to find all the words in a
text window - Colin Farrell wallpaper and wallpaper of Colin
Farrell
19SNP and CNP Recognition
- Overlapped phrases
- Two potential SNP/CNP Search all words, compare
the numbers of the instances. - sony dvd handyam ? sony dvd and dvd
handycam
20Document Retrieval Using Phrases
- Search a phrase in a document
- Exact match PN/DP
- Search all words in a text window SNP/CNP
21Document Retrieval Using Phrases
- Sim(Query, Doc) ltSim_P, Sim_Tgt
- Phrase similarity
- Sim_P(P_i) idf(P_i)
- Sim_P sum ( sim_P(P_i) )
- Term similarity
- Okapi/BM-25 similarity
- Document ranking
- D1 is ranked higher than D2, if
- (Sim_P1gtSim_P2) OR (P1P2 AND T1gtT2)
22Experimental Results
- Phrase recognition experiments
- Tuned by using TREC queries
23Experimental Results
- Phrase recognition experiments
- Tested by using Web queries
24Experimental Results
- Performance of individual tools
- Wikipedia is better than WordNet and Minipar
- Need for a complete dictionary
- Collins parser alone is not enough for SNP/CNP
recognition - Lack of real world usage information
25Experimental Results
- Document retrieval experiments
- Ad-hoc TREC 6, 7 and 8, robust TREC 12, 13 and 14
- Retrieval without using phrases
- Using Wikipedia for PN/DP and just collins parser
for SNP/CNP - Using phrases from the full recognition algorithm
- 33 MAP increase and 44.27 GMAP increase from 1
to 2 - 5.8 MAP increase and 12.58 GMAP increase from 2
to 3
26Conclusions
- Our algorithm can effectively recognize the four
types of phrases in the short Web queries - The recognized phrases help improve the retrieval
effectiveness
27Questions?
- wzhang_at_cs.uic.edu
- http//www.cs.uic.edu/wzhang/