IR and NLP - PowerPoint PPT Presentation

About This Presentation
Title:

IR and NLP

Description:

On the Menu. Overview of information retrieval. Evaluation. Three IR models. Boolean. Vector space ... Burger King and Wendy's International (WEN: down $0. ... – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 94
Provided by: Jimm123
Category:
Tags: nlp | menu | wendys

less

Transcript and Presenter's Notes

Title: IR and NLP


1
IR and NLP
  • Jimmy Lin
  • College of Information Studies
  • Institute for Advanced Computer Studies
  • University of Maryland
  • Wednesday, March 15, 2006

2
On the Menu
  • Overview of information retrieval
  • Evaluation
  • Three IR models
  • Boolean
  • Vector space
  • Language modeling
  • NLP for IR

3
Types of Information Needs
  • Retrospective
  • Searching the past
  • Different queries posed against a static
    collection
  • Time invariant
  • Prospective
  • Searching the future
  • Static query posed against a dynamic collection
  • Time dependent

4
Retrospective Searches (I)
  • Ad hoc retrieval find documents about this
  • Known item search
  • Directed exploration

Identify positive accomplishments of the Hubble
telescope since it was launched in 1991. Compile
a list of mammals that are considered to be
endangered, identify their habitat and, if
possible, specify what threatens them.
Find Jimmy Lins homepage. Whats the ISBN
number of Modern Information Retrieval?
Who makes the best chocolates? What video
conferencing systems exist for digital reference
desk services?
5
Retrospective Searches (II)
  • Question answering

6
Prospective Searches
  • Filtering
  • Make a binary decision about each incoming
    document
  • Routing
  • Sort incoming documents into different bins?

Spam or not spam?
Categorize news headlines World? Nation? Metro?
Sports?
7
The Information Retrieval Cycle
Source Selection
Query Formulation
Search
Selection
Examination
Delivery
8
Supporting the Search Process
Source Selection
Resource
Query Formulation
Query
Search
Ranked List
Selection
Indexing
Documents
Index
Examination
Acquisition
Documents
Collection
Delivery
9
Evaluation
10
IR is an experimental science!
  • Formulate a research question the hypothesis
  • Questions about the system
  • Questions about the system user
  • Design an experiment to answer the question
  • Perform the experiment
  • Compare with a baseline
  • Does the experiment answer the question?
  • Are the results significant? Or is it just luck?
  • Report the results!
  • Rinse, repeat

11
The Importance of Evaluation
  • The ability to measure differences underlies
    experimental science
  • How well do our systems work?
  • Is A better than B?
  • Is it really?
  • Under what conditions?
  • Evaluation drives what to research
  • Identify techniques that work and dont work
  • Build on techniques that work

12
Evaluating the Black Box
Search
13
Automatic Evaluation Model
Documents
Query
IR Black Box
Ranked List
Evaluation Module
Relevance Judgments
Measure of Effectiveness
14
Test Collections
  • Reusable test collections consist of
  • Collection of documents
  • Should be representative
  • Things to consider size, sources, genre, topics,
  • Sample of information needs
  • Should be randomized and representative
  • Usually formalized topic statements
  • Known relevance judgments
  • Assessed by humans, for each topic-document pair
    (topic, not query!)
  • Binary judgments make evaluation easier
  • Measure of effectiveness
  • Usually a numeric score for quantifying
    performance
  • Used to compare different systems

15
Which is the Best Rank Order?
A.
B.
C.
D.
E.
F.
16
Set-Based Measures
  • Precision A (AB)
  • Recall A (AC)
  • Miss C (AC)
  • False alarm (fallout) B (BD)

Relevant Not relevant
Retrieved A B
Not retrieved C D
Collection size ABCD Relevant
AC Retrieved AB
When is precision important? When is recall
important?
17
Another View
Space of all documents
Relevant Retrieved
Relevant
Retrieved
Not Relevant Not Retrieved
18
ROC Curves
Adapted from a presentation by Ellen Voorhees at
the University of Maryland, March 29, 1999
19
Building Test Collections
  • Where do test collections come from?
  • Someone goes out and builds them (expensive)
  • As the byproduct of large scale evaluations
  • TREC Text REtrieval Conferences
  • Sponsored by NIST
  • Series of annual evaluations, started in 1992
  • Organized into tracks
  • Larger tracks may draw a few dozen participants

See proceedings online at http//trec.nist.gov/
20
Ad Hoc Topics
  • In TREC, a statement of information need is
    called a topic

Title Health and Computer Terminals
Description Is it hazardous to the health of
individuals to work with computer terminals on a
daily basis? Narrative Relevant documents
would contain any information that expands on
any physical disorder/problems that may be
associated with the daily working with computer
terminals. Such things as carpel tunnel,
cataracts, and fatigue have been said to be
associated, but how widespread are these or other
problems and what is being done to alleviate any
health problems.
21
Obtaining Judgments
  • Exhaustive assessment is usually impractical
  • TREC has 50 queries
  • Collection has gt1 million documents
  • Random sampling wont work
  • If relevant docs are rare, none may be found!
  • IR systems can help focus the sample
  • Each system finds some relevant documents
  • Different systems find different relevant
    documents
  • Together, enough systems will find most of them
  • Leverages cooperative evaluations

22
Pooling Methodology
  • Systems submit top 1000 documents per topic
  • Top 100 documents from each are judged
  • Single pool, duplicates removed, arbitrary order
  • Judged by the person who developed the topic
  • Treat unevaluated documents as not relevant
  • Evaluate down to 1000 documents
  • To make pooling work
  • Systems must do reasonable well
  • Systems must not all do the same thing
  • Gather topics and relevance judgments to create a
    reusable test collection

23
Retrieval Models
  • Boolean
  • Vector space
  • Language Modeling

24
What is a model?
  • A model is a construct designed help us
    understand a complex system
  • A particular way of looking at things
  • Models inevitably make simplifying assumptions
  • What are the limitations of the model?
  • Different types of models
  • Conceptual models
  • Physical analog models
  • Mathematical models

25
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
26
The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
27
How do we represent text?
  • How do we represent the complexities of language?
  • Keeping in mind that computers dont understand
    documents or queries
  • Simple, yet effective approach bag of words
  • Treat unique words as independent features of the
    document

28
Sample Document
  • McDonald's slims down spuds
  • Fast-food chain to reduce certain types of fat in
    its french fries with new cooking oil.
  • NEW YORK (CNN/Money) - McDonald's Corp. is
    cutting the amount of "bad" fat in its french
    fries nearly in half, the fast-food chain said
    Tuesday as it moves to make all its fried menu
    items healthier.
  • But does that mean the popular shoestring fries
    won't taste the same? The company says no. "It's
    a win-win for our customers because they are
    getting the same great french-fry taste along
    with an even healthier nutrition profile," said
    Mike Roberts, president of McDonald's USA.
  • But others are not so sure. McDonald's will not
    specifically discuss the kind of oil it plans to
    use, but at least one nutrition expert says
    playing with the formula could mean a different
    taste.
  • Shares of Oak Brook, Ill.-based McDonald's (MCD
    down 0.54 to 23.22, Research, Estimates) were
    lower Tuesday afternoon. It was unclear Tuesday
    whether competitors Burger King and Wendy's
    International (WEN down 0.80 to 34.91,
    Research, Estimates) would follow suit. Neither
    company could immediately be reached for comment.
  • 16 said
  • 14 McDonalds
  • 12 fat
  • 11 fries
  • 8 new
  • 6 company french nutrition
  • 5 food oil percent reduce taste Tuesday

Bag of Words
29
Whats the point?
  • Retrieving relevant information is hard!
  • Evolving, ambiguous user needs, context, etc.
  • Complexities of language
  • To operationalize information retrieval, we must
    vastly simplify the picture
  • Bag-of-words approach
  • Information retrieval is all (and only) about
    matching words in documents with words in queries
  • Obviously, not true
  • But it works pretty well!

30
Representing Documents
Document 1
Term
Document 1
Document 2
The quick brown fox jumped over the lazy dogs
back.
Stopword List
for
is
of
Document 2
the
to
Now is the time for all good men to come to the
aid of their party.
31
Boolean Retrieval
  • Weights assigned to terms are either 0 or 1
  • 0 represents absence term isnt in the
    document
  • 1 represents presence term is in the
    document
  • Build queries by combining terms with Boolean
    operators
  • AND, OR, NOT
  • The system returns all documents that satisfy the
    query

32
Boolean View of a Collection
Each column represents the view of a particular
document What terms are contained in this
document?
Each row represents the view of a particular
term What documents contain this term?
To execute a query, pick out rows corresponding
to query terms and then apply logic table of
corresponding Boolean operator
33
Sample Queries
dog AND fox ? Doc 3, Doc 5
dog OR fox ? Doc 3, Doc 5, Doc 7
dog NOT fox ? empty
fox NOT dog ? Doc 7
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
good
0
1
0
1
0
1
0
1
party
0
0
0
0
0
1
0
1
good AND party ? Doc 6, Doc 8
g ? p
0
0
0
0
0
1
0
1
over
1
0
1
0
1
0
1
1
good AND party NOT over ? Doc 6
g ? p ? o
0
0
0
0
0
1
0
0
34
Proximity Operators
  • More precise versions of AND
  • NEAR n allows at most n-1 intervening terms
  • WITH requires terms to be adjacent and in order
  • Other extensions within n sentences, within n
    paragraphs, etc.
  • Relatively easy to implement, but less efficient
  • Store position information for each word in the
    document vectors
  • Perform normal Boolean computations, but treat
    WITH and NEAR as extra constraints

35
Why Boolean Retrieval Works
  • Boolean operators approximate natural language
  • Find documents about a good party that is not
    over
  • AND can discover relationships between concepts
  • good party
  • OR can discover alternate terminology
  • excellent party, wild party, etc.
  • NOT can discover alternate meanings
  • Democratic party

36
Why Boolean Retrieval Fails
  • Natural language is way more complex
  • AND discovers nonexistent relationships
  • Terms in different sentences, paragraphs,
  • Guessing terminology for OR is hard
  • good, nice, excellent, outstanding, awesome,
  • Guessing terms to exclude is even harder!
  • Democratic party, party to a lawsuit,

37
Strengths and Weaknesses
  • Strengths
  • Precise, if you know the right strategies
  • Precise, if you have an idea of what youre
    looking for
  • Efficient for the computer
  • Weaknesses
  • Users must learn Boolean logic
  • Boolean logic insufficient to capture the
    richness of language
  • No control over size of result set either too
    many documents or none
  • When do you stop reading? All documents in the
    result set are considered equally good
  • What about partial matches? Documents that dont
    quite match the query may be useful also

38
The Vector Space Model
  • Lets replace relevance with similarity
  • Rank documents by their similarity with the query
  • Treat the query as if it were a document
  • Create a query bag-of-words
  • Find its similarity to each document
  • Rank order the documents by similarity
  • Surprisingly, this works pretty well!

39
Vector Space Model
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together in
vector space talk about the same things
Therefore, retrieve documents based on how close
the document is to the query (i.e., similarity
closeness)
40
Similarity Metric
  • How about d1 d2?
  • This is the Euclidean distance between the
    vectors
  • Instead of distance, use angle between the
    vectors

Why is this not a good idea?
41
How do we weight doc terms?
  • Heres the intuition
  • Terms that appear often in a document should get
    high weights
  • Terms that appear in many documents should get
    low weights
  • How do we capture this mathematically?
  • Term frequency
  • Inverse document frequency

The more often a document contains the term
dog, the more likely that the document is
about dogs.
Words like the, a, of appear in (nearly)
all documents.
42
TF.IDF Term Weighting
  • Simple, yet effective!

weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term i
43
What is a Language Model?
  • Probability distribution over strings of text
  • How likely is a string in a given language?
  • Probabilities depend on what language were
    modeling

p1 P(a quick brown dog)
p2 P(dog quick a brown)
p3 P(??????? brown dog)
p4 P(??????? ??????)
In a language model for English p1 gt p2 gt p3 gt p4
In a language model for Russian p1 lt p2 lt p3 lt p4
44
Noisy-Channel Model of IR
Information need
d1
d2
Query

User has a information need, thinks of a
relevant document
and writes down some queries
dn
document collection
Task of information retrieval given the query,
figure out which document it came from?
45
Retrieval w/ Language Models
  • Build a model for every document
  • Rank document d based on P(MD q)
  • Expand using Bayes Theorem
  • Same as ranking by P(q MD)

P(q) is same for all documents doesnt change
ranks P(MD) the prior is assumed to be the same
for all d
46
What does it mean?
Ranking by P(MD q)
is the same as ranking by P(q MD)


47
Ranking Models?
Ranking by P(q MD)
is the same as ranking documents

48
Unigram Language Model
  • Assume each word is generated independently
  • Obviously, this is not true
  • But it seems to work well in practice!
  • The probability of a string, given a model

The probability of a sequence of words decomposes
into a product of the probabilities of individual
words
49
Modeling
  • How do we build a language model for a document?

Whats in the urn?
50
NLP for IR
51
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
52
Why is IR hard?
  • IR is hard because natural language is so rich
    (among other reasons)
  • What are the issues?
  • Tokenization
  • Morphological Variation
  • Synonymy
  • Polysemy
  • Paraphrase
  • Ambiguity
  • Anaphora

53
Possible Solutions
  • Vary the unit of indexing
  • Strings and segments
  • Tokens and words
  • Phrases and entities
  • Senses and concepts
  • Manipulate queries and results
  • Term expansion
  • Post-processing of results

54
Tokenization
  • Whats a word?
  • First try words are separated by spaces
  • What about clitics?
  • What about languages without spaces?
  • Same problem with speech!

Im not saying that I dont want Johns input on
this.
?????????????????????
55
Word-Level Issues
  • Morphological variation
  • different forms of the same concept
  • Inflectional morphology same part of speech
  • Derivational morphology different parts of
    speech
  • Synonymy
  • different words, same meaning
  • Polysemy
  • same word, different meanings

break, broke, broken sing, sang, sung etc.
destroy, destruction invent, invention,
reinvention etc.
dog, canine, doggy, puppy, etc. ? concept of dog
Bank financial institution or side of a
river? Crane bird or construction equipment? Is
depends on what the meaning of is is!
56
Paraphrase
  • Language provides different ways of saying the
    same thing
  • Who killed Abraham Lincoln?
  • John Wilkes Booth killed Abraham Lincoln.
  • John Wilkes Booth altered history with a bullet.
    He will forever be known as the man who ended
    Abraham Lincolns life.
  • When did Wilt Chamberlain score 100 points?
  • Wilt Chamberlain scored 100 points on March 2,
    1962 against the New York Knicks.
  • On December 8, 1961, Wilt Chamberlain scored 78
    points in a triple overtime game. It was a new
    NBA record, but Warriors coach Frank McGuire
    didnt expect it to last long, saying, Hell get
    100 points someday. McGuires prediction came
    true just a few months later in a game against
    the New York Knicks on March 2.

57
Ambiguity
  • What exactly do you mean?
  • Why dont we have problems (most of the time)?

58
Ambiguity in Action
  • Different documents with the same keywords may
    have different meanings

What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?
  1. Adult frogs eat mainly insects and other small
    animals, including earthworms, minnows, and
    spiders.
  2. Alligators eat many kinds of small animals that
    live in or near the water, including fish,
    snakes, frogs, turtles, small mammals, and birds.
  3. Some bats catch fish with their claws, and a few
    species eat lizards, rodents, small birds, tree
    frogs, and other bats.

?
?
59
Anaphora
  • Language provides different ways of referring to
    the same entity
  • Who killed Abraham Lincoln?
  • John Wilkes Booth killed Abraham Lincoln.
  • John Wilkes Booth altered history with a bullet.
    He will forever be known as the man who ended
    Abraham Lincolns life.
  • When did Wilt Chamberlain score 100 points?
  • Wilt Chamberlain scored 100 points on March 2,
    1962 against the New York Knicks.
  • On December 8, 1961, Wilt Chamberlain scored 78
    points in a triple overtime game. It was a new
    NBA record, but Warriors coach Frank McGuire
    didnt expect it to last long, saying, Hell get
    100 points someday. McGuires prediction came
    true just a few months later in a game against
    the New York Knicks on March 2.

60
More Anaphora
  • Terminology
  • Anaphor an expression that refers to another
  • Anaphora the phenomenon
  • Other different types of referring expressions
  • Anaphora resolution can be hard!

Fujitsu and NEC said they were still
investigating, and that knowledge of more such
bids could emerge... Other major Japanese
computer companies contacted yesterday said they
have never made such bids.
The hotel recently went through a 200 million
restoration original artworks include an
impressive collection of Greek statues in the
lobby.
The city council denied the demonstrators a
permit because they feared violence. they
advocated violence.
61
What can we do?
  • Here are the some of the problems
  • Tokenization
  • Morphological variation, synonymy, polysemy
  • Paraphrase, ambiguity
  • Anaphora
  • General approaches
  • Vary the unit of indexing
  • Manipulate queries and results

62
What do we index?
  • In information retrieval, we are after the
    concepts represented in the documents
  • but we can only index strings
  • So whats the best unit of indexing?

63
The Tokenization Problem
  • In many languages, words are not separated by
    spaces
  • Tokenization separating a string into words
  • Simple greedy approach
  • Start with a list of every possible term (e.g.,
    from a dictionary)
  • Look for the longest word in the unsegmented
    string
  • Take longest matching term as the next word and
    repeat

64
Probabilistic Segmentation
  • For an input word c1 c2 c3 cn
  • Try all possible partitions
  • Choose the highest probability partition
  • E.g., compute P(c1 c2 c3) using a language model
  • Challenges search, probability estimation

c1 c2 c3 c4 cn c1 c2 c3 c4 cn c1
c2 c3 c4 cn
65
Indexing N-Grams
  • Consider a Chinese document c1 c2 c3 cn
  • Dont segment (you could be wrong!)
  • Instead, treat every character bigram as a term
  • Break up queries the same way
  • Works at least as well as trying to segment
    correctly!

66
Morphological Variation
  • Handling morphology related concepts have
    different forms
  • Inflectional morphology same part of speech
  • Derivational morphology different parts of
    speech
  • Different morphological processes
  • Prefixing
  • Suffixing
  • Infixing
  • Reduplication

dogs dog PLURAL
broke break PAST
destruction destroy ion
researcher research er
67
Stemming
  • Dealing with morphological variation index stems
    instead of words
  • Stem a word equivalence class that preserves the
    central concept
  • How much to stem?
  • organization ? organize ? organ?
  • resubmission ? resubmit/submission ? submit?
  • reconstructionism?

68
Does Stemming Work?
  • Generally, yes! (in English)
  • Helps more for longer queries
  • Lots of work done in this area

Donna Harman (1991) How Effective is Suffixing?
Journal of the American Society for Information
Science, 42(1)7-15. Robert Krovetz. (1993)
Viewing Morphology as an Inference Process.
Proceedings of SIGIR 1993. David A. Hull. (1996)
Stemming Algorithms A Case Study for Detailed
Evaluation. Journal of the American Society for
Information Science, 47(1)70-84. And others
69
Stemming in Other Languages
  • Arabic makes frequent use of infixes
  • Whats the most effective stemming strategy in
    Arabic? Open research question

maktab (office), kitaab (book), kutub (books),
kataba (he wrote), naktubu (we write), etc.
the root ktb
70
Words wrong indexing unit!
  • Synonymy
  • different words, same meaning
  • Polysemy
  • same word, different meanings
  • Itd be nice if we could index concepts!
  • Word sense a coherent cluster in semantic space
  • Indexing word senses achieves the effect of
    conceptual indexing

dog, canine, doggy, puppy, etc. ? concept of dog
Bank financial institution or side of a
river? Crane bird or construction equipment?
71
Indexing Word Senses
  • How does indexing word senses solve the
    synonym/polysemy problem?
  • Okay, so where do we get the word senses?
  • WordNet
  • Automatically find clusters of words that
    describe the same concepts
  • Other methods also have been tried

dog, canine, doggy, puppy, etc. ? concept 112986
I deposited my check in the bank. bank ? concept
76529 I saw the sailboat from the bank. bank ?
concept 53107
72
Word Sense Disambiguation
  • Given a word in context, automatically determine
    the sense (concept)
  • This is the Word Sense Disambiguation (WSD)
    problem
  • Context is the key
  • For each ambiguous word, note the surrounding
    words
  • Learn a classifier from a collection of examples
  • Use the classifier to determine the senses of
    words in the documents

bank river, sailboat, water, etc. ? side of a
river bank check, money, account, etc. ?
financial institution
73
Does it work?
  • Nope!
  • Examples of limited success.

Ellen M. Voorhees. (1993) Using WordNet to
Disambiguate Word Senses for Text Retrieval.
Proceedings of SIGIR 1993. Mark Sanderson.
(1994) Word-Sense Disambiguation and Information
Retrieval. Proceedings of SIGIR 1994 And others
Hinrich Schütze and Jan O. Pedersen. (1995)
Information Retrieval Based on Word Senses.
Proceedings of the 4th Annual Symposium on
Document Analysis and Information
Retrieval. Rada Mihalcea and Dan Moldovan.
(2000) Semantic Indexing Using WordNet Senses.
Proceedings of ACL 2000 Workshop on Recent
Advances in NLP and IR.
74
Why Disambiguation Hurts
  • Bag-of-words techniques already disambiguate
  • Context for each term is established in the query
  • WSD is hard!
  • Many words are highly polysemous, e.g., interest
  • Granularity of senses is often domain/application
    specific
  • WSD tries to improve precision
  • But incorrect sense assignments would hurt recall
  • Slight gains in precision do not offset large
    drops in recall

75
An Alternate Approach
  • Indexing word senses freezes concepts at index
    time
  • What if we expanded query terms at query time
    instead?
  • Two approaches
  • Manual thesaurus, e.g., WordNet, UMLS, etc.
  • Automatically-derived thesaurus, e.g.,
    co-occurrence statistics

dog AND cat ? ( dog OR canine ) AND ( cat OR
feline )
76
Does it work?
  • Yes if done carefully
  • User should be involved in the process
  • Otherwise, poor choice of terms can hurt
    performance

77
Handling Anaphora
  • Anaphora resolution finding what the anaphor
    refers to (i.e., the antecedent)
  • Most common example pronominal anaphora
    resolution
  • Simplest method works pretty well find previous
    noun phrase matching in gender and number

John Wilkes Booth altered history with a bullet.
He will forever be known as the man who ended
Abraham Lincolns life. He John Wilkes Booth
78
Expanding Anaphors
  • When indexing, replace anaphors with their
    antecedents
  • Does it work?
  • Somewhat
  • but can be computationally expensive
  • helps more if you want to retrieve sub-document
    segments

79
Beyond Word-Level Indexing
  • Words are the wrong unit to index
  • Many multi-word combinations identify entities
  • Persons George W. Bush, Dr. Jones
  • Organizations Red Cross, United Way
  • Corporations Hewlett Packard, Kraft Foods
  • Locations Easter Island, New York City
  • Entities often have finer-grained structures

Professor Stephen W. Hawking
title
first name
middle initial
last name
Cambridge, Massachusetts
city
state
80
Indexing Named Entities
  • Why would we want to index named entities?
  • Index named entities as special tokens
  • And treat special tokens like query terms
  • Works pretty well for question answering

Who patented the light bulb?
patent light bulb PERSON
When was the light bulb patented?
patent light bulb DATE
John Prager, Eric Brown, and Anni Coden. (2000)
Question-Answering by Predictive Annotation.
Proceedings of SIGIR 2000.
81
Indexing Phrases
  • Two types of phrases
  • Those that make sense, e.g., school bus, hot
    dog
  • Those that dont, e.g., bigrams in Chinese
  • Treat multi-word tokens as index terms
  • Three sources of evidence
  • Dictionary lookup
  • Linguistic analysis
  • Statistical analysis (e.g., co-occurrence)

82
Known Phrases
  • Compile a term list that includes phrases
  • Technical terminology can be very helpful
  • Index any phrase that occurs in the list
  • Most effective in a limited domain
  • Otherwise hard to capture most useful phrases

83
Syntactic Phrases
  • Parsing automatically assign structure to a
    sentence
  • Walk the tree and extract phrases
  • Index all noun phrases
  • Index subjects and verbs
  • Index verbs and objects
  • etc.

Sentence
Prepositional Phrase
Noun Phrase
Noun phrase
Det
Adj
Adj
Noun
Verb
Adj
Noun
Adj
Det
Prep
The quick brown fox jumped over the lazy
black dog
84
Syntactic Variations
  • What does linguistic analysis buy?
  • Coordinations
  • Substitutions
  • Permutations

lung and breast cancer ? lung cancer, breast
cancer
inflammatory sinonasal disease ? inflammatory
disease, sinonasal disease
addition of calcium ? calcium addition
85
Statistical Analysis
  • Automatically discover phrases based on
    co-occurrence probabilities
  • If terms are not independent, they may form a
    phrase
  • Use this method to automatically learn a phrase
    dictionary

P(kick the bucket) P(kick) ? P(the) ?
P(bucket) ?
86
Does Phrasal Indexing Work?
  • Yes
  • But the gains are so small theyre not worth the
    cost
  • Primary drawback too slow!

87
What about ambiguity?
  • Different documents with the same keywords may
    have different meanings

What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?
  1. Adult frogs eat mainly insects and other small
    animals, including earthworms, minnows, and
    spiders.
  2. Alligators eat many kinds of small animals that
    live in or near the water, including fish,
    snakes, frogs, turtles, small mammals, and birds.
  3. Some bats catch fish with their claws, and a few
    species eat lizards, rodents, small birds, tree
    frogs, and other bats.

?
?
88
Indexing Relations
  • Instead of terms, index syntactic relations
    between entities in the text

Adult frogs eat mainly insects and other small
animals, including earthworms, minnows, and
spiders.
lt frogs subject-of eat gt lt insects object-of eat
gt lt animals object-of eat gt lt adult modifies
frogs gt lt small modifies animals gt
Alligators eat many kinds of small animals that
live in or near the water, including fish,
snakes, frogs, turtles, small mammals, and birds.
lt alligators subject-of eat gt lt kinds object-of
animals gt lt small modifies animals gt
From the relations, it is clear whos eating whom!
89
Are syntactic relations enough?
  • Consider this example
  • Syntax sometimes isnt enough we need semantics
    (or meaning)!
  • Semantics, for example, allows us to relate the
    following two fragments

John broke the window. The window broke.
lt John subject-of break gt lt window subject-of
breakgt
John and window are both subjects But John
is the person doing the breaking (or
agent), and the window is the thing being
broken (or theme)
The barbarians destroyed the city The
destruction of the city by the barbarians
event destroy agent barbarians theme city
90
Semantic Roles
  • Semantic roles are invariant with respect to
    syntactic expression
  • The idea
  • Identify semantic roles
  • Index frame structures with filled slots
  • Retrieve answers based on semantic-level matching

Mary loaded the truck with hay. Hay was loaded
onto the truck by Mary.
event load agent Mary material
hay destination truck
91
Does it work?
  • No, not really
  • Why not?
  • Syntactic and semantic analysis is difficult
    errors offset whatever gain is gotten
  • As with WSD, these techniques are
    precision-enhancers recall usually takes a dive
  • Its slow!

92
Alternative Approach
  • Sophisticated linguistic analysis is slow!
  • Unnecessary processing can be avoided by query
    time analysis
  • Two-stage retrieval
  • Use standard document retrieval techniques to
    fetch a candidate set of documents
  • Use passage retrieval techniques to choose a few
    promising passages (e.g., paragraphs)
  • Apply sophisticated linguistic techniques to
    pinpoint the answer
  • Passage retrieval
  • Find good passages within documents
  • Key Idea locate areas where lots of query terms
    appear close together

93
Key Ideas
  • IR is hard because language is rich and complex
    (among other reasons)
  • Two general approaches to the problem
  • Attempt to find the best unit of indexing
  • Try to fix things at query time
  • It is hard to predict a priori what techniques
    work
  • Questions must be answered experimentally
  • Words are really the wrong thing to index
  • But there isnt really a better alternative
Write a Comment
User Comments (0)
About PowerShow.com