The Vector Space Model - PowerPoint PPT Presentation

About This Presentation

Title:

The Vector Space Model

Description:

Title: Boolean Retrieval Model and Controlled Vocabulary Techniques Author: Douglas W. Oard Last modified by: jj Created Date: 6/17/1995 11:31:02 PM – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 44

Provided by: Dougl216

Learn more at: http://users.umiacs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Vector Space Model

1
The Vector Space Model

LBSC 796/CMSC828o
Session 3, February 9, 2004
Douglas W. Oard

2
Agenda

Thinking about search
Design strategies
Decomposing the search component
Boolean free text retrieval
The bag of terms representation
Proximity operators
Ranked retrieval
Vector space model
Passage retrieval

3
Supporting the Search Process
Source Selection
?
????
?
??
?
4
Design Strategies

Foster human-machine synergy
Exploit complementary strengths
Accommodate shared weaknesses
Divide-and-conquer
Divide task into stages with well-defined
interfaces
Continue dividing until problems are easily
solved
Co-design related components
Iterative process of joint optimization

5
Human-Machine Synergy

Machines are good at
Doing simple things accurately and quickly
Scaling to larger collections in sublinear time
People are better at
Accurately recognizing what they are looking for
Evaluating intangibles such as quality
Both are pretty bad at
Mapping consistently between words and concepts

6
Divide and Conquer

Strategy use encapsulation to limit complexity
Approach
Define interfaces (input and output) for each
component
Query interface input terms, output
representation
Define the functions performed by each component
Remove common words, weight rare terms higher,
Repeat the process within components as needed
Result a hierarchical decomposition

7
Search Goal

Choose the same documents a human would
Without human intervention (less work)
Faster than a human could (less time)
As accurately as possible (less accuracy)
Humans start with an information need
Machines start with a query
Humans match documents to information needs
Machines match document query representations

8
Search Component Model
Utility
Human Judgment
Information Need
Document
Query Formulation
Query
Document Processing
Query Processing
Representation Function
Representation Function
Query Representation
Document Representation
Comparison Function
Retrieval Status Value
9
Relevance

Relevance relates a topic and a document
Duplicates are equally relevant, by definition
Constant over time and across users
Pertinence relates a task and a document
Accounts for quality, complexity, language,
Utility relates a user and a document
Accounts for prior knowledge
We seek utility, but relevance is what we get!

10
Bag of Terms Representation

Bag a set that can contain duplicates
The quick brown fox jumped over the lazy dogs
back ?
back, brown, dog, fox, jump, lazy, over,
quick, the, the
Vector values recorded in any consistent order
back, brown, dog, fox, jump, lazy, over, quick,
the, the ?
1 1 1 1 1 1 1 1 2

11
Bag of Terms Example
Document 1
Stopword List
Term
Document 1
Document 2
The quick brown fox jumped over the lazy dogs
back.
for
aid
0
1
is
all
0
1
back
of
1
0
the
brown
1
0
to
come
0
1
dog
1
0
fox
1
0
Document 2
good
0
1
jump
1
0
lazy
1
0
Now is the time for all good men to come to the
aid of their party.
men
0
1
now
0
1
over
1
0
party
0
1
quick
1
0
their
0
1
time
0
1
12
Boolean Free Text Retrieval

Limit the bag of words to absent and present
Boolean values, represented as 0 and 1
Represent terms as a bag of documents
Same representation, but rows rather than columns
Combine the rows using Boolean operators
AND, OR, NOT
Result set every document with a 1 remaining

13
Boolean Operators
B
B
0
1
0
1
A
0
1
1
0
0
NOT B
A OR B
1
1
1
B
B
0
1
0
1
A
A
0
0
0
0
0
0
A AND B
A NOT B
0
1
1
0
1
1
( A AND NOT B)
14
Boolean Free Text Example

dog AND fox
Doc 3, Doc 5
dog NOT fox
Empty
fox NOT dog
Doc 7
dog OR fox
Doc 3, Doc 5, Doc 7
good AND party
Doc 6, Doc 8
good AND party NOT over
Doc 6

Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
aid
0
0
0
1
0
0
0
1
all
0
1
0
1
0
1
0
0
back
1
0
1
0
0
0
1
0
brown
1
0
1
0
1
0
1
0
come
0
1
0
1
0
1
0
1
dog
0
0
1
0
1
0
0
0
fox
0
0
1
0
1
0
1
0
good
0
1
0
1
0
1
0
1
jump
0
0
1
0
0
0
0
0
lazy
1
0
1
0
1
0
1
0
men
0
1
0
1
0
0
0
1
now
0
1
0
0
0
1
0
1
over
1
0
1
0
1
0
1
1
party
0
0
0
0
0
1
0
1
quick
1
0
1
0
0
0
0
0
their
1
0
0
0
1
0
1
0
time
0
1
0
1
0
1
0
0
15
Why Boolean Retrieval Works

Boolean operators approximate natural language
Find documents about a good party that is not
over
AND can discover relationships between concepts
good party
OR can discover alternate terminology
excellent party
NOT can discover alternate meanings
Democratic party

16
The Perfect Query Paradox

Every information need has a perfect doc set
If not, there would be no sense doing retrieval
Almost every document set has a perfect query
AND every word to get a query for document 1
Repeat for each document in the set
OR every document query to get the set query
But users find Boolean query formulation hard
They get too much, too little, useless stuff,

17
Why Boolean Retrieval Fails

Natural language is way more complex
She saw the man on the hill with a telescope
AND discovers nonexistent relationships
Terms in different paragraphs, chapters,
Guessing terminology for OR is hard
good, nice, excellent, outstanding, awesome,
Guessing terms to exclude is even harder!
Democratic party, party to a lawsuit,

18
Proximity Operators

More precise versions of AND
NEAR n allows at most n-1 intervening terms
WITH requires terms to be adjacent and in order
Easy to implement, but less efficient
Store a list of positions for each word in each
doc
Stopwords become very important!
Perform normal Boolean computations
Treat WITH and NEAR like AND with an extra
constraint

19
Proximity Operator Example
Term
Doc 1
Doc 2

time AND come
Doc 2
time (NEAR 2) come
Empty
quick (NEAR 2) fox
Doc 1
quick WITH fox
Empty

aid
1 (13)
0
all
1 (6)
0
back
0
1 (10)
brown
0
1 (3)
come
0
1 (9)
dog
0
1 (9)
fox
0
1 (4)
good
1 (7)
0
jump
0
1 (5)
lazy
0
1 (8)
men
1 (8)
0
now
1 (1)
0
over
0
1 (6)
party
1 (16)
0
quick
1 (2)
0
their
1 (15)
0
time
1 (4)
0
20
Strengths and Weaknesses

Strong points
Accurate, if you know the right strategies
Efficient for the computer
Weaknesses
Often results in too many documents, or none
Users must learn Boolean logic
Sometimes finds relationships that dont exist
Words can have many meanings
Choosing the right words is sometimes hard

21
Ranked Retrieval Paradigm

Exact match retrieval often gives useless sets
No documents at all, or way too many documents
Query reformulation is one solution
Manually add or delete query terms
Best-first ranking can be superior
Select every document within reason
Put them in order, with the best ones first
Display them one screen at a time

22
Advantages of Ranked Retrieval

Closer to the way people think
Some documents are better than others
Enriches browsing behavior
Decide how far down the list to go as you read it
Allows more flexible queries
Long and short queries can produce useful results

23
Ranked Retrieval Challenges

Best first is easy to say but hard to do!
The best we can hope for is to approximate it
Will the user understand the process?
It is hard to use a tool that you dont
understand
Efficiency becomes a concern
Only a problem for long queries, though

24
Partial-Match Ranking

Form several result sets from one long query
Query for the first set is the AND of all the
terms
Then all but the 1st term, all but the 2nd,
Then all but the first two terms,
And so on until each single term query is tried
Remove duplicates from subsequent sets
Display the sets in the order they were made
Document rank within a set is arbitrary

25
Partial Match Example
information AND retrieval
Readings in Information Retrieval Information
Storage and Retrieval Speech-Based Information
Retrieval for Digital Libraries Word Sense
Disambiguation and Information Retrieval
information NOT retrieval
The State of the Art in Information Filtering
retrieval NOT information
Inference Networks for Document
Retrieval Content-Based Image Retrieval
Systems Video Parsing, Retrieval and Browsing An
Approach to Conceptual Text Retrieval Using the
EuroWordNet Cross-Language Retrieval
English/Russian/French
26
Similarity-Based Queries

Treat the query as if it were a document
Create a query bag-of-words
Find the similarity of each document
Using the coordination measure, for example
Rank order the documents by similarity
Most similar to the query first
Surprisingly, this works pretty well!
Especially for very short queries

27
Document Similarity

How similar are two documents?
In particular, how similar is their bag of words?

1
2
3
1
complicated
1 Nuclear fallout contaminated Montana.
1
contaminated
1
fallout
2 Information retrieval is interesting.
1
1
information
3 Information retrieval is complicated.
1
interesting
1
nuclear
1
1
retrieval
1
siberia
28
The Coordination Measure

Count the number of terms in common
Based on Boolean bag-of-words
Documents 2 and 3 share two common terms
But documents 1 and 2 share no terms at all
Useful for more like this queries
more like doc 2 would rank doc 3 ahead of doc 1
Where have you seen this before?

29
Coordination Measure Example
1
2
3
1
complicated
Query complicated retrieval Result 3, 2
1
contaminated
1
fallout
Query interesting nuclear fallout Result 1, 2
1
1
information
1
interesting
1
nuclear
Query information retrieval Result 2, 3
1
1
retrieval
1
siberia
30
Counting Terms

Terms tell us about documents
If rabbit appears a lot, it may be about
rabbits
Documents tell us about terms
the is in every document -- not discriminating
Documents are most likely described well by rare
terms that occur in them frequently
Higher term frequency is stronger evidence
Low collection frequency makes it stronger still

31
The Document Length Effect

Humans look for documents with useful parts
But probabilities are computed for the whole
Document lengths vary in many collections
So probability calculations could be inconsistent
Two strategies
Adjust probability estimates for document length
Divide the documents into equal passages

32
Incorporating Term Frequency

High term frequency is evidence of meaning
And high IDF is evidence of term importance
Recompute the bag-of-words
Compute TF IDF for every element

33
Weighted Matching Schemes

Unweighted queries
Add up the weights for every matching term
User specified query term weights
For each term, multiply the query and doc weights
Then add up those values
Automatically computed query term weights
Most queries lack useful TF, but IDF may be
useful
Used just like user-specified query term weights

34
TFIDF Example
1
2
3
4
1
2
3
4
Unweighted query contaminated
retrieval Result 2, 3, 1, 4
5
2
1.51
0.60
complicated
0.301
4
1
3
0.50
0.13
0.38
contaminated
0.125
5
4
3
0.63
0.50
0.38
fallout
0.125
Weighted query contaminated(3)
retrieval(1) Result 1, 3, 2, 4
6
3
3
2
information
0.000
1
0.60
interesting
0.602
3
7
0.90
2.11
nuclear
0.301
IDF-weighted query contaminated
retrieval Result 2, 3, 1, 4
6
1
4
0.75
0.13
0.50
retrieval
0.125
2
1.20
siberia
0.602
35
Document Length Normalization

Long documents have an unfair advantage
They use a lot of terms
So they get more matches than short documents
And they use the same words repeatedly
So they have much higher term frequencies
Normalization seeks to remove these effects
Related somehow to maximum term frequency
But also sensitive to the of number of terms

36
Cosine Normalization

Compute the length of each document vector
Multiply each weight by itself
Add all the resulting values
Take the square root of that sum
Divide each weight by that length

37
Cosine Normalization Example
1
2
3
4
1
2
3
4
1
2
3
4
0.57
0.69
5
2
1.51
0.60
complicated
0.301
0.29
0.13
0.14
4
1
3
0.50
0.13
0.38
contaminated
0.125
0.37
0.19
0.44
5
4
3
0.63
0.50
0.38
fallout
0.125
6
3
3
2
information
0.000
0.62
1
0.60
interesting
0.602
0.53
0.79
3
7
0.90
2.11
nuclear
0.301
0.77
0.05
0.57
6
1
4
0.75
0.13
0.50
retrieval
0.125
0.71
2
1.20
siberia
0.602
1.70
0.97
2.67
0.87
Length
Unweighted query contaminated retrieval,
Result 2, 4, 1, 3 (compare to 2, 3, 1, 4)
38
Why Call It Cosine?
d2
?
d1
39
Interpreting the Cosine Measure