Information Retrieval using the Boolean Model presentation

About This Presentation

Transcript and Presenter's Notes

Title: Information Retrieval using the Boolean Model

1
Information Retrieval using the Boolean Model
2
Query

Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
Could grep all of Shakespeares plays for Brutus
and Caesar, then strip out lines containing
Calpurnia?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the phrase Romans
and countrymen) not feasible

3
Term-document incidence
1 if play contains word, 0 otherwise
4
Incidence vectors

So we have a 0/1 vector for each term.
To answer query take the vectors for Brutus,
Caesar and Calpurnia (complemented) ? bitwise
AND.
110100 AND 110111 AND 101111 100100.

5
Answers to query

Antony and Cleopatra, Act III, Scene ii
Agrippa Aside to DOMITIUS ENOBARBUS Why,
Enobarbus,
When Antony found
Julius Caesar dead,
He cried almost to
roaring and he wept
When at Philippi he
found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius I did enact Julius Caesar I was
killed i' the
Capitol Brutus killed me.

6
Bigger document collections

Consider N 1million documents, each with about
1K terms.
Avg 6 bytes/term incl spaces/punctuation
6GB of data in the documents.
Say there are M 500K distinct terms among these.

7
Cant build the matrix

500K x 1M matrix has half-a-trillion 0s and 1s.
But it has no more than one billion 1s.
matrix is extremely sparse.
Whats a better representation?
We only record the 1 positions.

Why?
8
Inverted index

For each term T store a list of all documents
that contain T.
Do we use an array or a list for this?

Brutus
Calpurnia
Caesar
13
16
What happens if the word Caesar is added to
document 14?
9
Inverted index

Linked lists generally preferred to arrays
Dynamic space allocation
Insertion of terms into documents easy
Space overhead of pointers

2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
Sorted by docID (more later on why).
10
Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
11
Indexer steps

Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
12

Sort by terms.

Core indexing step.
13

Multiple term entries in a single document are
merged.
Frequency information is added.

Why frequency? Will discuss later.
14

The result is split into a Dictionary file and a
Postings file.

Where do we pay in storage?

Terms
Pointers
16
The index we just built
Todays focus

How do we process a Boolean query?
Later - what kinds of queries can we process?

17
Query processing

Consider processing the query
Brutus AND Caesar
Locate Brutus in the Dictionary
Retrieve its postings.
Locate Caesar in the Dictionary
Retrieve its postings.
Merge the two postings

128
Brutus
Caesar
34
18
The merge

Walk through the two postings simultaneously, in
time linear in the total number of postings
entries

128
2
34
If the list lengths are x and y, the merge takes
O(xy) operations. Crucial postings sorted by
docID.
19
Basic postings intersection
20
Boolean queries Exact match

Queries using AND, OR and NOT together with query
terms
Views each document as a set of words
Is precise document matches condition or not.
Primary commercial retrieval tool for 3 decades.
Professional searchers (e.g., Lawyers) still like
Boolean queries
You know exactly what youre getting.

21
Example WestLaw http//www.westlaw.com/

Largest commercial (paying subscribers) legal
search service (started 1975 ranking added 1992)
About 7 terabytes of data 700,000 users
Majority of users still use boolean queries
Example query
What is the statute of limitations in cases
involving the federal tort claims act?
LIMIT! /3 STATUTE ACTION /s FEDERAL /2 TORT /3
CLAIM
Long, precise queries proximity operators
incrementally developed not like web search

22
Query optimization

What is the best order for query processing?
Consider a query that is an AND of t terms.
For each of the t terms, get its postings, then
AND together.

Brutus
Calpurnia
Caesar
13
16
Query Brutus AND Calpurnia AND Caesar
23
Query optimization example

Process in order of increasing freq
start with smallest set, then keep cutting
further.

This is why we kept freq in dictionary
Execute the query as (Caesar AND Brutus) AND
Calpurnia.
24
Query optimization
25
More general optimization

e.g., (madding OR crowd) AND (ignoble OR strife)
Get freqs for all terms.
Estimate the size of each OR by the sum of its
freqs (conservative).
Process in increasing order of OR sizes.

26
Exercise

Recommend a query processing order for

(tangerine OR trees) AND (marmalade OR skies)
AND (kaleidoscope OR eyes)
27
Beyond Boolean term search

What about phrases?
Proximity Find Gates NEAR Microsoft.
Need index to capture position information in
docs. More later.
Zones in documents Find documents with (author
Ullman) AND (text contains automata).

28
Evidence accumulation

1 vs. 0 occurrence of a search term
2 vs. 1 occurrence
3 vs. 2 occurrences, etc.
Need term frequency information in docs.
Used to compute a score for each document
Matching documents rank-ordered by this score.

29
Evaluating search engines
30
Measures for a search engine

How fast does it index
Number of documents/hour
(Average document size)
How fast does it search
Latency as a function of index size
Expressiveness of query language
Speed on complex queries

31
Measures for a search engine

All of the preceding criteria are measurable we
can quantify speed/size we can make
expressiveness precise
The key measure user happiness
What is this?
Speed of response/size of index are factors
But blindingly fast, useless answers wont make a
user happy
Need a way of quantifying user happiness

32
Measuring user happiness

Issue who is the user we are trying to make
happy?
Depends on the setting
Web engine user finds what they want and return
to the engine
Can measure rate of return users
eCommerce site user finds what they want and
make a purchase
Is it the end-user, or the eCommerce site, whose
happiness we measure?
Measure time to purchase, or fraction of
searchers who become buyers?

33
Measuring user happiness

Enterprise (company/govt/academic) Care about
user productivity
How much time do my users save when looking for
information?
Many other criteria having to do with breadth of
access, secure access more later

34
Happiness elusive to measure

Most common proxy relevance of search results
But how do you measure relevance?
Will detail a methodology here, then examine its
issues
Requires 3 elements
A benchmark document collection
A benchmark suite of queries
A binary assessment of either Relevant or
Irrelevant for each query-doc pair

35
Evaluating an IR system

Note information need is translated into a query
Relevance is assessed relative to the information
need not the query
E.g., Information need I'm looking for
information on whether drinking red wine is more
effective at reducing your risk of heart attacks
than white wine.
Query wine red white heart attack effective

36
Standard relevance benchmarks

TREC - National Institute of Standards and
Testing (NIST) has run large IR benchmark for
many years
Reuters and other benchmark doc collections used
Retrieval tasks specified
sometimes as queries
Human experts mark, for each query and for each
doc, Relevant or Irrelevant
or at least for subset of docs that some system
returned for that query

37
Precision and Recall

Precision fraction of retrieved docs that are
relevant P(relevantretrieved)
Recall fraction of relevant docs that are
retrieved P(retrievedrelevant)
Precision P tp/(tp fp)
Recall R tp/(tp fn)

Relevant Not Relevant
Retrieved tp fp
Not Retrieved fn tn
38
Accuracy a different measure

Given a query an engine classifies each doc as
Relevant or Irrelevant.
Accuracy of an engine the fraction of these
classifications that is correct.

39
Why not just use accuracy?

How to build a 99.9999 accurate search engine on
a low budget.
People doing information retrieval want to find
something and have a certain tolerance for junk.

Snoogle.com
Search for
0 matching results found.
40
Precision/Recall

Can get high recall (but low precision) by
retrieving all docs for all queries!
Recall is a non-decreasing function of the number
of docs retrieved
Precision usually decreases (in a good system)

41
Difficulties in using precision/recall

Should average over large corpus/query ensembles
Need human relevance assessments
People arent reliable assessors
Assessments have to be binary
Nuanced assessments?
Heavily skewed by corpus/authorship
Results may not translate from one domain to
another

Information Retrieval using the Boolean Model PowerPoint PPT Presentation