Introduction to Information Retrieval (cont.): Boolean Model presentation

About This Presentation

Transcript and Presenter's Notes

Title: Introduction to Information Retrieval (cont.): Boolean Model

1
Introduction to Information Retrieval (cont.)
Boolean Model

University of California, Berkeley
School of Information Management and Systems
SIMS 202 Information Organization and Retrieval
Lecture authors Marti Hearst Ray Larson

2
The Standard Retrieval Interaction Model
3
IR is an Iterative Process
4
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
5
Restricted Form of the IR Problem

The system has available only pre-existing,
canned text passages.
Its response is limited to selecting from these
passages and presenting them to the user.
It must select, say, 10 or 20 passages out of
millions or billions!

6
Information Retrieval

Revised Task Statement
Build a system that retrieves documents that
users are likely to find relevant to their
queries.
This set of assumptions underlies the field of
Information Retrieval.

7
Some IR History

Roots in the scientific Information Explosion
following WWII
Interest in computer-based IR from mid 1950s
H.P. Luhn at IBM (1958)
Probabilistic models at Rand (Maron Kuhns)
(1960)
Boolean system development at Lockheed (60s)
Vector Space Model (Salton at Cornell 1965)
Statistical Weighting methods and theoretical
advances (70s)
Refinements and Advances in application (80s)
User Interfaces, Large-scale testing and
application (90s)

8
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
9
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
10
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
11
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
12
Relevance (introduction)

In what ways can a document be relevant to a
query?
Answer precise question precisely.
Who is buried in grants tomb? Grant.
Partially answer question.
Where is Danville? Near Walnut Creek.
Suggest a source for more information.
What is lymphodema? Look in this Medical
Dictionary.
Give background information.
Remind the user of other knowledge.
Others ...
Ideally, IR systems should retrieve ALL and ONLY
the RELEVANT documents for a user

13
Query Languages

A way to express the question (information need)
Types
Boolean
Natural Language
Stylized Natural Language
Form-Based (GUI)

14
Simple query language Boolean

Terms Connectors (or operators)
terms
words
normalized (stemmed) words
phrases
thesaurus terms
connectors
AND
OR
NOT

15
Boolean Queries

Cat
Cat OR Dog
Cat AND Dog
(Cat AND Dog)
(Cat AND Dog) OR Collar
(Cat AND Dog) OR (Collar AND Leash)
(Cat OR Dog) AND (Collar OR Leash)

16
Boolean Queries

(Cat OR Dog) AND (Collar OR Leash)
Each of the following combinations works
Cat x x x x
Dog x x x x x
Collar x x x x
Leash x x x x

17
Boolean Queries

(Cat OR Dog) AND (Collar OR Leash)
None of the following combinations work
Cat x x
Dog x x
Collar x x
Leash x x

18
Boolean Logic
B
A
19
Boolean Queries

Usually expressed as INFIX operators in IR
((a AND b) OR (c AND b))
NOT is UNARY PREFIX operator
((a AND b) OR (c AND (NOT b)))
AND and OR can be n-ary operators
(a AND b AND c AND d)
Some rules - (De Morgan revisited)
NOT(a) AND NOT(b) NOT(a OR b)
NOT(a) OR NOT(b) NOT(a AND b)
NOT(NOT(a)) a

20
Boolean Logic
t1
t2
D9
D2
D1
m3
m5
m6
m1 t1 t2 t3
D4
D11
m2 t1 t2 t3
D5
m3 t1 t2 t3
D3
m1
D6
m4 t1 t2 t3
m2
m4
D10
m5 t1 t2 t3
m6 t1 t2 t3
m7
m8
m7 t1 t2 t3
D8
D7
m8 t1 t2 t3
t3
21
Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
22
Psuedo-Boolean Queries

A new notation, from web search
cat dog collar leash
Does not mean the same thing!
Need a way to group combinations.
Phrases
stray cat AND frayed collar
stray cat frayed collar

23
Information need
Collections
text input
24
Result Sets

Run a query, get a result set
Two choices
Reformulate query, run on entire collection
Reformulate query, run on result set
Example Dialog query
(Redford AND Newman)
-gt S1 1450 documents
(S1 AND Sundance)
-gtS2 898 documents

25
Information need
Collections
text input
Reformulated Query
26
Ordering of Retrieved Documents

Pure Boolean has no ordering
In practice
order chronologically
order by total number of hits on query terms
What if one term has more hits than others?
Is it better to one of each term or many of one
term?
Fancier methods have been investigated
p-norm is most famous
usually impractical to implement
usually hard for user to understand

27
Boolean

Advantages
simple queries are easy to understand
relatively easy to implement
Disadvantages
difficult to specify what is wanted
too much returned, or too little
ordering not well determined
Dominant language in commercial systems until the
WWW

28
Faceted Boolean Query

Strategy break query into facets (polysemous
with earlier meaning of facets)
conjunction of disjunctions
a1 OR a2 OR a3
b1 OR b2
c1 OR c2 OR c3 OR c4
each facet expresses a topic
rain forest OR jungle OR amazon
medicine OR remedy OR cure
Smith OR Zhou

AND
AND
29
Faceted Boolean Query

Query still fails if one facet missing
Alternative Coordination level ranking
Order results in terms of how many facets
(disjuncts) are satisfied
Also called Quorum ranking, Overlap ranking, and
Best Match
Problem Facets still undifferentiated
Alternative assign weights to facets

30
Proximity Searches

Proximity terms occur within K positions of one
another
pen w/5 paper
A Near function can be more vague
near(pen, paper)
Sometimes order can be specified
Also, Phrases and Collocations
United Nations Bill Clinton
Phrase Variants
retrieval of information information
retrieval

31
Filters

Filters Reduce set of candidate docs
Often specified simultaneous with query
Usually restrictions on metadata
restrict by
date range
internet domain (.edu .com .berkeley.edu)
author
size
limit number of documents returned

32
How are the texts handled?

What happens if you take the words exactly as
they appear in the original text?
What about punctuation, capitalization, etc.?
What about spelling errors?
What about plural vs. singular forms of words
What about cases and declension in non-english
languages?
What about non-roman alphabets?

33
Content Analysis

Automated Transformation of raw text into a form
that represent some aspect(s) of its meaning
Including, but not limited to
Automated Thesaurus Generation
Phrase Detection
Categorization
Clustering
Summarization

34
Techniques for Content Analysis

Statistical
Single Document
Full Collection
Linguistic
Syntactic
Semantic
Pragmatic
Knowledge-Based (Artificial Intelligence)
Hybrid (Combinations)

35
Text Processing

Standard Steps
Recognize document structure
titles, sections, paragraphs, etc.
Break into tokens
usually space and punctuation delineated
special issues with Asian languages
Stemming/morphological analysis
Store in inverted index (to be discussed later)

36
Information need
Collections
How is the query constructed?
How is the text processed?
text input
37
Document Processing Steps
38
Stemming and Morphological Analysis

Goal normalize similar words
Morphology (form of words)
Inflectional Morphology
E.g,. inflect verb endings and noun number
Never change grammatical class
dog, dogs
tengo, tienes, tiene, tenemos, tienen
Derivational Morphology
Derive one word from another,
Often change grammatical class
build, building health, healthy

39
Automated Methods

Powerful multilingual tools exist for
morphological analysis
PCKimmo, Xerox Lexical technology
Require a grammar and dictionary
Use two-level automata
Stemmers
Very dumb rules work well (for English)
Porter Stemmer Iteratively remove suffixes
Improvement pass results through a lexicon

Introduction to Information Retrieval (cont.): Boolean Model PowerPoint PPT Presentation