Title: Query Models
1Query Models
- Use
- Types
- What do search engines do
2What we have covered
- What is IR
- Evaluation
- Tokenization and properties of text
- Web crawling
- This time
- Query models
3Index
Query Engine
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
4Why the interest in Queries?
- Queries are ways we interact with IR systems
- Expression of an information need
- Nonquery methods?
- Types of queries?
5Issues with Query Structures
- Matching and ranking criteria
- Given a query, what documents are retrieved?
- In what order (rank)?
6Types of Query Structures
- Query Models (languages) most common
- Boolean Queries
- Extended-Boolean Queries
- Natural Language Queries
- Vector queries
- Others?
7Simple query language Boolean
- Earliest query model
- Terms Connectors (or operators)
- terms
- words
- normalized (stemmed) words
- phrases
- thesaurus terms
- connectors
- AND
- OR
- NOT
8Simple query language Boolean
- Geek-speak
- Variations are still used in search engines!
- Ex X AND Y, Y AND X
9Truth Tables Boolean Logic
Presence of P, P 1 Absence of P, P 0 True
1 False 0
10Problems with Boolean Queries
- How do you express your need in a Boolean
Query???? (geekspeak) - No good way to weight terms for significance
- Want music by Beethoven, preferably a sonata
- Query?
- Ranking?
- Binary
11Problems with Boolean Queries
- Incorrect interpretation of Boolean connectives
AND and OR - Example - Seeking Saturday entertainment
- Queries
- Dinner AND sports AND symphony
- Dinner OR sports OR symphony
- Dinner AND sports OR symphony
12Order of precedence of operators
- Example of query. Is
- A AND B
- the same as
- B AND A
- Why?
13Sample Boolean Queries
- Cat
- Cat OR Dog
- Cat AND Dog
- (Cat AND Dog)
- (Cat AND Dog) OR Collar
- (Cat AND Dog) OR (Collar AND Leash)
- (Cat OR Dog) AND (Collar OR Leash)
14Satisfaction of Boolean Query
- (Cat OR Dog) AND (Collar OR Leash)
- Each of the following column combinations works
- Cat x x x x
- Dog x x x x x
- Collar x x x x
- Leash x x x x
Others?
15Satisfaction of Boolean Query
- (Cat OR Dog) AND (Collar OR Leash)
- None of the following column combinations work
- Cat x x
- Dog x x
- Collar x x
- Leash x x
16Boolean Logic
B
A
17Order of Preference
- Define order of preference
- EX a OR b AND c
- Infix notation
- Parenthesis evaluated 1st with left to right
precedence of operators - Next NOTs are applied
- Then ANDs
- Then ORs
- a OR b AND c becomes
- a OR (b AND c)
18Infix Notation
- Usually expressed as INFIX operators in IR
- ((a AND b) OR (c AND b))
- NOT is UNARY PREFIX operator
- ((a AND b) OR (c AND (NOT b)))
- AND and OR can be n-ary operators
- (a AND b AND c AND d)
- Some rules - (De Morgan revisited)
- NOT(a) AND NOT(b) NOT(a OR b)
- NOT(a) OR NOT(b) NOT(a AND b)
- NOT(NOT(a)) a
19DNFs and CNFs
- All queries can be rewritten as
- Disjunctive Normal Forms (DNFs)
- Conjunctive Normal Forms (CNFs)
- DNF Constituents
- Terms (words or phrases)
- Conjuncts (terms joined by ANDs)
- Disjuncts (conjuncts joined by ORs)
- Ex (A AND B) OR (A AND NOTC)
- CNF Constituents
- Terms (words or phrases)
- Disjuncts (terms joined by ORs)
- Conjuncts (disjuncts joined by ANDs)
- Ex (A OR B) AND (A OR NOTC)
20Effect of CNFs
- All complex Boolean queries can be simplified
- Why do reference librarians like CNFs?
- ANDs reduce the size of the set returned and are
easily expandable - So do minuss
21Boolean Logic
t1
t2
D9
D2
D1
m3
m5
m6
m1 t1 t2 t3
D4
D11
m2 t1 t2 t3
D5
m3 t1 t2 t3
D3
m1
D6
m4 t1 t2 t3
m2
m4
D10
m5 t1 t2 t3
m6 t1 t2 t3
m7
m8
m7 t1 t2 t3
D8
D7
m8 t1 t2 t3
t3
22Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
23Pseudo-Boolean Queries
- A new notation, from web search
- cat dog collar leash
- Does not mean the same thing!
- Need a way to group combinations.
- Phrases
- stray cat AND frayed collar
- stray cat frayed collar
24Information need
Collections
text input
25Result Sets
- Run a query, get a result set
- Two choices
- Reformulate query, run on entire collection
- Reformulate query, run on result set
- Example Dialog query
- (Redford AND Newman)
- -gt S1 1450 documents
- (S1 AND Sundance)
- -gtS2 898 documents
26Information need
Collections
text input
Reformulated Query
27Ordering (ranking) of Retrieved Documents
- Pure Boolean has no ordering
- Term is there or its not
- In practice
- order chronologically
- order by total number of hits on query terms
- What if one term has more hits than others?
- Is it better to have one of each term or many of
one term?
28Boolean Query - Summary
- Advantages
- simple queries are easy to understand
- relatively easy to implement
- Disadvantages
- difficult to specify what is wanted
- too much returned, or too little
- ordering not well determined
- Dominant language in commercial systems until the
WWW
29Vector Space Model
- Documents and queries are represented as vectors
in term space - Terms are usually stems
- Documents represented by binary vectors of terms
- Queries represented the same as documents
- Query and Document weights are based on length
and direction of their vector - A vector distance measure between the query and
documents is used to rank retrieved documents
30Document Vectors
- Documents are represented as bags of words
- Words are terms with no order
- Represented as vectors when used computationally
- A vector is like an array of floating point
values - Has direction and magnitude
- Each vector holds a place for every term in the
collection - Therefore, most vectors are sparse
31Queries
- Vocabulary (dog, house, white)
- Queries
- dog (1,0,0)
- house (0,1,0)
- white (0,0,1)
- house and dog (1,1,0)
- dog and house (1,1,0)
- Show 3-D space plot
32Documents (queries) in Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
33Documents in 3D Space
Assumption Documents that are close together
in space are similar in meaning.
34Vector Query Problems
- Significance of queries
- Can different values be placed on the different
terms eg. 2dog 1house - Scaling size of vectors
- Number of words in the dictionary?
- 100,000
35Proximity Searches
- Proximity terms occur within K positions of one
another - pen w/5 paper
- A Near function can be more vague
- near(pen, paper)
- Sometimes order can be specified
- Also, Phrases and Collocations
- United Nations Bill Clinton
- Phrase Variants
- retrieval of information information
retrieval
36Filters
- Filters Reduce set of candidate docs
- Often specified simultaneous with query
- Usually restrictions on metadata
- restrict by
- date range
- internet domain (.edu .com .berkeley.edu)
- author
- size
- limit number of documents returned
37Natural Language Queries
- The Holy Grail of information retrieval
- Issues in Natural Language Processing
- syntax
- semantics
- pragmatics
- speech understanding
- speech generation
38What do search engines do?
- Tags
- Title
- Meta
- Term frequency and location
- Popularity
39UC Berkeley Search Engine Guide
http//www.lib.berkeley.edu/TeachingLib/Guides/Int
ernet/SearchEngines.html
40UC Berkeley Search Engine Guide
http//www.lib.berkeley.edu/TeachingLib/Guides/Int
ernet/SearchEngines.html
41Search Engine Queries
42OldSearch Engine Query Differences
43(No Transcript)
44Older Search engine query models
45Search engine query models
46Types of Query Structures
- Query Models (languages) most common
- Boolean Queries
- Old model
- Vector queries
- Very common
- Holy grail of search
- Natural Language Queries
- Batch lookup - its all there before you query!