Introduction to Information Retrieval (cont.): Boolean Model - PowerPoint PPT Presentation

Loading...

PPT – Introduction to Information Retrieval (cont.): Boolean Model PowerPoint presentation | free to download - id: 72f490-ZWE2Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Introduction to Information Retrieval (cont.): Boolean Model

Description:

Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of Information Management and Systems SIMS 202: Information ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 42
Provided by: RayR158
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to Information Retrieval (cont.): Boolean Model


1
Introduction to Information Retrieval (cont.)
Boolean Model
  • University of California, Berkeley
  • School of Information Management and Systems
  • SIMS 202 Information Organization and Retrieval
  • Lecture authors Marti Hearst Ray Larson

2
The Standard Retrieval Interaction Model
3
IR is an Iterative Process
4
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
5
Restricted Form of the IR Problem
  • The system has available only pre-existing,
    canned text passages.
  • Its response is limited to selecting from these
    passages and presenting them to the user.
  • It must select, say, 10 or 20 passages out of
    millions or billions!

6
Information Retrieval
  • Revised Task Statement
  • Build a system that retrieves documents that
    users are likely to find relevant to their
    queries.
  • This set of assumptions underlies the field of
    Information Retrieval.

7
Some IR History
  • Roots in the scientific Information Explosion
    following WWII
  • Interest in computer-based IR from mid 1950s
  • H.P. Luhn at IBM (1958)
  • Probabilistic models at Rand (Maron Kuhns)
    (1960)
  • Boolean system development at Lockheed (60s)
  • Vector Space Model (Salton at Cornell 1965)
  • Statistical Weighting methods and theoretical
    advances (70s)
  • Refinements and Advances in application (80s)
  • User Interfaces, Large-scale testing and
    application (90s)

8
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
9
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
10
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
11
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
12
Relevance (introduction)
  • In what ways can a document be relevant to a
    query?
  • Answer precise question precisely.
  • Who is buried in grants tomb? Grant.
  • Partially answer question.
  • Where is Danville? Near Walnut Creek.
  • Suggest a source for more information.
  • What is lymphodema? Look in this Medical
    Dictionary.
  • Give background information.
  • Remind the user of other knowledge.
  • Others ...
  • Ideally, IR systems should retrieve ALL and ONLY
    the RELEVANT documents for a user

13
Query Languages
  • A way to express the question (information need)
  • Types
  • Boolean
  • Natural Language
  • Stylized Natural Language
  • Form-Based (GUI)

14
Simple query language Boolean
  • Terms Connectors (or operators)
  • terms
  • words
  • normalized (stemmed) words
  • phrases
  • thesaurus terms
  • connectors
  • AND
  • OR
  • NOT

15
Boolean Queries
  • Cat
  • Cat OR Dog
  • Cat AND Dog
  • (Cat AND Dog)
  • (Cat AND Dog) OR Collar
  • (Cat AND Dog) OR (Collar AND Leash)
  • (Cat OR Dog) AND (Collar OR Leash)

16
Boolean Queries
  • (Cat OR Dog) AND (Collar OR Leash)
  • Each of the following combinations works
  • Cat x x x x
  • Dog x x x x x
  • Collar x x x x
  • Leash x x x x

17
Boolean Queries
  • (Cat OR Dog) AND (Collar OR Leash)
  • None of the following combinations work
  • Cat x x
  • Dog x x
  • Collar x x
  • Leash x x

18
Boolean Logic
B
A
19
Boolean Queries
  • Usually expressed as INFIX operators in IR
  • ((a AND b) OR (c AND b))
  • NOT is UNARY PREFIX operator
  • ((a AND b) OR (c AND (NOT b)))
  • AND and OR can be n-ary operators
  • (a AND b AND c AND d)
  • Some rules - (De Morgan revisited)
  • NOT(a) AND NOT(b) NOT(a OR b)
  • NOT(a) OR NOT(b) NOT(a AND b)
  • NOT(NOT(a)) a

20
Boolean Logic
t1
t2
D9
D2
D1
m3
m5
m6
m1 t1 t2 t3
D4
D11
m2 t1 t2 t3
D5
m3 t1 t2 t3
D3
m1
D6
m4 t1 t2 t3
m2
m4
D10
m5 t1 t2 t3
m6 t1 t2 t3
m7
m8
m7 t1 t2 t3
D8
D7
m8 t1 t2 t3
t3
21
Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
22
Psuedo-Boolean Queries
  • A new notation, from web search
  • cat dog collar leash
  • Does not mean the same thing!
  • Need a way to group combinations.
  • Phrases
  • stray cat AND frayed collar
  • stray cat frayed collar

23
Information need
Collections
text input
24
Result Sets
  • Run a query, get a result set
  • Two choices
  • Reformulate query, run on entire collection
  • Reformulate query, run on result set
  • Example Dialog query
  • (Redford AND Newman)
  • -gt S1 1450 documents
  • (S1 AND Sundance)
  • -gtS2 898 documents

25
Information need
Collections
text input
Reformulated Query
26
Ordering of Retrieved Documents
  • Pure Boolean has no ordering
  • In practice
  • order chronologically
  • order by total number of hits on query terms
  • What if one term has more hits than others?
  • Is it better to one of each term or many of one
    term?
  • Fancier methods have been investigated
  • p-norm is most famous
  • usually impractical to implement
  • usually hard for user to understand

27
Boolean
  • Advantages
  • simple queries are easy to understand
  • relatively easy to implement
  • Disadvantages
  • difficult to specify what is wanted
  • too much returned, or too little
  • ordering not well determined
  • Dominant language in commercial systems until the
    WWW

28
Faceted Boolean Query
  • Strategy break query into facets (polysemous
    with earlier meaning of facets)
  • conjunction of disjunctions
  • a1 OR a2 OR a3
  • b1 OR b2
  • c1 OR c2 OR c3 OR c4
  • each facet expresses a topic
  • rain forest OR jungle OR amazon
  • medicine OR remedy OR cure
  • Smith OR Zhou

AND
AND
29
Faceted Boolean Query
  • Query still fails if one facet missing
  • Alternative Coordination level ranking
  • Order results in terms of how many facets
    (disjuncts) are satisfied
  • Also called Quorum ranking, Overlap ranking, and
    Best Match
  • Problem Facets still undifferentiated
  • Alternative assign weights to facets

30
Proximity Searches
  • Proximity terms occur within K positions of one
    another
  • pen w/5 paper
  • A Near function can be more vague
  • near(pen, paper)
  • Sometimes order can be specified
  • Also, Phrases and Collocations
  • United Nations Bill Clinton
  • Phrase Variants
  • retrieval of information information
    retrieval

31
Filters
  • Filters Reduce set of candidate docs
  • Often specified simultaneous with query
  • Usually restrictions on metadata
  • restrict by
  • date range
  • internet domain (.edu .com .berkeley.edu)
  • author
  • size
  • limit number of documents returned

32
How are the texts handled?
  • What happens if you take the words exactly as
    they appear in the original text?
  • What about punctuation, capitalization, etc.?
  • What about spelling errors?
  • What about plural vs. singular forms of words
  • What about cases and declension in non-english
    languages?
  • What about non-roman alphabets?

33
Content Analysis
  • Automated Transformation of raw text into a form
    that represent some aspect(s) of its meaning
  • Including, but not limited to
  • Automated Thesaurus Generation
  • Phrase Detection
  • Categorization
  • Clustering
  • Summarization

34
Techniques for Content Analysis
  • Statistical
  • Single Document
  • Full Collection
  • Linguistic
  • Syntactic
  • Semantic
  • Pragmatic
  • Knowledge-Based (Artificial Intelligence)
  • Hybrid (Combinations)

35
Text Processing
  • Standard Steps
  • Recognize document structure
  • titles, sections, paragraphs, etc.
  • Break into tokens
  • usually space and punctuation delineated
  • special issues with Asian languages
  • Stemming/morphological analysis
  • Store in inverted index (to be discussed later)

36
Information need
Collections
How is the query constructed?
How is the text processed?
text input
37
Document Processing Steps
38
Stemming and Morphological Analysis
  • Goal normalize similar words
  • Morphology (form of words)
  • Inflectional Morphology
  • E.g,. inflect verb endings and noun number
  • Never change grammatical class
  • dog, dogs
  • tengo, tienes, tiene, tenemos, tienen
  • Derivational Morphology
  • Derive one word from another,
  • Often change grammatical class
  • build, building health, healthy

39
Automated Methods
  • Powerful multilingual tools exist for
    morphological analysis
  • PCKimmo, Xerox Lexical technology
  • Require a grammar and dictionary
  • Use two-level automata
  • Stemmers
  • Very dumb rules work well (for English)
  • Porter Stemmer Iteratively remove suffixes
  • Improvement pass results through a lexicon

40
Errors Generated by Porter Stemmer (Krovetz 93)
41
Next
  • Statistical Properties of Text
  • Preparing information for search Lexical
    analysis
  • Introduction to the Vector Space model of IR.
About PowerShow.com