Prof. Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Ray Larson

Description:

Tuesday and Thursday 10:30 am - 12:00 pm. Fall 2003 ... Berry-Picking Model. Q0. Q1. Q2. Q3. Q4. Q5. A sketch of a searcher... – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 67
Provided by: ValuedGate1
Category:
Tags: berry | boolean | larson | logic | prof | ray

less

Transcript and Presenter's Notes

Title: Prof. Ray Larson


1
Lecture 17 Boolean IR and Text Processing
SIMS 202 Information Organization and Retrieval
  • Prof. Ray Larson Prof. Marc Davis
  • UC Berkeley SIMS
  • Tuesday and Thursday 1030 am - 1200 pm
  • Fall 2003
  • http//www.sims.berkeley.edu/academics/courses/is2
    02/f03/

2
Announcements
  • Wishter volunteers meeting tonight 700
  • Testers needed!!
  • UI Tests on Image Gallery/ Annotation software
  • Thursday between 2-4
  • and Friday 10-4.
  • The tests will be approximately 1 ½ hours (but
    most likely will run a bit shorter.)
  • Signup sheet will be available at the end of class

3
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
4
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
5
IR is an Iterative Process
6
Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
7
Restricted Form of the IR Problem
  • The system has available only pre-existing,
    canned text passages
  • Its response is limited to selecting from these
    passages and presenting them to the user
  • It must select, say, 10 or 20 passages out of
    millions or billions!

8
Information Retrieval
  • Revised Task Statement
  • Build a system that retrieves documents that
    users are likely to find relevant to their
    queries
  • This set of assumptions underlies the field of
    Information Retrieval

9
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
10
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
11
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
12
Central Concepts in IR
  • Documents
  • Queries
  • Collections
  • Evaluation
  • Relevance

13
Documents
  • What do we mean by a document?
  • Full document?
  • Document surrogates?
  • Pages?
  • Buckland (JASIS, Sept. 1997) What is a Document
  • Are IR systems better called Document Retrieval
    systems?
  • A document is a representation of some
    aggregation of information, treated as a unit

14
Collection
  • A collection is some physical or logical
    aggregation of documents
  • A database
  • A Library
  • An index?
  • Others?

15
Queries
  • A query is some expression of a users
    information needs
  • Can take many forms
  • Natural language description of need
  • Formal query in a query language
  • Queries may not be accurate expressions of the
    information need
  • Differences between conversation with a person
    and formal query expression

16
Evaluation Why Evaluate?
  • Determine if the system is desirable
  • Make comparative assessments
  • Others?

17
What To Evaluate?
  • How much of the information need was satisfied
  • How much was learned about a topic
  • Incidental learning
  • How much was learned about the collection
  • How much was learned about other topics
  • How inviting the system is

18
What To Evaluate?
  • What can be measured that reflects users
    ability to use system? (Cleverdon 66)
  • Coverage of information
  • Form of presentation
  • Effort required/ease of use
  • Time and space efficiency
  • Recall
  • Proportion of relevant material actually
    retrieved
  • Precision
  • Proportion of retrieved material actually relevant

Effectiveness
19
Relevance (revisited)
  • Intuitively, we understand quite well what
    relevance means. It is a primitive y know
    concept, as is information for which we hardly
    need a definition. if and when any productive
    contact in communication is desired,
    consciously or not, we involve and use this
    intuitive notion or relevance.
  • Saracevic, 1975 p. 324

20
Relevance
  • How relevant is the document
  • For this user, for this information need
  • Subjective, but
  • Measurable to some extent
  • How often do people agree a document is relevant
    to a query?
  • How well does it answer the question?
  • Complete answer? Partial?
  • Background information?
  • Hints for further exploration?

21
Relevance Research and Thought
  • Review to 1975 by Saracevic
  • Reconsideration of user-centered relevance by
    Schamber, Eisenberg and Nilan, 1990
  • Special Issue of JASIS on relevance (April 1994,
    45(3))

22
Saracevic
  • Relevance is considered as a measure of
    effectiveness of the contact between a source and
    a destination in a communications process
  • Systems view
  • Destinations view
  • Subject Literature view
  • Subject Knowledge view
  • Pertinence
  • Pragmatic view

23
Define Your Own Relevance
  • As we saw last time most definitions of relevance
    follow a formula
  • Relevance is the (A) gage of relevance of an (B)
    aspect of relevance existing between an (C)
    object judged and a (D) frame of reference as
    judged by an (E) assessor

From Saracevic, 1975 and Schamber 1990
24
Schamber, Eisenberg and Nilan
  • Relevance is the measure of retrieval
    performance in all information systems, including
    full-text, multimedia, question-answering,
    database management and knowledge-based systems.
  • Systems-oriented relevance Topicality

25
Schamber, et al. Conclusions
  • Relevance is a multidimensional concept whose
    meaning is largely dependent on users
    perceptions of information and their own
    information need situations
  • Relevance is a dynamic concept that depends on
    users judgments of the quality of the
    relationship between information and information
    need at a certain point in time.
  • Relevance is a complex but systematic and
    measurable concept if approached conceptually and
    operationally from the users perspective.

26
Janes View
27
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
28
Query Languages
  • A way to express the question (information need)
  • Types
  • Boolean
  • Natural Language
  • Stylized Natural Language
  • Form-Based (GUI)

29
Simple Query Language Boolean
  • Terms Connectors (or operators)
  • Terms
  • Words
  • Normalized (stemmed) words
  • Phrases
  • Thesaurus terms
  • Connectors
  • AND
  • OR
  • NOT

30
Boolean Queries
  • Cat
  • Cat OR Dog
  • Cat AND Dog
  • (Cat AND Dog)
  • (Cat AND Dog) OR Collar
  • (Cat AND Dog) OR (Collar AND Leash)
  • (Cat OR Dog) AND (Collar OR Leash)

31
Boolean Queries
  • (Cat OR Dog) AND (Collar OR Leash)
  • Each of the following combinations works

32
Boolean Queries
  • (Cat OR Dog) AND (Collar OR Leash)
  • None of the following combinations works

33
Boolean Logic
A
B
34
Boolean Queries
  • Usually expressed as INFIX operators in IR
  • ((a AND b) OR (c AND b))
  • NOT is UNARY PREFIX operator
  • ((a AND b) OR (c AND (NOT b)))
  • AND and OR can be n-ary operators
  • (a AND b AND c AND d)
  • Some rules - (De Morgan revisited)
  • NOT(a) AND NOT(b) NOT(a OR b)
  • NOT(a) OR NOT(b) NOT(a AND b)
  • NOT(NOT(a)) a

35
Boolean Logic
m1 t1 t2 t3
m2 t1 t2 t3
m3 t1 t2 t3
m4 t1 t2 t3
m5 t1 t2 t3
m6 t1 t2 t3
m7 t1 t2 t3
m8 t1 t2 t3
36
Boolean Searching
37
Pseudo-Boolean Queries
  • A new notation, from web search
  • cat dog collar leash
  • Does not mean the same thing!
  • Need a way to group combinations
  • Phrases
  • stray cat AND frayed collar
  • stray cat frayed collar

38
Another View of IR
Information Need
Collections
Pre-Process
Text Input
Index
Query
Parse
Rank
39
Result Sets
  • Run a query, get a result set
  • Two choices
  • Reformulate query, run on entire collection
  • Reformulate query, run on result set
  • Example Dialog query
  • (Redford AND Newman)
  • -gt S1 1450 documents
  • (S1 AND Sundance)
  • -gtS2 898 documents

40
Feedback Queries
41
Ordering of Retrieved Documents
  • Pure Boolean has no ordering
  • In practice
  • Order chronologically
  • Order by total number of hits on query terms
  • What if one term has more hits than others?
  • Is it better to one of each term or many of one
    term?
  • Fancier methods have been investigated
  • p-norm is most famous
  • Usually impractical to implement
  • Usually hard for user to understand

42
Boolean
  • Advantages
  • Simple queries are easy to understand
  • Relatively easy to implement
  • Disadvantages
  • Difficult to specify what is wanted
  • Too much returned, or too little
  • Ordering not well determined
  • Dominant language in commercial systems until the
    WWW

43
Faceted Boolean Query
  • Strategy Break query into facets (polysemous
    with earlier meaning of facets)
  • Conjunction of disjunctions
  • a1 OR a2 OR a3
  • b1 OR b2
  • c1 OR c2 OR c3 OR c4
  • Each facet expresses a topic
  • rain forest OR jungle OR amazon
  • medicine OR remedy OR cure
  • Smith OR Zhou

AND
AND
44
Faceted Boolean Query
  • Query still fails if one facet missing
  • Alternative Coordination level ranking
  • Order results in terms of how many facets
    (disjuncts) are satisfied
  • Also called Quorum ranking, Overlap ranking, and
    Best Match
  • Problem Facets still undifferentiated
  • Alternative Assign weights to facets

45
Proximity Searches
  • Proximity Terms occur within K positions of one
    another
  • pen w/5 paper
  • A Near function can be more vague
  • near(pen, paper)
  • Sometimes order can be specified
  • Also, Phrases and Collocations
  • United Nations Bill Clinton
  • Phrase Variants
  • retrieval of information information
    retrieval

46
Filters
  • Filters Reduce set of candidate docs
  • Often specified simultaneous with query
  • Usually restrictions on metadata
  • Restrict by
  • Date range
  • Internet domain (.edu .com .berkeley.edu)
  • Author
  • Size
  • Limit number of documents returned

47
Boolean Systems
  • Most of the commercial database search systems
    that pre-date the WWW are based on Boolean search
  • Dialog, Lexis-Nexis, etc.
  • Most Online Library Catalogs are Boolean systems
  • E.g., MELVYL
  • Database systems use Boolean logic for searching
  • Many of the search engines sold for intranet
    search of web sites are Boolean

48
Why Boolean?
  • Easy to implement
  • Efficient searching across very large databases
  • Easy to explain results
  • Has to have all of the words (AND)
  • Has to have at least one of the words (OR)

49
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
50
Content Analysis
  • Automated Transformation of raw text into a form
    that represents some aspect(s) of its meaning
  • Including, but not limited to
  • Automated Thesaurus Generation
  • Phrase Detection
  • Categorization
  • Clustering
  • Summarization

51
Techniques for Content Analysis
  • Statistical
  • Single Document
  • Full Collection
  • Linguistic
  • Syntactic
  • Semantic
  • Pragmatic
  • Knowledge-Based (Artificial Intelligence)
  • Hybrid (Combinations)

52
Text Processing
  • Standard Steps
  • Recognize document structure
  • Titles, sections, paragraphs, etc.
  • Break into tokens
  • Usually space and punctuation delineated
  • Special issues with Asian languages
  • Stemming/morphological analysis
  • Store in inverted index (to be discussed later)

53
Content Analysis Areas
54
Document Processing Steps
From Modern IR Textbook
55
Stemming and Morphological Analysis
  • Goal normalize similar words
  • Morphology (form of words)
  • Inflectional Morphology
  • E.g,. inflect verb endings and noun number
  • Never change grammatical class
  • dog, dogs
  • tengo, tienes, tiene, tenemos, tienen
  • Derivational Morphology
  • Derive one word from another,
  • Often change grammatical class
  • build, building health, healthy

56
Automated Methods
  • Powerful multilingual tools exist for
    morphological analysis
  • PCKimmo, Xerox Lexical technology
  • Require a grammar and dictionary
  • Use two-level automata
  • Stemmers
  • Very dumb rules work well (for English)
  • Porter Stemmer Iteratively remove suffixes
  • Improvement Pass results through a lexicon

57
Errors Generated by Porter Stemmer
From Krovetz 93
58
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic
  • Boolean IR Systems
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
59
Questions from Patrick Riley
  • In Plato's Meno Dialogue, Plato asks "How does
    one investigate what one does not know?" Plato's
    question is similar to typical questions we
    encounter in this and other readings of INFOSYS
    202 how do we overcome the synonymy and polysemy
    problems faced by lexical searching? Can the LSA
    (Latent Semantic Analysis) and SVD (singular
    value decomposition) statistical techniques
    demonstrated by Demais et al solve the lexicon
    deficiencies in information retrieval?

60
Paradox
  • The Fundamental paradox of Information
    Retrieval as stated by Roland Hjerrpe
  • The need to describe that which you do not know
    in order to find it

61
Questions from Patrick Riley
  • This paper is from 1988...do you know of any
    applications or advancements of this LSA approach
    from the information retrieval community?
    (Example AI (LSA passed the TEFL).
  • And what are some of the limitations of using
    this corpus-based text comparison mechanism?
    (Example no use of word order, incompleteness?)
    How does the LSA approach differ from other
    statistical approaches you've encountered?
    (Example Google's "Similar Pages" feature.)

62
Questions from Joe Hall
  • I would really like to see a show of hands (in
    class, I can't see you now!) of how many people
    have heard of either of the terms "Singular-value
    Decomposition" or "Eigenvector Decomposition"
    before you sat down to read this article. (I ask
    because we use this a lot in numerical
    approximation of radiative transfer in
    astrophysics... SVD is definately a litmus test
    as to whether or not a problem is difficult.)

63
Questions from Joe Hall
  • I'm going to get picky here. In the Conclusion,
    Dumais et al. claim, "The latent structure LSI
    approach is useful for helping people find
    textual information in large collections."
    However, their results (and those of other
    researchers!) mostly contradict this claim. So
    which is it... does the SVD approach "offer no
    improvement over term matching methods" only for
    "relatively homogenous" groups of documents like
    "information science documents." Does LSI work
    best on widely different documents? Take a look
    at this paper's abstract which contradicts the
    Dumais findings http//tinyurl.com/smfo

64
Questions from Joe Hall
  • If you raised your hand for the first question,
    you may know that SVD is very computationally
    intensive... Dumais claims that "it need only be
    done once for each dataset." That's no fun...
    most datasets change over time... not only that,
    but most datasets grow with time... which means
    that SVD techniques can only be used on small,
    static, homogenous data sets (if you buy the link
    I showed above)... what fun is that? Where is
    SVD-enabled SLI useful? Is it merely a
    fascination of IR researchers and a way to write
    fancy grant proposals to make the next mazaratti
    payment?

65
Questions from Tu Tran
  • In what context was this paper written? What was
    the state of the IR field?
  • Imagine you are an information specialist and had
    to explain LSI and SVD to your non-mathematically
    oriented/non-technical manager. How would you do
    it?
  • The paper did not include any user studies. Can
    you imagine tasks where users would not find this
    system useful?

66
Next Time
  • Statistical Properties of Texts and Vector
    Representation
  • Readings/Discussion
  • Cooper, Getting Beyond Boole Dan
  • Bates, How to use Controlled Vocabularies More
    Effectively in Online Searching Ann
  • Hearst, Improving Full-Text Precision on Short
    Queries Using Simple Constraints Simon
  • Modern IR Chapter 7 Sean
Write a Comment
User Comments (0)
About PowerShow.com