CS 430 / INFO 430 Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

CS 430 / INFO 430 Information Retrieval

Description:

( Originally published in Program, 14 no. 3, pp 130-137, July 1980. ... ties - ti. ss ss caress - caress. s cats - cat. 7. Porter Stemmer: Step 1b ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 39
Provided by: wya54
Category:

less

Transcript and Presenter's Notes

Title: CS 430 / INFO 430 Information Retrieval


1
CS 430 / INFO 430 Information Retrieval
Lecture 6 Boolean Methods
2
Course Administration
3
CS 430 / INFO 430 Information Retrieval
Completion of Lecture 5
4
Porter Stemmer
A multi-step, longest-match stemmer. M. F.
Porter, An algorithm for suffix stripping.
(Originally published in Program, 14 no. 3, pp
130-137, July 1980.) http//www.tartarus.org/mart
in/PorterStemmer/def.txt Notation v vowel(s) c c
onstant(s) (vc)m vowel(s) followed by
constant(s), repeated m times Any word can be
written c(vc)mv m is called the measure of
the word
5
Porter's Stemmer
Porter Stemming Algorithm Complex
suffixes Complex suffixes are removed bit by bit
in the different steps. Thus GENERALIZATIONS bec
omes GENERALIZATION (Step 1) becomes GENERALIZE
(Step 2) becomes GENERAL (Step 3) becomes GENER
(Step 4).
6
Porter Stemmer Step 1a
Suffix Replacement Examples
sses ss caresses -gt
caress ies i
ponies -gt poni
ties -gt ti ss
ss caress -gt
caress s
cats -gt cat
7
Porter Stemmer Step 1b
Conditions Suffix Replacement Examples (m gt
0) eed ee feed -gt feed agreed -gt
agree (v) ed null plastered -gt plaster bled
-gt bled (v) ing null motoring -gt motor sing
-gt sing
v - the stem contains a vowel
8
Porter Stemmer Step 5a
(mgt1) E -gt probate -gt
probat rate -gt rate (m1 and not o) E -gt
cease -gt ceas o - the stem
ends cvc, where the second c is not W, X or Y
(e.g. -WIL, -HOP).
9
Stemming in Practice
Evaluation studies have found that stemming can
affect retrieval performance, usually for the
better, but the results are mixed. Effectiveness
is dependent on the vocabulary. Fine
distinctions may be lost through stemming.
Automatic stemming is as effective as manual
conflation. Performance of various algorithms
is similar. Porter's Algorithm is entirely
empirical, but has proved to be an effective
algorithm for stemming English text with trained
users.
10
Selection of tokens, weights, stop lists and
stemming
Special purpose collections (e.g., law, medicine,
monographs) Best results are obtained by tuning
the search engine for the characteristics of the
collections and the expected queries. It is
valuable to use a training set of queries, with
lists of relevant documents, to tune the system
for each application. General purpose collections
(e.g., news articles) The modern practice is to
use a basic weighting scheme (e.g., tf.idf), a
simple definition of token, a short stop list and
little stemming except for plurals, with minimal
conflation. Web searching combine similarity
ranking with ranking based on document
importance.
11
CS 430 / INFO 430 Information Retrieval
Lecture 6 Boolean Methods
12
Exact Matching (Boolean Model)
Documents
Query
Index database
Mechanism for determining whether a document
matches a query.
Set of hits
13
Boolean Queries
Boolean query two or more search terms, related
by logical operators, e.g., and or not Exam
ples abacus and actor abacus or
actor (abacus and actor) or (abacus and
atoll) not actor
14
Boolean Diagram
not (A or B)
A and B
A
B
A or B
15
Adjacent and Near Operators
abacus adj actor Terms abacus and actor are
adjacent to each other as in the string "abacus
actor" abacus near 4 actor Terms abacus and
actor are near to each other as in the string
"the actor has an abacus" Some systems support
other operators, such as with (two terms in the
same sentence) or same (two terms in the same
paragraph).
16
Evaluation of Boolean Operators
Precedence of operators must be defined adj,
near high and, not or low Example A and B or
C and B is evaluated as (A and B) or (C and B)
17
Evaluating a Boolean Query
Examples abacus and actor Postings for
abacus Postings for actor Document 19 is the
only document that contains both terms, "abacus"
and "actor".
To evaluate the and operator, merge the two
inverted lists with a logical AND operation.
18
Evaluating an Adjacency Operation
Examples abacus adj actor Postings for
abacus Postings for actor Document 19,
locations 212 and 213, is the only occurrence of
the terms "abacus" and "actor" adjacent.
19
Query Matching Boolean Methods
  • Query (abacus or asp) and actor
  • 1. From the index file (word list), find the
    postings file for
  • "abacus"
  • every word that begins "asp"
  • "actor"
  • Merge these posting lists. For each document
    that occurs in any of the postings lists,
    evaluate the Boolean expression to see if it is
    true or false.
  • Step 2 should be carried out in a single pass.

20
Use of Postings File for Query Matching
  • 1 abacus
  • 3 94
  • 19 7
  • 19 212
  • 22 56
  • 2 actor
  • 66
  • 19 213
  • 29 45

3 aspen 5 43
  • 4 atoll
  • 3
  • 70
  • 34 40

21
Query Matching Vector Ranking Methods
  • Query abacus asp
  • 1. From the index file (word list), find the
    postings file for
  • "abacus"
  • every word that begins "asp"
  • Merge these posting lists. Calculate the
    similarity to the query for each document that
    occurs in any of the postings lists.
  • Sort the similarities to obtain the results in
    ranked order.
  • Steps 2 and 3 should be carried out in a single
    pass.

22
Contrast of Ranking with Matching
With matching, a document either matches a query
exactly or not at all Encourages short
queries Requires precise choice of index
terms Requires precise formulation of queries
(professional training) With retrieval using
similarity measures, similarities range from 0 to
1 for all documents Encourages long queries,
to have as many dimensions as possible
Benefits from large numbers of index terms
Benefits from queries with many terms, not all of
which need match the document
23
Problems with the Boolean model
Counter-intuitive results Query q A and B and
C and D and E Document d has terms A, B, C and
D, but not E Intuitively, d is quite a good match
for q, but it is rejected by the Boolean model.
Query q A or B or C or D or E Document d1 has
terms A, B, C, D and E Document d2 has term A,
but not B, C, D or E Intuitively, d1 is a much
better match than d2, but the Boolean model ranks
them as equal.
24
Problems with the Boolean model (continued)
Boolean is all or nothing Boolean model has no
way to rank documents. Boolean model allows for
no uncertainty in assigning index terms to
documents. The Boolean model has no provision
for adjusting the importance of query terms.
25
Extending the Boolean model
Term weighting Give weights to terms in
documents and/or queries. Combine standard
Boolean retrieval with vector ranking of
results Fuzzy sets Relax the boundaries of the
sets used in Boolean retrieval
26
Ranking methods in Boolean systems
SIRE (Syracuse Information Retrieval
Experiment) Term weights Add term weights to
documents Weights calculated by the standard
method of term frequency inverse document
frequency. Ranking Calculate results set by
standard Boolean methods Rank results by
vector distances
27
Relevance feedback in SIRE
SIRE (Syracuse Information Retrieval
Experiment) Relevance feedback is particularly
important with Boolean retrieval because it
allow the results set to be expanded Results
set is created by standard Boolean retrieval
User selects one document from results set
Other documents in collection are ranked by
vector distance from this document
28
Boolean model as sets
d is either in the set A or not in A.
d
A
29
Boolean model as fuzzy sets
d is more or less in A.
d
A
30
Basic concept
A document has a term weight associated with
each index term. The term weight measures the
degree to which that term characterizes the
document. Term weights are in the range 0, 1.
(In the standard Boolean model all weights are
either 0 or 1.) For a given query, calculate
the similarity between the query and each
document in the collection. This calculation
is needed for every document that has a non-zero
weight for any of the terms in the query.
31
MMM Mixed Min and Max model
Fuzzy set theory dA is the degree of membership
of an element to set A intersection (and) dA?B
min(dA, dB) union (or) dA?B max(dA, dB)
32
MMM Mixed Min and Max model
Fuzzy set theory example standard
fuzzy set theory set
theory dA 1 1 0 0 0.5 0.5 0 0 dB 1 0 1 0 0.7 0
0.7 0 and dA?B 1 0 0 0 0.5 0 0 0 or
dA?B 1 1 1 0 0.7 0.5 0.7 0
33
MMM Mixed Min and Max model
Terms A1, A2, . . . , An Document d, with
index-term weights d1, d2, . . . , dn
qor (A1 or A2 or . . . or An) Query-document
similarity S(qor, d) ?or max(d1,
d2,.. , dn) (1 - ?or) min(d1, d2,.. ,
dn) With regular Boolean logic, ?or 1
34
MMM Mixed Min and Max model
Terms A1, A2, . . . , An Document d, with
index-term weights d1, d2, . . . , dn qand
(A1 and A2 and . . . and An) Query-document
similarity S(qand, d) ?and
min(d1,.. , dn) (1 - ?and) max(d1,.. ,
dn) With regular Boolean logic, ?and 1
35
MMM Mixed Min and Max model
Experimental values ?and in range 0.5,
0.8 ?or gt 0.2 Computational cost is low.
Retrieval performance much improved.
36
Other Models
Paice model The MMM model considers only the
maximum and minimum document weights. The Paice
model takes into account all of the document
weights. Computational cost is higher than MMM.
P-norm model Document D, with term weights
dA1, dA2, . . . , dAn Query terms are given
weights, a1, a2, . . . ,an Operators have
coefficients that indicate degree of
strictness Query-document similarity is
calculated by considering each document and query
as a point in n space.
37
Test data
CISI CACM INSPEC P-norm 79 106 210 Paice 77 104 2
06 MMM 68 109 195
Percentage improvement over standard Boolean
model (average best precision) Lee and Fox, 1988
38
Reading
E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended
Boolean Models, Frake, Chapter 15 Methods based
on fuzzy set concepts
Write a Comment
User Comments (0)
About PowerShow.com