Chapter 5: Query Operations - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Chapter 5: Query Operations

Description:

approaches based on information derived from the set of ... Idea by Crouch and Yang (1992) Use complete link algorithm to produce small and tight clusters ... – PowerPoint PPT presentation

Number of Views:250
Avg rating:3.0/5.0
Slides: 28
Provided by: csieN5
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5: Query Operations


1
Chapter 5 Query Operations
  • Baeza-Yates, 1999
  • Modern Information Retrieval

2
Query Modification
  • Improving initial query formulation
  • Relevance feedback
  • approaches based on feedback information from
    users
  • Local analysis
  • approaches based on information derived from the
    set of documents initially retrieved (called the
    local set of documents)
  • Global analysis
  • approaches based on global information derived
    from the document collection

3
Relevance Feedback
  • Relevance feedback process
  • it shields the user from the details of the query
    reformulation process
  • it breaks down the whole searching task into a
    sequence of small steps which are easier to grasp
  • it provides a controlled process designed to
    emphasize some terms and de-emphasize others
  • Two basic techniques
  • Query expansion
  • addition of new terms from relevant documents
  • Term reweighting
  • modification of term weights based on the user
    relevance judgement

4
Vector Space Model
  • Definitionwi,j the ith term in the vector for
    document djwi,k the ith term in the vector for
    query qkt the number of unique terms in the
    data set

5
Query Expansion and and Term Reweighting for the
Vector Model
  • Ideal situation
  • CR set of relevant documents among all documents
    in the collection
  • Rocchio (1965, 1971)
  • R set of relevant documents, as identified by
    the user among the retrieved documents
  • S set of non-relevant documents among the
    retrieved documents

6
Rocchios Algorithm
  • Ide_Regular (1971)
  • Ide_Dec_Hi
  • Parameters
  • a b g 1
  • b gt g 0

7
Probabilistic Model
  • Definition
  • pi the probability of observing term ti in the
    set of relevant documents
  • qi the probability of observing term ti in the
    set of nonrelevant documents
  • Initial search assumption
  • pi is constant for all terms ti (typically 0.5)
  • qi can be approximated by the distribution of ti
    in the whole collection

8
Term Reweighting for the Probabilistic Model
  • Robertson and Sparck Jones (1976)
  • With relevance feedback from user
  • N the number of documents in the collection
  • R the number of relevant documents for query q
  • ni the number of documents having term ti
  • ri the number of relevant documents having term
    ti

Document Relevance
Document Indexing

-
N-ni-Rri
9
Term Reweighting for the Probabilistic Model
(cont.)
Initial search assumption pi is constant for all
terms ti (typically 0.5) qi can be approximated
by the distribution of ti in the whole
collection With relevance feedback from users pi
and qi can be approximated by hence the term
weight is updated by

10
Term Reweighting for the Probabilistic Model
(Cont.)
  • However, the last formula poses problems for
    certain small values of R and ri (R1, ri0)
  • Instead of 0.5, alternative adjustments have been
    propsed

11
Term Reweighting for the Probabilistic Model
(Cont.)
  • Characteristics
  • Advantage
  • the term reweighting is optimal under the
    asumptions of
  • term independence
  • binary document indexing (wi,q ?0,1 and wi,j
    ?0,1)
  • Disadvantage
  • no query expansion is used
  • weights of terms in the previous query
    formulations are also disregarded
  • document term weights are not taken into account
    during the feedback loop

12
Evaluation of relevance feedback
  • Standard evaluation method is not suitable
  • (i.e., recall-precision) because the relevant
    documents used to reweight the query terms are
    moved to higher ranks.
  • The residual collection method
  • the set of all documents minus the set of
    feedback documents provided by the user
  • because highly ranked documents are removed from
    the collection, the recall-precision figures for
    tend to be lower than the figures for the
    original query
  • as a basic rule of thumb, any experimentation
    involving relevance feedback strategies should
    always evaluate recall-precision figures relative
    to the residual collection

13
Automatic Local Analysis
  • Definition
  • local document set Dl the set of documents
    retrieved by a query
  • local vocabulary Vl the set of all distinct
    words in Dl
  • stemed vocabulary Sl the set of all distinct
    stems derived from Vl
  • Building local clusters
  • association clusters
  • metric clusters
  • scalar clusters

14
Association Clusters
  • Idea
  • co-occurrence of stems (or terms) inside
    documents
  • fu,j the frequency of a stem ku in a document dj
  • local association cluster for a stem ku
  • the set of k largest values c(ku, kv)
  • given a query q, find clusters for the q query
    terms
  • normalized form

15
Metric Clusters
  • Idea
  • consider the distance between two terms in the
    same cluster
  • Definition
  • V(ku) the set of keywords which have the same
    stem form as ku
  • distance r(ki, kj)the number of words between
    term ku and kv
  • normalized form

16
Scalar Clusters
  • Idea
  • two stems with similar neighborhoods have some
    synonymity relationships
  • Definition
  • cu,vc(ku, kv)
  • vectors of correlation values for stem ku and kv
  • scalar association matrix
  • scalar clusters
  • the set of k largest values of scalar association

17
Automatic Global Analysis
  • A thesaurus-like structure
  • Short history
  • Until the beginning of the 1990s, global analysis
    was considered to be a technique which failed to
    yield consistent improvements in retrieval
    performance with general collections
  • This perception has changed with the appearance
    of modern procedures for global analysis

18
Query Expansion based on a Similarity Thesaurus
  • Idea by Qiu and Frei 1993
  • Similarity thesaurus is based on term to term
    relationships rather than on a matrix of
    co-occurrence
  • Terms for expansion are selected based on their
    similarity to the whole query rather than on
    their similarities to individual query terms
  • Definition
  • N total number of documents in the collection
  • t total number of terms in the collection
  • tfi,j occurrence frequency of term ki in the
    document dj
  • tj the number of distinct index terms in the
    document dj
  • itfj the inverse term frequency for document dj

19
Similarity Thesaurus
  • Each term is associated with a vector
  • where wi,j is a weight associated to the
    index-document pair
  • The relationship between two terms ku and kv is
  • Note that this is a variation of the correlation
    measure used for computing scalar association
    matrices

20
Term weighting vs. Term concept space
Doc dj
Term ki
Doc dj
tfij
tfij
Term ki
21
Query Expansion Procedure with Similarity
Thesaurus
  • 1. Represent the query in the concept space by
    using the representation of the index terms
  • 2. Compute the similarity sim(q,kv) between each
    term kv and the whole query
  • 3. Expand the query with the top r ranked terms
    according to sim(q,kv)

22
Example of Similarity Thesaurus
  • The distance of a given term kv to the query
    centroid QC might be quite distinct from the
    distances of kv to the individual query terms

ki
QCka ,kb
kv
kj
ka
kb
QC
23
Query Expansion based on a Similarity Thesaurus
  • A document dj is represented term-concept space
    by
  • If the original query q is expanded to include
    all the t index terms, then the similarity sim(q,
    dj) between the document dj and the query q can
    be computed as
  • which is similar to the generalized vector space
    model

24
Query Expansion based on a Statistical Thesaurus
  • Idea by Crouch and Yang (1992)
  • Use complete link algorithm to produce small and
    tight clusters
  • Use term discrimination value to select terms for
    entry into a particular thesaurus class
  • Term discrimination value
  • A measure of the change in space separation which
    occurs when a given term is assigned to the
    document collection

25
Term Discrimination Value
  • Terms
  • good discriminators (terms with positive
    discrimination values)
  • index terms
  • indifferent discriminators (near-zero
    discrimination values)
  • thesaurus class
  • poor discriminators (negative discrimination
    values)
  • term phrases
  • Document frequency dfk
  • dfk gtn/10 high frequency term (poor
    discriminators)
  • dfk ltn/100 low frequency term (indifferent
    discriminators)
  • n/100 ? dfk ?n/10 good discriminator

26
Statistical Thesaurus
  • Term discrimination value theory
  • the terms which make up a thesaurus class must be
    indifferent discriminators
  • The proposed approach
  • cluster the document collection into small, tight
    clusters
  • A thesaurus class is defined as the intersection
    of all the low frequency terms in that cluster
  • documents are indexed by the thesaurus classes
  • the thesaurus classes are weighted by

27
Discussion
  • Query expansion
  • useful
  • little explored technique
  • Trends and research issues
  • The combination of local analysis, global
    analysis, visual displays, and interactive
    interfaces is also a current and important
    research problem
Write a Comment
User Comments (0)
About PowerShow.com