Modeling - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Modeling

Description:

Vector Model (1/4) Index terms are assigned non-binary weights ... Probabilistic Model (1/7) Introduced by Roberston and Sparck Jones, 1976 ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 18
Provided by: chiahu
Category:
Tags: model1 | modeling

less

Transcript and Presenter's Notes

Title: Modeling


1
Modeling
  • Modern Information Retrieval
  • by R. Baeza-Yates and B. Ribeiro-Neto
  • Addison-Wesley, 1999.
  • (Chapter 2)

2
Introduction
  • Ranking algorithms
  • The central problem regarding IR systems is the
    issue of predicting which documents are relevant
    and which are not.
  • Taxonomy of IR Models
  • Boolean set theoretic
  • Vector algebraic
  • Probabilistic

3
Retrieval
  • Ad hoc
  • the documents in the collection remain relatively
    static while new queries are submitted to the
    system
  • Filtering (Routing)
  • the queries remain relatively static while new
    documents come into the system
  • construction of user profile

4
Basic Concepts
  • In the classic models
  • each document is described by a set of
    representative keywords called index terms
  • index terms are mainly nouns
  • distinct index terms have varying relevance
  • index term weights are usually assumed to be
    mutually independent

5
Boolean Model
  • Binary decision criterion
  • Data retrieval model
  • A query is a Boolean expression which can be
    represented as a disjunction of conjunctive
    vectors
  • Advantage
  • clean formalism, simplicity
  • Disadvantage
  • exact matching may lead to retrieval of too few
    or too many documents

6
Vector Model (1/4)
  • Index terms are assigned non-binary weights
  • Term weights are used to compute the degree of
    similarity between documents and the user query
  • Then, retrieved documents are sorted in
    decreasing order.
  • Definition For the vector model, the weight wi,j
    is associated with term ki and document dj

7
Vector Model (2/4)
  • Degree of similarity

8
Vector Model (3/4)
  • Salton
  • IR vs. clustering
  • intra-clustering similarity tf factor (term
    frequency)
  • inter-cluster dissimilarity idf factor (inverse
    document frequency)
  • Definition
  • normalized frequency
  • inverse document fequency
  • term-weighting schemes
  • query-term weights

9
Vector Model (4/4)
  • Advantages
  • its term-weighting scheme improves retrieval
    performance
  • its partial matching strategy allows retrieval of
    documents that approximate the query conditions
  • its cosine ranking formula sorts the documents
    according to their degree of similarity to the
    query
  • Disadvantage
  • The assumption of mutual independence between
    index terms

10
Probabilistic Model (1/7)
  • Introduced by Roberston and Sparck Jones, 1976
  • Also called binary independence retrieval (BIR)
    model
  • Idea Given a user query q, and the ideal answer
    set of the relevant documents, the problem is to
    specify the properties for this set.
  • i.e.the probabilistic model tries to estimate the
    probability that the user will find the document
    dj relevant with ratio
  • P(dj relevant to q)/P(dj nonrelevant to q)

11
Probabilistic Model (2/7)
  • Definition
  • All index term weights are all binary i.e., wi,j
    ? 0,1
  • Let R be the set of documents know to be relevant
    to query q
  • Let be the complement of R
  • Let be the probability that the
    document dj is relevant to the query q
  • Let be the probability that the
    document dj is nonelevant to query q

12
Probabilistic Model (3/7)
  • The similarity sim(dj,q) of the document dj to
    the query q is defined as the ratio
  • Using Bayes rule,
  • P(R) stands for the probability that a document
    randomly selected from the entire collection is
    relevant
  • stands for the probability of
    randomly selecting the document dj from the set R
    of relevant documents

13
Probabilistic Model (4/7)
  • Assuming independence of index terms and given
    q(d1, d2, , dt),

14
Probabilistic Model (5/7)
  • Pr(ki R) stands for the probability that the
    index term ki is present in a document randomly
    selected from the set R
  • stands for the probability that the
    index term ki is not present in a document
    randomly selected from the set R
  • let Pr(ki R)pi
  • di is either 0 or 1
  • 0 di is absent from q
  • 1 di is present in q

15
Probabilistic Model (6/7)
16
Probabilistic Model (7/7)
  • The retrieval value of each ki present in a
    document (i.e., di1) is term relevance weight
  • pj 0.5, qj dfj / N

17
Estimation of Term Relevance
Write a Comment
User Comments (0)
About PowerShow.com