Modeling

About This Presentation

Title:

Description:

Number of Views:26

Avg rating:3.0/5.0

Slides: 18

Provided by: chiahu

Category:

Tags: model1 | modeling

Transcript and Presenter's Notes

Title: Modeling

1
Modeling

2
Introduction

Ranking algorithms
The central problem regarding IR systems is the
issue of predicting which documents are relevant
and which are not.
Taxonomy of IR Models
Boolean set theoretic
Vector algebraic
Probabilistic

3
Retrieval

Ad hoc
the documents in the collection remain relatively
static while new queries are submitted to the
system
Filtering (Routing)
the queries remain relatively static while new
documents come into the system
construction of user profile

4
Basic Concepts

In the classic models
each document is described by a set of
representative keywords called index terms
index terms are mainly nouns
distinct index terms have varying relevance
index term weights are usually assumed to be
mutually independent

5
Boolean Model

Binary decision criterion
Data retrieval model
A query is a Boolean expression which can be
represented as a disjunction of conjunctive
vectors
Advantage
clean formalism, simplicity
Disadvantage
exact matching may lead to retrieval of too few
or too many documents

6
Vector Model (1/4)

Index terms are assigned non-binary weights
Term weights are used to compute the degree of
similarity between documents and the user query
Then, retrieved documents are sorted in
decreasing order.
Definition For the vector model, the weight wi,j
is associated with term ki and document dj

7
Vector Model (2/4)

8
Vector Model (3/4)

9
Vector Model (4/4)

Advantages
its term-weighting scheme improves retrieval
performance
its partial matching strategy allows retrieval of
documents that approximate the query conditions
its cosine ranking formula sorts the documents
according to their degree of similarity to the
query
Disadvantage
The assumption of mutual independence between
index terms

10
Probabilistic Model (1/7)

Introduced by Roberston and Sparck Jones, 1976
Also called binary independence retrieval (BIR)
model
Idea Given a user query q, and the ideal answer
set of the relevant documents, the problem is to
specify the properties for this set.
i.e.the probabilistic model tries to estimate the
probability that the user will find the document
dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)

11
Probabilistic Model (2/7)

12
Probabilistic Model (3/7)

The similarity sim(dj,q) of the document dj to
the query q is defined as the ratio
Using Bayes rule,
P(R) stands for the probability that a document
randomly selected from the entire collection is
relevant
stands for the probability of
randomly selecting the document dj from the set R
of relevant documents

13
Probabilistic Model (4/7)

14
Probabilistic Model (5/7)

Pr(ki R) stands for the probability that the
index term ki is present in a document randomly
selected from the set R
stands for the probability that the
index term ki is not present in a document
randomly selected from the set R
let Pr(ki R)pi
di is either 0 or 1
0 di is absent from q
1 di is present in q

15
Probabilistic Model (6/7)
16
Probabilistic Model (7/7)