Chapter 2 Modeling - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 2 Modeling

Description:

Fuzzy Theory. A fuzzy subset A of a universe U is characterized by a ... two fuzzy subsets of U, Fuzzy Information ... Define a fuzzy set associated to ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 48
Provided by: chiahu
Category:

less

Transcript and Presenter's Notes

Title: Chapter 2 Modeling


1
Chapter 2 Modeling
  • Modern Information Retrieval
  • by R. Baeza-Yates and B. Ribeiro-Neto

2
Introduction
  • Traditional information retrieval systems usually
    adopt index terms to index and retrieve
    documents.
  • An index term is a keyword (or group of related
    words) which has some meaning of its own (usually
    a noun).
  • Advantages
  • Simple
  • The semantic of the documents and of the user
    information need can be naturally expressed
    through sets of index terms.

3
IR Models
  • Ranking algorithms are at the core of information
    retrieval systems (predicting which documents are
    relevant and which are not).

4
A taxonomy of information retrieval models
Classic Models
Set Theoretic
Boolean Vector Probabilistic
Fuzzy Extended Boolean
  • U
  • S
  • E
  • R
  • T
  • A
  • S
  • K

Retrieval Ad hoc Filtering
Algebraic
Structured Models
Generalized Vector Lat. Semantic Index Neural
Networks
Non-overlapping lists Proximal Nodes
Browsing
Probabilistic
Browsing
Inference Network Belief Network
Flat Structured Guided Hypertext
5
Figure 2.2 Retrieval models most frequently
associated with distinct combinations of a
document logical view and a user task.
6
Retrieval Ad hoc and Filtering
  • Ad hoc (Search) The documents in the collection
    remain relatively static while new queries are
    submitted to the system.
  • Routing (Filtering) The queries remain
    relatively static while new documents come into
    the system

7
A formal characterization of IR models
  • D A set composed of logical views (or
    representation) for the documents in the
    collection.
  • Q A set composed of logical views (or
    representation) for the user information needs
    (queries).
  • F A framework for modeling document
    representations, queries, and their
    relationships.
  • R(qi, dj) A ranking function which defines an
    ordering among the documents with regard to the
    query.

8
Define
  • ki A generic index term
  • K The set of all index terms k1,,kt
  • wi,j A weight associated with index term
  • ki of a document dj
  • gi A function returns the weight associated
  • with ki in any t-dimensoinal vector(
    gi(dj)wi,j )

9
Classic IR Model
  • Basic concepts Each document is described by a
    set of representative keywords called index
    terms.
  • Assign a numerical weights to distinct relevance
    between index terms.

10
Boolean model
  • Binary decision criterion
  • Data retrieval model
  • Advantage
  • clean formalism, simplicity
  • Disadvantage
  • It is not simple to translate an information need
    into a Boolean expression.
  • exact matching may lead to retrieval of too few
    or too many documents

11
Example
  • Can be represented as a disjunction of
    conjunction vectors (in DNF).
  • Q qa?(qb??qc)(1,1,1) ? (1,1,0) ? (1,0,0)
  • Formal definition
  • For the Boolean model, the index term weight are
    all binary.
  • A query is a conventional Boolean expression,
    which can be transformed to a disjunctive normal
    form
  • if (?qcc? )?(?ki, wi,jgi(qcc))

12
Vector model
  • Assign non-binary weights to index terms in
    queries and in documents. gt TFxIDF
  • Compute the similarity between documents and
    query. gt Sim(Dj, Q)
  • More precise than Boolean model.

13
The IR problem ? A clustering problem
  • We think of the documents as a collection C of
    objects and think of the user query as a
    specification of a set A of objects.
  • Intra-cluster
  • What are the features which better describe the
    objects in the set A?
  • Inter-cluster
  • What are the features which better distinguish
    the objects in the set A?

14
Idea for TFxIDF
  • TF inter-clustering similarity is quantified by
    measuring the raw frequency of a term ki inside a
    document dj, such term frequency is usually
    referred to as the tf factor and provides one
    measure of how well that term describes the
    document contents.
  • IDF inter-clustering similarity is quantified
    by measuring the inverse of the frequency of a
    term ki among the documents in the
    collection.This frequency is often referred to as
    the inverse document frequency.

15
Vector Model (1/4)
  • Index terms are assigned positive and non-binary
    weights.
  • The index terms in the query are also weighted.
  • Term weights are used to compute the degree of
    similarity between documents and the user query.
    Then, retrieved documents are sorted in
    decreasing order.

16
Vector Model (2/4)
  • Degree of similarity

17
Vector Model (3/4)
  • Definition
  • normalized frequency
  • inverse document frequency
  • term-weighting schemes
  • query-term weights

18
Vector Model (4/4)
  • Advantages
  • its term-weighting scheme improves retrieval
    performance
  • its partial matching strategy allows retrieval of
    documents that approximate the query conditions
  • its cosine ranking formula sorts the documents
    according to their degree of similarity to the
    query
  • Disadvantage
  • The assumption of mutual independence between
    index terms

19
Orthogonal
v1 (1,0) (1,0) v2 (1,1) (0,1) v3
(0,1) (-1,1) Cos(v1,v2)1/?2 Cos(v2,v3)1/?2 Cos(
v1,v3)0 Cos(v1,v2)0 Cos(v2,v3)1/?2 Cos(v1,v3)
-1/?2
v2
v3
v1
20
Probabilistic Model (1/6)
  • Introduced by Roberston and Sparck Jones, 1976
  • Also called binary independence retrieval (BIR)
    model
  • Idea Given a user query q, and the ideal answer
    set of the relevant documents, the problem is to
    specify the properties for this set.
  • i.e. the probabilistic model tries to estimate
    the probability that the user will find the
    document dj relevant with ratio
  • P(dj relevant to q)/P(dj nonrelevant to q)

21
Probabilistic Model (2/6)
  • Definition
  • All index term weights are all binary i.e., wi,j
    ? 0,1
  • Let R be the set of documents know to be relevant
    to query q
  • Let be the complement of R
  • Let be the probability that the
    document dj is relevant to the query q
  • Let be the probability that the
    document dj is nonelevant to query q

22
Probabilistic Model (3/6)
  • The similarity sim(dj,q) of the document dj to
    the query q is defined as the ratio
  • Using Bayes rule,
  • P(R) stands for the probability that a document
    randomly selected from the entire collection is
    relevant
  • stands for the probability of
    randomly selecting the document dj from the set R
    of relevant documents

23
Probabilistic Model (4/6)
  • Assuming independence of index terms and given
    q(d1, d2, , dt),

24
Probabilistic Model (5/6)
  • Pr(ki R) stands for the probability that the
    index term ki is present in a document randomly
    selected from the set R
  • stands for the probability that the
    index term ki is not present in a document
    randomly selected from the set R

25
Probabilistic Model (6/6)
26
Estimation of Term Relevance
  • In the very beginning
  • Next, the ranking can be improved as follows
  • For small values for V

Let V be a subset of the documents initially
retrieved
27
Alternative Set Theoretic Models
  • Fuzzy Set Model
  • Extended Boolean Model

28
Fuzzy Theory
  • A fuzzy subset A of a universe U is characterized
    by a membership function uA U?0,1 which
    associates with each element u?U a number uA
  • Let A and B be two fuzzy subsets of U,

29
Fuzzy Information Retrieval
  • Using a term-term correlation matrix
  • Define a fuzzy set associated to each
    index term ki.
  • If a term kl is strongly related to ki, that is
    ci,l 1, then ui(dj)1
  • If a term kl is loosely related to ki, that is
    ci,l 0, then ui(dj)0

30
Example
  • Disjunctive Normal Form

31
Algebraic Sum and Product
  • The degree of membership in a disjunctive fuzzy
    set is computed using an algebraic sum, instead
    of max function.
  • The degree of membership in a conjunctive fuzzy
    set is computed using an algebraic product,
    instead of min function.
  • More smooth than max and min functions.

32
Alternative Algebraic Models
  • Generalized Vector Space Model
  • Latent Semantic Model

33
Latent Semantic Indexing (1/5)
  • Let A be a term-document association matrix with
    m rows and n columns.
  • Latent semantic indexing decomposes A using
    singular value decompositions.
  • U (m?m) is the matrix of eigenvectors derived
    from the term-to-term correlation matrix (AAT)
  • V (n?n) is the matrix of eigenvectors derived
    from the the document-to-document matrix (ATA)
  • ? is an m?n diagonal matrix of singular values,
    where r?min(t,N) is the rank of A.

34
Latent Semantic Indexing (2/5)
  • Consider now only the s largest singular values
    of S, and their corresponding columns in U and V.
    (The remaining singular values of ? are deleted).
  • The resultant matrix As (rank s) is closest to
    the original matrix A in the least square sense.
  • sltr is the dimensionality of a reduced concept
    space.

35
Latent Semantic Indexing (3/5)
  • The selection of s attempts to balance two
    opposing effects
  • s should be large enough to allow fitting all the
    structure in the real data
  • s should be small enough to allow filtering out
    all the non-relevant representational details
  • Usu1, u2, , us are the s principle components
    of column space (document space) Rm
  • Vsv1, v2, , vs are the s principle components
    of row space (term space) Rn

36
Latent Semantic Indexing (4/5)
  • Consider the relationship between any two
    documents
  • is the projected vector for document di
    (Rm?Rs)
  • is the projected vector for term vector
    ti (Rn?Rs)

37
Latent Semantic Indexing (5/5)
  • To rank documents with regard to a given user
    query, we model the query as a pseudo-document in
    the matrix A (original).
  • Assume the query is modeled as the document with
    number k.
  • Then the kth row in the matrix provides
    the ranks of all documents with respect to this
    query.

38
Speedup
  • The matrix vector multiplication
    requires a total of N?t scalar multiplications.
  • While requires only
    (nm)?s scalar multiplications.

39
Alternative Probabilistic Model
  • Bayesian Networks
  • Inference Network Model
  • Belief Network Model

40
Bayesian Network
  • Let xi be a node in a Bayesian network G and ?xi
    be the set of parent nodes of xi.
  • The influence of ?xi on xi can be specified by
    any set of functions that satisfy
  • P(x1,x2,x3,x4,x5)P(x1)P(x2x1)P(x3x1)P(x4x2,x3)
    P(x5x3)

41
Belief Network Model (1/6)
  • The probability space
  • The set Kk1, k2, , kt is the universe. To
    each subset u is associated a vector such that
    gi( )1 ? ki?u.
  • Random variables
  • To each index term ki is associated a binary
    random variable.

42
Belief Network Model (2/6)
  • Concept space
  • A document dj is represented as a concept
    composed of the terms used to index dj.
  • A user query q is also represented as a concept
    composed of the terms used to index q.
  • Both user query and document are modeled as
    subsets of index terms.
  • Probability distribution P over K

43
Belief Network Model (3/6)
  • A query is modeled as a network node
  • This variable is set to 1 whenever q completely
    covers the concept space K
  • P(q) computes the degree of coverage of the space
    K by q
  • A document dj is modeled as a network node
  • This random variable is 1 to indicate that dj
    completely covers the concept space K
  • P(dj) computes the degree of coverage of the
    space K by dj

44
Belief Network Model (4/6)
45
Belief Network Model (5/6)
  • Assumption
  • P(dj q) is adopted as the rank of the document
    dj with respect to the query q.

46
Belief Network Model (6/6)
  • Specify the conditional probabilities as follows
  • Thus, the belief network model can be tuned to
    subsume the vector model.

47
Comparison
  • Belief network model
  • Belief network model is based on set-theoretic
    view
  • Belief network model provides a separation
    between the document and the query
  • Belief network model is able to reproduce any
    ranking strategy generated by the inference
    network model
  • Inference network model
Write a Comment
User Comments (0)
About PowerShow.com