Chapter 2 Modeling - PowerPoint PPT Presentation

1 / 148
About This Presentation
Title:

Chapter 2 Modeling

Description:

01 A Man A Woman Men and Women. 02 An Old Person An Adult The ... 01 A Tall Person A Dwarf. 02 A Fat Person A Thin Person. 03 A Beautiful Woman A Handsome Man ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 149
Provided by: hsinhs
Category:

less

Transcript and Presenter's Notes

Title: Chapter 2 Modeling


1
Chapter 2 Modeling
  • Hsin-Hsi Chen
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University

2
Indexing
3
Indexing
  • indexing assign identifiers to text items.
  • assign manual vs. automatic indexing
  • identifiers
  • objective vs. nonobjective text identifiers
    cataloging rules define, e.g., author names,
    publisher names, dates of publications,
  • controlled vs. uncontrolled vocabulariesinstructi
    on manuals, terminological schedules,
  • single-term vs. term phrase

4
Two Issues
  • Issue 1 indexing exhaustivity
  • exhaustive assign a large number of terms
  • non-exhaustive
  • Issue 2 term specificity
  • broad terms (generic)cannot distinguish relevant
    from non-relevant items
  • narrow terms (specific)retrieve relatively fewer
    items, but most of them are relevant

5
Parameters of retrieval effectiveness
  • Recall
  • Precision
  • Goal high recall and high precision

6
Retrieved Part
b
a
Non-relevant Items
Relevant Items
d
c
7
A Joint Measure
  • F-score
  • ? is a parameter that encode the importance of
    recall and procedure.
  • ?1 equal weight
  • ?lt1 precision is more important
  • ?gt1 recall is more important

8
Choices of Recall and Precision
  • Both recall and precision vary from 0 to 1.
  • In principle, the average user wants to achieve
    both high recall and high precision.
  • In practice, a compromise must be reached because
    simultaneously optimizing recall and precision is
    not normally achievable.

9
Choices of Recall and Precision (Continued)
  • Particular choices of indexing and search
    policies have produced variations in performance
    ranging from 0.8 precision and 0.2 recall to 0.1
    precision and 0.8 recall.
  • In many circumstance, both the recall and the
    precision varying between 0.5 and 0.6 are more
    satisfactory for the average users.

10
Term-Frequency Consideration
  • Function words
  • for example, "and", "or", "of", "but",
  • the frequencies of these words are high in all
    texts
  • Content words
  • words that actually relate to document content
  • varying frequencies in the different texts of a
    collection
  • indicate term importance for content

11
A Frequency-Based Indexing Method
  • Eliminate common function words from the document
    texts by consulting a special dictionary, or stop
    list, containing a list of high frequency
    function words.
  • Compute the term frequency tfij for all remaining
    terms Tj in each document Di, specifying the
    number of occurrences of Tj in Di.
  • Choose a threshold frequency T, and assign to
    each document Di all term Tj for which tfij gt T.

12
Discussions
  • high-frequency termsfavor recall
  • high precisionthe ability to distinguish
    individual documents from each other
  • high-frequency termsgood for precision when its
    term frequency is not equally high in all
    documents.

13
Inverse Document Frequency
  • Inverse Document Frequency (IDF) for term
    Tjwhere dfj (document frequency of term Tj)
    is number of documents in which Tj occurs.
  • fulfil both the recall and the precision
  • occur frequently in individual documents but
    rarely in the remainder of the collection

14
New Term Importance Indicator
  • weight wij of a term Tj in a document ti
  • Eliminating common function words
  • Computing the value of wij for each term Tj in
    each document Di
  • Assigning to the documents of a collection all
    terms with sufficiently high (tf x idf) factors

15
Term-discrimination Value
  • Useful index termsdistinguish the documents of a
    collection from each other
  • Document Space
  • two documents are assigned very similar term
    sets, when the corresponding points in document
    configuration appear close together
  • when a high-frequency term without discrimination
    is assigned, it will increase the document space
    density

16
A Virtual Document Space
After Assignment of good discriminator
After Assignment of poor discriminator
Original State
17
Good Term Assignment
  • When a term is assigned to the documents of a
    collection, the few items (i.e., documents) to
    which the term is assigned will be distinguished
    from the rest of the collection.
  • This should increase the average distance between
    the items in the collection and hence produce a
    document space less dense than before.

18
Poor Term Assignment
  • A high frequency term is assigned that does not
    discriminate between the items (i.e., documents)
    of a collection.
  • Its assignment will render the document more
    similar.
  • This is reflected in an increase in document
    space density.

19
Term Discrimination Value
  • definition dvj Q - Qjwhere Q and Qj are
    space densities before and after the
    assignments of term Tj.
  • dvjgt0, Tj is a good term dvjlt0, Tj is a poor
    term.

20
Variations of Term-Discrimination Value with
Document Frequency
Phrase transformation
Thesaurus transformation
Document Frequency
N
Low frequency dvj0
Medium frequency dvjgt0
High frequency dvjlt0
21
Another Term Weighting
  • wij tfij ? dvj
  • compared with
  • decrease steadily with increasing
    document frequency
  • dvj increase from zero to positive as the
    document frequency of the term increase,
    decrease shapely (i.e., negative) as the
    document frequency becomes still larger.

22
Term Relationships in Indexing
  • Single-term indexing
  • Single terms are often ambiguous.
  • Many single terms are either too specific or too
    broad to be useful.
  • Complex text identifiers
  • subject experts and trained indexers
  • linguistic analysis algorithms, e.g., NP chunker
  • term-grouping or term clustering methods

23
Term Classification (Clustering)
24
Term Classification (Clustering)
  • Column partGroup terms whose corresponding
    column representation reveal similar assignments
    to the documents of the collection.
  • Row partGroup documents that exhibit
    sufficiently similar term assignment.

25
Linguistic Methodologies
  • Indexing phrasesnominal constructions including
    adjectives and nouns
  • Assign syntactic class indicators (i.e., part of
    speech) to the words occurring in document texts.
  • Construct word phrases from sequences of words
    exhibiting certain allowed syntactic markers
    (noun-noun and adjective-noun sequences).

26
Term-Phrase Formation
  • Term Phrasea sequence of related text words
    carry a more specific meaning than the single
    termse.g., computer science vs. computer

Phrase transformation
Thesaurus transformation
Document Frequency
N
Low frequency dvj0
Medium frequency dvjgt0
High frequency dvjlt0
27
Simple Phrase-Formation Process
  • the principal phrase component (phrase head)a
    term with a document frequency exceeding a stated
    threshold, or exhibiting a negative discriminator
    value
  • the other components of the phrasemedium- or
    low- frequency terms with stated co-occurrence
    relationships with the phrase head
  • common function wordsnot used in the
    phrase-formation process

28
An Example
  • Effective retrieval systems are essential for
    people in need of information.
  • are, for, in and ofcommon function
    words
  • system, people, and informationphrase
    heads

29
The Formatted Term-Phrases
effective retrieval systems essential people need
information
2/5
5/12
phrases assumed to be useful for content
identification
30
The Problems
  • A phrase-formation process controlled only by
    word co-occurrences and the document frequencies
    of certain words is not likely to generate a
    large number of high-quality phrases.
  • Additional syntactic criteria for phrase heads
    and phrase components may provide further control
    in phrase formation.

31
Additional Term-Phrase Formation Steps
  • Syntactic class indicator are assigned to the
    terms, and phrase formation is limited to
    sequences of specified syntactic markers, such as
    adjective-noun and noun-noun sequences. Adverb-ad
    jective ? adverb-noun ?
  • The phrase elements are all chosen from within
    the same syntactic unit, such as subject phrase,
    object phrase, and verb phrase.

32
Consider Syntactic Unit
  • effective retrieval systems are essential for
    people in the need of information
  • subject phrase
  • effective retrieval systems
  • verb phrase
  • are essential
  • object phrase
  • people in need of information

33
Phrases within Syntactic Components
subj effective retrieval systems vp are
essential for obj people need information
  • Adjacent phrase heads and components within
    syntactic components
  • retrieval systems
  • people need
  • need information
  • Phrase heads and components co-occur within
    syntactic components
  • effective systems

2/3
34
Problems
  • More stringent phrase formation criteria produce
    fewer phrases, both good and bad, than less
    stringent methodologies.
  • Prepositional phrase attachment, e.g., The man
    saw the girl with the telescope.
  • Anaphora resolution He dropped the plate on his
    foot and broke it.

35
Problems (Continued)
  • Any phrase matching system must be able to deal
    with the problems of
  • synonym recognition
  • differing word orders
  • intervening extraneous word
  • Example
  • retrieval of information vs. information retrieval

36
Equivalent Phrase Formulation
  • Base form text analysis system
  • Variants
  • system analyzes the text
  • text is analyzed by the system
  • system carries out text analysis
  • text is subjected to system analysis
  • Related term substitution
  • text documents, information items
  • analysis processing, transformation,
    manipulation
  • system program, process

37
Thesaurus-Group Generation
  • Thesaurus transformation
  • broadens index terms whose scope is too narrow to
    be useful in retrieval
  • a thesaurus must assemble groups of related
    specific terms under more general, higher-level
    class indicators

Phrase transformation
Thesaurus transformation
Document Frequency
N
Low frequency dvj0
Medium frequency dvjgt0
High frequency dvjlt0
38
Sample Classes of Rogets Thesaurus
39
?????
  • 12 large categories
  • 94 middle categories
  • 1,428 small categories
  • 3,925 word clusters

40
A People Aa a collective name 01        Human
being The people Everybody 02       
I We 03        You You 04       
He/She They 05        Myself Others Someone 06
        Who Ab people of all ages and both
sexes 01 A Man A Woman Men and Women 02
An Old Person An Adult The old and the
young 03        A Teenager 04        An Infant A
Child Ac posture 01 A Tall
Person A Dwarf 02 A Fat Person A Thin
Person 03 A Beautiful Woman A Handsome Man
41
(No Transcript)
42
The Indexing Prescription (1)
  • Identify the individual words in the document
    collection.
  • Use a stop list to delete from the texts the
    function words.
  • Use an suffix-stripping routine to reduce each
    remaining word to word-stem form.
  • For each remaining word stem Tj in document Di,
    compute wij.
  • Represent each document Di by Di(T1, wi1 T2,
    wi2 , Tt, wit)

43
Word Stemming
  • effectiveness --gt effective --gt effect
  • picnicking --gt picnic
  • king -\-gt k

44
Some Morphological Rules
  • Restore a silent e after suffix removal from
    certain words to produce hope from hoping
    rather than hop
  • Delete certain doubled consonants after suffix
    removal, so as to generate hop from hopping
    rather than hopp.
  • Use a final y for an i in forms such as easier,
    so as to generate easy instead of easi.

45
The Indexing Prescription (2)
  • Identify individual text words.
  • Use stop list to delete common function words.
  • Use automatic suffix stripping to produce word
    stems.
  • Compute term-discrimination value for all word
    stems.
  • Use thesaurus class replacement for all
    low-frequency terms with discrimination values
    near zero.
  • Use phrase-formation process for all
    high-frequency terms with negative discrimination
    values.
  • Compute weighting factors for complex indexing
    units.
  • Assign to each document single term weights, term
    phrases, and thesaurus classes with weights.

46
Query vs. Document
  • Differences
  • Query texts are short.
  • Fewer terms are assigned to queries.
  • The occurrence of query terms rarely exceeds 1.

Q(wq1, wq2, , wqt) where wqj inverse document
frequency Di(di1, di2, , dit) where dij
term frequencyinverse document frequency
47
Query vs. Document
  • When non-normalized documents are used, the
    longer documents with more assigned terms have a
    greater chance of matching particular query terms
    than do the shorter document vectors.

or
48
Relevance Feedback
  • Terms present in previously retrieved documents
    that have been identified as relevant to the
    users query are added to the original
    formulations.
  • The weights of the original query terms are
    altered by replacing the inverse document
    frequency portion of the weights with
    term-relevance weights obtained by using the
    occurrence characteristics of the terms in the
    previous retrieved relevant and nonrelevant
    documents of the collection.

49
Relevance Feedback
  • Q (wq1, wq2, ..., wqt)
  • Di (di1, di2, ..., dit)
  • New query may be the following formQ awq1,
    wq2, ..., wqtbwqt1, wqt2, ..., wqtm
  • The weights of the newly added terms Tt1 to Ttm
    may consist of a combined term-frequency and
    term-relevance weight.

50
Final Indexing
  • Identify individual text words.
  • Use a stop list to delete common words.
  • Use suffix stripping to produce word stems.
  • Replace low-frequency terms with thesaurus
    classes.
  • Replace high-frequency terms with phrases.
  • Compute term weights for all single terms,
    phrases, and thesaurus classes.
  • Compare query statements with document vectors.
  • Identify some retrieved documents as relevant and
    some as nonrelevant to the query.

51
Final Indexing
  • Compute term-relevance factors based on available
    relevance assessments.
  • Construct new queries with added terms from
    relevant documents and term weights based on
    combined frequency and term-relevance weight.
  • Return to step (7).Compare query statements with
    document vectors ..

52
Summary of expected effectiveness of automatic
indexing (Salton, 1989)
  • Basic single-term automatic indexing -
  • Use of thesaurus to group related terms in the
    given topic area 10 to 20
  • Use of automatically derived term associations
    obtained from joint term assignments found in
    sample document collections 0 to -10
  • Use of automatically derived term phrases
    obtained by using co-occurring terms found in the
    texts of sample collections 5 to 10
  • Use of one iteration of relevant feedback to add
    new query terms extracted from previously
    retrieved relevant documents 30 to 60

53
Models
54
Ranking
  • central problem of IR
  • Predict which documents are relevant and which
    are not
  • Ranking
  • Establish an ordering of the documents retrieved
  • IR models
  • Different model provides distinct sets of
    premises to deal with document relevance

55
Information Retrieval Models
  • Classic Models
  • Boolean model
  • set theoretic
  • documents and queries are represented as sets of
    index terms
  • compare Boolean query statements with the term
    sets used to identify document content.
  • Vector model
  • algebraic model
  • documents and queries are represented as vectors
    in a t-dimensional space
  • compute global similarities between queries and
    documents.
  • Probabilistic model
  • probabilistic
  • documents and queries are represented on the
    basis of probabilistic theory
  • compute the relevance probabilities for the
    documents of a collection.

56
Information Retrieval Models(Continued)
  • Structured Models
  • reference to the structure present in written
    text
  • non-overlapping list model
  • proximal nodes model
  • Browsing
  • flat
  • structured guided
  • hypertext

57
Taxonomy of Information Retrieval Models
Set Theoretic
Classic Models
boolean vector probabilistic
Fuzzy Extended Boolean
U S E R T A S K
Retrieval Adhoc Filtering
Algebraic
Structured Models
Generalized Vector Lat. Semantic Index Neural
Network
Non-Overlapped Lists Proximal Nodes
Browsing
Browsing
Probabilistic
Flat Structured Guided Hypertext
Inference Network Belief Network
58
Issues of a retrieval system
  • Models
  • boolean
  • vector
  • probabilistic
  • Logical views of documents
  • full text
  • set of index terms
  • User task
  • retrieval
  • browsing

59
Combinations of these issues
LOGICAL VIEW OF DOCUMENTS
Full Text Structure
Index Terms
Full Text
U S E R T A S K
Classic Set Theoretic Algebraic Probabilistic
Classic Set Theoretic Algebraic Probabilistic
Retrieval
Structured
Flat Hypertext
Structure Guided Hypertext
Browsing
Flat
60
Retrieval Ad hoc and Filtering
  • Ad hoc retrieval
  • Documents remain relatively static while new
    queries are submitted
  • Filtering
  • Queries remain relatively static while new
    documents come into the system
  • e.g., news wiring services in the stock market
  • User profile describes the users preferences
  • Filtering task indicates to the user which
    document might be interested to him
  • Which ones are really relevant is fully reserved
    to the user
  • Routing a variation of filtering
  • Ranking filtered documents and show this ranking
    to users

61
User profile
  • Simplistic approach
  • The profile is described through a set of
    keywords
  • The user provides the necessary keywords
  • Elaborate approach
  • Collect information from the user
  • initial profile relevance feedback (relevant
    information and nonrelevant information)

62
Formal Definition of IR Models
  • /D, Q, F, R(qi, dj)/
  • D a set composed of logical views (or
    representations) for the documents in collection
  • Q a set composed of logical views (or
    representations) for the user information needs
  • F a framework for modeling documents
    representations, queries, and their relationships
  • R(qi, dj) a ranking function which associations
    a real number with qi?Q and dj ?D

query
63
Formal Definition of IR Models(continued)
  • classic Boolean model
  • set of documents
  • standard operations on sets
  • classic vector model
  • t-dimensional vector space
  • standard linear algebra operations on vector
  • classic probabilistic model
  • sets
  • standard probabilistic operations, and Bayes
    theorem

64
Basic Concepts of Classic IR
  • index terms (usually nouns) index and summarize
  • weight of index terms
  • Definition
  • Kk1, , kt a set of all index terms
  • wi,j a weight of an index term ki of a document
    dj
  • dj(w1,j, w2,j, , wt,j) an index term vector
    for the document dj
  • gi(dj) wi,j
  • assumption
  • index term weights are mutually independent

wi,j associated with (ki,dj) tells us
nothing about wi1,j associated with (ki1,dj)
The terms computer and network in the area of
computer networks
65
Boolean Model
  • The index term weight variables are all binary,
    i.e., wi,j?0,1
  • A query q is a Boolean expression (and, or, not)
  • qdnf the disjunctive normal form for q
  • qcc conjunctive components of qdnf
  • sim(dj,q) similarity of dj to q
  • 1 if ?qcc (qcc ?qdnf?(?ki, gi(dj)gi(qcc))
  • 0 otherwise

dj is relevant to q
66
Boolean Model (Continued)
  • (ka ? kb) ? (ka ? ?kc)
  • (ka ? kb ? kc) ? (ka ? kb ? ? kc)
  • (ka ? kb ? ?kc) ?(ka ? ?kb ? ?kc)
  • (ka ? kb ? kc) ? (ka ? kb ? ? kc) ?
  • (ka ? ?kb ? ?kc)
  • Example
  • qka ? (kb ? ?kc)
  • qdnf(1,1,1) ? (1,1,0) ? (1,0,0)

(1,1,0)
ka
kb
(1,0,0)
(1,1,1)
kc
67
Boolean Model (Continued)
  • advantage simple
  • disadvantage
  • binary decision (relevant or non-relevant)
    without grading scale
  • exact match (no partial match)
  • e.g., dj(0,1,0) is non-relevant to q(ka ? (kb ?
    ?kc)
  • retrieve too few or too many documents

68
Basic Vector Space Model
  • Term vector representation of documents
    Di(ai1, ai2, , ait) queries Qj(qj1, qj2, ,
    qjt)
  • t distinct terms are used to characterize
    content.
  • Each term is identified with a term vector T.
  • t vectors are linearly independent.
  • Any vector (i.e., document vectors and query
    vectors) is represented as a linear combination
    of the t term vectors.
  • The rth document Dr can be represented as a
    document vector, written as

document vector
query vector
69
Document representation in vector space
a document vector in a two-dimensional vector
space
70
Similarity Measure
document- document similarity
document vector and document vector
  • measure by product of two vectors x y x
    y cos?
  • document-query similarity
  • how to determine the vector components (i.e.,
    ari, qsj) and term correlations (i.e., Ti ? Tj)?

document vector and query vector
query vector
document vector
document- query similarity
71
Similarity Measure (Continued)
  • vector components

72
Similarity Measure (Continued)
  • term correlations Ti Tj are not
    availableassumption term vectors are
    orthogonal Ti Tj 0 (i?j) Ti Tj 1
    (ij)
  • Assume that terms are uncorrelated.Similarity
    measurement between query and document
  • Similarity measurement between documents

73
Sample query-documentsimilarity computation
  • D12T13T25T3 D23T17T21T3Q0T10T22T3
  • similarity computations for uncorrelated
    termssim(D1,Q)203 05 210sim(D2,Q)307
    01 22
  • D1 is preferred

74
Sample query-documentsimilarity computation
(Continued)
  • T1 T2 T3 T1 1 0.5 0 T2 0.5 1 -0.2 T3 0 -0
    .2 1
  • similarity computations for correlated
    termssim(D1,Q)(2T13T25T3) (0T10T22T3
    ) 4T1T36T2 T3 10T3 T3 -60.21018.8
    sim(D2,Q)(3T17T21T3) (0T10T22T3
    ) 6T1T314T2 T3 2T3 T3 -140.221-0.8
  • D1 is preferred

75
Vector Model
  • wi,j a positive, non-binary weight for (ki,dj)
  • wi,q a positive, non-binary weight for (ki,q)
  • q(w1,q, w2,q, , wt,q) a query vector, where t
    is the total number of index terms in the system
  • dj (w1,j, w2,j, , wt,j) a document vector

76
Similarity of document dj w.r.t. query q
  • The correlation between vectors dj and q
  • q does not affect the ranking
  • dj provides a normalization

dj
cos(dj,q)
?
Q
77
document ranking
  • Similarity (i.e., sim(q, dj)) varies from 0 to 1.
  • Retrieve the documents with a degree of
    similarity above a predefined threshold(allow
    partial matching)

78
term weighting techniques
  • IR problem one of clustering
  • user query a specification of a set A of objects
  • clustering problem determine which documents are
    in the set A (relevant), which ones are not
    (non-relevant)
  • intra-cluster similarity
  • the features better describe the objects in the
    set A
  • tf factor in vector modelthe raw frequency of a
    term ki inside a document dj
  • inter-cluster dissimilarity
  • the features better distinguish the the objects
    in the set A from the remaining objects in the
    collection C
  • idf factor (inverse document frequency) in vector
    modelthe inverse of the frequency of a term ki
    among the documents in the collection

79
Definition of tf
  • N total number of documents in the system
  • ni the number of documents in which the index
    term ki appears
  • freqi,j the raw frequency of term ki in the
    document dj
  • fi,j the normalized frequency of term ki in
    document dj

(01)
Term tl has maximum frequency in the document dj
80
Definition of idf and tf-idf scheme
  • idfi inverse document frequency for ki
  • wi,j term-weighting by tf-idf scheme
  • query term weight (Salton and Buckley)

(a very short document)
freqi,q1, max freq2
freqi,q the raw frequency of the term ki in q
document formula 0.5 query formula 0.75
81
Analysis of vector model
  • advantages
  • its term-weighting scheme improves retrieval
    performance
  • its partial matching strategy allows retrieval of
    documents that approximate the query conditions
  • its cosine ranking formula sorts the documents
    according to their degree of similarity to the
    query
  • disadvantages
  • indexed terms are assumed to be mutually
    independent

82
Probabilistic Model
  • Given a query, there is an ideal answer set
  • a set of documents which contains exactly the
    relevant documents and no other
  • query process
  • a process of specifying the properties of an
    ideal answer set
  • problem what are the properties?

83
Probabilistic Model (Continued)
  • Generate a preliminary probabilistic description
    of the ideal answer set
  • Initiate an interaction with the user
  • User looks at the retrieved documents and decide
    which ones are relevant and which ones are not
  • System uses this information to refine the
    description of the ideal answer set
  • Repeat the process many times.

84
Probabilistic Principle
  • Given a user query q and a document dj in the
    collection, the probabilistic model estimates the
    probability that user will find dj relevant
  • assumptions
  • The probability of relevance depends on query and
    document representations only
  • There is a subset of all documents which the user
    prefers as the answer set for the query q
  • Given a query, the probabilistic model assigns to
    each document dj a measure of its similarity to
    the query

85
Probabilistic Principle
  • wi,j?0,1, wi,q?0,1 the index term weight
    variables are all binary
  • q a query which is a subset of index terms
  • R the set of documents known to be relevant
  • R (complement of R) the set of non-relevant
    documents
  • P(Rdj) the probability that the document dj is
    relevant to the query q
  • P(Rdj) the probability that dj is non-relevant
    to q

86
similarity
  • sim(dj,q) the similarity of the document dj to
    the query q

(by definition)
(Bayes rule)
(P(R) and P(R) are the same for all documents)
the probability of randomly selecting the
document dj from the set of R of relevant
documents
P(R) the probability that a document randomly
selected from the entire collection is
relevant
87
P(kiR) the probability that the index term ki
is present in a document randomly selected from
the set R. P(kiR) the probability that the
index term ki is not present in a document
randomly selected from the set R.
independence assumption of index terms
88
independent of document
Problem where is the set R?
89
Initial guess
  • P(kiR) is constant for all index terms ki.
  • The distribution of index terms among the
    non-relevant documents can be approximated by the
    distribution of index terms among all the
    documents in the collection.

(??NgtgtR,N?R)
90
Initial ranking
  • V a subset of the documents initially retrieved
    and ranked by the probabilistic model (top r
    documents)
  • Vi subset of V composed of documents which
    contain the index term ki
  • Approximate P(kiR) by the distribution of the
    index term ki among the documents retrieved so
    far.
  • Approximate P(kiR) by considering that all the
    non-retrieved documents are not relevant.

91
Small values of V and Vi
a problem when V1 and Vi0
  • alternative 1
  • alternative 2

92
Analysis of Probabilistic Model
  • advantage
  • documents are ranked in decreasing order of their
    probability of being relevant
  • disadvantages
  • the need to guess the initial separation of
    documents into relevant and non-relevant sets
  • do not consider the frequency with which an index
    terms occurs inside a document
  • the independence assumption for index terms

93
Comparison of classic models
  • Boolean model the weakest classic model
  • Vector model is expected to outperform the
    probabilistic model with general collections
    (Salton and Buckley)

94
Alternative Set Theoretic Models-Fuzzy Set Model
  • Model
  • a query term a fuzzy set
  • a document degree of membership in this set
  • membership function
  • Associate membership function with the elements
    of the class
  • 0 no membership in the set
  • 1 full membership
  • 01 marginal elements of the set

documents
95
Fuzzy Set Theory
for query term
a class
document collection
  • A fuzzy subset A of a universe of discourse U is
    characterized by a membership function µA
    U?0,1 which associates with each element u of U
    a number µA(u) in the interval 0,1
  • complement
  • union
  • intersection

a document
96
Examples
  • Assume Ud1, d2, d3, d4, d5, d6
  • Let A and B be d1, d2, d3 and d2, d3, d4,
    respectively.
  • Assume ?Ad10.8, d20.7, d30.6, d40, d50,
    d60 and ?Bd10, d20.6,
    d30.8, d40.9, d50, d60
  • d10.2, d20.3,
    d30.4, d41, d51, d61

  • d10.8, d20.7, d30.8, d40.9, d50, d60
  • d10,
    d20.6, d30.6, d40, d50, d60

97
Fuzzy Information Retrieval
  • basic idea
  • Expand the set of index terms in the query with
    related terms (from the thesaurus) such that
    additional relevant documents can be retrieved
  • A thesaurus can be constructed by defining a
    term-term correlation matrix c whose rows and
    columns are associated to the index terms in the
    document collection

keyword connection matrix
98
Fuzzy Information Retrieval(Continued)
  • normalized correlation factor ci,l between two
    terms ki and kl (01)
  • In the fuzzy set associated to each index term
    ki, a document dj has a degree of membership µi,j

ni is of documents containing term ki
where
nl is of documents containing term kl
ni,l is of documents containing ki and kl
99
Fuzzy Information Retrieval(Continued)
  • physical meaning
  • A document dj belongs to the fuzzy set associated
    to the term ki if its own terms are related to
    ki, i.e., ?i,j1.
  • If there is at least one index term kl of dj
    which is strongly related to the index ki, then
    ?i,j?1. ki is a good fuzzy index
  • When all index terms of dj are only loosely
    related to ki, ?i,j?0. ki is not a good fuzzy
    index

100
Example
  • q(ka ? (kb ? ?kc))(ka ? kb ? kc) ? (ka ? kb ?
    ? kc) ?(ka ? ?kb ? ?kc)cc1cc2cc3

Da the fuzzy set of documents associated
to the index ka
cc2
cc3
Da
dj?Da has a degree of membership ?a,j gt a
predefined threshold K
cc1
Db
Da the fuzzy set of documents associated
to the index ka (the negation of index
term ka)
Dc
101
Example
  • Query qka ? (kb ? ? kc)

disjunctive normal form qdnf(1,1,1) ? (1,1,0) ?
(1,0,0)
(1) the degree of membership in a disjunctive
fuzzy set is computed using an algebraic sum
(instead of max function) more smoothly (2) the
degree of membership in a conjunctive fuzzy set
is computed using an algebraic product
(instead of min function)
Recall
102
Fuzzy Set Model
  • Q gold silver truckD1 Shipment of gold
    damaged in a fireD2 Delivery of silver
    arrived in a silver truckD3 Shipment of gold
    arrived in a truck
  • IDF (Select Keywords)
  • a in of 0 log 3/3 arrived gold
    shipment truck 0.176 log 3/2damaged
    delivery fire silver 0.477 log 3/1
  • 8 Keywords (Dimensions) are selected
  • arrived(1), damaged(2), delivery(3), fire(4),
    gold(5), silver(6), shipment(7), truck(8)

103
Fuzzy Set Model
104
Fuzzy Set Model
105
Fuzzy Set Model
  • Sim(q,d) Alternative 1
  • Sim(q,d3) gt Sim(q,d2) gt Sim(q,d1)
  • Sim(q,d) Alternative 2
  • Sim(q,d3) gt Sim(q,d2) gt Sim(q,d1)

106
Alternative Algebraic ModelGeneralized Vector
Space Model
  • independence of index terms
  • ki a vector associated with the index term ki
  • the set of vectors k1, k2, , kt is linearly
    independent
  • orthogonal
  • The index term vectors are assumed linearly
    independent but are not pairwise orthogonal in
    generalized vector space model
  • The index term vectors, which are not seen as the
    basis of the space, are composed of smaller
    components derived from the particular collection.

for i?j
107
Review
  • Two vectors u and v are linearly independent
  • if ?u?v0 then ??0
  • Two vectors u and v are orthogonal, I.e., ?90o
  • uv0 (I.e., uTv0)
  • if two vectors u and v are orthogonal, then u and
    v are linearly independent
  • assume ?u?v0, u?0 and v?0
  • uT(?u?v)0 --gt ? uTu? uT v0 --gt ?uTu0

108
Generalized Vector Space Model
  • k1, k2, , kt index terms in a collection
  • wi,j binary weights associated with the
    term-document pair ki, dj
  • The patterns of term co-occurrence (inside
    documents) can be represented by a set of 2t
    minterms
  • gi(mj) return the weight 0,1 of the index term
    ki in the minterm mj (1 ? i ? t)

m1(0, 0, , 0) point to documents containing
none of index terms m2(1, 0, , 0) point to
documents containing the index term k1
only m3(0,1,,0) point to documents containing
the index term k2 only m4(1,1,,0) point to
documents containing the index terms k1 and
k2 m2t(1, 1, , 1) point to documents
containing all the index terms
109
Generalized Vector Space Model(Continued)
  • mi (2t-tuple vector) is associated with minterm
    mi (t-tuple vector)
  • e.g., m4 is associated with m4 containing k1 and
    k2, and no others
  • co-occurrence of index terms inside documents
    dependencies among index terms

(the set of mi are pairwise orthogonal)
110
minterm mr mr vector m1(0,0,0) m1(1,0,0,0,0,0,0,
0) m2(0,0,1) m2(0,1,0,0,0,0,0,0) m3(0,1,0) m3(
0,0,1,0,0,0,0,0) m4(0,1,1) m4(0,0,0,1,0,0,0,0) m
5(1,0,0) m5(0,0,0,0,1,0,0,0) m6(1,0,1) m6(0,0,
0,0,0,1,0,0) m7(1,1,0) m7(0,0,0,0,0,0,1,0) m8(1
,1,1) m8(0,0,0,0,0,0,0,1)
d1 (k1) d11 (k1 k2) d2 (k3) d12 (k1 k3) d3
(k3) d13 (k1 k2) d4 (k1) d14 (k1 k2) d5
(k2) d15 (k1 k2 k3) d6 (k2) d16 (k1 k2) d7 (k2
k3) d17 (k1 k2) d8 (k2 k3) d18 (k1 k2) d9
(k2) d19 (k1 k2 k3) d10 (k2 k3) d20 (k1 k2)
t3
111
minterm mr mr vector m1(0,0,0) m1(1,0,0,0,0,0,0,
0) m2(0,0,1) m2(0,1,0,0,0,0,0,0) m3(0,1,0) m3(
0,0,1,0,0,0,0,0) m4(0,1,1) m4(0,0,0,1,0,0,0,0) m
5(1,0,0) m5(0,0,0,0,1,0,0,0) m6(1,0,1) m6(0,0,
0,0,0,1,0,0) m7(1,1,0) m7(0,0,0,0,0,0,1,0) m8(1
,1,1) m8(0,0,0,0,0,0,0,1)
d1 (k1) d11 (k1 k2) d2 (k3) d12 (k1 k3) d3
(k3) d13 (k1 k2) d4 (k1) d14 (k1 k2) d5
(k2) d15 (k1 k2 k3) d6 (k2) d16 (k1 k2) d7 (k2
k3) d17 (k1 k2) d8 (k2 k3) d18 (k1 k2) d9
(k2) d19 (k1 k2 k3) d10 (k2 k3) d20 (k1 k2)
t3
112
minterm mr mr vector m1(0,0,0) m1(1,0,0,0,0,0,0,
0) m2(0,0,1) m2(0,1,0,0,0,0,0,0) m3(0,1,0) m3(
0,0,1,0,0,0,0,0) m4(0,1,1) m4(0,0,0,1,0,0,0,0) m
5(1,0,0) m5(0,0,0,0,1,0,0,0) m6(1,0,1) m6(0,0,
0,0,0,1,0,0) m7(1,1,0) m7(0,0,0,0,0,0,1,0) m8(1
,1,1) m8(0,0,0,0,0,0,0,1)
d1 (k1) d11 (k1 k2) d2 (k3) d12 (k1 k3) d3
(k3) d13 (k1 k2) d4 (k1) d14 (k1 k2) d5
(k2) d15 (k1 k2 k3) d6 (k2) d16 (k1 k2) d7 (k2
k3) d17 (k1 k2) d8 (k2 k3) d18 (k1 k2) d9
(k2) d19 (k1 k2 k3) d10 (k2 k3) d20 (k1 k2)
t3
113
Generalized Vector Space Model(Continued)
  • Determine the index vector ki associated with the
    index term ki

Collect all the vectors mr in which the index
term ki is in state 1.
Sum up wi,j associated with the index term ki and
document dj whose term occurrence pattern
coincides with minterm mr
114
Generalized Vector Space Model(Continued)
  • ki?kj quantifies a degree of correlation between
    ki and kj
  • standard cosine similarity is adopted

115
(No Transcript)
116
Comparison with Standard Vector Space Model
d1 (k1) (w1,1,0,0) d11 (k1 k2)
(w1,11,w2,11,0) d2 (k3) (0,0,w3,2) d12 (k1 k3)
(w1,12,0,w3,12) d3 (k3) (0,0,w3,3) d13 (k1 k2)
(w1,13,w2,13,0) d4 (k1) (w1,4,0,0) d14 (k1 k2)
(w1,14,w2,14,0) d5 (k2) (0,w2,5,0) d15 (k1 k2
k3) (w1,15,w2,15, w3,15) d6 (k2)
(0,w2,6,0) d16 (k1 k2) (w1,16,w2,16,0) d7 (k2
k3) (0,w2,7,w3,7) d17 (k1 k2)
(w1,17,w2,17,0) d8 (k2 k3) (0,w2,8,w3,8) d18 (k1
k2) (w1,18,w2,18,0) d9 (k2) (0,w2,9,0) d19 (k1
k2 k3) (w1,19,w2,19, w3,19) d10 (k2 k3)
(0,w2,10,w3,10) d20 (k1 k2) (w1,20,w2,20,0)
117
Generalized Vector Space Model
118
Generalized Vector Space Model
119
Generalized Vector Space Model
120
Vector Space Model
  • Q gold silver truckD1 Shipment of gold
    damaged in a fireD2 Delivery of silver
    arrived in a silver truckD3 Shipment of gold
    arrived in a truck
  • 8 Dimensions (arrived, damaged, delivery, fire,
    gold, silver, shipment, truck)
  • Weight TF IDF
  • Q (0, 0, 0, 0, .176, .477, 0, .176)D1 (0,
    .477, 0, .477, .176, 0, .176, 0)D2 (.176, 0,
    .477, 0, 0, .954, 0, .176)D3 (.176, 0, 0, 0,
    .176, 0, .176, .176)

Construction of Matrix T
121
Construction of Matrix T
d1
d2
d3
122
Normalize Matrix K
Normalized Direction
123
Construction of Matrix T
Calculate by Yourself
124
Latent Semantic Indexing (LSI) Model
  • representation of documents and queries by index
    terms
  • problem 1 many unrelated documents might be
    included in the answer set
  • problem 2 relevant documents which are not
    indexed by any of the query keywords are not
    retrieved
  • possible solution concept matching instead of
    index term matching
  • application in cross-language information
    retrieval (CLIR)

125
basic idea
  • Map each document and query vector into a lower
    dimensional space which is associated with
    concepts
  • Retrieval in the reduced space may be superior to
    retrieval in the space of index terms

126
Definition
  • t the number of index terms in the collection
  • N the total number of documents
  • M(Mij) a term-document association matrix with
    t rows (i.e., term) and N columns (i.e.,
    document)
  • Mij a weight wi,j associated with the
    term-document pair ki, dj (e.g., using tf-idf)

127
Singular Value Decomposition
orthogonal
?1
0
?2
where D
diagonal matrix
.
.
0
.
?n
?1 ? ?2 ? ? ?n ? 0
128
orthogonal
(AB)T BT AT
?1
0
?2
where D
diagonal matrix
.
.
0
.
?n
?1 ? ?2 ? ? ?n ? 0
129
?1
0
?2
.
.
.
?n
?1, ?2, , ?n ?A?eigenvalues, qk?A????k?eigenvect
or
130
Singular Value Decomposition
According to
131
??AQDQt Q is matrix of eigenvectors of A D is
diagonal matrix of singular values
??
s lt r (Concept space is reduced)
132
Consider only the s largest singular values of S
?1
0
?2
.
.
0
.
?n
?1 ? ?2 ? ? ?n ? 0
The resultant Ms matrix is the matrix of rank s
which is closest to the original matrix M in the
least square sense.
???????? ??-??index term??????? ??-??index
term?????
(sltltt, sltltN)
s??????????????, ?????,?????????
133
Ranking in LSI
  • query a pseudo-document in the original M
    term-document
  • query is modeled as the document with number 0
  • First row of MstMs the ranks of all documents
    w.r.t this query

(i,j) qualifies the relationship
between documents di and dj When i 0, it
denotes similarity between q and documents
134
Structured Text Retrieval Models
  • Definition
  • Combine information on text content with
    information on the document structure
  • e.g., same-page(near(atomic holocaust,
    Figure(label(earth))))
  • Expressive power vs. evaluation efficiency
  • a model based on non-overlapping lists
  • a model based on proximal nodes
  • Terminology
  • match point position in the text of a sequence
    of words that matches the user query
  • region a contiguous portion of the text
  • node a structural component of the document
    (chap, sec, )

135
Non-Overlapping Lists
  • divide the whole text of each document in
    non-overlapping text regions (lists)
  • example
  • Text regions from distinct lists might overlap

non-overlapping in a list
1
5000
Chapter
Chapter 1
L0
a list of all chapters in the document
1
3001
5000
1.1
1.2
3000
L1
Sections
a list of all sections in the document
indexing lists
1
1000
1001
3000
3001
5000
1.1.1
1.2.1
1.1.2
Subsections
L2
a list of all subsections in the document
1
500
501
1000
1001
Subsubsections
L3
a list all subsubsections in the document
136
Non-Overlapping Lists(Continued)
Recall that there is another inverted file for
the words in the text
  • Data structure
  • a single inverted file
  • each structural component (e.g., chap, sec, )
    stands as an entry
  • for each entry, there is a list of text regions
    as a list occurrences
  • Operations
  • Select a region which contains a given word
  • Select a region A which does not contain any
    other region B (where B belongs to a list
    distinct from the list for A)
  • Select a region not contained within any other
    region

137
Inverted Files
  • File is represented as an array of indexed
    records.

138
Inverted-file process
  • The record-term array is inverted (transposed).

139
Inverted-file process (Continued)
  • Take two or more rows of an inverted term-record
    array, and produce a single combined list of
    record identifiers. Query (term2 and
    term3) 1 1 0 0 0 1 1 1-------------------------
    -------- 1 lt-- R2

140
Extensions of Inverted Index Operations(Distance
Constraints)
  • Distance Constraints
  • (A within sentence B)terms A and B must co-occur
    in a common sentence
  • (A adjacent B)terms A and B must occur
    adjacently in the text

141
Extensions of Inverted Index Operations(Distance
Constraints)
  • Implementation
  • include term-location in the inverted
    indexesinformation R345, R348, R350,
    retrieval R123, R128, R345,
  • include sentence-location in the indexes
    information R345, 25 R345, 37 R348, 10
    R350, 8 retrieval R123, 5 R128, 25
    R345, 37 R345, 40

142
Extensions of Inverted Index Operations(Distance
Constraints)
  • include paragraph numbers in the indexessentence
    numbers within paragraphsword numbers within
    sentencesinformation R345, 2, 3, 5
    retrieval R345, 2, 3, 6
  • query examples(information adjacent
    retrieval)(information within five words
    retrieval)
  • cost the size of indexes

143
Model Based on Proximal Nodes
  • hierarchical vs. flat indexing structures

nodes position in the text
Chapter
Sections
hierarchical index
Subsections
Subsubsections
flat index
paragraphs, pages, lines
an inverted list for holocaust


holocaust
10
256
48,324

entries positions in the text
144
Model Based on Proximal Nodes(Continued)
  • query language
  • Specification of regular expressions
  • Reference to structural components by name
  • Combination
  • Example
  • Search for sections, subsections, or
    subsubsections which contain the word holocaust
  • (section) with (holocaust)

145
Model Based on Proximal Nodes(Continued)
  • Basic algorithm
  • Traverse the inverted list for the term
    holocaust
  • For each entry in the list (i.e., an occurrence),
    search the hierarchical index looking for
    sections, subsections, and sub-subsections
  • Revised algorithm
  • For the first entry, search as before
  • Let the last matching structural component be the
    innermost matching component
  • Verify the innermost matching component also
    matches the second entry.
  • If it does, the larger structural components
    above it also do.

nearby nodes
146
Models for Browsing
  • Browsing vs. searching
  • The goal of a searching task is clearer in the
    mind of the user than the goal of a browsing task
  • Models
  • Flat browsing
  • Structure guided browsing
  • The hypertext model

147
Models for Browsing
  • Flat organization
  • Documents are represented as dots in a 2-D plan
  • Documents are represented as elements in a 1-D
    list, e.g., the results of search engine
  • Structure guided browsing
  • Documents are organized in a directory, which
    group documents covering related topics
  • Hypertext model
  • Navigating the hypertext a traversal of a
    directed graph

148
Trends and Research Issues
  • Library systems
  • Cognitive and behavioral issues oriented
    particularly at a better understanding of which
    criteria the users adopt to judge relevance
  • Specialized retrieval systems
  • e.g., legal and business documents
  • how to retrieve all relevant documents without
    retrieving a large number of unrelated documents
  • The Web
  • User does not know what he wants or has great
    difficulty in formulating his request
  • How the paradigm adopted for the user interface
    affects the ranking
  • The indexes maintained by various Web search
    engine are almost disjoint
Write a Comment
User Comments (0)
About PowerShow.com