Classical Models PowerPoint PPT Presentation

presentation player overlay
1 / 57
About This Presentation
Transcript and Presenter's Notes

Title: Classical Models


1
Lecture 3Document Models for IR
  • Classical Models
  • Latent Semantic Indexing Model
  • A Structural Model

2
Logic View of Document
  • A text document may be represent for computer
    analysis in different formats
  • Full text
  • Index Terms
  • Structures

3
The Role of Indexer
  • The huge size of the Internet makes it
    unrealistic to use the full text for information
    retrieval that requires quick response
  • The indexer simplifies the logical view of a
    document
  • Indexing method dictates document storage and
    retrieval algorithms
  • Automation of indexing methods is necessary for
    information retrieval over the Internet.

4
Possible drawbacks
  • Summary of document through a set of index terms
    may lead to poor performance
  • many unrelated documents may be included in the
    answer set for a query
  • relevant documents which are not indexed by any
    of the query keywords cannot be retrieved

5
An Formal Description of IR Models
  • A quadruple D,Q,F,R(q,d)
  • D (document) is a set of composed of logical view
    (or representations) for the documents in the
    collection.
  • Q (queries) is a set composed of logical views
    (or representations) for user information needs.
  • F (Framework) is a framework for modeling
    documents representations, queries, and their
    relationships
  • R(q,d) is a ranking function which associates a
    real number with a query q and a document
    representation d. Such ranking defines an
    ordering among the documents with regard to the
    query q .

6
Classic Models
  • Boolean Model
  • Vector Space Model
  • Probabilistic Model

7
Boolean Model
  • Documents representations full text or a set of
    key-words (contained in the text or not)
  • Query representation logic operators, query
    terms, query expressions
  • Searching using inverted file and set operations
    to construct the result set

8
Boolean Searching
  • Queries
  • A and B and (C D)
  • Break collection into two unordered sets
  • Documents that match the query
  • Documents that dont
  • Return all the match the query

9
Boolean Model
ka
kb
(1,1,0)
(1,0,0)
(1,1,1)
kc
The three conjunctive components for the query
qka (kb kc)
10
Another Example
Consider three document about cityu _at_
http//www.cityu.edu.hk/cityu/about/index.htm abo
ut FSE _at_ http//www.cityu.edu.hk/cityu/dpt-acad/f
se.htm about CS _at_ http//www.cs.cityu.edu.hk/c
ontent/about/ Query degree aim returns
about cityU Query degree aim returns all
three
11
Advantages
  • Simple and clean formalism
  • The answer set is exactly what the users look
    for.
  • Therefore, users can have complete control if
    they know how to write a Boolean formula of terms
    for the document(s) they want to find out.
  • Easy to be implemented on computers
  • Popular (most of the search engines support this
    model)

12
Disadvantage
  • Results are considered to be equals and no
    ranking of the documents
  • The set of all documents that satisfies a query
    may be still too large for the users to browser
    through or too little
  • The users may only know what they are looking for
    in a vague way but not be able to formulate it as
    a Boolean expression
  • Need to train the users

13
Improvement to Boolean model
  • Expand and refine query through interactive
    protocols
  • Automation of query formula generation
  • Assign weights to query terms and rank the
    results accordingly

14
Vector Space Model
  • Vector Presentation
  • Similarity Measure

15
Vector Space Model
  • represent stored text as well as information
    queries by vectors of terms
  • term is typically a word, a word stem, or a
    phrase associated with the text under
    consideration or may be word weights.
  • generate terms by term weighting system
  • terms are not equally useful for content
    representation
  • assign high weights to terms deems important and
    low weights to the less important terms

16
Vector Presentation
  • Every document in the collection is presented by
    a vector
  • Distinct terms in the collection is called Index
    terms, or vocabulary

Computer XML Operating System Microsoft Office Uni
x Search Engines
Page Collection
Collection about computer
Index terms
17
Terms relationship
  • Each term is identified as Ti
  • No relationship between terms in vector space,
    they are orthogonal
  • Instead ,in collection about computer, terms,
    like computer, OS, are correlated to each
    other.

18
Vector space model
  • A vocabulary of 2 terms forms a 2D space, each
    document may contain 0,1 or 2 terms. We may see
    the following vectors for the representation.
  • D1lt0,0gt
  • D2lt0,0.3gt
  • D3lt2,3gt

19
Term/Document matrix
  • t-Terms will form a t-D space
  • Documents and queries can be presented as t-D
    vectors
  • Documents can be considered as a point in t-D
    space
  • We may form a matrix of n by t rows for n
    documents indexed with t terms.

20
Document-Term Matrix
Terms
Weight of a term in the document
Documents
21
Decide the weight
  • Combine two factors in the document-term weight
  • tfij frequency of term j in document I
  • df j document frequency of term j
  • number of documents containing term j
  • idfj
  • inverse document frequency of term j
  • log2 (N/ df j) (N number of documents in
    collection)
  • Inverse document frequency -- an indication of
    term values as a document discriminator.

22
Tf-idf term weight
  • A term occurs frequently in the document but
    rarely in the remaining of the collection has a
    high weight
  • A typical combined term importance indicator
  • wij tfij? idfj tfij? log2 (N/ df j)
  • many other ways are recommended to determine the
    document-term weight

23
An Example of 5 documents
  • D1 How to Bake Bread without Recipes
  • D2 The Classic Art of Viennese Pastry
  • D3 Numerical Recipes The Art of Scientific
    Computing
  • D4 Breads, Pastries, Pies and Cakes Quantity
    Baking Recipes
  • D5 Pastry A Book of Best French Recipe

24
Six Index terms
  • T1 Bak(e,ing)
  • T2 recipes
  • T3 bread
  • T4 cake
  • T5 pastr(y, ies)
  • T6 pie

25
An Example of 5 documents
  • D1 How to Bake Bread without Recipes
  • D2 The Classic Art of Viennese Pastry
  • D3 Numerical Recipes The Art of Scientific
    Computing
  • D4 Breads, Pastries, Pies and Cakes Quantity
    Baking Recipes
  • D5 Pastry A Book of Best French Recipe

26
Term Frequency in documents
(I,j)1 of document I contains item j once
27
Document frequency of term j
28
Tf-idf weight matrix
log(5/2) log(5/4) log(5/2) 0 0
0 0 0 0 0 log(5/3)
0 0 log(5/4) 0 0 0
0 log(5/2) log(5/4) log(5/2) log(5)
log(5/3) log(5) 0 log(5/4) 0 0
log(5/3) 0
29
Exercise
  • Write a program that use Tf-idf term weight to
    form the term/document matrix. Test it for the
    following three documents
  • http//www.cityu.edu.hk/cityu/about/index.htm
  • http//www.cityu.edu.hk/cityu/dpt-acad/fse.htm
  • http//www.cs.cityu.edu.hk/content/about/

30
Similarity Measure
  • Determine the similarity between document D and
    query Q
  • Lots of method can be used to calculate the
    similarity
  • Cosine Similarity Measures

31
Similarity Measure cosine
dj
?
Q
Cosine similarity measures the cosine of the
angle between two vectors
32
Advantages of VSM
  • Term-weighting scheme improves retrieval
    performance
  • Partial matching strategy allows retrieval of
    documents that approximate the query conditions
  • Its cosine ranking formula sorts the documents
    according to their degree of similarity to the
    query

33
Limitations of VSM
  • underlying assumption is that the terms in the
    vector are orthogonal
  • the need for several query terms if a
    discriminating ranking is to be achieved, whereas
    only two or three ANDed terms may suffice in a
    Boolean environment to obtain a high-quality
    output
  • Difficult to explicitly specifying synonymous and
    phrasal relationships, where these can be easily
    handled in a Boolean environment by means of the
    OR and AND operators or by an extended Boolean
    model

34
Latent Semantic Indexing Model of document/query
  • Map document and query vector into a lower
    dimensional space which is associated with
    concepts
  • Information retrieval using a singular value
    decomposition model of latent semantic structure.
    11th ACM SIGIR Conference, pp.465-480, 1988
  • by G.W.Furnas, S. Deerwester,S.T.Dumais,T.K.Landau
    er, R.A.Harshman,L.A.Streeter, and K.E.Lochbaum
  • http//www.cs.utk.edu/lsi/
  • A tutorial
  • http//www.cs.utk.edu/berry/lsi/node5.html

35
General Approach
  • It is based on Vector Space Model
  • In vector space model, terms are treated
    independently
  • Here some relationship of the terms are obtained,
    implicitly, magically through matrix analysis
  • This allows reduction of some un-necessary
    information in the document representation.

36
term- document association matrix
  • Let t be the number of terms and N be the number
    of documents
  • Let M(Mij) be term-document association matrix.
  • Mij may be considered as weight associated with
    the term-document pair (ti,dj)

37
Eigenvalue and Eigenvector
  • Let A be an mxn matrix and x be an n-dimensional
    vector
  • x is an eigenvalue of A if Ax is the same as cx
    where c is a scale factor.
  • Example
  • A x
  • Then Ax3x
  • 3 is an eigenvalue, and x is an eigenvector
  • Question find another eigenvalue?

38
Example continued
  • yt(1,-1). Ay(1,-1)ty.
  • Therefore, another eigenvalue is 1 and its
    associated eigenvector is y
  • Then
  • Let
  • S
  • Then A(x,y) (x,y)S
  • More over xty0

39
Example continued
  • Let K(x,y)/sqrt(2)
  • Then
  • KtKI
  • and AKSKt

40
A General Theorem from Linear Algebra
  • If A is a symmetrical matrix, then
  • There exist a matrix K (KtKI) and a diagonal
    matrix S such that
  • AKSKt

41
Application to our case
  • Both MMt and Mt M are symmetric
  • In addition, their eigenvalues are the same
    except that the large one has an extra number of
    zeros.

42
Decomposition of term- document association matrix
  • Decompose MKSDt
  • K the matrix of eigenvectors derived from the
    term-to-term correlation matrix given by MMt
  • Dt that of Mt M
  • S an (rxr) matrix of singular values where r is
    the rank of M

43
Reduced Concept Space
  • Let Ss be the s largest singular values of S.
  • Let Ks and Dst be the corresponding columns of
    rows K and S.
  • The matrix MsKsSsDst
  • is closest to M in the least square sense
  • NOTE Ms has the same number of rows (terms) and
    columns (documents) as M but it may be totally
    different from M.
  • A numerical example
  • www.cs.arizona.edu/classes/ cs630/spring03/slides/
    jan-29.ppt

44
The relationship of two documents di and dj
  • MstMs(KsSsDst )t(KsSsDst )
  • DsSsKstKsSsDst
  • DsSsSsDst
  • DsSs(DsSs)t
  • The (i,j) element quantifies the relationship
    between document i and j.

45
The choice of s
  • It should be large enough to allow fitting all
    the structure in the original data
  • It should be small enough to allow filtering out
    noise caused by variation of choices of terms.

46
Ranking documents according to a query
  • Model the query Q as a pseudo-document in the
    original term document matrix M
  • The vector MstQ provides ranks of all documents
    with respect to this query Q.

47
Advantage
  • When s is small with respect to t and N, it
    provides an efficient indexing model
  • it provides for elimination of noise and removal
    of redundancy
  • it introduces conceptualization based on theory
    of singular value decomposition

48
Graph model of document/query
  • Improving Effectiveness and Efficiency of Web
    Search
  • by Graph-based Text Representation
  • Junji Tomita and Yoshihiko Hayashi
  • http//www9.org/final-posters/13/poster13.html
  • Interactive web search by graphical query
    refinement
  • By Junji Tomita and Genichiro Kikui
  • http//www10.org/cdrom/posters/1078.pdf

49
Graph-based text representation model
  • Subject Graph
  • a node represents a term in the text,
  • a link denotes an association between the linked
    terms.
  • Significance of terms and term-term associations
    are represented as weights assigned to them.

50
Assignment of Weights
  • Term-statistics-based weighting schemes
  • frequencies of terms
  • frequencies of term-term association
  • multiplied by inverse document frequency

51
Similarity of documents
  • Subject Graph Matching.
  • Weight terms and term-term associations with ?
    and 1-? for adequately chosen ?.
  • Then calculate the cosine value of two documents
    treating weighted terms and term-term
    associations as elements of the vector space
    model.

52
Query as graph
  • Sometimes users query is vague
  • System Represents users query as a query graph
  • User can interactively and explicitly clarify
    his/her query by looking at and editing the query
    graph
  • System implicitly edits the query graph
    according to users choice on documents

53
User interface of the system
54
A query graph
guide
transport
travel
train
Asia
Japan
link and nodes with no link or
55
Interactive query graph refinement
  • User inputs sentences as query, system displays
    the initial query graph made from the inputs
  • User edits the query graph by removing and/or
    adding nodes and/or links
  • System measures the relevance score of each
    document against the modified query graph
  • System ranks search results in descending score
    order and displays their titles to the user
    interface
  • User selects documents relevant to his/her needs
  • System refines the query graph based on the
    documents selected by user and the old query
    graph
  • System displays the new query graph to user
  • Repeat previous steps until the user is satisfied
    with the search results

56
Details of step 6 making a new query graph
57
Digest Graph
  • The output of search engines is presented via
    graphical representation
  • a subgraph of the Subject Graph for the entire
    document.
  • The subgraph is generated on the fly in response
    to the current query.
  • User can intuitively understand the subject of
    each document from the terms and the term-term
    associations in the graph.
Write a Comment
User Comments (0)
About PowerShow.com