Classical Models presentation

About This Presentation

Transcript and Presenter's Notes

Title: Classical Models

1
Lecture 3Document Models for IR

Classical Models
Latent Semantic Indexing Model
A Structural Model

2
Logic View of Document

A text document may be represent for computer
analysis in different formats
Full text
Index Terms
Structures

3
The Role of Indexer

The huge size of the Internet makes it
unrealistic to use the full text for information
retrieval that requires quick response
The indexer simplifies the logical view of a
document
Indexing method dictates document storage and
retrieval algorithms
Automation of indexing methods is necessary for
information retrieval over the Internet.

4
Possible drawbacks

Summary of document through a set of index terms
may lead to poor performance
many unrelated documents may be included in the
answer set for a query
relevant documents which are not indexed by any
of the query keywords cannot be retrieved

5
An Formal Description of IR Models

A quadruple D,Q,F,R(q,d)
D (document) is a set of composed of logical view
(or representations) for the documents in the
collection.
Q (queries) is a set composed of logical views
(or representations) for user information needs.
F (Framework) is a framework for modeling
documents representations, queries, and their
relationships
R(q,d) is a ranking function which associates a
real number with a query q and a document
representation d. Such ranking defines an
ordering among the documents with regard to the
query q .

6
Classic Models

Boolean Model
Vector Space Model
Probabilistic Model

7
Boolean Model

Documents representations full text or a set of
key-words (contained in the text or not)
Query representation logic operators, query
terms, query expressions
Searching using inverted file and set operations
to construct the result set

8
Boolean Searching

Queries
A and B and (C D)
Break collection into two unordered sets
Documents that match the query
Documents that dont
Return all the match the query

9
Boolean Model
ka
kb
(1,1,0)
(1,0,0)
(1,1,1)
kc
The three conjunctive components for the query
qka (kb kc)
10
Another Example
Consider three document about cityu _at_
http//www.cityu.edu.hk/cityu/about/index.htm abo
ut FSE _at_ http//www.cityu.edu.hk/cityu/dpt-acad/f
se.htm about CS _at_ http//www.cs.cityu.edu.hk/c
ontent/about/ Query degree aim returns
about cityU Query degree aim returns all
three
11
Advantages

Simple and clean formalism
The answer set is exactly what the users look
for.
Therefore, users can have complete control if
they know how to write a Boolean formula of terms
for the document(s) they want to find out.
Easy to be implemented on computers
Popular (most of the search engines support this
model)

12
Disadvantage

Results are considered to be equals and no
ranking of the documents
The set of all documents that satisfies a query
may be still too large for the users to browser
through or too little
The users may only know what they are looking for
in a vague way but not be able to formulate it as
a Boolean expression
Need to train the users

13
Improvement to Boolean model

Expand and refine query through interactive
protocols
Automation of query formula generation
Assign weights to query terms and rank the
results accordingly

14
Vector Space Model

Vector Presentation
Similarity Measure

15
Vector Space Model

represent stored text as well as information
queries by vectors of terms
term is typically a word, a word stem, or a
phrase associated with the text under
consideration or may be word weights.
generate terms by term weighting system
terms are not equally useful for content
representation
assign high weights to terms deems important and
low weights to the less important terms

16
Vector Presentation

Every document in the collection is presented by
a vector
Distinct terms in the collection is called Index
terms, or vocabulary

Computer XML Operating System Microsoft Office Uni
x Search Engines
Page Collection
Collection about computer
Index terms
17
Terms relationship

Each term is identified as Ti
No relationship between terms in vector space,
they are orthogonal
Instead ,in collection about computer, terms,
like computer, OS, are correlated to each
other.

18
Vector space model

A vocabulary of 2 terms forms a 2D space, each
document may contain 0,1 or 2 terms. We may see
the following vectors for the representation.
D1lt0,0gt
D2lt0,0.3gt
D3lt2,3gt

19
Term/Document matrix

t-Terms will form a t-D space
Documents and queries can be presented as t-D
vectors
Documents can be considered as a point in t-D
space
We may form a matrix of n by t rows for n
documents indexed with t terms.

20
Document-Term Matrix
Terms
Weight of a term in the document
Documents
21
Decide the weight

Combine two factors in the document-term weight
tfij frequency of term j in document I
df j document frequency of term j
number of documents containing term j
idfj
inverse document frequency of term j
log2 (N/ df j) (N number of documents in
collection)
Inverse document frequency -- an indication of
term values as a document discriminator.

22
Tf-idf term weight

A term occurs frequently in the document but
rarely in the remaining of the collection has a
high weight
A typical combined term importance indicator
wij tfij? idfj tfij? log2 (N/ df j)
many other ways are recommended to determine the
document-term weight

23
An Example of 5 documents

D1 How to Bake Bread without Recipes
D2 The Classic Art of Viennese Pastry
D3 Numerical Recipes The Art of Scientific
Computing
D4 Breads, Pastries, Pies and Cakes Quantity
Baking Recipes
D5 Pastry A Book of Best French Recipe

24
Six Index terms

T1 Bak(e,ing)
T2 recipes
T3 bread
T4 cake
T5 pastr(y, ies)
T6 pie

25
An Example of 5 documents

D1 How to Bake Bread without Recipes
D2 The Classic Art of Viennese Pastry
D3 Numerical Recipes The Art of Scientific
Computing
D4 Breads, Pastries, Pies and Cakes Quantity
Baking Recipes
D5 Pastry A Book of Best French Recipe

26
Term Frequency in documents
(I,j)1 of document I contains item j once
27
Document frequency of term j
28
Tf-idf weight matrix
log(5/2) log(5/4) log(5/2) 0 0
0 0 0 0 0 log(5/3)
0 0 log(5/4) 0 0 0
0 log(5/2) log(5/4) log(5/2) log(5)
log(5/3) log(5) 0 log(5/4) 0 0
log(5/3) 0
29
Exercise

Write a program that use Tf-idf term weight to
form the term/document matrix. Test it for the
following three documents
http//www.cityu.edu.hk/cityu/about/index.htm
http//www.cityu.edu.hk/cityu/dpt-acad/fse.htm
http//www.cs.cityu.edu.hk/content/about/

30
Similarity Measure

Determine the similarity between document D and
query Q
Lots of method can be used to calculate the
similarity
Cosine Similarity Measures

31
Similarity Measure cosine
dj
?
Q
Cosine similarity measures the cosine of the
angle between two vectors
32
Advantages of VSM

Term-weighting scheme improves retrieval
performance
Partial matching strategy allows retrieval of
documents that approximate the query conditions
Its cosine ranking formula sorts the documents
according to their degree of similarity to the
query

33
Limitations of VSM

underlying assumption is that the terms in the
vector are orthogonal
the need for several query terms if a
discriminating ranking is to be achieved, whereas
only two or three ANDed terms may suffice in a
Boolean environment to obtain a high-quality
output
Difficult to explicitly specifying synonymous and
phrasal relationships, where these can be easily
handled in a Boolean environment by means of the
OR and AND operators or by an extended Boolean
model

34
Latent Semantic Indexing Model of document/query

Map document and query vector into a lower
dimensional space which is associated with
concepts
Information retrieval using a singular value
decomposition model of latent semantic structure.
11th ACM SIGIR Conference, pp.465-480, 1988
by G.W.Furnas, S. Deerwester,S.T.Dumais,T.K.Landau
er, R.A.Harshman,L.A.Streeter, and K.E.Lochbaum
http//www.cs.utk.edu/lsi/
A tutorial
http//www.cs.utk.edu/berry/lsi/node5.html

35
General Approach

It is based on Vector Space Model
In vector space model, terms are treated
independently
Here some relationship of the terms are obtained,
implicitly, magically through matrix analysis
This allows reduction of some un-necessary
information in the document representation.

36
term- document association matrix

Let t be the number of terms and N be the number
of documents
Let M(Mij) be term-document association matrix.
Mij may be considered as weight associated with
the term-document pair (ti,dj)

37
Eigenvalue and Eigenvector

Let A be an mxn matrix and x be an n-dimensional
vector
x is an eigenvalue of A if Ax is the same as cx
where c is a scale factor.
Example
A x
Then Ax3x
3 is an eigenvalue, and x is an eigenvector
Question find another eigenvalue?

38
Example continued

yt(1,-1). Ay(1,-1)ty.
Therefore, another eigenvalue is 1 and its
associated eigenvector is y
Then
Let
S
Then A(x,y) (x,y)S
More over xty0

39
Example continued

Let K(x,y)/sqrt(2)
Then
KtKI
and AKSKt

40
A General Theorem from Linear Algebra

If A is a symmetrical matrix, then
There exist a matrix K (KtKI) and a diagonal
matrix S such that
AKSKt

41
Application to our case

Both MMt and Mt M are symmetric
In addition, their eigenvalues are the same
except that the large one has an extra number of
zeros.

42
Decomposition of term- document association matrix

Decompose MKSDt
K the matrix of eigenvectors derived from the
term-to-term correlation matrix given by MMt
Dt that of Mt M
S an (rxr) matrix of singular values where r is
the rank of M

43
Reduced Concept Space

Let Ss be the s largest singular values of S.
Let Ks and Dst be the corresponding columns of
rows K and S.
The matrix MsKsSsDst
is closest to M in the least square sense
NOTE Ms has the same number of rows (terms) and
columns (documents) as M but it may be totally
different from M.
A numerical example
www.cs.arizona.edu/classes/ cs630/spring03/slides/
jan-29.ppt

44
The relationship of two documents di and dj

MstMs(KsSsDst )t(KsSsDst )
DsSsKstKsSsDst
DsSsSsDst
DsSs(DsSs)t
The (i,j) element quantifies the relationship
between document i and j.

45
The choice of s

It should be large enough to allow fitting all
the structure in the original data
It should be small enough to allow filtering out
noise caused by variation of choices of terms.

46
Ranking documents according to a query

Model the query Q as a pseudo-document in the
original term document matrix M
The vector MstQ provides ranks of all documents
with respect to this query Q.

47
Advantage

When s is small with respect to t and N, it
provides an efficient indexing model
it provides for elimination of noise and removal
of redundancy
it introduces conceptualization based on theory
of singular value decomposition

48
Graph model of document/query

Improving Effectiveness and Efficiency of Web
Search
by Graph-based Text Representation
Junji Tomita and Yoshihiko Hayashi
http//www9.org/final-posters/13/poster13.html
Interactive web search by graphical query
refinement
By Junji Tomita and Genichiro Kikui
http//www10.org/cdrom/posters/1078.pdf

49
Graph-based text representation model

Subject Graph
a node represents a term in the text,
a link denotes an association between the linked
terms.
Significance of terms and term-term associations
are represented as weights assigned to them.

50
Assignment of Weights

Term-statistics-based weighting schemes
frequencies of terms
frequencies of term-term association
multiplied by inverse document frequency

51
Similarity of documents

Subject Graph Matching.
Weight terms and term-term associations with ?
and 1-? for adequately chosen ?.
Then calculate the cosine value of two documents
treating weighted terms and term-term
associations as elements of the vector space
model.

52
Query as graph

Sometimes users query is vague
System Represents users query as a query graph
User can interactively and explicitly clarify
his/her query by looking at and editing the query
graph
System implicitly edits the query graph
according to users choice on documents

53
User interface of the system
54
A query graph
guide
transport
travel
train
Asia
Japan
link and nodes with no link or
55
Interactive query graph refinement

User inputs sentences as query, system displays
the initial query graph made from the inputs
User edits the query graph by removing and/or
adding nodes and/or links
System measures the relevance score of each
document against the modified query graph
System ranks search results in descending score
order and displays their titles to the user
interface
User selects documents relevant to his/her needs
System refines the query graph based on the
documents selected by user and the old query
graph
System displays the new query graph to user
Repeat previous steps until the user is satisfied
with the search results

56
Details of step 6 making a new query graph
57
Digest Graph

The output of search engines is presented via
graphical representation
a subgraph of the Subject Graph for the entire
document.
The subgraph is generated on the fly in response
to the current query.
User can intuitively understand the subject of
each document from the terms and the term-term
associations in the graph.

Write a Comment

User Comments (0)

About PowerShow.com

Classical Models PowerPoint PPT Presentation