VECTOR SPACE MODEL - PowerPoint PPT Presentation

About This Presentation
Title:

VECTOR SPACE MODEL

Description:

It is a standard technique in Information Retrieval ... d1 d2 d1 FA - Fagin's Algorithm. d2 d3 d3 TA Threshold Algorithm. d3 d4 d2. d5 d2 d4 ... – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0
Slides: 20
Provided by: Arj12
Learn more at: https://ranger.uta.edu
Category:
Tags: model | space | vector | fagin

less

Transcript and Presenter's Notes

Title: VECTOR SPACE MODEL


1
VECTOR SPACE MODEL
  • Its Applications and implementations
  • in Information Retrieval
  • Lecture -3

2
Slides for Lecture -3
  • The Vector Space Model (VSM) is a way of
    representing documents through the words that
    they contain
  • It is a standard technique in Information
    Retrieval
  • The VSM allows decisions to be made about which
    documents are similar to each other and to
    keyword queries

3
Slides for Lecture 3
  • The Vector Space Model
  • Documents and queries both are vectors
  • Di (wdi1,wdi2,.wdij)
  • each wij is the weight of the term j in document
    i
  • similarity metric is measured as equal to
    cosine of
  • the angle between them

4
Slides for lecture 3
  • Documents and Queries are represented as vectors.
  • Position 1 corresponds to term 1, position 2 to
    term 2, position t to term t

5
Slides for Lecture 3
  • Cosine Similarity measure
  • Similarity(dvector, qvector) cos ?
  • ( x.y x . y cos ?)
  • j1mSwij qij
  • (j1mSwij2)1/2(j1m S qij2)1/2
  • Cosine is a normalized dot product

6
Slides for lecture 3
  • TF-IDF Normalization
  • Normalize the term weights (so longer documents
    are not unfairly given more weight)
  • The longer the document, the more likely it is
    for a given term to appear in it, and the more
    often a given term is likely to appear in it. So,
    we want to reduce the importance attached to a
    term appearing in a document based on the length
    of the document

7
Slides for lecture 3
  • Cosine is the normalized dot product
  • Documents are ranked according to decreasing
    order of Cosine value
  • sim( dvector, qvector) 1 when

  • dvector qvector
  • sim( dvector, qvector) 0 when dvector and
    qvector share no terms

8
Slides for lecture 3
  • A user enters a query
  • The query is compared to all documents using a
    similarity measure
  • A vector distance between the query and documents
    is used to rank the retrieved pages
  • The user is shown the documents in decreasing
    order of similarity to the query term

9
Slides for lecture 3
  • How to weight terms?
  • Higher the weight term higher impact on cosine
  • What terms are important?
  • gtif term is present in query then its
    presence in
  • the document is relevant to the query
  • gtInfrequent in other documents
  • gtFrequent in document A
  • So the cosine needs to be modified in this
    respect

10
Slides for lecture 3
  • Modeling and Implementation
  • Example suppose a query is fired by the user
    specifying three particular terms T1, T2,T3
  • Query q (T1, T2, T3)
  • lets there be n documents with a total of m
    terms
  • Now for implementation

11
Slides for lecture-3
  • Document Ranking
  • A user enters a query
  • The query is compared to all documents using a
    similarity measure
  • The user is shown the documents in decreasing
    order of similarity to the query term

12
Slides for lecture 3
  • Example
  • Term T1 T2 T3
    Tm
  • Document d1 d2 d1
    .. d1
  • d2 d4 d7
    d7
  • d3 d8 d9
    .d10
  • d9 d10 d6
    d11
  • . .
    .
    .
  • . .
    .
    d65
  • d7 d89 .


  • d76
  • we can arrange the documents in a descending
    order of the corresponding score that is
    computed from tf idf

13
Slides for lecture 3
tf idf measure term frequency (tf) inverse
document frequency (idf)
14
Slides for lecture 3
Slides for lecture 3
  • In the case of multiple values we take the
    smallest list of documents among corresponding to
    the Query values T1, T2,T3
  • T1 T2 T3 FA and TA
    algorithms are used for merging of the lists
  • d1 d2 d1 FA -
    Fagins Algorithm
  • d2 d3 d3 TA
    Threshold Algorithm
  • d3 d4 d2
  • d5 d2 d4
  • . . .
  • d4 . .
  • . d6
  • d7

15
Slides for lecture 3
  • After the intersected list for the multi values
    has been found , then take the TfIDF score for
    each and add them for the corresponding terms
    arrange them in decreasing order of the total
  • T1 T2 T3
  • (tfidf tfidf tfidf)
  • d1 106 106 0 (this
    will be ranked higher)
  • d2 4 4 4

16
Slides for lecture 3
  • DATBASE CONTEXT
  • All distinct values are words or terms
  • A tuple is taken as a Document
  • Important Points
  • Vector space does not force any broad conditions
  • No search engine uses vector space model
  • It is implemented but using some constraints

17
Slides for lecture 3
  • Advantages
    Disadvantages
  • gtRanked Retrieval gtterms are
    taken
  • gtTerms are weighted independent
  • according to importance gtWeighting is not

  • very formal

18
Slides for Lecture -3
  • Thank you
  • Slides Made By Arjun Saraswat

19
Slides for lecture -3
  • References
  • www.scit.wlv.ac.uk/jphb/cp4040/mtnotes
  • http//www.cs.wisc.edu/dbbook/openAccess/thirdEdi
    tion/slides
  • http//krakow.lti.cs.cmu.edu
  • http//www.cs.utexas.edu/users/mooney/ir-course/sl
    ides
  • http//db.uwaterloo.ca/tozsu/courses
Write a Comment
User Comments (0)
About PowerShow.com