ISYS 300 Document Similarity - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

ISYS 300 Document Similarity

Description:

The Cosine Coefficient is a common way to measure similarity: This is the same as: ... Jaccard Coefficient: Jaccard(D1, D2) = w/(N-z) = w/(n1 n2-w) ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 9
Provided by: xlin2
Category:

less

Transcript and Presenter's Notes

Title: ISYS 300 Document Similarity


1
ISYS 300 Document Similarity
2
Similarity Metrics
  • Criteria for a Similarity Metric
  • A good metric has three defining properties
  • Its values are non-negative
  • Its symmetric
  • It satisfies the triangle inequality
  • AC?ABBC
  • Many similarity metrics have been proposed.
  • Many applications for similarity metrics
  • Retrieval is only one of many applications
  • Can also be useful for summarization

3
Cosine Measure
  • The Cosine Coefficient is a common way to measure
    similarity
  • This is the same as

4
Similarity Based on Word Overlaps
  • Another natural measure is the number of shared
    words
  • If tik is either 0 or 1.
  • wthe number of times t1k1, t2k1.
  • xthe number of times t1k1, t2k0.
  • ythe number of times t1k0, t2k1.
  • zthe number of times t1k0, t2k0.
  • n1number of terms in document 1
  • n2number of terms in document 2
  • D1s terms only n1wx (the number of times
    t1k1)
  • D2s terms only n2wy (the number of times
    t2k1)
  • mean mean (n1n2)/2

5
Word Overlaps (Contd)
  • Sameness count sc (wz)/(n1n2)
  • Difference count dc (xy)/(n1n2)
  • Rectangular distance rd MAX(n1, n2)
  • Conditional probability cpMIN(n1, n2)

6
Metrics Based on Word Overlaps
  • Dices Coefficient
  • Dice(D1, D2) 2w/(n1n2)
  • Jaccard Coefficient
  • Jaccard(D1, D2) w/(N-z)
  • w/(n1n2-w)
  • See Korfage (pp 128-129) for many more

7
Lp Metric
Can define a document space and distances
within that space. In particular
T
See Korfage p 132
8
Similarity Matrix
  • Pair-wise coupling of similarities among a group
    of documents
  • S11 S12 S13 S14 S15 S16 S17 S18
  • S21 S22 S23 S24 S25 S26 S27 S28
  • S31 S32 S33 S34 S35 S36 S37 S38
  • S41 S42 S43 S44 S45 S46 S47 S48
  • S51 S52 S53 S54 S55 S56 S57 S58
  • S61 S62 S63 S64 S65 S66 S67 S68
  • S71 S72 S73 S74 S75 S76 S77 S78
  • S81 S82 S83 S84 S85 S86 S87 S88
Write a Comment
User Comments (0)
About PowerShow.com