Fast Indexes and Algorithms For Set Similarity Selection Queries - PowerPoint PPT Presentation

About This Presentation
Title:

Fast Indexes and Algorithms For Set Similarity Selection Queries

Description:

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou Chandel N. Koudas D. Srivastava Strings as sets s1 = Main St. Maine ... – PowerPoint PPT presentation

Number of Views:260
Avg rating:3.0/5.0
Slides: 24
Provided by: MariosHadj3
Learn more at: http://www.cs.utah.edu
Category:

less

Transcript and Presenter's Notes

Title: Fast Indexes and Algorithms For Set Similarity Selection Queries


1
Fast Indexes and AlgorithmsFor Set Similarity
Selection Queries
  • M. Hadjieleftheriou
  • Chandel
  • N. Koudas
  • D. Srivastava

2
Strings as sets
  • s1 Main St. Maine
  • Main St. Maine
  • Mai ain in n S St St. t.
  • s2 Main St. Main
  • Main St. Main
  • How similar is s1 and s2 ?

3
TF/IDF weighted similarity
  • Inverse Document Frequency (idf)
  • Main is common
  • Maine is not
  • idf(t) log21 N / df(t)
  • Term Frequency (tf)
  • Main appears twice in s2
  • Similarity
  • Inner Product

4
Is TF important?
  • Information retrieval
  • Given a query string retrieve relevant documents
  • Relational databases
  • Given a query string retrieve relevant strings
  • In practice TF is small in many applications

5
IDF similarity
  • Query q t1, , tn
  • Set s r1, , rm
  • Length len(s) (?t 2 s idf(t)2)1/2
  • I(q, s) ?t 2 s \ q idf(t)2 / len(s) len(q)
  • IDF is as good as TF/IDF in practice!

6
How can I build an index?
  • Let w(t, s) idf(t) / len(s)
  • Then I(q, s) ?t 2 q \ s w(t, s) w(t, q)
  • So
  • Decompose strings into tokens
  • Compute the idf of each token
  • Create one inverted list per token
  • Sort lists by string id Do a merge join
  • Sort lists by w Run TA/NRA

7
Example Sort by id
8
Example Sort by w
  • NRA
  • Round robin list accesses
  • Main memory hash table
  • Computes lower and upper bounds per entry

9
Semantic properties of IDF
  • Order Preservation
  • For all t1 ? t2 if w(t1, s) lt w(t1, r), then
    w(t2, s) lt w(t2, r)
  • Length Boundedness
  • Query q, set s, threshold ?
  • I(q, s) gt ? ) ? len(q) lt len(s) lt len(q) / ?

10
Improved NRA
  • Order Preservation determines if a given set
    appears in a list or not
  • ti encounter s1, then s2
  • tk encounter s2 first
  • Length Boundedness restricts the search in a
    small portion of lists

11
Something surprising
  • Lemma NRA reads arbitrarily more elements than
    iNRA
  • Lemma NRA reads arbitrarily more elements than
    any algorithm that uses the Length Boundedness
    property

12
Any other strategies?
  • NRA style is breadth-first
  • Try depth-first
  • Sort query lists in decreasing idf order
  • Let q t1, , tn and idf(t1) gt idf(t2) gt gt
    idf(tn)
  • Let ?i be the maximum length a set s in ti can
    have s.t. I(q, s) gt ?, assuming that s exists in
    all tk gt ti
  • ?i ?I lt k lt n idf(tk)2 / ? len(q)
  • ?i is a natural cutoff point
  • ?1 gt ?2 gt gt ?n

13
Shortest-First
  • Sort qt1, , tn in decreasing idf order
  • Let candidate set C
  • For 1 lt i lt n
  • Skip to first entry with len(s) gt ? len(q)
  • Compute ?i
  • Let ?i min(?i, len(q) / ?)
  • Repeat
  • s pop next element from ti
  • Maintain lower/upper bounds of entries in C
  • Until len(s) gt max(max len C, ?i)

14
Comparison with NRA
  • Lemma Let qt1, , tn and d the maximum depth
    SF descents over all lists. In the worst case
    iNRA will read (d 1)(n 1) elements more than
    SF
  • But surprisingly

15
A hybrid strategy
  • Run iNRA normally
  • Use ?i and max len C to stop reading from a
    particular list
  • This guarantees that iNRA stops with or before SF
  • Drawback of NRA variants
  • Very high book keeping cost compared to SF

16
Experiments
  • DBLP, IMDB and YellowPages datasets
  • Actors, movies, authors, businesses etc.
  • Vary threshold, query size, query strings and
    mistakes
  • Test wall-clock time, pruning power
  • AlgorithmsNRA, TA, iNRA, iTA, SF, Hybrid,
    Sort-by-id, Improved SQL based

17
Wall-clock time vs. Threshold
18
Wall-clock time vs. Query size
TA
SF
NRA
Sort-by-id
iTA
19
Space
20
Conclusion
  • Proposed a simplified TF/IDF measure
  • Identified strong monotonicity properties
  • Used the properties to design efficient
    algorithms
  • SF works best overall in practice
  • Achieves sub-second answers in most practical
    cases

21
QA
22
Pruning power vs. Threshold
23
Pruning power vs. Query size
iTA
TA
NRA
Write a Comment
User Comments (0)
About PowerShow.com