Fast Indexes and Algorithms For Set Similarity Selection Queries - PowerPoint PPT Presentation

About This Presentation

Title:

Fast Indexes and Algorithms For Set Similarity Selection Queries

Description:

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou Chandel N. Koudas D. Srivastava Strings as sets s1 = Main St. Maine ... – PowerPoint PPT presentation

Number of Views:260

Avg rating:3.0/5.0

Slides: 24

Provided by: MariosHadj3

Learn more at: http://www.cs.utah.edu

Category:

Tags: algorithms | fast | indexes | merge | queries | selection | set | similarity | sort

Transcript and Presenter's Notes

Title: Fast Indexes and Algorithms For Set Similarity Selection Queries

1
Fast Indexes and AlgorithmsFor Set Similarity
Selection Queries

M. Hadjieleftheriou
Chandel
N. Koudas
D. Srivastava

2
Strings as sets

s1 Main St. Maine
Main St. Maine
Mai ain in n S St St. t.
s2 Main St. Main
Main St. Main
How similar is s1 and s2 ?

3
TF/IDF weighted similarity

Inverse Document Frequency (idf)
Main is common
Maine is not
idf(t) log21 N / df(t)
Term Frequency (tf)
Main appears twice in s2
Similarity
Inner Product

4
Is TF important?

Information retrieval
Given a query string retrieve relevant documents
Relational databases
Given a query string retrieve relevant strings
In practice TF is small in many applications

5
IDF similarity

Query q t1, , tn
Set s r1, , rm
Length len(s) (?t 2 s idf(t)2)1/2
I(q, s) ?t 2 s \ q idf(t)2 / len(s) len(q)
IDF is as good as TF/IDF in practice!

6
How can I build an index?

Let w(t, s) idf(t) / len(s)
Then I(q, s) ?t 2 q \ s w(t, s) w(t, q)
So
Decompose strings into tokens
Compute the idf of each token
Create one inverted list per token
Sort lists by string id Do a merge join
Sort lists by w Run TA/NRA

7
Example Sort by id
8
Example Sort by w

NRA
Round robin list accesses
Main memory hash table
Computes lower and upper bounds per entry

9
Semantic properties of IDF

Order Preservation
For all t1 ? t2 if w(t1, s) lt w(t1, r), then
w(t2, s) lt w(t2, r)
Length Boundedness
Query q, set s, threshold ?
I(q, s) gt ? ) ? len(q) lt len(s) lt len(q) / ?

10
Improved NRA

Order Preservation determines if a given set
appears in a list or not
ti encounter s1, then s2
tk encounter s2 first
Length Boundedness restricts the search in a
small portion of lists

11
Something surprising

Lemma NRA reads arbitrarily more elements than
iNRA
Lemma NRA reads arbitrarily more elements than
any algorithm that uses the Length Boundedness
property

12
Any other strategies?

NRA style is breadth-first
Try depth-first
Sort query lists in decreasing idf order
Let q t1, , tn and idf(t1) gt idf(t2) gt gt
idf(tn)
Let ?i be the maximum length a set s in ti can
have s.t. I(q, s) gt ?, assuming that s exists in
all tk gt ti
?i ?I lt k lt n idf(tk)2 / ? len(q)
?i is a natural cutoff point
?1 gt ?2 gt gt ?n

13
Shortest-First

Sort qt1, , tn in decreasing idf order
Let candidate set C
For 1 lt i lt n
Skip to first entry with len(s) gt ? len(q)
Compute ?i
Let ?i min(?i, len(q) / ?)
Repeat
s pop next element from ti
Maintain lower/upper bounds of entries in C
Until len(s) gt max(max len C, ?i)

14
Comparison with NRA

Lemma Let qt1, , tn and d the maximum depth
SF descents over all lists. In the worst case
iNRA will read (d 1)(n 1) elements more than
SF
But surprisingly

15
A hybrid strategy

Run iNRA normally
Use ?i and max len C to stop reading from a
particular list
This guarantees that iNRA stops with or before SF
Drawback of NRA variants
Very high book keeping cost compared to SF

16
Experiments

DBLP, IMDB and YellowPages datasets
Actors, movies, authors, businesses etc.
Vary threshold, query size, query strings and
mistakes
Test wall-clock time, pruning power
AlgorithmsNRA, TA, iNRA, iTA, SF, Hybrid,
Sort-by-id, Improved SQL based

17
Wall-clock time vs. Threshold
18
Wall-clock time vs. Query size
TA
SF
NRA
Sort-by-id
iTA
19
Space
20
Conclusion

Proposed a simplified TF/IDF measure
Identified strong monotonicity properties
Used the properties to design efficient
algorithms
SF works best overall in practice
Achieves sub-second answers in most practical
cases

21
QA
22
Pruning power vs. Threshold
23
Pruning power vs. Query size
iTA
TA
NRA

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

World's Best PowerPoint Templates PowerPoint PPT Presentation

World's Best PowerPoint Templates - CrystalGraphics offers more PowerPoint templates than anyone else in the world, with over 4 million to choose from. Winner of the Standing Ovation Award for “Best PowerPoint Templates” from Presentations Magazine. They'll give your presentations a professional, memorable appearance - the kind of sophisticated look that today's audiences expect. Boasting an impressive range of designs, they will support your presentations with inspiring background photos or videos that support your themes, set the right mood, enhance your credibility and inspire your audiences.

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Fast Indexes and Algorithms For Set Similarity Selection Queries PowerPoint PPT Presentation

Fast Indexes and Algorithms For Set Similarity Selection Queries - Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based ... Used the properties to design efficient algorithms. SF works best overall in practice ... | PowerPoint PPT presentation | free to view

TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets PowerPoint PPT Presentation

TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets - Title: Fast Monte-Carlo Algorithms for Matrix Multiplication Author: Petros Drineas Last modified by: Petros Drineas Created Date: 9/26/2001 6:00:28 PM | PowerPoint PPT presentation | free to view

Module 5 Implementation of XQuery (Rewrite, Indexes, Runtime System) PowerPoint PPT Presentation

Module 5 Implementation of XQuery (Rewrite, Indexes, Runtime System) - Title: PowerPoint Presentation Author: Donald Kossmann Last modified by: Fabio Riccardi Created Date: 3/20/2004 11:17:55 PM Document presentation format | PowerPoint PPT presentation | free to view

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects PowerPoint PPT Presentation

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects - Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang ... | PowerPoint PPT presentation | free to view

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects PowerPoint PPT Presentation

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects - ... only used melody information ... Results are generally better than using only melody information ... First use melody information for DTW distance computing ... | PowerPoint PPT presentation | free to view

Approximate Query Processing: Taming the TeraBytes! A Tutorial PowerPoint PPT Presentation

Approximate Query Processing: Taming the TeraBytes! A Tutorial - Multi-D Histograms, Join synopses, Wavelets. Set-Valued Queries ... Can also use one-pass quantile algorithms (e.g., [GK01]) Count in. bucket. Domain values ... | PowerPoint PPT presentation | free to view

gStore: Answering SPARQL Queries Via Subgraph Matching PowerPoint PPT Presentation

gStore: Answering SPARQL Queries Via Subgraph Matching - gStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou1, Jinghui Mo1, Lei Chen2, M. Tamer zsu3, Dongyan Zhao1 1Peking University, 2Hong Kong University of ... | PowerPoint PPT presentation | free to view

A Search Engine Architecture Based on Collection Selection PowerPoint PPT Presentation

A Search Engine Architecture Based on Collection Selection - The Web is getting bigger and bigger, and users are more and more picky! Precise results are needed very fast. The index is growing, due to added page and advanced ... | PowerPoint PPT presentation | free to view

Efficient Processing of XPath Queries Using Indexes PowerPoint PPT Presentation

Efficient Processing of XPath Queries Using Indexes - Yan Chen1, Sanjay Madria1, Kalpdrum Passi2, Sourav Bhowmick3 ... XQuery, XML-QL, XML-GL, Lorel, and Quilt. Semistructured data is represented as a graph ... | PowerPoint PPT presentation | free to view

Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts PowerPoint PPT Presentation

Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts - Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts Srinath Srinivasa IIIT Bangalore sri@iiitb.ac.in Some MDBMS Operations Roll-up Add ... | PowerPoint PPT presentation | free to view

Lecture 05: Web Search Issues and Algorithms PowerPoint PPT Presentation

Lecture 05: Web Search Issues and Algorithms - Lecture 05: Web Search Issues and Algorithms SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS | PowerPoint PPT presentation | free to view

Chap 6: Spatial Networks 6.1 Example Network Databases 6.2 Conceptual, Logical and Physical Data Models 6.3 Query Language for Graphs 6.4 Graph Algorithms 6.5 Trends: Access Methods for Spatial Networks PowerPoint PPT Presentation

Chap 6: Spatial Networks 6.1 Example Network Databases 6.2 Conceptual, Logical and Physical Data Models 6.3 Query Language for Graphs 6.4 Graph Algorithms 6.5 Trends: Access Methods for Spatial Networks - AS Query to populate relational schema Syntax details ... Query to populate relational schema has UNION of nested sub-queries ... | PowerPoint PPT presentation | free to view

Gossip Algorithms and Emergent Shape PowerPoint PPT Presentation

Gossip Algorithms and Emergent Shape - Title: PowerPoint Presentation Last modified by: Administrator Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles | PowerPoint PPT presentation | free to view

GENETIC ALGORITHMS AND GENETIC PROGRAMMING PowerPoint PPT Presentation

GENETIC ALGORITHMS AND GENETIC PROGRAMMING - Truss has 10 members (6 are length of 30 feet and 4 are length 302 = 41 feet) ... The weight is based on volume (i.e., cross-sectional area length) TRUSS GENOME ... | PowerPoint PPT presentation | free to view

CS276 Information Retrieval and Web Search PowerPoint PPT Presentation

CS276 Information Retrieval and Web Search - This was all invented before the days when people were in the business of ... Fast Monte-Carlo Algorithms for finding low-rank approximations. ... | PowerPoint PPT presentation | free to view

TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets PowerPoint PPT Presentation

TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets - Algorithm has access to the data via a pass over the data. ... is not known in advance, can select one element u.a.r. in one pass over the data. ... | PowerPoint PPT presentation | free to view

Processing Location Based Queries in a Streaming Fashion PowerPoint PPT Presentation

Processing Location Based Queries in a Streaming Fashion - Processing Location Based Queries in a Streaming Fashion Rui Zhang Department of Computer Science and Software Engineering University of Melbourne | PowerPoint PPT presentation | free to view

String algorithms and data structures or, tips and tricks for index design PowerPoint PPT Presentation

String algorithms and data structures or, tips and tricks for index design - (or, tips and tricks for index design) Paolo Ferragina. Why string data are interesting ? ... (or, tips and tricks for index design) Paolo Ferragina. Inverted ... | PowerPoint PPT presentation | free to view

Module 5 Implementation of XQuery Rewrite, Indexes, Runtime System PowerPoint PPT Presentation

Module 5 Implementation of XQuery Rewrite, Indexes, Runtime System - XQuery: a language at the cross-roads. Query languages. Functional ... WHERE A.a = B.b AND B.b = C.c AND A.a = C.c. Why is this transformation good (or bad) ... | PowerPoint PPT presentation | free to view

UMass Lowell Computer Science 91.404 Analysis of Algorithms Prof. Karen Daniels Fall, 2001 PowerPoint PPT Presentation

UMass Lowell Computer Science 91.404 Analysis of Algorithms Prof. Karen Daniels Fall, 2001 - Robotics. Bioinformatics. Astrophysics. Medical Imaging. Telecommunications ... Algorithms taking more than this amount of time may exist, but won't help us. ... | PowerPoint PPT presentation | free to view

Exploiting Sequential Locality for Fast Disk Accesses PowerPoint PPT Presentation

Exploiting Sequential Locality for Fast Disk Accesses - Exploiting Sequential Locality for Fast Disk Accesses Xiaodong Zhang Ohio State University In collaboration with Song Jiang, Wayne State University | PowerPoint PPT presentation | free to view

Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts PowerPoint PPT Presentation

Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts - Data Mining, Data Warehousing and Knowledge Discovery ... which contain j as a sequence Sequence data: transaction logs, DNA sequences, patient ailment history, ... | PowerPoint PPT presentation | free to view

Location-aware Query Processing and Optimization: A Tutorial PowerPoint PPT Presentation

Location-aware Query Processing and Optimization: A Tutorial - Department of Computer Science and Engineering, University of Minnesota ... How many cars are in ... What are the fast food restaurants within 3 miles ... | PowerPoint PPT presentation | free to view

Indexing Multidimensional Feature Spaces PowerPoint PPT Presentation

Indexing Multidimensional Feature Spaces - Queries over Feature Spaces ... quite complex and ... to low dim. space works well when data correlated into a few dimensions only difficult to manage ... | PowerPoint PPT presentation | free to view

Ranking of Database Query Results PowerPoint PPT Presentation

Ranking of Database Query Results - Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor Introduction As the name suggests Ranking is the process of ordering a set of ... | PowerPoint PPT presentation | free to view

Relational Query Optimization PowerPoint PPT Presentation

Relational Query Optimization - CS186, Fall 2005 R & G Chapters 12/15 Review Implementation of single Relational Operations Choices depend on indexes, memory, stats, | PowerPoint PPT presentation | free to view

Query Processing and Networking Infrastructures PowerPoint PPT Presentation

Query Processing and Networking Infrastructures - Day 2: Seed some cross-fertilized research. Especially with networking ... Day 2: Research Synergies w/Networking. Queries as indirection, revisited ... | PowerPoint PPT presentation | free to view