Chen Li - PowerPoint PPT Presentation

About This Presentation

Title:

Chen Li

Description:

Answering Approximate Queries Efficiently Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 52

Provided by: Chen2172

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chen Li

1
Answering Approximate Queries Efficiently

Chen Li
Department of Computer Science
Joint work with Liang Jin, Nick Koudas, Anthony
Tung, and Rares Vernica

2
30,000-Foot View of Info Systems
Data Repository (RDBMS, Search Engines, etc.)
Answers matching conditions
Query
3
Example a movie database
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

4
How about our governor Schwarrzenger?
The user doesnt know the exact spelling!
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

5
Relaxing Conditions
Find movies with a star similar to
Schwarrzenger.
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

6
In general Gap between Queries and Facts

Errors in the query
The user doesnt remember a string exactly
The user unintentionally types a wrong string

7
Did you mean? features in Search Engines
8
What if we dont want the user to change the
query?Answering Queries Approximately
Data Repository (RDBMS, Search Engines, etc.)
Answers matching conditions approximately
Query
9
Technical Challenges

How to relax conditions?
Name Schwarzenegger vs Schwarrzenger
Salary in 50K,60K vs in 49K,63K
How to answer queries efficiently?
Index structures
Selectivity estimation
See our three recent VLDB papers

10
Rest of the talk

Selectivity estimation of fuzzy predicates
Our approach SEPIA
Construction and maintenance of SEPIA
Experiments
Other works

11
Queries with Fuzzy String Predicates

Stars name similar to Schwarrzenger
Employees SSN similar to 430-87-7294
Customers telephone number similar to 412-0964

Database

Similar to
a domain-specific function
returns a similarity value between two strings
Examples
Edit distance ed(Schwarrzenger,
Schwarzenegger)2
Cosine similarity
Jaccard coefficient distance
Soundex

12
Example Similarity Function Edit Distance

A widely used metric to define string similarity
Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2
Example
s1 Tom Hanks
s2 Ton Hank
ed(s1,s2) 2

13
Selectivity of Fuzzy Predicates

star SIMILARTO Schwarrzenger
Selectivity of records satisfying the predicate

Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

14
Selectivity Estimation Problem Formulation
star SIMILARTO Schwarrzenger
Input fuzzy string predicate P(q, d)
A bag of strings
Output of strings s that satisfy dist(s,q) lt d
15
Why Selectivity Estimation?
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1970,1971
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1980,1989
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

The optimizer needs to know the selectivity of a
predicate to decide a good plan.
16
Using traditional histograms?

No nice order for strings
Lexicographical order?
Similar strings could be far from each other
Kammy/Cammy
Adjacent strings have different selectivities
Cathy/Catherine

17
Outline

Selectivity estimation of fuzzy predicates
Our approach SEPIA
Overview
Proximity between strings
Estimation algorithm
Construction and maintenance of SEPIA
Experiments
Other works

18
Our approach SEPIA

Selectivity Estimation of Approximate Predicates

Intuition
19
Proximity between Strings
Edit Distance? Not discriminative enough
20
Edit Vector from s1 to s2

A vector ltI, D, Sgt
I of insertions
D of deletions
S of substitutions
in a sequence of edit operations with their edit
distance

Easily computable
Not symmetric
Not unique, but tend to be (ed lt 3 ? 91 unique)

21
Why Edit Vector?

More discriminative

22
SEPIA histograms Overview
23
Frequency table for each cluster
24
Global PPD Table

Proximity Pair Distribution table

25
SEPIA histograms summary
26
Selectivity Estimation ed(lukas, 2)

Do it for all v2 vectors in each cluster, for all
clusters
Take the sum of these contributions

27
Selectivity Estimation for ed(q,d)

For each cluster Ci
For each v2 in frequency table of Ci
Use (v1,v2,d) to lookup PPD
Take the sum of these f N
Pruning possible (triangle inequality)

28
Outline

Selectivity estimation of fuzzy predicates
Our approach SEPIA
Overview
Proximity between strings
Estimation algorithm
Construction and maintenance of SEPIA
Experiments
Other works

29
Clustering Strings

Two example algorithms
Lexicographic order based.
K-Medoids
Choose initial pivots
Assign strings to its closest pivot
Swap a pivot with another string
Reassign the strings

30
Number of Clusters

It affects
Cluster quality
Similarity of strings within each cluster
Costs
Space
Estimation time

31
Constructing Frequency Tables

For each cluster, group strings based on their
edit vector from the pivot
Count the frequency for each group

32
Constructing PPD Table

Get enough samples of string triplets (q,p,s)
Propose a few heuristics
ALL_RAND
CLOSE_RAND
CLOSE_LEX
CLOSE_UNIQUE

33
Dynamic Maintenance Frequency Table

Take insertion as an example

34
Dynamic Maintenance PPD
35
Improving Estimation Accuracy

Reasons of estimate errors
Miss hits in PPD.
Inaccurate percentage entries in PPD.
Improvement use sample fuzzy predicates to
analyze their estimation errors

36
Relative-Error Model

Use the errors to build a model
Use the model to adjust initial estimation

37
Outline

Motivation selectivity estimation of fuzzy
predicates
Our approach SEPIA
Overview
Proximity between strings
Estimation algorithm
Construction and maintenance of SEPIA
Experiments
Other works

38
Data

Citeseer
71K author names
Length 2,20, avg 12
Movie records from UCI KDD repository
11K movie titles.
Length 3,80, avg 35
Introduced duplicates
10 of records
of duplicates 1,20, uniform
Final results
Citeseer 142K author names
UCI KDD 23K movie titles

39
Setting

Test bed
PC 2.4G P4, 1.2GB RAM, Windows XP
Visual C compiler
Query workload
Strings from the data
String not in the data
Results similar
Quality measurements
Relative error (fest freal) / freal
Absolute relative error fest freal / freal

40
Clustering Algorithms
K-Metoids is better
41
Quartile distribution of relative errors
Data set 1. CLOSE_RAND 1000 clusters
42
Number of Clusters
43
Effectiveness of Applying Relative-Error Model
44
Dynamic Maintenance
45
Other work 1 Relaxing SQL queries with
Selections/Joins

SELECT FROM Jobs J, Candidate CWHERE J.Salary
lt 95 AND J.Zipcode C.Zipcode AND
C.WorkYear gt 5

Jobs Jobs Jobs Jobs Candidates Candidates Candidates Candidates
JID Company Zipcode Salary CID Zipcode ExpSalary WorkYear
r1 Broadcom 92047 80 s1 93652 120 3
r2 Intel 93652 95 s2 92612 130 6
r3 Microsoft 82632 120 s3 82632 100 5
r4 IBM 90391 130 s4 90391 150 1
... ...
46
Query Relaxation Skyline!
47
Other work 2 Fuzzy predicates on attributes of
mixed types
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1977 lt 3
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

48
Mixed-Typed Predicates

String attributes edit distance
Numeric attributes absolute numeric difference

SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1977 lt 3
49
MAT-tree Intuition

Indexing on two attributes is more effective than
two separate indexing structures
Numeric attribute B-tree
String attribute tree-based index structure?

50
MAT-tree Overview

Tree-based indexing structure
Each node has MBR for both numeric attribute and
string attribute
Compressing strings as a compressed trie that
fits into a limited space
An edit distance between a string and compressed
trie can be computed
Experiments show that MAT-tree is very efficient

51
Conclusion

Its important to support answering approximate
queries efficiently
Our results so far
SEPIA provides accurate selectivity estimation
for fuzzy string predicates
Relaxing SQL queries with selections and joins
MAT-tree indexing structure supporting fuzzy
queries with mixed-types predicates

Write a Comment

User Comments (0)