Chen Li - PowerPoint PPT Presentation

About This Presentation
Title:

Chen Li

Description:

Answering Approximate Queries Efficiently Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 52
Provided by: Chen2172
Learn more at: https://ics.uci.edu
Category:
Tags: chen | fuzzy | joins

less

Transcript and Presenter's Notes

Title: Chen Li


1
Answering Approximate Queries Efficiently
  • Chen Li
  • Department of Computer Science
  • Joint work with Liang Jin, Nick Koudas, Anthony
    Tung, and Rares Vernica

2
30,000-Foot View of Info Systems
Data Repository (RDBMS, Search Engines, etc.)
Answers matching conditions
Query
3
Example a movie database
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

4
How about our governor Schwarrzenger?
The user doesnt know the exact spelling!
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

5
Relaxing Conditions
Find movies with a star similar to
Schwarrzenger.
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

6
In general Gap between Queries and Facts
  • Errors in the query
  • The user doesnt remember a string exactly
  • The user unintentionally types a wrong string

7
Did you mean? features in Search Engines
8
What if we dont want the user to change the
query?Answering Queries Approximately
Data Repository (RDBMS, Search Engines, etc.)
Answers matching conditions approximately
Query
9
Technical Challenges
  • How to relax conditions?
  • Name Schwarzenegger vs Schwarrzenger
  • Salary in 50K,60K vs in 49K,63K
  • How to answer queries efficiently?
  • Index structures
  • Selectivity estimation
  • See our three recent VLDB papers

10
Rest of the talk
  • Selectivity estimation of fuzzy predicates
  • Our approach SEPIA
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works

11
Queries with Fuzzy String Predicates
  • Stars name similar to Schwarrzenger
  • Employees SSN similar to 430-87-7294
  • Customers telephone number similar to 412-0964

Database
  • Similar to
  • a domain-specific function
  • returns a similarity value between two strings
  • Examples
  • Edit distance ed(Schwarrzenger,
    Schwarzenegger)2
  • Cosine similarity
  • Jaccard coefficient distance
  • Soundex

12
Example Similarity Function Edit Distance
  • A widely used metric to define string similarity
  • Ed(s1,s2) minimum of operations (insertion,
    deletion, substitution) to change s1 to s2
  • Example
  • s1 Tom Hanks
  • s2 Ton Hank
  • ed(s1,s2) 2

13
Selectivity of Fuzzy Predicates
  • star SIMILARTO Schwarrzenger
  • Selectivity of records satisfying the predicate

Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

14
Selectivity Estimation Problem Formulation
star SIMILARTO Schwarrzenger
Input fuzzy string predicate P(q, d)
A bag of strings
Output of strings s that satisfy dist(s,q) lt d
15
Why Selectivity Estimation?
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1970,1971
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1980,1989
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

The optimizer needs to know the selectivity of a
predicate to decide a good plan.
16
Using traditional histograms?
  • No nice order for strings
  • Lexicographical order?
  • Similar strings could be far from each other
    Kammy/Cammy
  • Adjacent strings have different selectivities
    Cathy/Catherine

17
Outline
  • Selectivity estimation of fuzzy predicates
  • Our approach SEPIA
  • Overview
  • Proximity between strings
  • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works

18
Our approach SEPIA
  • Selectivity Estimation of Approximate Predicates

Intuition
19
Proximity between Strings
Edit Distance? Not discriminative enough
20
Edit Vector from s1 to s2
  • A vector ltI, D, Sgt
  • I of insertions
  • D of deletions
  • S of substitutions
  • in a sequence of edit operations with their edit
    distance
  • Easily computable
  • Not symmetric
  • Not unique, but tend to be (ed lt 3 ? 91 unique)

21
Why Edit Vector?
  • More discriminative

22
SEPIA histograms Overview
23
Frequency table for each cluster
24
Global PPD Table
  • Proximity Pair Distribution table

25
SEPIA histograms summary
26
Selectivity Estimation ed(lukas, 2)
  • Do it for all v2 vectors in each cluster, for all
    clusters
  • Take the sum of these contributions

27
Selectivity Estimation for ed(q,d)
  • For each cluster Ci
  • For each v2 in frequency table of Ci
  • Use (v1,v2,d) to lookup PPD
  • Take the sum of these f N
  • Pruning possible (triangle inequality)

28
Outline
  • Selectivity estimation of fuzzy predicates
  • Our approach SEPIA
  • Overview
  • Proximity between strings
  • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works

29
Clustering Strings
  • Two example algorithms
  • Lexicographic order based.
  • K-Medoids
  • Choose initial pivots
  • Assign strings to its closest pivot
  • Swap a pivot with another string
  • Reassign the strings

30
Number of Clusters
  • It affects
  • Cluster quality
  • Similarity of strings within each cluster
  • Costs
  • Space
  • Estimation time

31
Constructing Frequency Tables
  • For each cluster, group strings based on their
    edit vector from the pivot
  • Count the frequency for each group

32
Constructing PPD Table
  • Get enough samples of string triplets (q,p,s)
  • Propose a few heuristics
  • ALL_RAND
  • CLOSE_RAND
  • CLOSE_LEX
  • CLOSE_UNIQUE

33
Dynamic Maintenance Frequency Table
  • Take insertion as an example

34
Dynamic Maintenance PPD
35
Improving Estimation Accuracy
  • Reasons of estimate errors
  • Miss hits in PPD.
  • Inaccurate percentage entries in PPD.
  • Improvement use sample fuzzy predicates to
    analyze their estimation errors

36
Relative-Error Model
  • Use the errors to build a model
  • Use the model to adjust initial estimation

37
Outline
  • Motivation selectivity estimation of fuzzy
    predicates
  • Our approach SEPIA
  • Overview
  • Proximity between strings
  • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works

38
Data
  • Citeseer
  • 71K author names
  • Length 2,20, avg 12
  • Movie records from UCI KDD repository
  • 11K movie titles.
  • Length 3,80, avg 35
  • Introduced duplicates
  • 10 of records
  • of duplicates 1,20, uniform
  • Final results
  • Citeseer 142K author names
  • UCI KDD 23K movie titles

39
Setting
  • Test bed
  • PC 2.4G P4, 1.2GB RAM, Windows XP
  • Visual C compiler
  • Query workload
  • Strings from the data
  • String not in the data
  • Results similar
  • Quality measurements
  • Relative error (fest freal) / freal
  • Absolute relative error fest freal / freal

40
Clustering Algorithms
K-Metoids is better
41
Quartile distribution of relative errors
Data set 1. CLOSE_RAND 1000 clusters
42
Number of Clusters
43
Effectiveness of Applying Relative-Error Model
44
Dynamic Maintenance
45
Other work 1 Relaxing SQL queries with
Selections/Joins
  • SELECT FROM Jobs J, Candidate CWHERE J.Salary
    lt 95 AND J.Zipcode C.Zipcode AND
    C.WorkYear gt 5

Jobs Jobs Jobs Jobs Candidates Candidates Candidates Candidates
JID Company Zipcode Salary CID Zipcode ExpSalary WorkYear
r1 Broadcom 92047 80 s1 93652 120 3
r2 Intel 93652 95 s2 92612 130 6
r3 Microsoft 82632 120 s3 82632 100 5
r4 IBM 90391 130 s4 90391 150 1
... ...
46
Query Relaxation Skyline!
47
Other work 2 Fuzzy predicates on attributes of
mixed types
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1977 lt 3
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

48
Mixed-Typed Predicates
  • String attributes edit distance
  • Numeric attributes absolute numeric difference

SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1977 lt 3
49
MAT-tree Intuition
  • Indexing on two attributes is more effective than
    two separate indexing structures
  • Numeric attribute B-tree
  • String attribute tree-based index structure?

50
MAT-tree Overview
  • Tree-based indexing structure
  • Each node has MBR for both numeric attribute and
    string attribute
  • Compressing strings as a compressed trie that
    fits into a limited space
  • An edit distance between a string and compressed
    trie can be computed
  • Experiments show that MAT-tree is very efficient

51
Conclusion
  • Its important to support answering approximate
    queries efficiently
  • Our results so far
  • SEPIA provides accurate selectivity estimation
    for fuzzy string predicates
  • Relaxing SQL queries with selections and joins
  • MAT-tree indexing structure supporting fuzzy
    queries with mixed-types predicates
Write a Comment
User Comments (0)
About PowerShow.com