Liang Jin and Chen Li - PowerPoint PPT Presentation

PPT – Liang Jin and Chen Li PowerPoint presentation | free to download - id: 244988-ZDc1Z

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

Liang Jin and Chen Li

Description:

Star Wars: Episode III - Revenge of the Sith. The Matrix. Title. Schwarzenegger. Samuel Jackson ... estimation for predicates with wildcards: star LIKE '%Hanks ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 53
Provided by: che791
Category:
Tags:
Transcript and Presenter's Notes

Title: Liang Jin and Chen Li

1
Selectivity Estimation for Fuzzy String
Predicates in Large Data Sets
• Liang Jin and Chen Li

Supported by NSF CAREER Award IIS-0238586
2
Example a movie database
Find movies starred Schwarrzenger?
Find movies with a star similar to
Schwarrzenger.
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

3
Queries with Fuzzy String Predicates
• Stars name similar to Schwarrzenger
• Employees SSN similar to 430-87-7294
• Customers telephone number similar to 412-0964
• Similar to
• a domain-specific function
• returns a similarity value between two strings
• Example edit distance
• Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2
• ed(Tom Hanks,
• Ton Hank ) 2

Database
4
Selectivity Estimation Problem Formulation
star SIMILARTO Schwarrzenger
Input fuzzy string predicate P(q, d)
A bag of strings
Output of strings s that satisfy dist(s,q) lt d
5
Why Selectivity Estimation?
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1970,1971
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1980,1999
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

The optimizer needs to know the selectivity of a
predicate to decide a good plan.
6
Rest of the talk
• Motivation selectivity estimation of fuzzy
predicates
• Our approach SEPIA
• Proximity between strings
• Histograms and estimation algorithm
• Construction and maintenance of SEPIA
• Experiments

7
Intuition of SEPIA
• Selectivity Estimation of Approximate Predicates

8
Proximity between Strings
Edit Distance? Not discriminative enough
9
Edit Vector from s1 to s2
• A vector ltI, D, Sgt
• I of insertions
• D of deletions
• S of substitutions
• in a sequence of edit operations with their edit
distance

10
Why Edit Vector? More discriminative
11
SEPIA histograms Overview
12
Frequency table for each cluster
13
Global PPD Table
• Proximity Pair Distribution table

14
SEPIA histograms summary
15
Selectivity Estimation ed(lukas, 2)
• Do it for all v2 vectors in each cluster, for all
clusters
• Take the sum of these contributions

16
Selectivity Estimation for ed(q,d)
• For each cluster Ci
• For each v2 in frequency table of Ci
• Use (v1,v2,d) to lookup PPD
• Take the sum of these f N
• Pruning possible (triangle inequality)

17
Outline
• Motivation selectivity estimation of fuzzy
predicates
• Our approach SEPIA
• Proximity between strings
• Histograms and estimation algorithm
• Construction and maintenance of SEPIA
• Experiments

18
Clustering Strings
• Two example algorithms
• Lexicographic order based.
• K-Medoids
• Choose initial pivots
• Assign strings to its closest pivot
• Swap a pivot with another string
• Reassign the strings

19
Number of Clusters
• It affects
• Cluster quality
• Similarity of strings within each cluster
• Costs
• Space
• Estimation time

20
Constructing Frequency Tables
• For each cluster, group strings based on their
edit vector from the pivot
• Count the frequency for each group

21
Constructing PPD Table
• Get enough samples of string triplets (q,p,s)
• Propose a few heuristics
• ALL_RAND
• CLOSE_RAND
• CLOSE_LEX
• CLOSE_UNIQUE

22
Dynamic Maintenance Frequency Table
• Take insertion as an example

23
Dynamic Maintenance PPD
24
Improving Estimation Accuracy
• A post-processing step to further improve
estimation accuracy
• See paper for details.

25
Outline
• Motivation selectivity estimation of fuzzy
predicates
• Our approach SEPIA
• Proximity between strings
• Histograms and estimation algorithm
• Construction and maintenance of SEPIA
• Experiments

26
Data
• Citeseer
• 71K author names
• Length 2,20, avg 12
• Movie records from UCI KDD repository
• 11K movie titles.
• Length 3,80, avg 35
• Introduced duplicates
• 10 of records
• of duplicates 1,20, uniform
• Final results
• Citeseer 142K author names
• UCI KDD 23K movie titles

27
Setting
• Test bed
• PC 2.4G P4, 1.2GB RAM, Windows XP
• Visual C compiler
• Strings from the data
• String not in the data
• Results similar
• Quality measurements
• Relative error (fest freal) / freal
• Absolute relative error fest freal / freal

28
Quartile distribution of relative errors
Data set 1. CLOSE_RAND 1000 clusters
29
Number of Clusters
30
Dynamic Maintenance
• More results in the paper
• Extension to other similarity functions
• More experimental results

31
Related Work
• Selectivity estimation for predicates with
wildcards star LIKE Hanks
• Answering fuzzy predicates efficiently (another
talk in this conference)

32
Conclusions
• Important to support queries with fuzzy string
predicates
• SEPIA provides accurate selectivity estimation
• Structures can be efficiently constructed and
maintained.
• Extendable to various similarity measurements

The Flamingo Project http//www.ics.uci.edu/fla
mingo/
QA?
33
BLANK
34
Backup Slides
35
Why do we care?
• Errors in queries
• User doesnt remember a string exactly
• User types a wrong string

36
Size of histograms
• Data set 1
• 1000 clusters
• PPD table 5MB
• Frequency tables 200KB

37
Constructing PPD table
• We want to generate enough sample triplets to
cover as many (v1, v2) pairs as possible.
• We also want to control the cost of generating
the samples and calculation.
• Heuristics
• ALL_RAND
• CLOSE_RAND
• CLOSE_LEX
• CLOSE_UNIQUE

38
Populating PPD Table
CLOSE_RAND is used
39
Number of Clusters (cont)
Number of clusters grows with the size of the
dataset
Fixed number of clusters
40
Extension to other similarity functions
• SEPIA a general framework for selectivity
estimation for fuzzy string predicates.
• Key issue in extensions proximity between
strings
• Too specific?
• Too general?
• Example Jaccard coefficient distance
• Proximity between two strings s1 and s2.
• G(s,n) is the n-gram set for string s.
• lt G(s1, n) G(s2, n), G(s1, n) v G(s2, n),
ed(s1, s2) gt

41
Jaccard Coefficient Distance
42
Research Issues
• Deciding similarity functions
• Domain specific
• Query processing
• Answering a query with fuzzy predicates
efficiently
• Query optimization
• Selectivity estimation

43
Queries with Fuzzy String Predicates
• Stars name similar to Schwarrzenger
• Employees SSN similar to 430-87-7294
• Customers telephone number similar to 412-0964

Database
• Similar to
• a domain-specific function
• returns a similarity value between two strings
• Examples
• Edit distance ed(Schwarrzenger,
Schwarzenegger)2
• Cosine similarity
• Jaccard coefficient distance
• Soundex

44
Why do we care?
• Errors in the query
• The user doesnt remember a string exactly
• The user unintentionally types a wrong string

45
Selectivity of Fuzzy Predicates
• star SIMILARTO Schwarrzenger
• Selectivity of records satisfying the predicate

Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

46
Example Similarity Function Edit Distance
• A widely used metric to define string similarity
• Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2
• Example
• s1 Tom Hanks
• s2 Ton Hank
• ed(s1,s2) 2

47
• No nice order for strings
• Lexicographical order?
• Similar strings could be far from each other
Kammy/Cammy
• Adjacent strings have different selectivities
Cathy/Catherine

48
Edit Vector from s1 to s2
• A vector ltI, D, Sgt
• I of insertions
• D of deletions
• S of substitutions
• in a sequence of edit operations with their edit
distance
• Easily computable
• Not symmetric
• Not unique, but tend to be (ed lt 3 ? 91 unique)

49
Improving Estimation Accuracy
• Reasons of estimate errors
• Miss hits in PPD.
• Inaccurate percentage entries in PPD.
• Improvement use sample fuzzy predicates to
analyze their estimation errors

50
Relative-Error Model
• Use the errors to build a model
• Use the model to adjust initial estimation

51
Effectiveness of Applying Relative-Error Model
52
Clustering Algorithms
K-Metoids is better