Title: Robust and Efficient Fuzzy Match for Online Data Cleaning
1Robust and Efficient Fuzzy Match for Online Data
Cleaning
- Surajit Chaudhuri Kris Ganjam Venkatesh Ganti
Rajeev Motwani - Microsoft Research Stanford University
- Databases and Information Management Seminar
- Gayatri Swamynathan
2INTRODUCTION
- - data warehouses high data quality essential
- - need to cleanse incoming data tuples from
external sources - match with acceptable tuples in reference
relations (known-to-be-clean tuples) - - common errors spelling mistakes, missing
fields
- Eg product_name, description sales record from
a distributor must match with - pre-recorded name and description fields in
Product Reference relation.
3INTRODUCTION
- - What is data cleaning? The task of detecting
and correcting errors in data - - How?
- periodic data cleaning of an entire data
warehouse - OR
- avoid introduction of errors during the
- process of adding new data into the warehouse
-
- - validate and correct input tuples before they
are loaded
4FUZZY MATCH OPERATION
- - similarity function notion of closeness
between tuples - - fuzzy match error-resilient matching of input
tuples against the - reference table
- - returns closest K reference tuples
5RELATED WORK
- 1. Trillium, Vality, Axciom proprietary
domain-specific functions - 2. Googles Did you mean?
3. String Edit Distance for measuring similarity
between tuples - does not make use of
informative tokens (I3 matched incorrectly to
R2) - does not take into account the token
transposition error (I4 matched to R3) 4.
Information Retrieval use of weights
6MOTIVATION
- - to develop a domain-independent method
- - new fuzzy match similarity (FMS) function
views a string as a - sequence of tokens and recognizes varying
importance of tokens by - explicitly associating weights quantifying
their importance - - tuples matching on high weight tokens are more
similar than tuples - matching on low weight tokens
- - effective use of token weights in combination
with data entry errors has - not been considered earlier
7THE SIMILARITY FUNCTION
- EDIT DISTANCE ed(s1,s2)
- - minimum number of character edit operations
(delete, - insert, substitute) required to transform s1 to
s2, - normalized by maximum of lengths of s1 and s2.
-
- - character edit UNIT cost
- Eg ed(company, corporation) 7/11 0.64
REFERENCE RELATION R Tid, A1, , An Ai
denotes ith column (string valued attribute) Tid
tuple identifier, key of R vi denotes
value ai in tuple v r, a1,an Eg v1 in
v R1, Boeing Company, Seattle, WA, 98004 is
Boeing Company
8THE SIMILARITY FUNCTION
- TOKENIZATION tok(s)
- - splits string s into a set of tokens, based on
a set of delimiters - Eg tuple v R1, Boeing Company, Seattle, WA,
98004 - tok(v1) boeing, company
- - column property for tokens in tok(vcol) is
col used to know the column from which a token
originates - - token set tok(v) union of all sets of tokens
in a tuple - WEIGHT FUNCTION
- - inverse document frequency (IDF) importance
of a token decreases with its - frequency (number of times it occurs in the
reference relation), freq (t,i) 0
freq 0 implies t is an erroneous version of
some token, so we assign its weight as the
average weight of all tokens in the ith column of
relation R
9FUZZY SIMILARITY FUNCTION (FMS)
- - similarity between an input tuple and a
reference tuple is the cost of - transforming the former into the latter
- Schema RA1,,An
- Input tuple u, Reference tuple v
- 1. Token replacement cost ed(t1,t2)w(t1,i)
where i is the column, t1 is - token in ui and t2 is the token in vi
- 2. Token insertion cost cinsw(t,i), where cins
is token insertion factor, between 0 and 1 - 3. Token deletion cost w(t,i)
- - insertion, deletion cost different more
likely that tokens be left out - during data entry that for spurious tokens to
be inserted
10FUZZY SIMILARITY FUNCTION (FMS)
- TRANSFORMATION COST tc(ui,vi)
- cost of the minimum cost transformation
- sequence for transforming ui to vi
Eg Input tuple u Beoing Corporation,
Seattle, WA, 98004 Reference tuple v
Boeing Company, Seattle, WA, 98004
tc(u1,v1) 0.33 (beoing to boeing) 0.64
(corporation to company) 0.97
FUZZY SIMILARITY MATCH - similarity between u
and v is defined as
where w(u) is sum of weights of all tokens in
token set tok(u)
Eg w(u) 5.0 (considering unit weight on each
token) fms(u,v) 1 0.97/5.0 0.806
11THE K-FUZZY MATCH PROBLEM
- Given reference relation R, a minimum similarity
threshold c (0 - input tuple u, the set FM(u) of fuzzy matches of
at most K tuples from R - such that
- 1. fms(u,v) c, for all v in FM(u)
- 2. fms(u,v) fms(u,v) for any v in FM(u) and
v in R-FM(u) - Naïve Algorithm scan the reference relation R,
comparing each tuple - with u
- Proposed Method
- - build an index on the reference relation for
quickly retrieving a superset of target fuzzy
matches (pre-processing phase) - - this indexed relation is called Error Tolerant
Index (ETI) - indexed using standard B trees to
perform fast-exact lookups. - - to prepare an ETI, fmsapx needed
12APPROXIMATION OF FMS
Error Tolerant Index (standard database
relation, but indexable)
Reference Relation (not indexable)
Pre-processing
Candidate Set - superset of FM(U)
- - fmsapx upper bounds fms with high probability
increases similarity - among tuples
- - it is a pared down version of fms, obtained by
- 1. ignoring ordering among tokens in the input
and reference tuples - 2. by allowing each input token to match with
the closest token from the reference tuple - Eg tuples beoing company, seattle, wa, 98004
and company beoing , seattle, wa, 98004 - are identical to fmsapx.
- - in fmsapx, closeness between two tokens is
measured through the similarity between sets of
substrings called q-gram sets.
13APPROXIMATION OF FMS
- Q-Gram Set QGq(s)
- set of all size q substrings of s
- Eg 3-gram set QG3(boeing) is
boe,oei,ein,ing - Min-Hash Similarity
- Let U denote universe of strings over alphabet S
- Hi U - N, i 1H hash functions, mapping U
uniformly and randomly to a set of natural
numbers N - Let S be a set of strings
- Min-Hash Signature mh(S) mh1(S),,mhH(S)
where the ith coordinate - mhi(S) arg min hi(a) where a ? S
- Let IX denote an indicator variable over
boolean X - i.e. IX 1 if X is true, else 0.
14APPROXIMATION OF FMS
Eg Input tuple u Company Beoing, Seattle,
NULL, 98004 Reference tuple v Boeing Company,
Seattle, WA, 98004 q 3, H 2, t w (token
weight) company 0.25, beoing 0.5,
seattle1.0, 98004 2.0 Total weight
3.75 Suppose min-hash signatures are oei, ing,
com, pan, sea, ttl, wa, 980, 004 Score
from matching beoing with boeing is
w(beoing)(2/3 0.5 (1 1/3))
w(beoing) Since every token matches exactly with
a reference token, fmsapx(u,v) 3.75/3.75
15ERROR TOLERANT INDEX
- - primary purpose of ETI is to enable, for each
input tuple u, the - efficient retrieval of a candidate set S of
reference tuples with - similarity greater than the minimum similarity
threshold. - - fmsapx is measured by comparing min-hash
signatures of tokens in - tok(u) and tok(v)
- - so to determine the candidate set, we need to
efficiently identify for - each token t in tok(u), a set of reference
tuples sharing min-hash q- - grams with that of t
16ERROR TOLERANT INDEX
- union of all the Sis contains the candidate
set S - thus we store in ETI each q-gram s along
with the list of all tids of reference tuples
with tokens whose min-hash signatures contain s
17ERROR TOLERANT INDEX
- obvious method of computing all ETI tuples in
main memory by scanning and processing each
reference tuple not scalable - pre-ETI
leverage the underlying database system by
building a temporary relation pre-ETI, and then
constructing the ETI from it using SQL queries. -
schema of pre-ETI QGram, Coordinate, Column,
Tid Eg R1Beoing Company, Seattle, WA,
98004 size-2 signature of token company
belonging to column 1 is com, pan Insert into
pre-ETI two rows com,1,1,R1, pan,2,1,R1 -
SQL query Select QGram, Coordinate, Column, Tid
from pre-ETI order by QGram, Coordinate,
Column, Tid - freq size of the group
18QUERY PROCESSING
- - goal reduce the number of lookups against the
reference relation by - effectively using ETI
- BASIC ALGORITHM
- - fetches tid-lists by looking up ETI of all
q-grams in min-hash - signatures of all tokens in u
- 1. For each token t in tok(u) compute its IDF
weight w(t) - 2. Determine the min-hash signature mh(t) of
each token - 3. Assign weight w(t)/mh(t) to each q-gram
in mh(t) - 4. Using ETI, determine candidate set S of
reference tuple as per fmsapx - 5. Fetch the tuples in S from the reference
relation, and test as per fms - 6. Among tuples that pass the test, return K
tuples with K highest similarity scores.
19CANDIDATE SET DETERMINATION
- - at any point, score of a tid equals sum of
weights of all q-grams whose - tid-lists it belongs to
- i.e. each tid is assigned a score proportional to
the weight w(t) of the - parent token t.
- - if tid r is very close to tuple u, then r has a
high overall score - - tids that have a overall score greater than
w(u)c minus an adjustment - term a correction to approximate the edit
distance between tokens - with similarity between their q-gram sets
constitute the candidate set
20CANDIDATE SET DETERMINATION
- Eg Input tuple
- u Beoing Company, Seattle, WA, 98004
- q 3, H2
- q1,q2w denotes min-hash signature
- q1,q2 each assigned weight w
- Boeing0.5, company 0.25, seattle1.0,
- WA 0.75, 98004 2.0
- eoi, ing 0.25, com,pan0.125, sea,ttl
0.5, - wa 0.75, 980, 004 1.0
Lookup ETI to fetch the following
tid-lists ,R1, R1,R3,R1,R3,R1,R2,R3
,R1,R2,R3,R1,R2,R3,R1,R2,R3,R1 R1
gets overall score of 4.25 R2 2.75, R3
3.0 Depending on the threshold, the candidate set
is a subset of R1,R2,R3
21OPTIMISTIC SHORT CIRCUITING
- - Need reduce the number of ETI lookups
- - Intuition weights of input tokens (and hence,
min-hash q-grams) vary - significantly
- - lookup the ETI on just a few important q-grams
and if a fetching test - succeeds optimistically short circuit the
algorithm by fetching the - current closest K reference tuples.
- - stopping test verifies whether these tuples
are actually the closest K - tuples
- - avoid ETI lookups on a number of unimportant
q-grams.
22OPTIMISTIC SHORT CIRCUITING
- Eg Input tuple
- u Beoing Company, Seattle, WA, 98004
- q 3, H2, K 1
- q1,q2w denotes min-hash signature
- q1,q2 each assigned weight w
- Boeing0.5, company 0.25, seattle1.0,
- WA 0.75, 98004 2.0
- eoi, ing 0.25, com,pan0.125, sea,ttl
0.5, - wa 0.75, 980, 004 1.0
- We order q-grams in the decreasing order
- of their weights and fetch their tid-lists in
this order
1. First fetch tid-list R1,R2,R3 of 980 - we
need further distinguishing at this point 2.
Fetch the list R1 of next most important q-gram
004 3. At this point, scores are R12.0, R2
1.0, R3 1.0 4. Extrapolation score for R1
over ALL the q-grams 4.5 score for
R2 1.0 (4.5-2.0) 3.5 (its current score
sum of weights of remaining q-grams) 5.
Since 4.53.5, anticipate R1 to be closest match,
fetch it from R, compute fms(u, R1)
23OPTIMISTIC SHORT CIRCUITING
- - Robustness of stopping test ensures that
inaccuracy in estimating the - score of R1 over all q-grams does not affect
the correctness of the final - result.
- - It may affect performance
- - OSC differs from the basic algorithm
- 1. the order in which we look-up q-grams
against ETI - 2. the additional short-circuiting procedure
24RESOURCE REQUIREMENTS
- ETI building time expensive steps
- 1. scan of the reference relation R
- 2. writing the pre-ETI
- 3. sorting the pre-ETI
- 4. writing the ETI
- Total I/O cost during these phases is
O(mavgqHRETI(12q)) where - mavg is the average number of tokens in each
tuple, and ETI is the - number of tuples in the ETI
- Input Tuple Processing expensive steps
- 1. looking up ETI for tid-lists of q-grams
- 2. processing tid-lists
- 3. fetching tuples in the candidate set
25EXTENSIONS
- 1. INDEXING USING TOKENS
- - effectively use tokens for further improving
efficiency - - intuition input tuple u Beoing Company,
Seattle, WA, 98004 has only - one incorrect token, rest all are
corrent - - index reference tuples on tokens as well, so
we can directly lookup ETI - against these tokens instead of several
min-hash signatures - - assign 0th coordinate in the signature to the
token - - so challenge is to gain efficiency without
losing accuracy - 2. COLUMN WEIGHTS
- - token t in ith column gets weight w(t)Wi
- - used to assign varying importance to columns
26EXPERIMENTS
- - clean reference tuple Customer name, city,
state, zip code with 1.7 million - tuples
- - input datasets created by introducting errors
in randomly selected subsets of - Customer tuples
- - 2 types of error-injection techniques
- Type I all tokens in a column are equally
likely to be erroneous - Type II tokens with higher frequency are more
likely to be errorneous
If we decide (with probability pi) to introduce
an error into ui, we select from among several
types of errors with the conditional
probabilities as shown.
27EXPERIMENTS (contd.)
- PARAMETER SETTINGS
- K 1 (closest fuzzy match)
- Q-gram size q 4
- Minimum similarity threshold c 0.0 (i.e. all
exact K matches) - Cins token insertion factor 0.5
- MACHINE SPECIFICATIONS experiments were run on
- 930MHz Pentium, 256MB RAM
- MS Windows XP
- - Algorithm was implemented on the MS SQLServer
2000 database - using OLEDB for database access
28EVALUATION METRICS
- 1. Normalized Elapsed Time
- elapsed time to process the set of input tuples
using fuzzy match algo - elapsed time to process one input tuple using the
naïve algorithm - 2. Accuracy the percentage of input tuples for
which a fuzzy match - algorithm identifies the seed tuple,
from which the - erroneous input tuple was generated,
as the closest - reference tuple.
- NOTATION USED to denote which signature
computation strategy - was used
- A_H, A ? Q, QT and H0
- Q denotes q-grams only QT denotes q-grams plus
token signatures
29EXPERIMENTAL RESULTS - ACCURACY
Accuracy Comparison between ed and fms -
accuracy for Type II errors better than Type I
- consider only datasets with Type I error - each
of D1, D2, D3 has 1655 tuples
Figure 5 1. min-hash signatures significantly
improve accuracy Q_H better than QT_0 2.
Adding tokens to signature does not negatively
impact accuracy 3. Even small signature sizes
yield high gains in accuracy
30EXPERIMENTAL RESULTS - EFFICIENCY
Figure 6 - algorithms are 2 to 3 orders of
magnitude faster than the naïve algorithm - query
processing time decreases with signature size -
For all 1than Q_H
Figure 7 - QT_H is greater than Q_H - because
we persist the ETI as a standard index relation,
we can use it for subsequent batches of input
tuples if the reference table does not change
31EXPERIMENTAL RESULTS - EFFICIENCY
Figure 8 - more q-grams help decrease candidate
set sizes by better distinguishing similarity
scores of tids - when OSC succeeds, we retrieve
very few (around 1 per input tuple) candidate
tuples
Figure 9 - the average number of tids processed
per input tuple increases - but more than
compensated by the average reduction in the
candidate set sizes
32EXPERIMENTAL RESULTS - EFFICIENCY
Figure 10 - OSC optimization is successful for
50 - 75 of the input tuples - success fraction
increases with signature size
33CONCLUSIONS
- - generalized edit distance similarity by
incorporating notion of tokens - and their importance to develop an accurate
fuzzy match - - developed an error tolerant index and an
efficient algorithm for - identifying with high probability the closest
fuzzy matching reference - tuples.