Robust and Efficient Fuzzy Match for Online Data Cleaning - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Robust and Efficient Fuzzy Match for Online Data Cleaning

Description:

match with acceptable tuples in reference relations (known-to-be-clean tuples) ... FUZZY SIMILARITY MATCH - similarity between u and v is defined as: ... – PowerPoint PPT presentation

Number of Views:484

Avg rating:3.0/5.0

Slides: 34

Provided by: gaya8

Category:

more less

Transcript and Presenter's Notes

Title: Robust and Efficient Fuzzy Match for Online Data Cleaning

1
Robust and Efficient Fuzzy Match for Online Data
Cleaning

Surajit Chaudhuri Kris Ganjam Venkatesh Ganti
Rajeev Motwani
Microsoft Research Stanford University
Databases and Information Management Seminar
Gayatri Swamynathan

2
INTRODUCTION

- data warehouses high data quality essential
- need to cleanse incoming data tuples from
external sources
match with acceptable tuples in reference
relations (known-to-be-clean tuples)
- common errors spelling mistakes, missing
fields

Eg product_name, description sales record from
a distributor must match with
pre-recorded name and description fields in
Product Reference relation.

3
INTRODUCTION

- What is data cleaning? The task of detecting
and correcting errors in data
- How?
periodic data cleaning of an entire data
warehouse
OR
avoid introduction of errors during the
process of adding new data into the warehouse
- validate and correct input tuples before they
are loaded

4
FUZZY MATCH OPERATION

- similarity function notion of closeness
between tuples
- fuzzy match error-resilient matching of input
tuples against the
reference table
- returns closest K reference tuples

5
RELATED WORK

1. Trillium, Vality, Axciom proprietary
domain-specific functions
2. Googles Did you mean?

3. String Edit Distance for measuring similarity
between tuples - does not make use of
informative tokens (I3 matched incorrectly to
R2) - does not take into account the token
transposition error (I4 matched to R3) 4.
Information Retrieval use of weights
6
MOTIVATION

- to develop a domain-independent method
- new fuzzy match similarity (FMS) function
views a string as a
sequence of tokens and recognizes varying
importance of tokens by
explicitly associating weights quantifying
their importance
- tuples matching on high weight tokens are more
similar than tuples
matching on low weight tokens
- effective use of token weights in combination
with data entry errors has
not been considered earlier

7
THE SIMILARITY FUNCTION

EDIT DISTANCE ed(s1,s2)
- minimum number of character edit operations
(delete,
insert, substitute) required to transform s1 to
s2,
normalized by maximum of lengths of s1 and s2.
- character edit UNIT cost
Eg ed(company, corporation) 7/11 0.64

REFERENCE RELATION R Tid, A1, , An Ai
denotes ith column (string valued attribute) Tid
tuple identifier, key of R vi denotes
value ai in tuple v r, a1,an Eg v1 in
v R1, Boeing Company, Seattle, WA, 98004 is
Boeing Company
8
THE SIMILARITY FUNCTION

TOKENIZATION tok(s)
- splits string s into a set of tokens, based on
a set of delimiters
Eg tuple v R1, Boeing Company, Seattle, WA,
98004
tok(v1) boeing, company
- column property for tokens in tok(vcol) is
col used to know the column from which a token
originates
- token set tok(v) union of all sets of tokens
in a tuple
WEIGHT FUNCTION
- inverse document frequency (IDF) importance
of a token decreases with its
frequency (number of times it occurs in the
reference relation), freq (t,i) 0

freq 0 implies t is an erroneous version of
some token, so we assign its weight as the
average weight of all tokens in the ith column of
relation R
9
FUZZY SIMILARITY FUNCTION (FMS)

- similarity between an input tuple and a
reference tuple is the cost of
transforming the former into the latter
Schema RA1,,An
Input tuple u, Reference tuple v
1. Token replacement cost ed(t1,t2)w(t1,i)
where i is the column, t1 is
token in ui and t2 is the token in vi
2. Token insertion cost cinsw(t,i), where cins
is token insertion factor, between 0 and 1
3. Token deletion cost w(t,i)
- insertion, deletion cost different more
likely that tokens be left out
during data entry that for spurious tokens to
be inserted

10
FUZZY SIMILARITY FUNCTION (FMS)

TRANSFORMATION COST tc(ui,vi)
cost of the minimum cost transformation
sequence for transforming ui to vi

Eg Input tuple u Beoing Corporation,
Seattle, WA, 98004 Reference tuple v
Boeing Company, Seattle, WA, 98004
tc(u1,v1) 0.33 (beoing to boeing) 0.64
(corporation to company) 0.97
FUZZY SIMILARITY MATCH - similarity between u
and v is defined as
where w(u) is sum of weights of all tokens in
token set tok(u)
Eg w(u) 5.0 (considering unit weight on each
token) fms(u,v) 1 0.97/5.0 0.806
11
THE K-FUZZY MATCH PROBLEM

Given reference relation R, a minimum similarity
threshold c (0
input tuple u, the set FM(u) of fuzzy matches of
at most K tuples from R
such that
1. fms(u,v) c, for all v in FM(u)
2. fms(u,v) fms(u,v) for any v in FM(u) and
v in R-FM(u)
Naïve Algorithm scan the reference relation R,
comparing each tuple
with u
Proposed Method
- build an index on the reference relation for
quickly retrieving a superset of target fuzzy
matches (pre-processing phase)
- this indexed relation is called Error Tolerant
Index (ETI) - indexed using standard B trees to
perform fast-exact lookups.
- to prepare an ETI, fmsapx needed

12
APPROXIMATION OF FMS
Error Tolerant Index (standard database
relation, but indexable)
Reference Relation (not indexable)
Pre-processing
Candidate Set - superset of FM(U)

- fmsapx upper bounds fms with high probability
increases similarity
among tuples
- it is a pared down version of fms, obtained by
1. ignoring ordering among tokens in the input
and reference tuples
2. by allowing each input token to match with
the closest token from the reference tuple
Eg tuples beoing company, seattle, wa, 98004
and company beoing , seattle, wa, 98004
are identical to fmsapx.
- in fmsapx, closeness between two tokens is
measured through the similarity between sets of
substrings called q-gram sets.

13
APPROXIMATION OF FMS

Q-Gram Set QGq(s)
set of all size q substrings of s
Eg 3-gram set QG3(boeing) is
boe,oei,ein,ing
Min-Hash Similarity
Let U denote universe of strings over alphabet S
Hi U - N, i 1H hash functions, mapping U
uniformly and randomly to a set of natural
numbers N
Let S be a set of strings
Min-Hash Signature mh(S) mh1(S),,mhH(S)
where the ith coordinate
mhi(S) arg min hi(a) where a ? S
Let IX denote an indicator variable over
boolean X
i.e. IX 1 if X is true, else 0.

14
APPROXIMATION OF FMS
Eg Input tuple u Company Beoing, Seattle,
NULL, 98004 Reference tuple v Boeing Company,
Seattle, WA, 98004 q 3, H 2, t w (token
weight) company 0.25, beoing 0.5,
seattle1.0, 98004 2.0 Total weight
3.75 Suppose min-hash signatures are oei, ing,
com, pan, sea, ttl, wa, 980, 004 Score
from matching beoing with boeing is
w(beoing)(2/3 0.5 (1 1/3))
w(beoing) Since every token matches exactly with
a reference token, fmsapx(u,v) 3.75/3.75
15
ERROR TOLERANT INDEX

- primary purpose of ETI is to enable, for each
input tuple u, the
efficient retrieval of a candidate set S of
reference tuples with
similarity greater than the minimum similarity
threshold.
- fmsapx is measured by comparing min-hash
signatures of tokens in
tok(u) and tok(v)
- so to determine the candidate set, we need to
efficiently identify for
each token t in tok(u), a set of reference
tuples sharing min-hash q-
grams with that of t

16
ERROR TOLERANT INDEX
- union of all the Sis contains the candidate
set S - thus we store in ETI each q-gram s along
with the list of all tids of reference tuples
with tokens whose min-hash signatures contain s
17
ERROR TOLERANT INDEX
- obvious method of computing all ETI tuples in
main memory by scanning and processing each
reference tuple not scalable - pre-ETI
leverage the underlying database system by
building a temporary relation pre-ETI, and then
constructing the ETI from it using SQL queries. -
schema of pre-ETI QGram, Coordinate, Column,
Tid Eg R1Beoing Company, Seattle, WA,
98004 size-2 signature of token company
belonging to column 1 is com, pan Insert into
pre-ETI two rows com,1,1,R1, pan,2,1,R1 -
SQL query Select QGram, Coordinate, Column, Tid
from pre-ETI order by QGram, Coordinate,
Column, Tid - freq size of the group
18
QUERY PROCESSING

- goal reduce the number of lookups against the
reference relation by
effectively using ETI
BASIC ALGORITHM
- fetches tid-lists by looking up ETI of all
q-grams in min-hash
signatures of all tokens in u
1. For each token t in tok(u) compute its IDF
weight w(t)
2. Determine the min-hash signature mh(t) of
each token
3. Assign weight w(t)/mh(t) to each q-gram
in mh(t)
4. Using ETI, determine candidate set S of
reference tuple as per fmsapx
5. Fetch the tuples in S from the reference
relation, and test as per fms
6. Among tuples that pass the test, return K
tuples with K highest similarity scores.

19
CANDIDATE SET DETERMINATION

- at any point, score of a tid equals sum of
weights of all q-grams whose
tid-lists it belongs to
i.e. each tid is assigned a score proportional to
the weight w(t) of the
parent token t.
- if tid r is very close to tuple u, then r has a
high overall score
- tids that have a overall score greater than
w(u)c minus an adjustment
term a correction to approximate the edit
distance between tokens
with similarity between their q-gram sets
constitute the candidate set

20
CANDIDATE SET DETERMINATION

Eg Input tuple
u Beoing Company, Seattle, WA, 98004
q 3, H2
q1,q2w denotes min-hash signature
q1,q2 each assigned weight w
Boeing0.5, company 0.25, seattle1.0,
WA 0.75, 98004 2.0
eoi, ing 0.25, com,pan0.125, sea,ttl
0.5,
wa 0.75, 980, 004 1.0

Lookup ETI to fetch the following
tid-lists ,R1, R1,R3,R1,R3,R1,R2,R3
,R1,R2,R3,R1,R2,R3,R1,R2,R3,R1 R1
gets overall score of 4.25 R2 2.75, R3
3.0 Depending on the threshold, the candidate set
is a subset of R1,R2,R3
21
OPTIMISTIC SHORT CIRCUITING

- Need reduce the number of ETI lookups
- Intuition weights of input tokens (and hence,
min-hash q-grams) vary
significantly
- lookup the ETI on just a few important q-grams
and if a fetching test
succeeds optimistically short circuit the
algorithm by fetching the
current closest K reference tuples.
- stopping test verifies whether these tuples
are actually the closest K
tuples
- avoid ETI lookups on a number of unimportant
q-grams.

22
OPTIMISTIC SHORT CIRCUITING

Eg Input tuple
u Beoing Company, Seattle, WA, 98004
q 3, H2, K 1
q1,q2w denotes min-hash signature
q1,q2 each assigned weight w
Boeing0.5, company 0.25, seattle1.0,
WA 0.75, 98004 2.0
eoi, ing 0.25, com,pan0.125, sea,ttl
0.5,
wa 0.75, 980, 004 1.0
We order q-grams in the decreasing order
of their weights and fetch their tid-lists in
this order

1. First fetch tid-list R1,R2,R3 of 980 - we
need further distinguishing at this point 2.
Fetch the list R1 of next most important q-gram
004 3. At this point, scores are R12.0, R2
1.0, R3 1.0 4. Extrapolation score for R1
over ALL the q-grams 4.5 score for
R2 1.0 (4.5-2.0) 3.5 (its current score
sum of weights of remaining q-grams) 5.
Since 4.53.5, anticipate R1 to be closest match,
fetch it from R, compute fms(u, R1)
23
OPTIMISTIC SHORT CIRCUITING

- Robustness of stopping test ensures that
inaccuracy in estimating the
score of R1 over all q-grams does not affect
the correctness of the final
result.
- It may affect performance
- OSC differs from the basic algorithm
1. the order in which we look-up q-grams
against ETI
2. the additional short-circuiting procedure

24
RESOURCE REQUIREMENTS

ETI building time expensive steps
1. scan of the reference relation R
2. writing the pre-ETI
3. sorting the pre-ETI
4. writing the ETI
Total I/O cost during these phases is
O(mavgqHRETI(12q)) where
mavg is the average number of tokens in each
tuple, and ETI is the
number of tuples in the ETI
Input Tuple Processing expensive steps
1. looking up ETI for tid-lists of q-grams
2. processing tid-lists
3. fetching tuples in the candidate set

25
EXTENSIONS

1. INDEXING USING TOKENS
- effectively use tokens for further improving
efficiency
- intuition input tuple u Beoing Company,
Seattle, WA, 98004 has only
one incorrect token, rest all are
corrent
- index reference tuples on tokens as well, so
we can directly lookup ETI
against these tokens instead of several
min-hash signatures
- assign 0th coordinate in the signature to the
token
- so challenge is to gain efficiency without
losing accuracy
2. COLUMN WEIGHTS
- token t in ith column gets weight w(t)Wi
- used to assign varying importance to columns

26
EXPERIMENTS

- clean reference tuple Customer name, city,
state, zip code with 1.7 million
tuples
- input datasets created by introducting errors
in randomly selected subsets of
Customer tuples
- 2 types of error-injection techniques
Type I all tokens in a column are equally
likely to be erroneous
Type II tokens with higher frequency are more
likely to be errorneous

If we decide (with probability pi) to introduce
an error into ui, we select from among several
types of errors with the conditional
probabilities as shown.
27
EXPERIMENTS (contd.)

PARAMETER SETTINGS
K 1 (closest fuzzy match)
Q-gram size q 4
Minimum similarity threshold c 0.0 (i.e. all
exact K matches)
Cins token insertion factor 0.5
MACHINE SPECIFICATIONS experiments were run on
930MHz Pentium, 256MB RAM
MS Windows XP
- Algorithm was implemented on the MS SQLServer
2000 database
using OLEDB for database access

28
EVALUATION METRICS

1. Normalized Elapsed Time
elapsed time to process the set of input tuples
using fuzzy match algo
elapsed time to process one input tuple using the
naïve algorithm
2. Accuracy the percentage of input tuples for
which a fuzzy match
algorithm identifies the seed tuple,
from which the
erroneous input tuple was generated,
as the closest
reference tuple.
NOTATION USED to denote which signature
computation strategy
was used
A_H, A ? Q, QT and H0
Q denotes q-grams only QT denotes q-grams plus
token signatures

29
EXPERIMENTAL RESULTS - ACCURACY
Accuracy Comparison between ed and fms -
accuracy for Type II errors better than Type I
- consider only datasets with Type I error - each
of D1, D2, D3 has 1655 tuples
Figure 5 1. min-hash signatures significantly
improve accuracy Q_H better than QT_0 2.
Adding tokens to signature does not negatively
impact accuracy 3. Even small signature sizes
yield high gains in accuracy
30
EXPERIMENTAL RESULTS - EFFICIENCY
Figure 6 - algorithms are 2 to 3 orders of
magnitude faster than the naïve algorithm - query
processing time decreases with signature size -
For all 1than Q_H
Figure 7 - QT_H is greater than Q_H - because
we persist the ETI as a standard index relation,
we can use it for subsequent batches of input
tuples if the reference table does not change
31
EXPERIMENTAL RESULTS - EFFICIENCY
Figure 8 - more q-grams help decrease candidate
set sizes by better distinguishing similarity
scores of tids - when OSC succeeds, we retrieve
very few (around 1 per input tuple) candidate
tuples
Figure 9 - the average number of tids processed
per input tuple increases - but more than
compensated by the average reduction in the
candidate set sizes
32
EXPERIMENTAL RESULTS - EFFICIENCY
Figure 10 - OSC optimization is successful for
50 - 75 of the input tuples - success fraction
increases with signature size
33
CONCLUSIONS