Robust and Efficient Fuzzy Match for Online Data Cleaning - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Robust and Efficient Fuzzy Match for Online Data Cleaning

Description:

match with acceptable tuples in reference relations (known-to-be-clean tuples) ... FUZZY SIMILARITY MATCH - similarity between u and v is defined as: ... – PowerPoint PPT presentation

Number of Views:484
Avg rating:3.0/5.0
Slides: 34
Provided by: gaya8
Category:

less

Transcript and Presenter's Notes

Title: Robust and Efficient Fuzzy Match for Online Data Cleaning


1
Robust and Efficient Fuzzy Match for Online Data
Cleaning
  • Surajit Chaudhuri Kris Ganjam Venkatesh Ganti
    Rajeev Motwani
  • Microsoft Research Stanford University
  • Databases and Information Management Seminar
  • Gayatri Swamynathan

2
INTRODUCTION
  • - data warehouses high data quality essential
  • - need to cleanse incoming data tuples from
    external sources
  • match with acceptable tuples in reference
    relations (known-to-be-clean tuples)
  • - common errors spelling mistakes, missing
    fields
  • Eg product_name, description sales record from
    a distributor must match with
  • pre-recorded name and description fields in
    Product Reference relation.

3
INTRODUCTION
  • - What is data cleaning? The task of detecting
    and correcting errors in data
  • - How?
  • periodic data cleaning of an entire data
    warehouse
  • OR
  • avoid introduction of errors during the
  • process of adding new data into the warehouse
  • - validate and correct input tuples before they
    are loaded

4
FUZZY MATCH OPERATION
  • - similarity function notion of closeness
    between tuples
  • - fuzzy match error-resilient matching of input
    tuples against the
  • reference table
  • - returns closest K reference tuples

5
RELATED WORK
  • 1. Trillium, Vality, Axciom proprietary
    domain-specific functions
  • 2. Googles Did you mean?

3. String Edit Distance for measuring similarity
between tuples - does not make use of
informative tokens (I3 matched incorrectly to
R2) - does not take into account the token
transposition error (I4 matched to R3) 4.
Information Retrieval use of weights
6
MOTIVATION
  • - to develop a domain-independent method
  • - new fuzzy match similarity (FMS) function
    views a string as a
  • sequence of tokens and recognizes varying
    importance of tokens by
  • explicitly associating weights quantifying
    their importance
  • - tuples matching on high weight tokens are more
    similar than tuples
  • matching on low weight tokens
  • - effective use of token weights in combination
    with data entry errors has
  • not been considered earlier

7
THE SIMILARITY FUNCTION
  • EDIT DISTANCE ed(s1,s2)
  • - minimum number of character edit operations
    (delete,
  • insert, substitute) required to transform s1 to
    s2,
  • normalized by maximum of lengths of s1 and s2.
  • - character edit UNIT cost
  • Eg ed(company, corporation) 7/11 0.64

REFERENCE RELATION R Tid, A1, , An Ai
denotes ith column (string valued attribute) Tid
tuple identifier, key of R vi denotes
value ai in tuple v r, a1,an Eg v1 in
v R1, Boeing Company, Seattle, WA, 98004 is
Boeing Company
8
THE SIMILARITY FUNCTION
  • TOKENIZATION tok(s)
  • - splits string s into a set of tokens, based on
    a set of delimiters
  • Eg tuple v R1, Boeing Company, Seattle, WA,
    98004
  • tok(v1) boeing, company
  • - column property for tokens in tok(vcol) is
    col used to know the column from which a token
    originates
  • - token set tok(v) union of all sets of tokens
    in a tuple
  • WEIGHT FUNCTION
  • - inverse document frequency (IDF) importance
    of a token decreases with its
  • frequency (number of times it occurs in the
    reference relation), freq (t,i) 0

freq 0 implies t is an erroneous version of
some token, so we assign its weight as the
average weight of all tokens in the ith column of
relation R
9
FUZZY SIMILARITY FUNCTION (FMS)
  • - similarity between an input tuple and a
    reference tuple is the cost of
  • transforming the former into the latter
  • Schema RA1,,An
  • Input tuple u, Reference tuple v
  • 1. Token replacement cost ed(t1,t2)w(t1,i)
    where i is the column, t1 is
  • token in ui and t2 is the token in vi
  • 2. Token insertion cost cinsw(t,i), where cins
    is token insertion factor, between 0 and 1
  • 3. Token deletion cost w(t,i)
  • - insertion, deletion cost different more
    likely that tokens be left out
  • during data entry that for spurious tokens to
    be inserted

10
FUZZY SIMILARITY FUNCTION (FMS)
  • TRANSFORMATION COST tc(ui,vi)
  • cost of the minimum cost transformation
  • sequence for transforming ui to vi

Eg Input tuple u Beoing Corporation,
Seattle, WA, 98004 Reference tuple v
Boeing Company, Seattle, WA, 98004
tc(u1,v1) 0.33 (beoing to boeing) 0.64
(corporation to company) 0.97
FUZZY SIMILARITY MATCH - similarity between u
and v is defined as
where w(u) is sum of weights of all tokens in
token set tok(u)
Eg w(u) 5.0 (considering unit weight on each
token) fms(u,v) 1 0.97/5.0 0.806
11
THE K-FUZZY MATCH PROBLEM
  • Given reference relation R, a minimum similarity
    threshold c (0
  • input tuple u, the set FM(u) of fuzzy matches of
    at most K tuples from R
  • such that
  • 1. fms(u,v) c, for all v in FM(u)
  • 2. fms(u,v) fms(u,v) for any v in FM(u) and
    v in R-FM(u)
  • Naïve Algorithm scan the reference relation R,
    comparing each tuple
  • with u
  • Proposed Method
  • - build an index on the reference relation for
    quickly retrieving a superset of target fuzzy
    matches (pre-processing phase)
  • - this indexed relation is called Error Tolerant
    Index (ETI) - indexed using standard B trees to
    perform fast-exact lookups.
  • - to prepare an ETI, fmsapx needed

12
APPROXIMATION OF FMS
Error Tolerant Index (standard database
relation, but indexable)
Reference Relation (not indexable)
Pre-processing
Candidate Set - superset of FM(U)
  • - fmsapx upper bounds fms with high probability
    increases similarity
  • among tuples
  • - it is a pared down version of fms, obtained by
  • 1. ignoring ordering among tokens in the input
    and reference tuples
  • 2. by allowing each input token to match with
    the closest token from the reference tuple
  • Eg tuples beoing company, seattle, wa, 98004
    and company beoing , seattle, wa, 98004
  • are identical to fmsapx.
  • - in fmsapx, closeness between two tokens is
    measured through the similarity between sets of
    substrings called q-gram sets.

13
APPROXIMATION OF FMS
  • Q-Gram Set QGq(s)
  • set of all size q substrings of s
  • Eg 3-gram set QG3(boeing) is
    boe,oei,ein,ing
  • Min-Hash Similarity
  • Let U denote universe of strings over alphabet S
  • Hi U - N, i 1H hash functions, mapping U
    uniformly and randomly to a set of natural
    numbers N
  • Let S be a set of strings
  • Min-Hash Signature mh(S) mh1(S),,mhH(S)
    where the ith coordinate
  • mhi(S) arg min hi(a) where a ? S
  • Let IX denote an indicator variable over
    boolean X
  • i.e. IX 1 if X is true, else 0.

14
APPROXIMATION OF FMS
Eg Input tuple u Company Beoing, Seattle,
NULL, 98004 Reference tuple v Boeing Company,
Seattle, WA, 98004 q 3, H 2, t w (token
weight) company 0.25, beoing 0.5,
seattle1.0, 98004 2.0 Total weight
3.75 Suppose min-hash signatures are oei, ing,
com, pan, sea, ttl, wa, 980, 004 Score
from matching beoing with boeing is
w(beoing)(2/3 0.5 (1 1/3))
w(beoing) Since every token matches exactly with
a reference token, fmsapx(u,v) 3.75/3.75
15
ERROR TOLERANT INDEX
  • - primary purpose of ETI is to enable, for each
    input tuple u, the
  • efficient retrieval of a candidate set S of
    reference tuples with
  • similarity greater than the minimum similarity
    threshold.
  • - fmsapx is measured by comparing min-hash
    signatures of tokens in
  • tok(u) and tok(v)
  • - so to determine the candidate set, we need to
    efficiently identify for
  • each token t in tok(u), a set of reference
    tuples sharing min-hash q-
  • grams with that of t

16
ERROR TOLERANT INDEX
- union of all the Sis contains the candidate
set S - thus we store in ETI each q-gram s along
with the list of all tids of reference tuples
with tokens whose min-hash signatures contain s
17
ERROR TOLERANT INDEX
- obvious method of computing all ETI tuples in
main memory by scanning and processing each
reference tuple not scalable - pre-ETI
leverage the underlying database system by
building a temporary relation pre-ETI, and then
constructing the ETI from it using SQL queries. -
schema of pre-ETI QGram, Coordinate, Column,
Tid Eg R1Beoing Company, Seattle, WA,
98004 size-2 signature of token company
belonging to column 1 is com, pan Insert into
pre-ETI two rows com,1,1,R1, pan,2,1,R1 -
SQL query Select QGram, Coordinate, Column, Tid
from pre-ETI order by QGram, Coordinate,
Column, Tid - freq size of the group
18
QUERY PROCESSING
  • - goal reduce the number of lookups against the
    reference relation by
  • effectively using ETI
  • BASIC ALGORITHM
  • - fetches tid-lists by looking up ETI of all
    q-grams in min-hash
  • signatures of all tokens in u
  • 1. For each token t in tok(u) compute its IDF
    weight w(t)
  • 2. Determine the min-hash signature mh(t) of
    each token
  • 3. Assign weight w(t)/mh(t) to each q-gram
    in mh(t)
  • 4. Using ETI, determine candidate set S of
    reference tuple as per fmsapx
  • 5. Fetch the tuples in S from the reference
    relation, and test as per fms
  • 6. Among tuples that pass the test, return K
    tuples with K highest similarity scores.

19
CANDIDATE SET DETERMINATION
  • - at any point, score of a tid equals sum of
    weights of all q-grams whose
  • tid-lists it belongs to
  • i.e. each tid is assigned a score proportional to
    the weight w(t) of the
  • parent token t.
  • - if tid r is very close to tuple u, then r has a
    high overall score
  • - tids that have a overall score greater than
    w(u)c minus an adjustment
  • term a correction to approximate the edit
    distance between tokens
  • with similarity between their q-gram sets
    constitute the candidate set

20
CANDIDATE SET DETERMINATION
  • Eg Input tuple
  • u Beoing Company, Seattle, WA, 98004
  • q 3, H2
  • q1,q2w denotes min-hash signature
  • q1,q2 each assigned weight w
  • Boeing0.5, company 0.25, seattle1.0,
  • WA 0.75, 98004 2.0
  • eoi, ing 0.25, com,pan0.125, sea,ttl
    0.5,
  • wa 0.75, 980, 004 1.0

Lookup ETI to fetch the following
tid-lists ,R1, R1,R3,R1,R3,R1,R2,R3
,R1,R2,R3,R1,R2,R3,R1,R2,R3,R1 R1
gets overall score of 4.25 R2 2.75, R3
3.0 Depending on the threshold, the candidate set
is a subset of R1,R2,R3
21
OPTIMISTIC SHORT CIRCUITING
  • - Need reduce the number of ETI lookups
  • - Intuition weights of input tokens (and hence,
    min-hash q-grams) vary
  • significantly
  • - lookup the ETI on just a few important q-grams
    and if a fetching test
  • succeeds optimistically short circuit the
    algorithm by fetching the
  • current closest K reference tuples.
  • - stopping test verifies whether these tuples
    are actually the closest K
  • tuples
  • - avoid ETI lookups on a number of unimportant
    q-grams.

22
OPTIMISTIC SHORT CIRCUITING
  • Eg Input tuple
  • u Beoing Company, Seattle, WA, 98004
  • q 3, H2, K 1
  • q1,q2w denotes min-hash signature
  • q1,q2 each assigned weight w
  • Boeing0.5, company 0.25, seattle1.0,
  • WA 0.75, 98004 2.0
  • eoi, ing 0.25, com,pan0.125, sea,ttl
    0.5,
  • wa 0.75, 980, 004 1.0
  • We order q-grams in the decreasing order
  • of their weights and fetch their tid-lists in
    this order

1. First fetch tid-list R1,R2,R3 of 980 - we
need further distinguishing at this point 2.
Fetch the list R1 of next most important q-gram
004 3. At this point, scores are R12.0, R2
1.0, R3 1.0 4. Extrapolation score for R1
over ALL the q-grams 4.5 score for
R2 1.0 (4.5-2.0) 3.5 (its current score
sum of weights of remaining q-grams) 5.
Since 4.53.5, anticipate R1 to be closest match,
fetch it from R, compute fms(u, R1)
23
OPTIMISTIC SHORT CIRCUITING
  • - Robustness of stopping test ensures that
    inaccuracy in estimating the
  • score of R1 over all q-grams does not affect
    the correctness of the final
  • result.
  • - It may affect performance
  • - OSC differs from the basic algorithm
  • 1. the order in which we look-up q-grams
    against ETI
  • 2. the additional short-circuiting procedure

24
RESOURCE REQUIREMENTS
  • ETI building time expensive steps
  • 1. scan of the reference relation R
  • 2. writing the pre-ETI
  • 3. sorting the pre-ETI
  • 4. writing the ETI
  • Total I/O cost during these phases is
    O(mavgqHRETI(12q)) where
  • mavg is the average number of tokens in each
    tuple, and ETI is the
  • number of tuples in the ETI
  • Input Tuple Processing expensive steps
  • 1. looking up ETI for tid-lists of q-grams
  • 2. processing tid-lists
  • 3. fetching tuples in the candidate set

25
EXTENSIONS
  • 1. INDEXING USING TOKENS
  • - effectively use tokens for further improving
    efficiency
  • - intuition input tuple u Beoing Company,
    Seattle, WA, 98004 has only
  • one incorrect token, rest all are
    corrent
  • - index reference tuples on tokens as well, so
    we can directly lookup ETI
  • against these tokens instead of several
    min-hash signatures
  • - assign 0th coordinate in the signature to the
    token
  • - so challenge is to gain efficiency without
    losing accuracy
  • 2. COLUMN WEIGHTS
  • - token t in ith column gets weight w(t)Wi
  • - used to assign varying importance to columns

26
EXPERIMENTS
  • - clean reference tuple Customer name, city,
    state, zip code with 1.7 million
  • tuples
  • - input datasets created by introducting errors
    in randomly selected subsets of
  • Customer tuples
  • - 2 types of error-injection techniques
  • Type I all tokens in a column are equally
    likely to be erroneous
  • Type II tokens with higher frequency are more
    likely to be errorneous

If we decide (with probability pi) to introduce
an error into ui, we select from among several
types of errors with the conditional
probabilities as shown.
27
EXPERIMENTS (contd.)
  • PARAMETER SETTINGS
  • K 1 (closest fuzzy match)
  • Q-gram size q 4
  • Minimum similarity threshold c 0.0 (i.e. all
    exact K matches)
  • Cins token insertion factor 0.5
  • MACHINE SPECIFICATIONS experiments were run on
  • 930MHz Pentium, 256MB RAM
  • MS Windows XP
  • - Algorithm was implemented on the MS SQLServer
    2000 database
  • using OLEDB for database access

28
EVALUATION METRICS
  • 1. Normalized Elapsed Time
  • elapsed time to process the set of input tuples
    using fuzzy match algo
  • elapsed time to process one input tuple using the
    naïve algorithm
  • 2. Accuracy the percentage of input tuples for
    which a fuzzy match
  • algorithm identifies the seed tuple,
    from which the
  • erroneous input tuple was generated,
    as the closest
  • reference tuple.
  • NOTATION USED to denote which signature
    computation strategy
  • was used
  • A_H, A ? Q, QT and H0
  • Q denotes q-grams only QT denotes q-grams plus
    token signatures

29
EXPERIMENTAL RESULTS - ACCURACY
Accuracy Comparison between ed and fms -
accuracy for Type II errors better than Type I
- consider only datasets with Type I error - each
of D1, D2, D3 has 1655 tuples
Figure 5 1. min-hash signatures significantly
improve accuracy Q_H better than QT_0 2.
Adding tokens to signature does not negatively
impact accuracy 3. Even small signature sizes
yield high gains in accuracy
30
EXPERIMENTAL RESULTS - EFFICIENCY
Figure 6 - algorithms are 2 to 3 orders of
magnitude faster than the naïve algorithm - query
processing time decreases with signature size -
For all 1than Q_H
Figure 7 - QT_H is greater than Q_H - because
we persist the ETI as a standard index relation,
we can use it for subsequent batches of input
tuples if the reference table does not change
31
EXPERIMENTAL RESULTS - EFFICIENCY
Figure 8 - more q-grams help decrease candidate
set sizes by better distinguishing similarity
scores of tids - when OSC succeeds, we retrieve
very few (around 1 per input tuple) candidate
tuples
Figure 9 - the average number of tids processed
per input tuple increases - but more than
compensated by the average reduction in the
candidate set sizes
32
EXPERIMENTAL RESULTS - EFFICIENCY
Figure 10 - OSC optimization is successful for
50 - 75 of the input tuples - success fraction
increases with signature size
33
CONCLUSIONS
  • - generalized edit distance similarity by
    incorporating notion of tokens
  • and their importance to develop an accurate
    fuzzy match
  • - developed an error tolerant index and an
    efficient algorithm for
  • identifying with high probability the closest
    fuzzy matching reference
  • tuples.
Write a Comment
User Comments (0)
About PowerShow.com