RealWorld Data Is Dirty - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

RealWorld Data Is Dirty

Description:

... over the data set. rmk 06.April.2004. Selection of Keys ... Equational Theory - Example. Two names are spelled nearly identically and have the same address ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 50

Provided by: davidg70

Category:

more less

Transcript and Presenter's Notes

Title: RealWorld Data Is Dirty

1
Real-World Data Is Dirty

Data Cleansing and the Merge/Purge Problem
Hernandez Stolfo Columbia University - 1998

Class Presentation by Rhonda Kost, 06.April 2004
2
TOPICS

Introduction
A Basic Data Cleansing Solution
Test Real World Results
Incremental Merge Purge w/ New Data
Conclusion
Recap

3
Introduction
4
The problem

Some corporations acquire large amounts of
information every month
The data is stored in many large databases (DB)
These databases may be heterogeneous
Variations in schema
The data may be represented differently across
the various datasets
Data in these DB may simply be inaccurate

5
Requirement of the analysis

The data mining needs to be done
Quickly
Efficiently
Accurately

6
Examples of real-world applications

Credit card companies
Assess risk of potential new customers
Find false identities
Match disparate records concerning a customer
Mass Marketing companies
Government agencies

7
A Basic Data Cleansing Solution
8
Duplicate Elimination

Sorted-Neighborhood Method (SNM)
This is done in three phases
Create a Key for each record
Sort records on this key
Merge/Purge records

9
SNM Create key

Compute a key for each record by extracting
relevant fields or portions of fields
Example

10
SNM Sort Data

Sort the records in the data list using the key
in step 1
This can be very time consuming
O(NlogN) for a good algorithm,
O(N2) for a bad algorithm

11
SNM Merge records

Move a fixed size window through the sequential
list of records.
This limits the comparisons to the records in the
window

12
SNM Considerations

What is the optimal window size while
Maximizing accuracy
Minimizing computational cost
Execution time for large DB will be bound by
Disk I/O
Number of passes over the data set

13
Selection of Keys

The effectiveness of the SNM highly depends on
the key selected to sort the records
A key is defined to be a sequence of a subset of
attributes
Keys must provide sufficient discriminating power

14
Example of Records and Keys
15
Equational Theory

The comparison during the merge phase is an
inferential process
Compares much more information than simply the
key
The more information there is, the better
inferences can be made

16
Equational Theory - Example

Two names are spelled nearly identically and have
the same address
It may be inferred that they are the same person
Two social security numbers are the same but the
names and addresses are totally different
Could be the same person who moved
Could be two different people and there is an
error in the social security number

17
A simplified rule in English

Given two records, r1 and r2
IF the last name of r1 equals the last name of
r2,
AND the first names differ slightly,
AND the address of r1 equals the address of r2
THEN
r1 is equivalent to r2

18
The distance function

A distance function is used to compare pieces
of data (usually text)
Apply distance function to data that differ
slightly
Select a threshold to capture obvious
typographical errors.
Impacts number of successful matches and number
of false positives

19
Examples of matched records
20
Building an equational theory

The process of creating a good equational theory
is similar to the process of creating a good
knowledge-base for an expert system
In complex problems, an experts assistance is
needed to write the equational theory

21
Transitive Closure

In general, no single pass (i.e. no single key)
will be sufficient to catch all matching records
An attribute that appears first in the key has
higher discriminating power than those appearing
after them
If an employee has two records in a DB with SSN
193456782 and 913456782, its unlikely they will
fall under the same window

22
Transitive Closure

To increase the number of similar records merged
Widen the scanning window size, w
Execute several independent runs of the SNM
Use a different key each time
Use a relatively small window
Call this the Multi-Pass approach

23
Transitive Closure

Each independent run of the Multi-Pass approach
will produce a set of pairs of records
Although one field in a record may be in error,
another field may not
Transitive closure can be applied to those pairs
to be merged

24
Multi-pass Matches

Pass 1 (Lastname discriminates)
KSNKAT48NRTH789 (Kathi Kason 789912345 )
KSNKAT48NRTH879 (Kathy Kason 879912345 )
Pass 2 (Firstname discriminates)
KATKSN48NRTH789 (Kathi Kason 789912345 )
KATKSN48NRTH879 (Kathy Kason 879912345 )
Pass 3 (Address discriminates)
48NRTH879KSNKAT (Kathy Kason 879912345 )
48NRTH879SMTKAT (Kathy Smith 879912345 )

25
Transitive Equality Example

IF A implies B
AND B implies C
THEN A implies C
From example
789912345 Kathi Kason 48 North St. (A)
879912345 Kathy Kason 48 North St. (B)
879912345 Kathy Smith 48 North St. (C)

26
Test Results
27
Test Environment

Test data was created by a database generator
Names are randomly chosen from a list of 63000
real names
The database generator provides a large number of
parameters
size of the DB,
percentage of duplicates,
amount of error

28
Correct Duplicate Detection
29
Time for each run
30
Accuracy for each run
31
Real-World Test

Data was obtained from the Office of Children
Administrative Research (OCAR) of the Department
of Social and Health Services (State of
Washington)
OCARs goals
How long do children stay in foster care?
How many different homes do children typically
stay in?

32
OCARs Database

Most of OCARs data is stored in one relation
The DB contains 6,000,000 total records
The DB grows by about 50,000 records per month

33
Typical Problems in the DB

Names are frequently misspelled
SSN or birthdays are either missing or clearly
wrong
Case number often changes when the childs family
moves to another part of the state
Some records use service provider names instead
of the childs
No reliable unique identifier

34
OCAR Equational Theory

Keys for the independent runs
Last Name, First Name, SSN, Case Number
First Name, Last Name, SSN, Case Number
Case Number, First Name, Last Name, SSN

35
OCAR Results
36
Incremental Merge/Purge w/ New Data
37
Incremental Merge/Purge

Lists are concatenated for first time processing
Concatenating new data before reapplying the
merge/purge process may be very expensive in both
time and space
An incremental merge/purge approach is needed
Prime Representatives method

38
Prime-Representative Definition

A set of records extracted from each cluster of
records used to represent the information in the
cluster
The Cluster Centroid or base element of
equivalence class

39
Prime-Representative creation

Initially, no PR exists
After the execution of the first merge/purge
create clusters of similiar records
Correct selection of PR from cluster impacts
accuracy of results
No PR can be the best selection for some clusters

40
3 Strategies for Choosing PR

Random Sample
Select a sample of records at random from each
cluster
N-Latest
Most recent elements entered in DB
Syntactic
Choose the largest or more complete record

41
Important Assumption

No data previously used to select each clusters
PR will be deleted
Deleted records could require restructuring of
clusters (expensive)
No changes in the rule-set will occur after the
first increment of data is processed
Substantial rule change could invalidate
clusters.

42
Results

Cumulative running time for the Incremental
Merge/Purge algorithm is higher than the classic
algorithm
PR selection methodology could improve cumulative
running time
Total running time of the Incremental Merge/Purge
algorithm is always smaller

43
Conclusion
44
Cleansing of Data

Sorted-Neighborhood Method is expensive due to
the sorting phase
the need for large windows for high accuracy
Multiple passes with small windows followed by
transitive closure improves accuracy and
performance for level of accuracy
increasing number of successful matches
decreasing number of false positives

45
Recap
46

2 major reasons merging large databases becomes
a difficult problem
The databases are heterogeneous
The identifiers or strings differ in how they
are represented within each DB

The 3 steps in SNM are
Creation of key(s)
Sorting records on this key
Merge/Purge records

Prime representative - set of records from
cluster considered to be representative of data
contained in cluster
3 strategies for selecting a PR
Random Sample
N-Latest
Syntactic

49
Questions

Write a Comment

User Comments (0)