Reducing redundancy in databases

About This Presentation

Title:

Reducing redundancy in databases

Description:

Textual information often contain a lot information that is corrupt: ... See Jaro metric and its variant due to Winkler. 9. Cost of equivalence ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 26

Provided by: s397271194

Category:

more less

Transcript and Presenter's Notes

Title: Reducing redundancy in databases

1
Reducing redundancy in databases

Problem definition
Solution approach
Optimization problem with graph theory
Experimentation and results
Compare with another method

2
Problem definition

Textual information often contain a lot
information that is corrupt
Key punch errors ("V" vs. "B")
Scanning errors ("I" vs. "1")
Misspellings Joulie and Julie
Insertions and deletions of characters Charles
and Charls.
Phonetic errors Eduard and Edwuard

3
Problem definition

Usually names can not be validated
Companies can not validate their records
Please contact us if your name is incorrect
Corrupt information causes redundancies
A customer/product name appears twice in a
database.
We are unsure if this is the customer/product we
are looking for.

4
Solution approaches in the past

Probabilistic record linkage.
Learn naïve Bayes classifier with a binary
feature vector of comparing a pair of records
Merge/purge problem using sliding windows.
As a cluster/classification problem.
Many classifications.
As an optimization problem.
Difficult to find the global optima.
A greedy approach, but still no published results.

5
Solution approach

Key is the unique record identifier in a
database.
Key equivalence is when two or more records point
to the same real world object

6
Key equivalence as an optimization problem

The idea is to compact a database as much as the
number of redundant records contained.
Where S is the size of the input database and
H is the processed dataset.
There is a cost associated every time we compact
the database.
Cost of doing the equivalent assumption in a
pair
Is record S equivalent to record T?.
If two records match the cost of assumption is
very small.
Minimize the size of the database by minimizing
a cost function.

7
Cost of equivalence

As in record linkage we can use a probability of
match given the feature vector.
There are several string metrics which compute
weights taking values
1 for a perfect pair match.
0 for non match.

8
String metrics for names

Three main heuristics when matching names
Treat the name as a bag of words.
The first characters of the name are more
important for matching.
Give some score to similar characters
l vs 1.
The match for a pair of names is based on a count
on characters in a matrix.
See Jaro metric and its variant due to Winkler

9
Cost of equivalence

Use an appropriated string metric method to
obtain the cost of matching two names.
We can model this two rules into a directed graph
The directed graph must satisfy the following
rules
One record can not be equivalent to more than one
record.
Two linked records are joined by an acyclic
relation

a
b
c
10
Optimization model
Given a graph
Find a subset
That minimizes the function
Subject to
(2)
(1)
(3)
Where
to
Is a cost or distance from
to
represent an arc from
11
Optimization model

The model contains a constant value k.
If k0, the solution is empty and no record is
merged.
If k is a very big value, all records are matched
and
As k increases its value from 0, optimal solution
becomes negative, thus the inequality
Because we are minimizing, no cost is greater
than k.
Thus k takes the role of a threshold

12
Solve the model

The model belongs to a family of integer
programming problems with restrictions similar to
the TSP.
It can not be solved by conventional approaches
efficiently.

13
Example by enumeration
k 0.3
14
Solve the model

Any tree satisfies restrictions of our model.
A minimum-weight tree in a weighted graph which
contains all of the graph's vertices is a minimum
spanning tree.
A spanning forest of a connected graph G is a
forest whose components are subtrees of a
spanning tree of G.
A minimum spanning forest of a connected graph G
is a forest whose components are minimum spanning
trees of the corresponding components in G.
For a given k, the optimal solution to model can
be obtained in polynomial time.

15
Proposed algorithm

1. Find the weighted complete graph G
corresponding to the given database
2. Find the minimum spanning tree of G
3. Assign the largest weight for which a good
match between records is found, to k
4. Remove all edges with weights gt k, from the
minimum spanning tree of G output by Algorithm 1.

16
Evaluate the quality of the solution

How good is our model to find equivalent records?
,
c Number of corrected rows linked
E' Number of edges in the solution set E'
E Number of redundant records in the
database
We evaluate perform our algorithm in 31 values of
k ranged from zero to one.

17
Results
18
Discussion

The choice of k is different from database to
database.
It is risky to choose k by eye-bullet.
The solution is too sensitive to the option of k.
Two alternative solutions
Reduce de degree of each node.
Reduce the length of the branches.

19
Reducing degree of a node

Sometimes nodes contains many adjacent arcs.
This feature is common when trying to link
members of the same family.
Family of birds, household or family of products.
Procedure
If the degree of a node is gt 2
Find the best adjacent edge.
Trim adjacent edges whose cost is relatively high
with respect to its best adjacent edge.
For our datasets we trim edges whose differences
are bigger than 0.05.

a
c
d
b
20
Reduce the length of the branches

Inference is gained every time two records are
merged.
Too much inference corrupt the similarity.
The similarity between nodes a and d might
not be evident.
Procedure
Route each vertex to the root of its tree
Obtain the cost between the actual vertex and the
visited vertex.
If the cost is bigger than a certain value ? then
we trim the routed branch in its weakest edge.
For our datasets we choose ?0.6.

a
b
c
d
21
(No Transcript)
22
Compare performance

Heuristically match pairs of records by the well
known assignment problem.
US census bureau.
The disadvantage of this method is when there are
more than two redundant records in a database.

23
(No Transcript)
24
Conclusions

The method suggested is efficient and robust.
Our model reduces the search space of linked
records from possible pairs to
The model is flexible, it can use different costs
functions.
The current method outperforms traditional
methods.
The choice of k is not as sensitive as initially.

25
Future work

The quality of the solution depends on the
quality of the weights.
Weight computation also has to be adapted to the
domain.
Train the similarity measure to obtain best
results.
Columns provide different value to the match
score.
As in IDF, less common fields are best scored,
but there are correlations between fields.
Processing the method in larger databases.

Write a Comment

User Comments (0)

About PowerShow.com

Reducing redundancy in databases - PowerPoint PPT Presentation

Reducing redundancy in databases

Textual information often contain a lot information that is corrupt: ... See Jaro metric and its variant due to Winkler. 9. Cost of equivalence ... – PowerPoint PPT presentation