Reducing redundancy in databases - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Reducing redundancy in databases

Description:

Textual information often contain a lot information that is corrupt: ... See Jaro metric and its variant due to Winkler. 9. Cost of equivalence ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 26
Provided by: s397271194
Category:

less

Transcript and Presenter's Notes

Title: Reducing redundancy in databases


1
Reducing redundancy in databases
  • Problem definition
  • Solution approach
  • Optimization problem with graph theory
  • Experimentation and results
  • Compare with another method

2
Problem definition
  • Textual information often contain a lot
    information that is corrupt
  • Key punch errors ("V" vs. "B")
  • Scanning errors ("I" vs. "1")
  • Misspellings Joulie and Julie
  • Insertions and deletions of characters Charles
    and Charls.
  • Phonetic errors Eduard and Edwuard

3
Problem definition
  • Usually names can not be validated
  • Companies can not validate their records
  • Please contact us if your name is incorrect
  • Corrupt information causes redundancies
  • A customer/product name appears twice in a
    database.
  • We are unsure if this is the customer/product we
    are looking for.

4
Solution approaches in the past
  • Probabilistic record linkage.
  • Learn naïve Bayes classifier with a binary
    feature vector of comparing a pair of records
  • Merge/purge problem using sliding windows.
  • As a cluster/classification problem.
  • Many classifications.
  • As an optimization problem.
  • Difficult to find the global optima.
  • A greedy approach, but still no published results.

5
Solution approach
  • Key is the unique record identifier in a
    database.
  • Key equivalence is when two or more records point
    to the same real world object

6
Key equivalence as an optimization problem
  • The idea is to compact a database as much as the
    number of redundant records contained.
  • Where S is the size of the input database and
    H is the processed dataset.
  • There is a cost associated every time we compact
    the database.
  • Cost of doing the equivalent assumption in a
    pair
  • Is record S equivalent to record T?.
  • If two records match the cost of assumption is
    very small.
  • Minimize the size of the database by minimizing
    a cost function.

7
Cost of equivalence
  • As in record linkage we can use a probability of
    match given the feature vector.
  • There are several string metrics which compute
    weights taking values
  • 1 for a perfect pair match.
  • 0 for non match.

8
String metrics for names
  • Three main heuristics when matching names
  • Treat the name as a bag of words.
  • The first characters of the name are more
    important for matching.
  • Give some score to similar characters
  • l vs 1.
  • The match for a pair of names is based on a count
    on characters in a matrix.
  • See Jaro metric and its variant due to Winkler

9
Cost of equivalence
  • Use an appropriated string metric method to
    obtain the cost of matching two names.
  • We can model this two rules into a directed graph
  • The directed graph must satisfy the following
    rules
  • One record can not be equivalent to more than one
    record.
  • Two linked records are joined by an acyclic
    relation

a
b
c
10
Optimization model
Given a graph
Find a subset
That minimizes the function
Subject to
(2)
(1)
(3)
Where
to
Is a cost or distance from
to
represent an arc from
11
Optimization model
  • The model contains a constant value k.
  • If k0, the solution is empty and no record is
    merged.
  • If k is a very big value, all records are matched
    and
  • As k increases its value from 0, optimal solution
    becomes negative, thus the inequality
  • Because we are minimizing, no cost is greater
    than k.
  • Thus k takes the role of a threshold

12
Solve the model
  • The model belongs to a family of integer
    programming problems with restrictions similar to
    the TSP.
  • It can not be solved by conventional approaches
    efficiently.

13
Example by enumeration
k 0.3
14
Solve the model
  • Any tree satisfies restrictions of our model.
  • A minimum-weight tree in a weighted graph which
    contains all of the graph's vertices is a minimum
    spanning tree.
  • A spanning forest of a connected graph G is a
    forest whose components are subtrees of a
    spanning tree of G.
  • A minimum spanning forest of a connected graph G
    is a forest whose components are minimum spanning
    trees of the corresponding components in G.
  • For a given k, the optimal solution to model can
    be obtained in polynomial time.

15
Proposed algorithm
  • 1. Find the weighted complete graph G
    corresponding to the given database
  • 2. Find the minimum spanning tree of G
  • 3. Assign the largest weight for which a good
    match between records is found, to k
  • 4. Remove all edges with weights gt k, from the
    minimum spanning tree of G output by Algorithm 1.

16
Evaluate the quality of the solution
  • How good is our model to find equivalent records?
  • ,
  • c Number of corrected rows linked
  • E' Number of edges in the solution set E'
  • E Number of redundant records in the
    database
  • We evaluate perform our algorithm in 31 values of
    k ranged from zero to one.

17
Results
18
Discussion
  • The choice of k is different from database to
    database.
  • It is risky to choose k by eye-bullet.
  • The solution is too sensitive to the option of k.
  • Two alternative solutions
  • Reduce de degree of each node.
  • Reduce the length of the branches.

19
Reducing degree of a node
  • Sometimes nodes contains many adjacent arcs.
  • This feature is common when trying to link
    members of the same family.
  • Family of birds, household or family of products.
  • Procedure
  • If the degree of a node is gt 2
  • Find the best adjacent edge.
  • Trim adjacent edges whose cost is relatively high
    with respect to its best adjacent edge.
  • For our datasets we trim edges whose differences
    are bigger than 0.05.

a
c
d
b
20
Reduce the length of the branches
  • Inference is gained every time two records are
    merged.
  • Too much inference corrupt the similarity.
  • The similarity between nodes a and d might
    not be evident.
  • Procedure
  • Route each vertex to the root of its tree
  • Obtain the cost between the actual vertex and the
    visited vertex.
  • If the cost is bigger than a certain value ? then
    we trim the routed branch in its weakest edge.
  • For our datasets we choose ?0.6.

a
b
c
d
21
(No Transcript)
22
Compare performance
  • Heuristically match pairs of records by the well
    known assignment problem.
  • US census bureau.
  • The disadvantage of this method is when there are
    more than two redundant records in a database.

23
(No Transcript)
24
Conclusions
  • The method suggested is efficient and robust.
  • Our model reduces the search space of linked
    records from possible pairs to
  • The model is flexible, it can use different costs
    functions.
  • The current method outperforms traditional
    methods.
  • The choice of k is not as sensitive as initially.

25
Future work
  • The quality of the solution depends on the
    quality of the weights.
  • Weight computation also has to be adapted to the
    domain.
  • Train the similarity measure to obtain best
    results.
  • Columns provide different value to the match
    score.
  • As in IDF, less common fields are best scored,
    but there are correlations between fields.
  • Processing the method in larger databases.
Write a Comment
User Comments (0)
About PowerShow.com