Text Joins for Data Cleansing and Integration in an RDBMS - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Text Joins for Data Cleansing and Integration in an RDBMS

Description:

Text Joins for Data Cleansing and Integration in an RDBMS Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava Columbia University AT&T Labs - Research – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 12
Provided by: Pana87
Category:

less

Transcript and Presenter's Notes

Title: Text Joins for Data Cleansing and Integration in an RDBMS


1
Text Joins for Data Cleansing and Integration
in an RDBMS
  • Luis Gravano
  • Panagiotis G. Ipeirotis

Nick Koudas Divesh Srivastava
Columbia University
ATT Labs - Research
2
Why Text Joins?
Service B
HATRONIC CORP
EUROAFT INC
EUROAFT CORP

Service A
EUROAFT CORP
HATRONIC INC

Problem Same entity has multiple textual
representations
3
Matching Text Attributes using Cosine Similarity
Similar tuples should share infrequent tokens
EUROAFT CORP EUROAFT INC
EUROAFT CORP ? HATRONIC CORP
Problem Given two relations, report tuple pairs
with similarity above threshold f
4
Computing Text Joins in an RDBMS
R1
R2
Name
1 EUROAFT CORP
2 HATRONIC INC

Name
1 HATRONIC CORP
2 EUROAFT INC
3 EUROAFT CORP

R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 1.00
EUROAFT CORP HATRONIC CORP 0.01
HATRONIC INC HATRONIC CORP 0.98
HATRONIC INC EUROAFT INC 0.02
5
Sampling Step for Text Joins
Similarity S weight(token, t1) weight(token,
t2)
  • Similarity is a sum of products
  • Products cannot be high when weight is small
  • Can (safely) drop low weights from RiWeights
    (adapted from Cohen Lewis, SODA97 for
    efficient execution inside an RDBMS)

RiWeights
Token W
EUROAFT 0.9144
HATRONIC 0.8419

CORP 0.01247
INC 0.00504
RiSample
Token TIMES SAMPLED
EUROAFT 18 (18/200.90)
HATRONIC 17 (17/200.85)
Eliminates low similarity pairs (e.g., EUROAFT
INC with HATRONIC INC)
6
Sampling-Based Text Joins in SQL
R1Weights
R2Sample
Token W
1 EUROAFT 0.98
1 CORP 0.02
2 HATRONIC 0.98
2 INC 0.01

Token W
1 HATRONIC 0.98
1 CORP 0.02
2 EUROAFT 0.95
2 INC 0.05
3 EUROAFT 0.97
3 CORP 0.03
R1
Name
1 EUROAFT CORP
2 HATRONIC INC

R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 0.9
HATRONIC INC HATRONIC CORP 0.98
7
Contributions
  • WHIRL Cohen, SIGMOD98 inside an RDBMS
    Scalability, no data exporting/importing
  • Different tokens choices
  • Words Captures word swaps, deletion of common
    words
  • Q-grams All the above, plus spelling mistakes,
    but slower

SQL statements tested in MS SQL Server and
available for download at http//www.cs.columbia.
edu/pirot/DataCleaning/
8
Questions?
9
Overflow Slides
10
Recall for 3-grams
Upcoming WWW 2003 paper
11
Precision for 3-grams
Upcoming WWW 2003 paper
Write a Comment
User Comments (0)
About PowerShow.com