Title: Text Joins for Data Cleansing and Integration in an RDBMS
1Text Joins for Data Cleansing and Integration
in an RDBMS
- Luis Gravano
- Panagiotis G. Ipeirotis
Nick Koudas Divesh Srivastava
Columbia University
ATT Labs - Research
2Why Text Joins?
Service B
HATRONIC CORP
EUROAFT INC
EUROAFT CORP
Service A
EUROAFT CORP
HATRONIC INC
Problem Same entity has multiple textual
representations
3Matching Text Attributes using Cosine Similarity
Similar tuples should share infrequent tokens
EUROAFT CORP EUROAFT INC
EUROAFT CORP ? HATRONIC CORP
Problem Given two relations, report tuple pairs
with similarity above threshold f
4Computing Text Joins in an RDBMS
R1
R2
Name
1 EUROAFT CORP
2 HATRONIC INC
Name
1 HATRONIC CORP
2 EUROAFT INC
3 EUROAFT CORP
R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 1.00
EUROAFT CORP HATRONIC CORP 0.01
HATRONIC INC HATRONIC CORP 0.98
HATRONIC INC EUROAFT INC 0.02
5Sampling Step for Text Joins
Similarity S weight(token, t1) weight(token,
t2)
- Similarity is a sum of products
- Products cannot be high when weight is small
- Can (safely) drop low weights from RiWeights
(adapted from Cohen Lewis, SODA97 for
efficient execution inside an RDBMS)
RiWeights
Token W
EUROAFT 0.9144
HATRONIC 0.8419
CORP 0.01247
INC 0.00504
RiSample
Token TIMES SAMPLED
EUROAFT 18 (18/200.90)
HATRONIC 17 (17/200.85)
Eliminates low similarity pairs (e.g., EUROAFT
INC with HATRONIC INC)
6Sampling-Based Text Joins in SQL
R1Weights
R2Sample
Token W
1 EUROAFT 0.98
1 CORP 0.02
2 HATRONIC 0.98
2 INC 0.01
Token W
1 HATRONIC 0.98
1 CORP 0.02
2 EUROAFT 0.95
2 INC 0.05
3 EUROAFT 0.97
3 CORP 0.03
R1
Name
1 EUROAFT CORP
2 HATRONIC INC
R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 0.9
HATRONIC INC HATRONIC CORP 0.98
7Contributions
- WHIRL Cohen, SIGMOD98 inside an RDBMS
Scalability, no data exporting/importing - Different tokens choices
- Words Captures word swaps, deletion of common
words - Q-grams All the above, plus spelling mistakes,
but slower
SQL statements tested in MS SQL Server and
available for download at http//www.cs.columbia.
edu/pirot/DataCleaning/
8Questions?
9Overflow Slides
10Recall for 3-grams
Upcoming WWW 2003 paper
11Precision for 3-grams
Upcoming WWW 2003 paper