Title: AnHai Doan, Pedro Domingos, Alon Halevy
 1Reconciling Schemas of Disparate Data Sources A 
Machine Learning Approach
The LSD Project
-  
 - AnHai Doan, Pedro Domingos, Alon Halevy 
 - University of Washington
 
  2Data Integration
Find houses with four bathrooms priced under 
500,000
mediated schema
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com 
 3Semantic Mappings between Schemas
- Mediated  source schemas  XML DTDs 
 
house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
 location contact
full-baths
half-baths
 name phone 
 4Current State of Affairs
- Finding semantic mappings is now the bottleneck! 
 - largely done by hand 
 - labor intensive  error prone 
 - Will only be exacerbated 
 - data sharing  XML become pervasive 
 - proliferation of DTDs 
 - translation of legacy data 
 - reconciling ontologies on the semantic web  
 - Need (semi-)automatic approaches to scale up!
 
  5The LSD (Learning Source Descriptions) Approach 
- Suppose user wants to integrate 100 data sources 
 - 1. User 
 - manually creates mappings for a few sources, say 
3  - shows LSD these mappings 
 - 2. LSD learns from the mappings 
 - 3. LSD proposes mappings for remaining 97 sources 
 
  6Example 
Mediated schema
 address price agent-phone 
description
 location listed-price phone 
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
 location Miami, FL Boston, MA ...
 phone (305) 729 0831 (617) 253 1429 ...
 comments Fantastic house Great location ...
realestate.com
If fantastic  great occur frequently in 
data values gt description
homes.com
 price 550,000 320,000 ...
 contact-phone (278) 345 7215 (617) 335 2315 ...
 extra-info Beautiful yard Great beach ... 
 7Our Contributions
- 1. Use of multi-strategy learning 
 - well-suited to exploit multiple types of 
knowledge  - highly modular  extensible 
 - 2. Extend learning to incorporate constraints 
 - handle a wide range of domain  user-specified 
constraints  - 3. Develop XML learner 
 - exploit hierarchical nature of XML
 
  8Multi-Strategy Learning
- Use a set of base learners 
 - each exploits well certain types of information 
 - Match schema elements of a new source 
 - apply the base learners 
 - combine their predictions using a meta-learner 
 - Meta-learner 
 - uses training sources to measure base learner 
accuracy  - weighs each learner based on its accuracy
 
  9Base Learners
- Input 
 - schema information name, proximity, structure, 
...  - data information value, format, ... 
 - Output 
 - prediction weighted by confidence score 
 - Examples 
 - Name learner 
 - agent-name gt (name,0.7), (phone,0.3) 
 - Naive Bayes learner 
 - Kent, WA gt (address,0.8), 
(name,0.2)  - Great location gt (description,0.9), 
(address,0.1) 
  10Training the Learners 
Mediated schema
 address price agent-phone 
description
 location listed-price phone 
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone, 
agent-phone) (comments, description) ... 
 ltlocationgt Miami, FL lt/gt ltlisted-pricegt 
250,000lt/gt ltphonegt (305) 729 0831lt/gt 
ltcommentsgt Fantastic house lt/gt
realestate.com
Naive Bayes Learner 
 ltlocationgt Boston, MA lt/gt ltlisted-pricegt 
110,000lt/gt ltphonegt (617) 253 1429lt/gt 
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000, 
price) ((305) 729 0831, agent-phone) (Fantastic
 house, description) ... 
 11Applying the Learners
Mediated schema
Schema of homes.com
 address price agent-phone 
description
 area day-phone extra-info
Name Learner Naive Bayes 
ltareagtSeattle, WAlt/gt ltareagtKent, 
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6), 
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes 
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335 
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat 
beachlt/gt ltextra-infogtClose to Seattlelt/gt 
 12Domain Constraints
- Impose semantic regularities on sources 
 - verified using schema or data 
 - Examples 
 - a  address  b  address a  b 
 - a  house-id a is a key 
 - a  agent-info  b  agent-name b is 
nested in a  - Can be specified up front 
 - when creating mediated schema 
 - independent of any actual source schema 
 
  13The Constraint Handler
Domain Constraints a  address  b  adderss 
 a  b
Predictions from Meta-Learner
area (address,0.7), 
(description,0.3) contact-phone 
(agent-phone,0.9), (description,0.1) extra-info 
 (address,0.6), (description,0.4) 
0.3 0.1 0.4 0.012
area address contact-phone 
agent-phone extra-info description
area address contact-phone 
agent-phone extra-info address 
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252
- Can specify arbitrary constraints 
 - User feedback  domain constraint 
 - ad-id  house-id 
 - Extended to handle domain heuristics 
 - a  agent-phone  b  agent-name a  b are 
usually close to each other 
  14Putting It All Together the LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
- Base learners Name Learner, XML learner, Naive 
Bayes, Whirl learner  - Meta-learner 
 - uses stacking TingWitten99, Wolpert92 
 - returns linear weighted combination of base 
learners predictions 
  15Empirical Evaluation
- Four domains 
 - Real Estate I  II, Course Offerings, Faculty 
Listings  - For each domain 
 - create mediated DTD  domain constraints 
 - choose five sources 
 - extract  convert data listings into XML 
 - mediated DTDs 14 - 66 elements, source DTDs 13 
- 48 
- Ten runs for each experiment - in each run 
 - manually provide 1-1 mappings for 3 sources 
 - ask LSD to propose mappings for remaining 2 
sources  - accuracy   of 1-1 mappings correctly identified
 
  16High Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92 
Best single base learner 42 - 72  
Meta-learner  5 - 22  
Constraint handler  7 - 13  XML 
learner  0.8 - 6 
 17Performance Sensitivity
Average matching accuracy ()
Number of data listings per source 
 18Contribution of Schema vs. Data
Average matching accuracy ()
- More experiments in the paper!
 
  19Related Work
- Rule-based approaches 
 - TRANSCM MiloZohar98, ARTEMIS 
CastanoAntonellis99, Palopoli et. al. 98, 
CUPID Madhavan et. al. 01  - utilize only schema information 
 - Learner-based approaches 
 - SEMINT LiClifton94, ILA PerkowitzEtzioni95 
 - employ a single learner, limited applicability 
 -  Others 
 - DELTA Clifton et. al. 97, CLIO Miller et. al. 
00Yan et. al. 01  - Multi-strategy learning in other domains 
 - series of workshops 91,93,96,98,00 
 - Freitag98, Proverb Keim et. al. 99
 
  20Summary
- LSD project 
 - applies machine learning to schema matching 
 - Main ideas  contributions 
 - use of multi-strategy learning 
 - extend learning to handle domain  user-specified 
constraints  - develop XML learner 
 - System design  A contribution to generic 
schema-matching  - highly modular  extensible 
 - handle multiple types of knowledge 
 - continuously improve over time 
 
  21 Ongoing  Future Work
- Improve accuracy 
 - address current system limitations 
 - Extend LSD to more complex mappings 
 - Apply LSD to other application contexts 
 - data translation 
 - data warehousing 
 - e-commerce 
 - information extraction 
 - semantic web 
 -  www.cs.washington.edu/homes/anhai/lsd.h
tml 
  22Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without 
Whirl Learner Without Constraint Handler The 
complete LSD system 
 23Exploiting Hierarchical Structure 
- Existing learners flatten out all structures 
 - Developed XML learner 
 - similar to the Naive Bayes learner 
 - input instance  bag of tokens 
 - differs in one crucial aspect 
 - consider not only text tokens, but also structure 
tokens 
ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt 
MAX Realtors lt/firmgt lt/contactgt
 ltdescriptiongt Victorian house with a view. 
Name your price! To see it, contact Gail 
Murphy at MAX Realtors. lt/descriptiongt