Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

Description:

bags of tokens. CS 652 Information Extraction and Information Integration. XML Learner ... bags of tokens including text tokens and structure tokens ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 19
Provided by: LiXu8
Category:

less

Transcript and Presenter's Notes

Title: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach


1
Reconciling Schemas of Disparate Data Sources A
Machine-Learning Approach
  • AnHai Doan
  • Pedro Domingos
  • Alon Halevy

2
Problem Solution
  • Problem
  • Large-scale Data Integration Systems
  • Bottleneck Semantic Mappings
  • Solution
  • Multi-strategy Learning
  • Integrity Constraints
  • XML Structure Learner
  • 1-1 Mappings

3
Learning Source Descriptions (LSD)
  • Components
  • Base learners
  • Meta-learner
  • Prediction converter
  • Constraint handler
  • Operating Phases
  • Training phase
  • Matching phase

4
Learners
  • Basic Learners
  • Name Matcher (Whirl)
  • Content Matcher (Whirl)
  • Naïve Bayes Learner
  • County-Name Recognizer
  • XML Learner
  • Meta-Learner (Stacking)

5
Naïve Bayes Learner
Input instance bags of tokens
6
XML Learner
Input instance bags of tokens including text
tokens and structure tokens
7
Domain Constraint Handler
  • Domain Constraints
  • Impose semantic regularities on schemas and
    source data in the domain
  • Can be specified at the beginning
  • When creating a mediated schema
  • Independent of any actual source schema
  • Constraint Handler
  • Domain constraints Prediction Converter
    Users feedback Output mappings

8
Training Phase
  • Manually Specify Mappings for Several Sources
  • Extract Source Data
  • Create Training Data for each Base Learner
  • Train the Base-Learner
  • Train the Meta-Learner

9
Example1 (Training Phase)
10
Example1 (Cont.)
Training Data
Source Data
11
Example1 (Cont.)
(location ,ADDRESS)
(Miami, FL, ADDRESS)
Source Data (location Miami, FL)
12
Matching Phase
  • Extract and Collect Data
  • Match each Source-DTD Tag
  • Apply the Constraint Handler

13
Example2 (Matching Phase)
14
Example2 (Cont.)
15
Example2 (Cont.)
16
Experimental Evaluation
  • Measures
  • Matching accuracy of a source
  • Average matching accuracy of a source
  • Average matching accuracy of a domain
  • Experiment Results
  • Average matching accuracy for different domains
  • Contributions of base learners and domain
    constraint handler
  • Contributions of schema information and instance
    information
  • Performance sensitivity to the amount data
    instances

17
Limitations
  • Enough Training Data
  • Domain Dependent Learners
  • Ambiguities in Sources
  • Efficiency
  • Overlapping of Schemas

18
Conclusion and Future Work
  • Improve over time
  • Extensible framework
  • Multiple types of knowledge
  • Non 1-1 mapping ?
Write a Comment
User Comments (0)
About PowerShow.com