Alon Halevy - PowerPoint PPT Presentation

About This Presentation
Title:

Alon Halevy

Description:

County name, zip code, phone numbers. ... date, time, city, zip code, name, etc. house-area (30 X 70, 500 sq. ft.) county-name recognizer ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 48
Provided by: zam34
Category:
Tags: alon | area | code | halevy | map | telephone

less

Transcript and Presenter's Notes

Title: Alon Halevy


1
Learning to Map Between Schemas Ontologies
  • Alon Halevy
  • University of Washington
  • Joint work with Anhai Doan and Pedro Domingos

2
Agenda
  • Ontology mapping is a key problem in many
    applications
  • Data integration
  • Semantic web
  • Knowledge management
  • E-commerce
  • LSD
  • Solution that uses multi-strategy learning.
  • Weve started with schema matching (I.e., very
    simple ontologies)
  • Currently extending to more expressive
    ontologies.
  • Experiments show the approach is very promising!

3
The Structure Mapping Problem
  • Types of structures
  • Database schemas, XML DTDs, ontologies, ,
  • Input
  • Two (or more) structures, S1 and S2
  • Data instances for S1 and S2
  • Background knowledge
  • Output
  • A mapping between S1 and S2
  • Should enable translating between data instances.
  • Semantics of mapping?

4
Semantic Mappings between Schemas
  • Source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
5
Motivation
  • Database schema integration
  • A problem as old as databases themselves.
  • database merging, data warehouses, data migration
  • Data integration / information gathering agents
  • On the WWW, in enterprises, large science
    projects
  • Model management
  • Model matching key operator in an algebra where
    models and mappings are first-class objects.
  • See Bernstein et al., 2000 for more.
  • The Semantic Web
  • Ontology mapping.
  • System interoperability
  • E-services, application integration, B2B
    applications, ,

6
Desiderata from Proposed Solutions
  • Accuracy, efficiency, ease of use.
  • Realistic expectations
  • Unlikely to be fully automated. Need user in the
    loop.
  • Some notion of semantics for mappings.
  • Extensibility
  • Solution should exploit additional background
    knowledge.
  • Memory, knowledge reuse
  • System should exploit previous manual or
    automatically generated matchings.
  • Key idea behind LSD.

7
LSD Overview
  • L(earning) S(ource) D(escriptions)
  • Problem generating semantic mappings between
    mediated schema and a large set of data source
    schemas.
  • Key idea generate the first mappings manually,
    and learn from them to generate the rest.
  • Technique multi-strategy learning (extensible!)
  • Step 1
  • SIGMOD, 2001 1-1 mappings between XML DTDs.
  • Current focus
  • Complex mappings
  • Ontology mapping.

8
Outline
  • Overview of structure mapping
  • Data integration and source mappings
  • LSD architecture and details
  • Experimental results
  • Current work.

9
Data Integration
Find houses with four bathrooms priced under
500,000
mediated schema
Query reformulation and optimization.
source schema 2
source schema 3
source schema 1
wrappers
homes.com
realestate.com
homeseekers.com
Applications WWW, enterprises, science
projects Techniques virtual data integration,
warehousing, custom code.
10
Semantic Mappings between Schemas
  • Source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
11
Semantics (preliminary)
  • Semantics of mappings has received no attention.
  • Semantics of 1-1 mappings
  • Given
  • R(A1,,An) and S(B1,,Bm)
  • 1-1 mappings (Ai,Bj)
  • Then, we postulate the existence of a relation W,
    s.t.
  • P (C1,,Ck) (W) P (A1,,Ak) (R) ,
  • P (C1,,Ck) (W) P (B1,,Bk) (S) ,
  • W also includes the unmatched attributes of R and
    S.
  • In English R and S are projections on some
    universal relation W, and the mappings specify
    the projection variables and correspondences.

12
Why Matching is Difficult
  • Aims to identify same real-world entity
  • using names, structures, types, data values, etc
  • Schemas represent same entity differently
  • different names gt same entity
  • area address gt location
  • same names gt different entities
  • area gt location or square-feet
  • Schema data never fully capture semantics!
  • not adequately documented, not sufficiently
    expressive
  • Intended semantics is typically subjective!
  • IBM Almaden Lab IBM?
  • Cannot be fully automated. Often hard for humans.
    Committees are required!

13
Current State of Affairs
  • Finding semantic mappings is now the bottleneck!
  • largely done by hand
  • labor intensive error prone
  • GTE 4 hours/element for 27,000 elements
    LiClifton00
  • Will only be exacerbated
  • data sharing XML become pervasive
  • proliferation of DTDs
  • translation of legacy data
  • reconciling ontologies on semantic web
  • Need semi-automatic approaches to scale up!

14
Outline
  • Overview of structure mapping
  • Data integration and source mappings
  • LSD architecture and details
  • Experimental results
  • Current work.

15
The LSD Approach
  • User manually maps a few data sources to the
    mediated schema.
  • LSD learns from the mappings, and proposes
    mappings for the rest of the sources.
  • Several types of knowledge are used in learning
  • Schema elements, e.g., attribute names
  • Data elements ranges, formats, word frequencies,
    value frequencies, length of texts.
  • Proximity of attributes
  • Functional dependencies, number of attribute
    occurrences.
  • One learner does not fit all. Use multiple
    learners and combine with meta-learner.

16
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
17
Multi-Strategy Learning
  • Use a set of base learners
  • Name learner, Naïve Bayes, Whirl, XML learner
  • And a set of recognizers
  • County name, zip code, phone numbers.
  • Each base learner produces a prediction weighted
    by confidence score.
  • Combine base learners with a meta-learner, using
    stacking.

18
Base Learners
  • Name Learner

(contact-info,office-address)
(contact-info,office-address)
(contact,agent-phone)
(contact,agent-phone)
(contact-phone, ? )
(phone,agent-phone)
(phone,agent-phone)
(listed-price,price)
(listed-price,price)
contact-phone gt (agent-phone,0.7),
(office-address,0.3)
  • Naive Bayes Learner DomingosPazzani 97
  • Kent, WA gt (address,0.8), (name,0.2)
  • Whirl Learner CohenHirsh 98
  • XML Learner
  • exploits hierarchical structure of XML data

19
Training the Base Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
(location, address) (listed-price, price) (phone,
agent-phone) ...
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) ...
20
Entity Recognizers
  • Use pre-programmed knowledge to identify specific
    types of entities
  • date, time, city, zip code, name, etc
  • house-area (30 X 70, 500 sq. ft.)
  • county-name recognizer
  • Recognizers often have nice characteristics
  • easy to construct
  • many off-the-self research commercial products
  • applicable across many domains
  • help with special cases that are hard to learn

21
Meta-Learner Stacking
  • Training of meta-learner produces a weight for
    every pair of
  • (base-learner, mediated-schema element)
  • weight(Name-Learner,address) 0.1
  • weight(Naive-Bayes,address) 0.9
  • Combining predictions of meta-learner
  • computes weighted sum of base-learner confidence
    scores

Name Learner Naive Bayes
(address,0.6) (address,0.8)
ltareagtSeattle, WAlt/gt
Meta-Learner
(address, 0.60.1 0.80.9 0.78)
22
Training the Meta-Learner
  • For address

Name Learner
Naive Bayes
True Predictions
Extracted XML Instances
ltlocationgt Miami, FLlt/gt ltlisted-pricegt
250,000lt/gt ltareagt Seattle, WA lt/gt lthouse-addrgtKen
t, WAlt/gt ltnum-bathsgt3lt/gt ...
0.5 0.8
1 0.4
0.3 0 0.3
0.9 1
0.6 0.8
1 0.3
0.3 0 ...
... ...
Least-SquaresLinear Regression
Weight(Name-Learner,address)
0.1 Weight(Naive-Bayes,address) 0.9
23
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(description,0.8), (address,0.2)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
24
The Constraint Handler
  • Extends learning to incorporate constraints
  • hard constraints
  • a address b address a b
  • a house-id a is a key
  • a agent-info b agent-name b is
    nested in a
  • soft constraints
  • a agent-phone b agent-name
    a b are usually
    close to each other
  • user feedback hard or soft constraints
  • Details in Doan et. al., SIGMOD 2001

25
The Current LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
User Feedback
Constraint Handler
Base-Learner1
Base-Learnerk
Meta-Learner
Mappings
26
Outline
  • Overview of structure mapping
  • Data integration and source mappings
  • LSD architecture and details
  • Experimental results
  • Current work.

27
Empirical Evaluation
  • Four domains
  • Real Estate I II, Course Offerings, Faculty
    Listings
  • For each domain
  • create mediated DTD domain constraints
  • choose five sources
  • extract convert data listings into XML
    (faithful to schema!)
  • mediated DTDs 14 - 66 elements, source DTDs 13
    - 48
  • Ten runs for each experiment - in each run
  • manually provide 1-1 mappings for 3 sources
  • ask LSD to propose mappings for remaining 2
    sources
  • accuracy of 1-1 mappings correctly identified

28
Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
29
Sensitivity to Amount of Available Data
Average matching accuracy ()
Number of data listings per source (Real Estate I)
30
Contribution of Schema vs. Data
Average matching accuracy ()

  • LSD with only schema info.
  • LSD with only data info.
  • Complete LSD
  • More experiments in the paper Doan et. al. 01

31
Reasons for Incorrect Matching
  • Unfamiliarity
  • suburb
  • solution add a suburb-name recognizer
  • Insufficient information
  • correctly identified general type, failed to
    pinpoint exact type
  • ltagent-namegtRichard Smithlt/gtltphonegt (206) 234
    5412 lt/gt
  • solution add a proximity learner
  • Subjectivity
  • house-style description?

32
Outline
  • Overview of structure mapping
  • Data integration and source mappings
  • LSD architecture and details
  • Experimental results
  • Current work.

33
Moving Up the Expressiveness Ladder
  • Schemas are very simple ontologies.
  • More expressive power More domain constraints.
  • Mappings become more complex, but constraints
    provide more to learn from.
  • Non 1-1 mappings
  • F1(A1,,Am) F2(B1,,Bm)
  • Ontologies (of various flavors)
  • Class hierarchy (I.e., containment on unary
    relations)
  • Relationships between objects
  • Constraints on relationships

34
Finding Non 1-1 MappingsCurrent work
  • Given two schemas, find
  • 1-many mappings address concat(city,state)
  • many-1 half-baths full-baths num-baths
  • many-many concat(addr-line1,addr-line2)
    concat(street,city,state)
  • 1-many mappings
  • expressed as query
  • value correspondence expression room-rate rate
    (1 tax-rate)
  • relationship state of tax-rate state of
    hotel that has rate
  • special case 1-many mappings between two
    relational tables

Mediated schema
Source schema
address description num-baths
city state comments half-baths full-baths
35
Brute-Force Solution
  • Define a set of operators
  • concat, , -, , /, etc
  • For each set of mediated-schema columns
  • enumerate all possible mappings
  • evaluate return best mapping

Source-schema columns
Mediated-schema columns
compute similarity using all base learners
m1
m1, m2, ..., mk
36
Search-Based Solution
  • States columns
  • goal state mediated-schema column
  • initial states all source-schema columns
  • use 1-1 matching to reduce the set of initial
    states
  • Operators concat, , -, , /, etc
  • Column-similarity
  • use all base learners recognizers

37
Multi-Strategy Search
  • Use a set of expert modules L1, L2, ..., Ln
  • Each module
  • applies to only certain types of mediated-schema
    column
  • searches a small subspace
  • uses a cheap similarity measure to compare
    columns
  • Example
  • L1 text concat TF/IDF
  • L2 numeric , -, , / Ho et. al. 2000
  • L3 address concat Naive Bayes
  • Search techniques
  • beam search as default
  • specialized, do not have to materialize columns

38
Multi-Strategy Search (contd)
  • Apply all applicable expert modules

L1 m11, m12, m13, ..., m1x L2 m21, m22, m23,
..., m2y L3 m31, m32, m33, ..., m3z
  • Combine modules predictions select the best one

compute similarity using all base learners
m11
m11, m12, m21, m22, m31,m32
39
Related Work
Single Learner 1-1 Matching
Recognizers Schema 1-1 Matching
TRANSCM MiloZohar98 ARTEMIS
CastanoAntonellis99
Palopoli et. al. 98 CUPID Madhavan et. al. 01
SEMINT LiClifton94 ILA PerkowitzEtzioni95 D
ELTA Clifton et. al. 97
Hybrid 1-1 Matching
DELTA Clifton et. al. 97
Multi-Strategy Learning Learners
Recognizers Schema Data 1-1 non 1-1 Matching
Schema Data 1-1 non 1-1 Matching Sophisticated
Data-Driven User Interaction
CLIO Miller et. al. 00,Yan et. al. 01
LSD Doan et. al. 2000, 2001
?
40
Summary
  • LSD
  • uses multi-strategy learning to
    semi-automatically generate semantic mappings.
  • LSD is extensible and incorporates domain and
    user knowledge, and previous techniques.
  • Experimental results show the approach is very
    promising.
  • Future work and issues to ponder
  • Accommodating more expressive languages
    ontologies
  • Reuse of learned concepts from related domains.
  • Semantics?
  • Data management is a fertile area for Machine
    Learning research!

41
Backup Slides
42
Mapping Maintenance
Source-schema S
Mediated-schema M
m1
m2
m3
  • Ten months later ...
  • are the mappings still correct?

Source-schema S
Mediated-schema M
m1
m2
m3
43
Information Extraction from Text
  • Extract data fragments from text documents
  • date, location, victims name from a news
    article
  • Intensive research on free-text documents
  • Many documents do have substantial structure
  • XML pages, name card, tables, list
  • Each such document a data source
  • structure forms a schema
  • only one data value per schema element
  • real data source has many data values per
    schema element
  • Ongoing research in the IE community

44
Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without
Whirl Learner Without Constraint Handler The
complete LSD system
45
Exploiting Hierarchical Structure
  • Existing learners flatten out all structures
  • Developed XML learner
  • similar to the Naive Bayes learner
  • input instance bag of tokens
  • differs in one crucial aspect
  • consider not only text tokens, but also structure
    tokens

ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt
MAX Realtors lt/firmgt lt/contactgt
ltdescriptiongt Victorian house with a view.
Name your price! To see it, contact Gail
Murphy at MAX Realtors. lt/descriptiongt
46
Domain Constraints
  • Impose semantic regularities on sources
  • verified using schema or data
  • Examples
  • a address b address a b
  • a house-id a is a key
  • a agent-info b agent-name b is
    nested in a
  • Can be specified up front
  • when creating mediated schema
  • independent of any actual source schema

47
The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252
  • Can specify arbitrary constraints
  • User feedback domain constraint
  • ad-id house-id
  • Extended to handle domain heuristics
  • a agent-phone b agent-name a b are
    usually close to each other
Write a Comment
User Comments (0)
About PowerShow.com