SCHEMA Matching Mapping Integration Mediation - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

SCHEMA Matching Mapping Integration Mediation

Description:

Application Domains for Schema Matching. Match Techniques. Result of Match ... sense similarity synonym, hypernym (generalization), hyponym (specialization) ... – PowerPoint PPT presentation

Number of Views:414
Avg rating:3.0/5.0
Slides: 63
Provided by: Waq6
Category:

less

Transcript and Presenter's Notes

Title: SCHEMA Matching Mapping Integration Mediation


1
SCHEMA Matching Mapping Integration
Mediation
  • Zohra Bellahsène, Khalid Saleem, Remi Colleta
  • bella, saleem, colleta_at_lirmm.fr
  • Master 2 Mention INFORMATIQUE
  • 350 Cours - Intégration de Données XML
  • I2S - LIRMM

28-09-2007
2
Road Map
  • Schema
  • Matching Schemas
  • Match Operation
  • Application Domains for Schema Matching
  • Match Techniques
  • Result of Match Operation
  • Mapping
  • Integration and Mediation
  • Existing Match Tools and their Deficiencies
  • Large Scale Scenarios
  • Search Space Optimization
  • Charlies Algorithm, Similarity Flooding,
    S-match, COMA
  • Tree Mining approach perspective
  • Conclusion

3
Schema
  • Data representation A particular way to
    structure the data e.g. XML DTDs, XML Schemas,
    ontologies, OO representations or ER models.
  • It consists of finite set of elements
    (representation elements).
  • Element is a syntactic construct of
    representation e.g.
  • XML elements or attributes in DTDs schemas,
  • concepts, attributes and relations in ontologies.
  • Each element is associated with a set of data
    instances.

4
Schema and Ontology
  • Schema represents Database Community
  • Schemas often do not provide explicit semantics
    of their data (ER, XML document schema).
  • Ontology represents the AI Community
  • Ontologies are logical systems that themselves
    obey some formal semantics. Designed to be
    interpreted by computers for reasoning (OWL)
  • Schemas and Ontologies are similar in the sense
  • Both provide a vocabulary of elements that
    describes a domain
  • Both constraint the meaning of terms used in
    vocabulary (Hierarchy of elements/ relations
    among the elements)

5
XML semi structured .
ltxselement name"employee"gt
ltxscomplexTypegt ltxssequencegt
ltxselement name"firstname"
type"xsstring"/gt ltxselement
name"lastname" type"xsstring"/gt
lt/xssequencegt lt/xscomplexTypegt
lt/xselementgt
ltemployeegt ltfirstnamegtJohnlt/firstnamegt
ltlastnamegtSmithlt/lastnamegt lt/employeegt
XML Schema
John Smith
Data
XML Document
6
Schema vs Ontology examples
DARPA Agent Markup Language
Ontology Inference Layer
7
OWL
  • OWL is built on top of RDF
  • OWL is for processing information on the web
  • OWL was designed to be interpreted by computers
  • OWL was not designed for being read by people
  • OWL is written in XML

8
OWL Example
ltrdfRDF xmlnsrdf "http//www.w3.org/1999/02/22-
rdf-syntax-ns" xmlnsrdfs"http//www.w3.org/2000
/01/rdf-schema" xmlnsowl"http//www.w3.org/2002
/07/owl" xmlbase"http//www.daml.org/2001/10/ht
ml/airport-ontgt ltowlOntology rdfabout""gt
ltowlversionInfogtId airport-ont.daml,v 1.1
2002/03/14 062416 mdean Exp
lt/owlversionInfogt ltrdfscommentgtAirportlt/rdf
scommentgt lt/owlOntologygt ltrdfsClass
rdfID"Airport"gt ltrdfssubClassOfgt
ltowlRestrictiongt ltowlonProperty
rdfresource"name"/gt
ltowlallValuesFrom rdfresource"http//www.w3.org
/ 2001/XMLSchemastring"/gt
lt/owlRestrictiongt lt/rdfssubClassOfgt
ltrdfssubClassOfgt ltowlRestrictiongt
ltowlonProperty rdfresource"iataCode"/gt
ltowlallValuesFrom
rdfresource"http//www.w3.org/
2001/XMLSchemastring"/gt
lt/owlRestrictiongt lt/rdfssubClassOfgt
ltrdfssubClassOfgt
lt/rdfssubClassOfgt
lt/rdfsClassgt ltowlDatatypeProperty
rdfID"elevation"/gt ltowlDatatypeProperty
rdfID"iataCode"/gt ltowlDatatypeProperty
rdfID"icaoCode"/gt ltowlDatatypeProperty
rdfID"latitude"/gt ltowlDatatypeProperty
rdfID"location"/gt ltowlDatatypeProperty
rdfID"longitude"/gt ltowlDatatypeProperty
rdfID"name"/gt lt/rdfRDFgt
9
Taxonomy
  • Mathematically, a hierarchical taxonomy is a
    tree structure of classifications for a given set
    of objects
  • Much lighter than Ontology)

Entity
Undergrad Courses
Grad Courses
People
Staff
Faculty
Assistant Professor
Associate Professor
Professor
CS Dept. US
10
Road Map
  • Schema
  • Matching Schemas
  • Match Operation
  • Application Domains for Schema Matching
  • Match Techniques
  • Result of Match Operation
  • Mapping
  • Integration and Mediation
  • Existing Match Tools and their Deficiencies
  • Large Scale Scenarios
  • Search Space Optimization
  • Charlies Algorithm, Similarity Flooding,
    S-match, COMA
  • Tree Mining approach perspective
  • Conclusion

11
Match Operation
  • Input Two schemas/ ontologies
  • Output A similarity correspondence between each
    pair of elements (E1 of Schema A and E2 of
    Schema B)
  • Similarity Correspondence can be
  • E1 E2
  • E1 ltgt E2
  • E1 p E2 ( E1 n E2 or E1 ? E2)

12
Match Operation
A ? B
B ? A
Books Schema A
Books Schema B
price book-title author-name
listed-price title a-fname
a-lname
partial match
match
13
The 3 Dimensions of Schema Matching
Y - Research Domains
Z - Application Domains
X - Basic Match Research
Z - Application Domains
Y - Research Domains
X - Basic Match Research
14
Application and Research Domains
  • Data Interoperability
  • Data Integration
  • Data Warehousing
  • Catalogue Integration
  • Web Services Discovery and Composition
  • Query over the Web
  • ...
  • Data Exchange
  • E-commerce
  • Agents Communication
  • P2P DB Systems

Contributing Schema Set Not Evolving gtgt
Matching is one time process
Static
Contributing Schema Set Evolving gtgt Matching
also evolve
Dynamic
15
Match Techniques based on ..
  • Schema
  • Data Instance

16
Schema Match Techniques
  • Schema Structure Level Techniques
  • Matching combinations of elements appearing
    together
  • Graph Matching (Example Acyclic Directed Graph
    Tree)
  • Children, Descendants
  • Leaves
  • Relations

?
17
Match Techniques Element level
  • Language based
  • Tokenization e.g Tool_Kit (Tool,Kit)
  • Lemmatization e.g Kits Kit
  • Elimination e.g IsRelatedTo Related
  • Word sense similarity synonym, hypernym
    (generalization), hyponym (specialization) etc.
    Or Verb, Noun, Adjective etc
  • DeliverTo ? InvoiceTo ? DeliverTo ? ShipTo

18
Match Techniques Element level
  • String based
  • prefix, suffix e.g. auth author
  • n-gram common consecutive substrings of size n
  • 2-gram of kit ki,it kiteki,it,te 2
  • edit distance number of steps required to
    transform from one string to another
  • e.g. class ? classe 1 insertion
  • Soundx e.g. class classe
  • Description matching
  • Approximate matching used in natural language
    processing

http//www.dcs.shef.ac.uk/sam/stringmetrics.html
19
Other techniques
  • Constraint based
  • Data Types e.g String, Integer, Currency etc.
  • Value domain e.g. 1..12 month or hour
  • Model Based
  • Entity Relationship, XML documents or XML schema,
    OWL, OO etc.
  • Auxiliary Information
  • User Input
  • Global Schema / Ontology
  • Previous Match Decisions
  • Dictionaries, Thesauri WordNet

20
Matching Two Schemas - Example
21
Example
22
Matching
23
Result of Match Operation
  • Match Candidates in Similarity Matrix
  • For each element of Schema A , we have a set of
    possibly similar elements in schema B

Match Confidence!
24
Match Confidence
  • Match quality/ confidence is calculated as a
    numeric weight between 0 and 1
  • i.e. 0 ? no match and 1 ? perfect match
  • 3-gram example
  • S1 university S2 université
  • S1 3-gram uni,niv,ive,ver,ers,rsi,sit,ity 8
  • S2 3-gram uni,niv,ive,ver,ers,rsi,sit,ité 8
  • S1S2 3gram 1
  • (1 3gramS1 3gramS2-23gramS1 n
    3gramS2)
  • 1/ (188-2(7) 1/ 3
    0.333

Match confidence between S1 and S2 of 3-gram
algorithm!
25
Match Confidence
26
Combining different Match Algorithms
  • Hybrid Matcher
  • Directly combine several match algorithms to
    determine match candidates.
  • Gives better performance
  • Multiple/Composite Matcher
  • Combines results of several independently
    executed matchers, can include other hybrid
    matchers.
  • Gives Flexibility depending upon the input
    schemas

27
Match Algorithm Dimensions SE05
  • For Match Algorithms designing
  • We need the knowledge for its utilization i.e.
    Dimensions
  • Input of the Algorithm
  • Data or Schema, Element level or Structure Level
  • Characteristics of the Matching Process
  • Require exact or approximate matching
  • Performance over quality
  • Output of the Algorithms
  • Output is an approximate result
  • OR part of a set of match algorithms which are
    combined together for a map result
  • Integrated Schema
  • Mapping Expressions

28
Road Map
  • Schema
  • Matching Schemas
  • Match Operation
  • Application Domains for Schema Matching
  • Match Techniques
  • Result of Match Operation
  • Mapping
  • Integration and Mediation
  • Existing Match Tools and their Deficiencies
  • Large Scale Scenarios
  • Search Space Optimization
  • Charlies Algorithm, Similarity Flooding,
    S-match, COMA
  • Tree Mining approach perspective
  • Conclusion

29
Mapping
  • Selection of the best candidate match
  • If the candidate match element set contains more
    than one elements, further techniques are applied
    to select the best match as the mapping.
  • Usually it is manual user selection for quality
    results

Complex Map
Simple Map
30
Map Cardinality
  • Map Complexity - 11, 1n, nm
  • 1n - authorName a-fname, a-lname
  • nm - Tel. of Person

31
Map Quality Measuring Similarity
  • Precision Share of real mappings among all
    found
  • Recall Share of real mappings that is found

Precision B / (B C) Recall B / (A
B)
32
Schema Mapping / Ontology Alignment
  • Schema mapping is usually performed with the help
    of techniques trying to guess the meaning encoded
    in the schemas
  • www.xml.com/
  • Ontology alignment try to exploit knowledge
    explicitly encoded in the ontologies. .
    Ontology Example (travel)
  • http//protege.stanford.edu/plugins/owl/

In real world applications Solutions from both
domains are mutually beneficial
33
Semantic Web Layers
34
Road Map
  • Schema
  • Matching Schemas
  • Match Operation
  • Application Domains for Schema Matching
  • Match Techniques
  • Result of Match Operation
  • Mapping
  • Integration and Mediation
  • Existing Match Tools and their Deficiencies
  • Large Scale Scenarios
  • Search Space Optimization
  • Charlies Algorithm, Similarity Flooding,
    S-match, COMA
  • Tree Mining approach perspective
  • Conclusion

35
Example
Data Source
Consumer
Mediator
Data Source
Data Source
  • Schema heterogeneity a key roadblock for
    information integration
  • Different data sources speak their own schema
  • Mapping is key to any data sharing architecture

36
Schema Integration and Mediation
  • Schema Mediation
  • All input schemas are merged together into one
    schema, without any concept redundancy. i.e.
    similar concepts are represented by one concept.
  • All the input schemas are mapped to this schema
    called the mediated schema

37
Types of Integration Strategies Batini86
binary
n-ary
balanced
one-shot
iterative
ladder
http//www.ifi.unizh.ch/pziegler/IntegrationProje
cts.htmlx
38
Xylème warehouse
39
Mediator overview
XML Documents
XQuery Requests
e-XML Mediator
Sub-requests XPath
Sub-requests XQuery
Sub-requests XQuery
Sub-requests XQuery
Web site Wrapper
Adapter
Adapter
XDBMS
RDBMS
RDBMS
Site Web (pages HTML)
40
Nimble
Front-End
Lens Builder
User Applications
Lens File
InfoBrowser
Software Developers Kit
NIMBLE APIs
Management Tools
Integration Layer
Nimble Integration Engine
Metadata Server
Compiler
Executor
Cache
Security Tools
Common XML View
Integration Builder
Concordance Developer
Data Administrator
41
Road Map
  • Schema
  • Matching Schemas
  • Match Operation
  • Application Domains for Schema Matching
  • Match Techniques
  • Result of Match Operation
  • Mapping
  • Integration and Mediation
  • Existing Match Tools and their Deficiencies
  • Large Scale Scenarios
  • Search Space Optimization
  • Charlies Algorithm, Similarity Flooding,
    S-match, COMA
  • Tree Mining approach perspective
  • Conclusion

42
Existing Matching Tools
  • Ontologies Specific
  • NOM/ QOM
  • OLA
  • Anchor-PROMPT
  • S-Match GSY04
  • HICAL
  • SKAT
  • Machine Learning
  • GLUE (LSD, CGLUE) DMDH02
  • Automatch
  • Cupid MBR01
  • COMA (COMA) ADMR05
  • Similarity Flooding
  • SemInt
  • Artemis
  • DIKE
  • TransScm
  • AutoMed
  • Charlie TBBT04

43
Charlies Algorithm
x/writer/own_books m/book/author/name m/book/autho
r/birth m/book/author m/book/publisher m/book/titl
e m/book
x/writer m/book 0.15 m/book/author
0.8 m/book/publisher 0.01 m/book/title
0 m/book/author/name 0.6
x/writer/birth m/book/author/name
m/book/author m/book/publisher m/book/title m/book
depth_max backtrack_max max_dist
x/writer/name m/book/author/name
44
Similarity Flooding
  • Directed Graph Matching Technique
  • The technique starts from string-based comparison
    (common prefix, suffix tests) of vertices labels
    to obtain an initial matching.
  • The algorithm is based on the assumption that
    whenever any two elements in Schemas S1 and S2
    are found to be similar, the similarity of their
    adjacent elements increases.
  • Iterative process
  • www-db.stanford.edu/melnik/mm/sfa/

45
S-Match approach
  • S-match is an Element Level Matcher
  • Syntactic Matching
  • Initially literal concept of labels are computed
    at nodes using WordNet .
  • Tokenize each label and sense of each token is
    calculated using WordNet. Label tokens senses are
    combined to make up a concept represented by the
    label.
  • E.g VineandCheese represents Vine and Cheese
  • Semantic Matching
  • Relations are computed between concepts at nodes
  • Concept at node is a combination of label
    semantics and the node placement in the tree
  • S-match utilizes the idea of schemas as trees

46
COMA - Matching Process
http//dbs.uni-leipzig.de/Research/coma. html
47
Deficiencies in Current Matching Tools
  • These tools do not completely fulfil the
    requirements for large scale schema matching
    because
  • Not fully automated
  • Less emphasis on search space optimization
  • Every element of schema A is matched to every
    element of schema B and a set of algorithms are
    applied on each pair of elements.
  • Schema A n elements
  • Schema B m elements
  • k number of Match Algorithms
  • n x m x k

48
Road Map
  • Schema
  • Matching Schemas
  • Match Operation
  • Application Domains for Schema Matching
  • Match Techniques
  • Result of Match Operation
  • Mapping
  • Integration and Mediation
  • Existing Match Tools and their Deficiencies
  • Large Scale Scenarios
  • Search Space Optimization
  • Charlies Algorithm, Similarity Flooding,
    S-match, COMA
  • Tree Mining approach perspective
  • Conclusion

49
Large Scale Scenarios
  • Creating a merged schema for data integration
    from two large schemas (with thousands of nodes).
  • For example bio-genetic taxonomies
  • Creating a mediated schema from a large set of
    schemas (with hundreds of schemas and thousands
    of nodes)
  • For example creating a mediated web interface
    input form (schema) from the hundreds of web
    interface forms (schemas) related to travel domain

50
Web Interface Schema for Travel
1yahoo-form 2Where_do_you_want_to_go 3d
ep_arp_cd_1 4dep_arp_range_1 5When_are_yo
u_traveling 6Depart 7dep_dt_mn_1 8
dep_dt_dy_1 9dep_tm_1 10Return 11
dep_dt_mn_2 12dep_dt_dy_2 13dep_tm_2
14num_cnx 15How_many_travelers_are_there
16adult_pax_cnt 17chld_pax_cnt 18sen
ior_pax_cnt 19Airline_preferences 20cls_s
vc 21aln_cd_1
1aa 2WhereDoYouWantToGo 3origin 4dest
ination 5WhenDoYouWantToGo 6DepartureDate
7departureMonth 8departureDay 9depar
tureTime 10ReturnDate 11returnMonth 1
2returnDay 13returnTime 14NumberOfPasseng
ers 15numAdultPassengers 16numChildPasseng
ers 17WhatAreYourServicePref 18cabinClass
19maximumStops 20carrier 21countryPointOf
Sale
1nwa-form 2origin 3EnterDepartureDate
4departMonth 5departDay 6departTime 7d
estination 8EnterReturnDate 9returnMonth
10returnDay 11returnTime 12adult
1absTravel 2D_City 3A_City 4Depart
5D_Month 6D_Day 7Return 8R_Month
9R_Day 10ClassOfService 11NumAdults
51
Large Scale Integration
52
Search Space Optimization
  • n x m x k complexity has to be reduced for better
    performance ! How?
  • Considering Schemas as trees
  • trees add structure which allows to perform the
    classification of documents more effectively -
    (Charlie Algorithm)
  • Labels of tree nodes have same literal meaning
    but specific sense in specific domains e.g. title
    - (Similarity Flooding)
  • Books domain ? title of book
  • Human Resource domain ? title of person M, Mme
  • Calculate this domain specific sense of label to
    find similar labels
  • thus minimizing the search space for a certain
    node
  • i.e. match nodes with similar sense labels
  • e.g. author?writer, auth-name

53
Tree mining approach!
  • Why?
  • Basic function of tree mining is to find
    sub-trees that are frequent in the given set of
    trees, which is similar to schema matching
    activity that tries to find similar concepts
    among a set of schema trees.
  • So it provides us with the technique to compare
    more than two schemas in parallel

54
Tree Mining
a author b book d detail f information g
general h birth i isbn n name o own-books p
publisher r price t title w writer
  • Motivation
  • Large Scale Scenario
  • Peer-to-peer Information Systems over the XML Web
  • Tree Mining Approach
  • Semantic Label Concept Matcher
  • Element Level Matching
  • Structure Level Matching

aw bo fd
Search sub-trees
55
Tree Mining Approach
  • Node scope values (calculated by depth first
    pre-order traversal) Top-down Zaki02
  • Schema matching and integration process for
    handling large sets of XML schema trees.
  • Employs
  • Element level Name Matcher (same node label or
    synonym)
  • Cluster similar/synonym labels
  • Utilize the node scope values properties to
    extract node context semantics out of the
    structure

56
Clustering Search space optimization
Synonym table aw bo fd
R
57
Property Descendent Node Check
Descendent Node Check Scope of Node x is X,Y
and Scope of Descendent Node xd Xd,Yd then
XdgtX and YdltY
  • publisher is mapped to publisher
  • publisher name can be mapped to writer/name or
    publisher/name
  • Node with label name n44,4 is a descendent of
    node (label publisher) n33,4 and not node name
    n22,2 verified using descendent test
  • For n2 2gt3 and 2lt4 (False) and for n4 4gt3
    and 4lt4 (True)

58
Conclusion
  • Element level Name and Linguistic Matching with
    the support of thesaurus is an integral part of
    every Match system.
  • With systems moving towards schema/ontology based
    manipulation, and lack of global schemas or
    previous matching results, Structure Level
    matching is equally important for making out the
    semantics.
  • Peer-to-peer environment requires new methods to
    be exploited for performance and quality mapping
    i.e. integration of Tree Mining techniques for
    matching purposes and search space optimisation.
  • Machine Learning algorithms can be beneficial in
    the P2P environment in later stages when training
    examples have been created from instance data,
    provided the target domain remains the same.

59
Some References
  • AH04 Antoniou G., Harmelen F. A Semantic Web
    Primer, The MIT Press, 2004
  • ADMR05 Aumuller D., Do H. H. , Massmann S., and
    Rahm E. Schema and ontology matching with COMA.
    In Proceedings of the International Conference on
    Management of Data (SIG-MOD), 2005
  • BR04 Bellahsène Z. and Roantree M. (2004)
    Querying Distributed Data in a Super-peer based
    Architecture. DEXA 2004.
  • BMP04 Bernstein PA., Melnik S., Petropoulos M.
    and Quix C. (2004) Industrial-Strength Schema
    Mapping. SIGMOD Record, Vol. 33, No. 4, December
    2004
  • DMDH02 Doan AH., Madhavan J., Domingos P. and
    Halvey A. (2002) Learning to Map Ontologies on
    the Semantic Web. WWW 2002
  • MBR01 Madhavan J., Bernstein PA. and Rahm E.
    (2001) Generic Schema Matching with Cupid. VLDB
    2001.
  • RB01 Rahm E. and Bernstein PA (2001) A Survey
    of Approaches to Automatic Schema Matching. VLDB
    Journal 2001  10(4)334-3503
  • SE05 Shvaiko P. and Euzenat J. (2005) A Survey
    of Schema-based Matching Approaches. Journal on
    Data Semantics, 2005.
  • TBBT04 Tranier J., Baraer R., Bellahsene Z. and
    Teisseire M (2004) Wheres Charlie Family Based
    Heuristics for Peer-to-Peer Schema Integration.
    IDEAS 2004, 227-235
  • Zaki02 Zaki MJ (2002) Efficiently Mining
    Frequent Trees in a Forest. 8th ACM SIGKDD Intl
    Conf. Knowledge Discovery and Data Mining. July
    2002
  • http//www.w3.org/TR/damloil-reference
  • http//www.doc.ic.ac.uk/automed/

60
Thank you
61
Backup slides
62
URI
  • A URI can be classified as a locator or a name or
    both. A Uniform Resource Locator (URL) is a URI
    that, in addition to identifying a resource,
    provides means of acting upon or obtaining a
    representation of the resource by describing its
    primary access mechanism or network "location".
    For example, the URL http//www.wikipedia.org/ is
    a URI that identifies a resource (Wikipedia's
    home page) and implies that a representation of
    that resource (such as the home page's current
    HTML code, as encoded characters) is obtainable
    via HTTP from a network host named
    www.wikipedia.org. A Uniform Resource Name (URN)
    is a URI that identifies a resource by name in a
    particular namespace. A URN can be used to talk
    about a resource without implying its location or
    how to dereference it. For example, the URN
    urnisbn0-395-36341-1 is a URI that, like an
    International Standard Book Number (ISBN), allows
    one to talk about a book, but doesn't suggest
    where and how to obtain an actual copy of it.
Write a Comment
User Comments (0)
About PowerShow.com