Architecture - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Architecture

Description:

(virtual views) Storage Method. Enterprise Information Integration. Reaches Maturity ... Primarily Application-Driven View or Identity View ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 74
Provided by: dbU7
Category:

less

Transcript and Presenter's Notes

Title: Architecture


1
From Enterprise Information Integration to
Community-Based Mediation
A presentation by
Yannis Papakonstantinou
on joint works with
Alin Deutsch, Yannis Katsis, Michalis Petropoulos
CSE Department
2
Data Integration Requirements Desiderata (high
level)
  • Provide application with integrated database
  • single point of (query/update) access to the
    data
  • Provide distribution and heterogeneity
    transparency
  • heterogenous formats, heterogenous interfaces,
    different rates of change (static versus
    dynamic), autonomous sources
  • Decouple application logic from integration
  • Easily add/change sources
  • Customize the delivery of content

3
Most-Generic Integration System Architecture
Client Application
Client Application
Client Application
Pull
Push
Integration Software
. . .
Information Source
Information Source
Information Source
4
SIGMOD Communitys Architecture forUnified
Access to Data Services
(Web) Client Application
(Web) Client Application
(Web) Client Application
Integrated (XML) Global View / Ontology Services
Cache Replication
Mediator
Local Common Model (XML) View Services
Local Common Model (XML) View Services
Wrapper
Wrapper
Information Service Source
Information Service Source
5
Approaches towards View-BasedData Integration
Integration Specification Method
Info Model Query Language
GLAVGAVLAV
Local As View (LAV)
XML (XQuery)
Global As View (GAV)
Object-Oriented
Relational (SQL)
Storage Method
Warehousing (materialized views)
On-Demand (virtual views)
6
Enterprise Information IntegrationReaches
Maturity
Enterprise
  • Materialized View (Warehousing) approach
    well-adopted since mid/late 90s
  • GAV function role played by Extract-Transform-Load
    tools
  • Human Intervention Occasionally Needed in
    Cleaning Up
  • Concordance tables for Object Identification
  • Virtual View (Mediation) approach at early
    adoption
  • many years of research
  • Distributed dbs, federated dbs, mediators
  • moving well into mainstream
  • BEA AquaLogic (XML, Virtual, GAV view)
  • IBM DB2

7
Current Enterprise InformationIntegration
Deployments
Enterprise
  • Small Domain
  • Mostly Vertical Partition of Sources
  • Primarily Application-Driven View or Identity
    View
  • Integration Administrator/Developer in charge

Integrated Global View V(M, S, E)
Integration Admin
GAV View V
View Builder(design time)
Mediator QueryProcessor (run time)
Schemas
Data
Local View M
Local View S
Local View E
Marketing
Sales
Service
8
Opportunities and Needs Presented by Motivated
Communities
Communities
  • Emerging Myriads of Internet Communities of
  • Myriads of sources and clients
  • Source owners motivated to participate
  • EII does not address needs
  • Expensive
  • Bottleneck of Single Integration Admin
  • Make building corresponding portals similar to
    starting and participating in newsgroups
  • Appropriate tools needed to enable source owner
    and client participation

9
A Community-Based Information Modeling
Architecture
Client Application 1
Client Application m
Application View V1a (G)
Application View Vma (G)
GAV V1a
GAV Vma
Integrated View Owners Domain
Data Services
Integrated XML View G
Source Owners Domain
Source Owners Domain
GLAV V1
GLAV Vn
Local XML View S1
Local XML View Sn
Information Source n
Information Source 1
Data Services 1
Data Services n
10
Visual Tools Matter! (example from the Enosys
Query Builder)
TARGET SCHEMA (XML VIEW)
1
OPEN VIEW SOURCE SCHEMAS IN XML
2
DRAG DROP TO CREATE TARGET XML VIEW
AUTOMATICALLY GENERATED MAPS
11
3
RUN TEST XQUERY
XML RESULT
XQUERY BASED ON DESIGN SPECS
12
Architecture for Large-Scale Data Integration
System and Design Tools
How can the user query and Browse the integrated
data? QURSED
Web Domain
Web Forms Reports
?
Developer
What queries can my app issue? What integrated
view services can I build? CLIDE
Application Domain
?
?
Application
Application
?
Integration Domain
Cache (Metadata)
Mediator
Global View Schema
Web Services
Integration Engineer
Source Domain
How do I export my database services
functionality? RIDE-Services
?
Web Service
Web Service
Web Service

Source Owner
How do I export my data? RIDE
Data Source
Data Source
Source Schema
Source Schema

13
Dual Interactive Registration Problems
?
?
Register Source Given Global Schema,
Constraints Queries
Register Client Given Sources
New App
New Query
Guide the client in query/form writing
Apps
Queries
?
?
?
?
Global View
Global View
Guide the source owner in registering a new
source and services
Source Services
New Source and Services
14
Source Data Registration
Server Side
  • How do my source attributes map to global
    attributes
  • mappers automatic matchers
  • How do my data relate to queries other sources
  • Inconsistencies?
  • What takes to contribute to queries?
  • How much should I clean up?
  • Multiple ways of dealing with redundancy

Apps
Queries
?
?
Global View
?
New Source
15
How to achieve this Goal
Before
Now
Apps
Queries
Apps
Queries
?
?
?
?
?
?
Look at all sources queries
Follow the suggestions of the interface
?
Decide how to register your source
Global View
Global View
?
?
Source Registration Tool
?
New Source
New Source
16
Our Goal in Source Registration
Guide the source owner visually through the
registration of the source
so as to avoid/warn about (potential)
inconsistencies and contribute information to the
answer of the queries
while exposing the minimum information possible
and/or minimizing effort
17
The Problem
17
18
The Contribution Problem
Client Queries
  • What is the contribution of source S to the
    result of the query Q?

?
?
?
Q
Mediator
(Global DB)
S
Sources
(Actual Local DBs)
18
19
The Problem
Client Queries
  • What is the contribution of source S to the
    result of the query Q?

?
Q
Q cars
Mediator
(Global DB)
Q cars JOIN reviews
cars
reviews
S
Sources
(Actual Local DBs)
19
20
Relational Schemas Local and Global
  • Relational Schemas
  • Visual Representation

?
S1
S2
G
auto
make
car
Id
Carmake
Model
Origin
Model
Carmake
Carmake
Sales
Doors
Baseprice
detail
brand
Id
Engine
Carmake
Baseprice
Origin
Source 1Business Magazine
Source 2Car Magazine
Global Car Portal
20
21
Source Registration using GLAV Mappings
  • Source Registration Correspondence between
    a source schema and

    the global schema
  • Set of Mapping Constraints of the form
  • (U ? V)
  • Open World
  • Global and Local As View (GLAV)

?
CQ over global schema
CQover source schema
21
22
Target Constraints
  • Constraints on the global schema
  • Set of Constraints of the form
  • (U ? V)
  • Also Expresses Dependencies (PKs, Ref Integrity,
    )

?
CQ over global schema
CQover global schema
22
23
Visual Representation of Mappings (1)
  • Visual Representation (IBM Clio)

?
G
S1
car
make
C
Model
Carmake
O
Carmake
Origin
Doors
Sales
Baseprice
brand
Carmake
Origin
Business Magazine Provides Carmake and Origin
23
24
Visual Representation of Mappings (2)
  • Visual Representation (IBM Clio)

?
G
S2
car
auto
Model
Id
Carmake
Model
Doors
Carmake
Baseprice
detail
brand
Id
Carmake
Engine
Origin
Baseprice
Car Magazine Provides Model, Carmake and
Baseprice
24
25
Example of Target Constraint
  • (Model, Carmake) is a PK of car

?
G
car
?
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
U1(M, C, D1, B1, D2, B2) - car(M, C, D1, B1),
car(M, C,
D2, B2) V1(M, C, D, B, D, B) - car(M, C,
D, B)
(U1 ? V1)
25
26
Query Semantics
  • Queries in UCQ
  • Set of Possible Global Instances
  • Set of global instances that satisfy all
    constraints
  • Query Answers Set of Certain Answers
  • The tuples appearing in the answer to Q for any
    possible global instance

?
?
Q
Possible globalinstances
CertainAnswersto Q
Answer to Qfor any of thepossible
globalinstances
26
27
Source Instances Contribution
-
  • For given instances of the sources
  • Contribution to Q of Source Instance
  • The tuples in answer of Q not provided by the
    other sources

Answer to Q
Answer to Q
27
28
Source Registrations Contribution
  • Source Registration Source Mappings
  • Degrees of Source Registrations Contribution
  • ? Self Sufficient
  • ? Now Complementary
  • ? Later Complementary
  • ? Unusable

More contribution
Less contribution
28
29
Self Sufficient Registration Example
Example
Baseprices of Models
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
29
30
Self Sufficient Registration Definition
? Self Sufficient
-
? Source instance s.t. The source has a non
empty contribution in the absence of the other
sources
Answer to Q
Answer to Q
? ?
X
X
X
X
30
31
Now Complementary Registration Example
Example
Baseprices of Modelsby German manufacturers
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
31
32
Now Complementary Registration Definition
? Now Complementary
-
Not Self Sufficient ? Source instances
s.t. The source has a non empty contribution
in combination with the other existing sources
Answer to Q
Answer to Q
? ?
32
33
Later Complementary Registration Example
Example
Baseprices of Modelsby German manufacturers
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
33
34
Later Complementary Registration Definition
? Later Complementary
-
Not Self Sufficient Not Now Complementary ?
Potential future sources Source
instances s.t. The source has a non empty
contribution in combinationwith the future
sources
Answer to Q
Answer to Q
? ?
34
35
Unusable Registration Example
Example
Origin of Carmakes
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
35
36
Unusable Registration Definition
? Unusable
-
Not Self Sufficient Not Now Complementary Not
Later Complementary ? The source has a empty
contribution regardless of what sources enter the
system
Answer to Q
Answer to Q
?
36
37
Subtleties for Unusable Registrations
Example
Baseprices and Doorsof Models
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
37
38
In presence of PK Unusable Example becomes Later
Complementary
Example
Baseprices and Doorsof Models
?
G
car
?
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
38
39
Decidability Results
Overview What is decidable
Target constraints
Degrees
39
40
Issues
?
Vs
Unique client query
Multiple client queries
Contribute to - all queries?
- one query? - specific
queries? - some queries
based on some ranking?
?
Vs
Data independence
Data dependence
e.g.
M1 cars, refPricesM2 reviewsQ cars JOIN
reviews JOIN refPrices
DB1 cars, refPrices (Audis)DB2 reviews (Hondas)
(M2, Q) now-complementarybut Certain Answers for
Instances DB1, DB2
?
41
Putting it all together
Architecture
Architecture
?
Query Answering / Mappings / Schemas
Query
Q
Contribution
4 categories Self Sufficient / Now Complementary
/ Later Complementary / Unusable
Global Schema
S
Goal
Mappings

M1
Mn1
Mn
Guide the source owner visually through the
registration of the source
LocalSchemas

so as to raise contribution to the answer of the
queries
S1
Sn
Sn1
while exposing the minimum info possible and/or
minimizing effort
Registeredsources
Newsource
42
Example 1
Without primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car

car

car

model
id
model
cmodel
drive
drive
ad

review

review

vin
model
model
carId
quality
quality
price
usedAd

usedAd

vin
vin
model
model
price
price
refPrice

refPrice

model
model
condition
condition
price
price
Unusable
BLUE Map at least one of the groups
43
Example 1
Without primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car

car

car

model
id
model
cmodel
drive
drive
ad

review

review

vin
model
model
carId
quality
quality
price
usedAd

usedAd

vin
vin
cmo
model
model
price
price
refPrice

refPrice

model
model
condition
condition
price
price
Unusable
44
Example 1
Without primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car

car

car

model
id
model
cmodel
drive
drive
ad

review

review

vin
model
model
carId
quality
quality
price
usedAd

usedAd

vin
vin
cmo
model
model
price
price
price
refPrice

refPrice

model
model
condition
condition
price
price
LaterComplementary
45
Example 2
With primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car

car

car

model
id
model
cmodel
drive
drive
ad

review

review

vin
model
model
carId
quality
quality
price
usedAd

usedAd

vin
vin
model
model
price
price
refPrice

refPrice

model
model
condition
condition
price
price
Unusable
46
Example 2
With primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car

car

car

model
id
model
cmodel
drive
drive
ad

review

review

vin
model
model
carId
quality
quality
price
usedAd

usedAd

vin
vin
vin
model
model
price
price
price
refPrice

refPrice

model
model
condition
condition
price
price
LaterComplementary
47
Lessons learned
Target constraints make a difference
?
To merge data with that of other sources (become
complementary)
Pick a relation and provide
In absence of primary keys
all its attributes asked by the query
In presence of primary keys
its primary key and one of its attributes asked
by the query
The number of choices increases in presence of
primary keys
?
Foreign keys on the target affect the suggestions
48
Large-Scale Data Integration Systems
How can the user query and Browse the integrated
data? QURSED
Web Domain
Web Forms Reports
?
Developer
What queries can the mediator answer for
me? CLIDE
Application Domain
?
?
Application
Application
?
Integration Domain
Mediator
Global View Schema
Integration Engineer
Source Domain
How do I export my database services
functionality? RIDE-Services
?
Web Service
Web Service
Web Service

Source Owner
How do I export my data? RIDE
Data Source
Data Source
Source Schema
Source Schema

49
Running Example
Parameterized Views
Dell
Cisco
  • Schema
  • Computers(cid, cpu, ram, price)
  • NetCards(cid, rate, standard, interface)
  • Views
  • V1 ComByCpu(cpu) ? (Computer)
  • SELECT DISTINCT Com1.
  • FROM Computers Com1
  • WHERE Com1.cpucpu
  • V2 ComNetByCpuRate(cpu, rate) ?
  • (Computer, NetCard)
  • SELECT DISTINCT Com1., Net1.
  • FROM Computers Com1, Network Net1
  • WHERE Com1.cidNet1.cid
  • AND Com1.cpucpu
  • AND Net1.raterate

Schema Routers(rate, standard, price,
type) Views V3 RouByTypeW() ?
(Router) SELECT DISTINCT Rou1. FROM Routers
Rou1 WHERE Rou1.type'Wired' V4 RouByTypeWL() ?
(Router) SELECT DISTINCT Rou1. FROM Routers
Rou1 WHERE Rou1.type'Wireless'
50
Running Example
Global Schema
?
?
Developer
Application
  • Global schema puts togetherthe Dell and Cisco
    schemas
  • Resembles the schema of CNET.com portal
  • Column Associations
  • (Computers.cid, NetCards.cid)
  • (NetCards.rate, Routers.rate)
  • (NetCards.standard, Routers.standard)

Mediator
Global Schema
V1
V3
V2
V4
Dell
Cisco
51
Sophisticated Mediators Make Feasibility Hard to
Predict
  • Feasible Queries FQ
  • Equivalent CQ query rewritings using the views
  • Might involve more than one views
  • Order might matter

Feasible
Query Get all P4 Computers, together with
their NetCards and their compatible Wireless
Routers
Query Get all Computers
Infeasible
E
B
D
Mediator
A
C
RouByTypeWL()
ComNetByCpuRate(P4, 10)
ComNetByCpuRate(P4, 54)
V4
V2
52
Problem
  • Large number of sources
  • Large number of views
  • Mediator capabilities
  • Developer formulates an application query
  • Is an application query feasible?
  • If not, how do I know which ones are feasible?
  • Previous options
  • The developer had to browse the view definitions
    and somehow formulate a feasible query
  • Or formulate queries until a feasible one is
    found(trial-and-error)
  • No system-provided guidance

53
The CLIDE Solution
CLIDE
?
?
Developer
Application
  • A query formulation interface, which
    interactively guides the user toward feasible
    queries by employing a coloring scheme

Mediator
Global Schema
V1
V3
V2
V4
Dell
Cisco
54
QBE-Like Interfaces
Microsoft SQL-Server
55
CLIDE Interface
  • Table, selection, projection and join actions
  • Color-based suggestions
  • Feasibility Flag

56
CLIDE Interface
Snapshot 1
  • Yellow ? required action
  • All feasible queries require this action
  • White ? optional action
  • Feasible queries can be formulatedw/ or w/o
    these actions

57
CLIDE Interface
Snapshot 2
  • Blue ? required choice of action
  • At least one feasible (next) query cannot be
    formulated unless this action is performed

C
Mediator
A
ComByCpu(P4)
B
V1
58
CLIDE Interface
Snapshot 3
  • Join Lines
  • Only yellow and blue are displayed
  • Must appear in Column Associations

59
CLIDE Interface
Snapshot 4
60
CLIDE Interface
Snapshot 5
  • ? any other constant
  • Red ? prohibited action
  • Does not appear in any feasible query
  • Lead to Dead End state

61
CLIDE Interface
Snapshot 6
F
Mediator
A
D
RouByTypeWL()
ComNetByCpuRate(P4, rate)
B
E
V4
V2
62
CLIDE Facts
  • Rapid Convergence
  • At every step, yellow and blue actions lead to a
    feasible query in a minimum number of steps
  • Completeness of Suggestions
  • Every feasible query can be formulated by
    performing yellow and blue actions at every step
  • Minimality of Suggestions
  • At every step, only a minimal number of actions
    are suggested, i.e., the ones that are needed to
    preserve completeness

63
Interaction Graph
Join Action
Table Action
Selection Action
Com1.cidNet1.cid
Com1.cpuP4
Com1
Com1.ram
Rou1
Com1.price
Net1









  • Nodes are queries
  • One for each q?CQ
  • Edges are actions
  • Table, selection, projection and join actions
  • Green nodes are feasible queries
  • Infinitely big structure
  • All CQ queries
  • All possible combinations of actions formulating
    them

64
Interaction Graph Colorable Actions
Com1.cid
Current Node

Com1.cpu

Com1.cid
  • Colorable actions AC labeloutgoing edges of the
    current node


Com1.cpu

Com1.ram

Com1.price

Net1

Rou1

Com2

65
Interaction Graph Colors
  • Yellow action ?
  • Every path from current node n to a feasible node
    contains ?
  • Blue action ?
  • At least one feasible query cannot be formulated
    unless this action is performed (minimality)
  • Red action ?
  • No path to a feasible node contains ?

Current Node
Com1.cid

Com1.cpu

Com1.cid


Com1.cpu
Net1
Com1.cidNet1.cid
Net1.rate54Mbps
Com1.cpu




Com1.ram
Com1.cidNet1.cid


Net1.rate54Mbps
Com1.price



Com1.cpu
Com1.cpu
Rou1
Net1.rateRou1.rate
Com1.cidNet1.cid
Net1






Com2
Rou1

Com2
Com2.cidNet1.cid
Com2.cpuP4
Net1.rate54Mbps





66
Color Determined By a Finite Set of Feasible
Queries
Challenge Infinitely Many Feasible Queries
?


Radius
Closest Feasible Queries FQC
Infinitely many feasible queries


n



  • Start by considering the closest feasible queries
    FQC
  • FQC is sufficient to color actions in AC
  • Theorem Set of Closest Feasible Queries is
    Finite
  • How far can closest feasible queries FQC be?
  • Based on Maximally Contained Queries FQMC?

67
Color Algorithm
Maximally Contained Queries FQMC
Query Q1 Get all Computers
Maximally Contained Query
Query Q2 Get all Computers with a given cpu
  • Assuming fixed SELECT clause (projection list)
  • Covered extensively in literature
  • MiniCon, Bucket, InverseRules
  • FQMC is finite

68
Color Algorithm
Maximally Contained Queries FQMC

Maximally Contained Queries FQMC
pL Radius
Closest Feasible Queries FQC


n



  • Compute maximally contained queries FQMC
  • The radius pL is the longest path to a node n
    such that q(n) in FQMC
  • All FQC queries are reachable via a path of
    length p ? pL

69
Color Algorithm
More on Closest Feasible Queries
Maximally Contained Feasible Queries FQMC

Closest Feasible Queries FQC

n




More feasible nodes
  • Theorem All queries in FQMC are in FQC
  • But not all queries in FQC are in FQMC

70
Color Algorithm
More on Closest Feasible Queries
Maximally Contained Feasible Queries FQMC

Closest Feasible Queries FQC

n



  • Naïve Approach
  • Start from n and explore paths up to length pL

71
Color Algorithm
Collapse Aliases
Maximally Contained Feasible Queries FQMC

Closest Feasible Queries FQC

n



  • Collapse Aliases to compute FQC \ FQMC
  • Check satisfiability

72
Color Algorithm
  • Coloring Non-Projection Actions
  • No interaction graph materialization
  • Use of containment mapping from current query to
    the closest feasible ones
  • An action ? is colored
  • Yellow, if ? is mapped into all queries in FQC
  • Red, if ? is not mapped into any query in FQC
  • Blue, if ? is mapped into at least one query qF
    in FQC, no other action in AP is mapped into qF,
    and ? is neither yellow nor red
  • Coloring Projection Actions
  • Never colored yellow
  • Can be colored blue only if
  • the current query is feasible
  • it is not colored red
  • Which ones are red?
  • Bring all projection atoms from views such that
    feasibility is preserved
  • If action ? is not mapped into any query in FQC,
    then ? is red

73
CLIDE Implementation
?
Other Back-End
Parameterized Views Back-End
Developer
Optimal Maximally Contained Queries
Action
Closest Feasible Queries
Maximally Contained Queries
Current Query
Colored Actions
MiniCon
Containment Test
Collapse Aliases
Color Actions
Front-End
Schemas
Column Associations
Views
  • MiniCon
  • Outputs redundant and non-minimal queries
  • Affects CLIDEs rapid convergence and minimality
    properties
  • Containment Test
  • Well-known NP-complete problem
  • Polynomial when query is acyclic
  • Collapse Aliases / Color Actions
  • Reuse containment mappings created by MiniCon

74
CLIDE Performance
Chains of Stars
  • Queries

A-span 7 B-span 4 Selections 4,6,8,10
A
Write a Comment
User Comments (0)
About PowerShow.com