Capturing both Types and Constraints in Data Integration - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Capturing both Types and Constraints in Data Integration

Description:

Chee-Yong Chan National Univ. of Singapore. Wenfei Fan Bell Laboratories. Juliana Freire Oregon Health & Science Univ. (OGI) Rajeev Rastogi Bell Laboratories ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 29

Provided by: wenfe4

Category:

more less

Transcript and Presenter's Notes

Title: Capturing both Types and Constraints in Data Integration

1
Capturing both Types and Constraints in Data
Integration

Michael Benedikt Bell Laboratories
Chee-Yong Chan National Univ. of Singapore
Wenfei Fan Bell Laboratories
Juliana Freire Oregon Health Science
Univ. (OGI)
Rajeev Rastogi Bell Laboratories

2
Schema-directed integration

data exchange

Integration
extract relevant data from distributed, multiple
databases
construct an XML view
Schema-directed conformance to a predefined
schema (D, ?)
D a DTD, type constraints
? a set of XML integrity constraints (keys,
foreign keys)

3
Example hospital and insurance company

patient (SSN, name, policy)
visitInfo (SSN, trId, date)

billing (trId, price)
cover (policy, trId)
daily XML report
treatment (trId, name) procedure (trId1, trId2)

Given a date, for each patient of that day,
report
SSN, name (DB1)
treatments (hierarchy) covered by insurance
(DB1, DB3, DB4)
cost of all and only those treatments received
(DB2)

4
Predefined schema (D, ?) DTD

report ? patient
patient ? SSN, name,
treatments, bill
treatments ? treatment
treatment ? trId, tname,
treatments
bill ? item
item ? trId, price

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
. . .
price
trId
trId
tname
treatments
5
Predefined schema (D, ?) XML constraints ?

constraints relative to each patient,
key each treatment is charged only once
patient ( item.trId ? item )
foreign key every treatment has a billing record
patient ( treatment.trId ? item.trId )

report
date
patient
patient
patient
name
treatments
SSN
bill
treatment
item
item
treatment
. . .
price
trId
treatments
6
Challenge nondeterministic structure

certain structure cannot be decided at compile
time
recursion expansion - the depth of XML tree
treatments ? treatment
treatment ? trId, tname, treatments
--- recursive
choice of children in a disjunction, e.g. A ? B
C

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
price
trId
trId
tname
treatments
. . .
unbounded --gt
7
Challenge context dependency

bill subtree all and only the trIds in the
treatments subtree
controlled derivation the bill subtree cannot be
started before the treatments subtree is
completed.
information passing downward, upward, sideways

date
report
patient
date
SSN
treatments
name
SSN
bill
trIdS
trId
trId
unbounded
8
Challenges

DTD-conformance recursive, nondeterministic
integrity constraints validation during
document generation
multi-source queries a single one involves
several databases
context-dependency not strictly top-down or
bottom-up
Previous work
XML publishing single data source, no
constraints, top-down.
XML integration little support for XML
schema-conformance.
XML query languages type checking provide no
guidance for how to ensure schema-conformance
optimization hard.

Schema-directed XML integration is nontrivial!
9
Our middleware for schema-directed integration

view definition language
optimization techniques

A lightweight language Attribute Integration
Grammar (AIG)
Cost-based optimization in light of
context-dependency

DTD constraints
semantic attributes

semantic rules
10
Attribute Integration Grammar (AIG)

DTD element type definitions e ? ?
? PCDATA ? e1, , en
e1 en e
Attributes associated with each element type e
Inh(e) inherited from parent/siblings
(top-down/sideways)
Syn(e) synthesized from children (bottom-up)
Syn(e), Inh(e) tuple or set/bag-valued
Rules associated with each production e ? ?, for
e in ?
Inh(child) Inh(e) Q(Inh(parent),
Syn(sibling) )
Syn(parent) Syn(e) U (Syn(children))
-- union
Q multi-source SQL query with parameters
Dependency e2 must be evaluated before e1 if
Inh(e1) Q( Syn(e2) ) -- acyclic graph
(DAG)

11
AIG semantics conceptual evaluation

following the dependency ordering starting from
the root
report ? patient
Inh(patient) ? Q1 (Inh(report))
select Inh(report) as date, p.SSN,
p.name, p.policy
from DB1 patient p, DB1
visitInfo v
where p.SSN v.SSN and v.date
Inh(report)
Recall DB1 patient (SSN, name, policy),
visitInfo (SSN, trId, date)
Parameter in a query Inh(report) as a constant
Data driven the number of patients depends on Q1

report
Inh
patient
patient
patient
12
Multi-source query

patient ? SSN, name, treatments, bill
Inh(SSN) Inh(patient).SSN, . . . ,
Inh(treatments) Q2(v) -- v
Inh(patient)
select t.trId, t.tname
from DB1 visitInfo i, DB3 cover
c, DB4 treatment t
where i.SSN v.SSN and i.date
v.date and t.trId i.trId
and c.trId i.trId and c.policy
v.policy
a single query uses DB1, DB3 and DB4
tuple- and set-valued attributes (Inh(SSN),
Inh(treatments))

Recall DB1 patient(SSN, name, policy),
visitInfo(SSN, trId, date)
DB3 cover (policy, trId)
DB4 treatment (trId, name),
procedure (trId1, trId2)

13
Initial top-down pass context-dependent

patient ? SSN, name, treatments, bill
Inh(SSN) Inh(patient).SSN, Inh(name)
Inh(patient).name,
Inh(treatments) Q2(Inh(patient))
Inh(bill) Syn(treatments) lt--
halt
DTD-directed generate children following the
production
Inh(bill) defined with sibling --
Syn(treatments), dependency ordering evaluate
bill after treatments

lt- - halt
14
Initial top-down pass recursion

treatments ? treatment
Inh(treatment) ? Inh(treatments) -- set of
(trId, tname)
Data driven treatments expansion depends on
Inh(treatments)
empty expansion terminates, Syn(treatments) is
empty
nonempty expands.

15
Leaf step

treatment ? trId, tname, treatments
Inh(trId) Inh(treatment).trId, . . .,
trId ? PCDATA

Syn(trId) Inh(trId)
treatments
Inh (trId, tname)
treatment
treatment
treatments
tname
trId
Inh
16
Bottom-up step synthesize attributes

treatment ? trId, tname, treatments
treatments ? treatment
Syn(treatments) U Syn(treatment)
Processing of an element e Inh(e) ? subtree(e) ?
Syn(e)

Syn(treatment) Syn(trId) ? Syn(treatments)
17
Sideways step controlled derivation

patient ? SSN, name, treatments, bill
Inh(bill) Syn(treatments)
bill ? item
Inh(item) ? Q(Inh(bill) )
select trId, price
from DB2 billing
where trId in Inh(bill) -- set
membership test
DTD-directed each step of construction follows a
production

Recall DB2 billing (trId, price)
18
Constraint compilation

Captured with rules on synthesized attributes of
patient
trIdB bag-valued, collecting trIds under item
trIdS1 set-valued, collecting trIds under
treatment
trIdS2 set-valued, collecting trIds under item
key patient ( item.trId ? item )
unique (Syn(patient).trIdB) -- no
duplicates in the bag
foreign key patient ( treatment.trId ?
item.trId )
subset ( Syn(patient).trIdS1,
Syn(patient).trIdS2)
compilation semantic rules and attributes for
constraints are automatically generated and
evaluated

19
Advantages of AIG

DTD-directed view definition automatically
ensures conformance to DTD -- recursive,
nondeterministic
Constraint compilation automatically captures
integrity constraints in a uniform framework
performance avoid post-materialization checking
optimization jointly with query evaluation
exception handling actions when constraints are
violated
Controlled derivation supports context-dependent
generation.
Information passing top-down, bottom-up,
sideways
Multi-source queries optimizer-based
decomposition
One sweep each node is visited at most twice
evaluates its inherited attribute, subtree, then
its synthesized attribute

20
Middleware evaluation of AIGs
optimizer
AIG
XML
merging
query plan execution
pre-processing
tagging
scheduling
data
query
DB3
cost statistics
DB2

pre-processing
constraint compilation
multi-source query decomposition ? single-source
queries
optimizer query plan generation using cost
statistics
execution SQL queries ? data sources, results
? mediator
tagging relational tables (paths from root) ?
XML view via merge-sorting

21
Optimization

Goal reduce response time
costs query execution, data transfer, storage
(caching)
constraints query dependency graph (DAG)
nodes queries computing inherited/synthesized
attributes
edges dependency relation (producer-consumer
relation)
recursion iterative unfolding by a certain depth
Optimization techniques
query merging
query scheduling

22
Query scheduling

Goal reduce the total response time by
increasing parallelism
Ordering execution of queries on the same site
e.g., ltQ4, Q5, Q3gt vs. ltQ3, Q5, Q4gt on DB2
Constraints
costs execution, communication, caching
overheads
dependency relation
Finding an optimal schedule NP-hard

23
Query merging

Goal reduce DB visits, leverage DB optimizer
Composition of queries on the same site
outer-join/union
Tradeoffs
large result tables with null
communication/caching cost
impact on scheduling -- changing query dependency
graph

Finding an optimal strategy for merging and
scheduling NP-hard
24
Cost-based heuristic scheduling

cost estimate given a fixed schedule, estimate
the completion time of a query Q, comp_time(Q)
statistics eval_cost(Q), size(Q), trans_cost(S,
S, size)
dependency Q cant start before comp_time(Q) if
Q -gt Q
scheduling given a fixed query dependency graph,
a heuristic based on dynamic programming to
find the most costly trailing path of each
query
sort queries to favor critical paths

25
Cost-based heuristic merging

greedy algorithm repeat until no further
improvement
merge each pair of queries on the same source
modify the query dependency graph accordingly
invoke scheduling w.r.t. the modified dependency
estimate the cost
pick the pair with the biggest improvement to
merge

Interaction between query scheduling and merging
26
Preliminary experimental study
Card(3-way self join) 4055 Card(4-way self
join) 6837

Data set running example insurance/hospital
recursion unfolding (depth) 3, 4, 5 --
treatments
data size small, medium, large generated via
ToXgene
System
DB2 v8.1 for Linux
simulation of data transfer bandwidth1Mbps

27
Experimental results ratio of improvement
438/193
61/34
28
Summary

AIG a novel specification language
ensures DTD-conformance (recursion/nondeterminism)
captures integrity constraints in a uniform
framework
supports complex transformations controlled
derivation (context-dependency), multi-source
queries, . . .
Optimization techniques nontrivial optimization
problems
constraint compilation, multi-source query
decomposition
query scheduling w.r.t. query dependency graph
query merging and its interaction with scheduling

A systematic framework for schema-directed XML
integration

Write a Comment

User Comments (0)