Title: Capturing both Types and Constraints in Data Integration
1Capturing both Types and Constraints in Data
Integration
- Michael Benedikt Bell Laboratories
- Chee-Yong Chan National Univ. of Singapore
- Wenfei Fan Bell Laboratories
- Juliana Freire Oregon Health Science
Univ. (OGI) - Rajeev Rastogi Bell Laboratories
2Schema-directed integration
data exchange
- Integration
- extract relevant data from distributed, multiple
databases - construct an XML view
- Schema-directed conformance to a predefined
schema (D, ?) - D a DTD, type constraints
- ? a set of XML integrity constraints (keys,
foreign keys)
3Example hospital and insurance company
- patient (SSN, name, policy)
- visitInfo (SSN, trId, date)
billing (trId, price)
cover (policy, trId)
daily XML report
treatment (trId, name) procedure (trId1, trId2)
- Given a date, for each patient of that day,
report - SSN, name (DB1)
- treatments (hierarchy) covered by insurance
(DB1, DB3, DB4) - cost of all and only those treatments received
(DB2)
4Predefined schema (D, ?) DTD
- report ? patient
- patient ? SSN, name,
treatments, bill - treatments ? treatment
- treatment ? trId, tname,
treatments - bill ? item
- item ? trId, price
report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
. . .
price
trId
trId
tname
treatments
5Predefined schema (D, ?) XML constraints ?
- constraints relative to each patient,
- key each treatment is charged only once
- patient ( item.trId ? item )
- foreign key every treatment has a billing record
- patient ( treatment.trId ? item.trId )
report
date
patient
patient
patient
name
treatments
SSN
bill
treatment
item
item
treatment
. . .
price
trId
treatments
6Challenge nondeterministic structure
- certain structure cannot be decided at compile
time - recursion expansion - the depth of XML tree
- treatments ? treatment
- treatment ? trId, tname, treatments
--- recursive - choice of children in a disjunction, e.g. A ? B
C
report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
price
trId
trId
tname
treatments
. . .
unbounded --gt
7Challenge context dependency
- bill subtree all and only the trIds in the
treatments subtree - controlled derivation the bill subtree cannot be
started before the treatments subtree is
completed. - information passing downward, upward, sideways
date
report
patient
date
SSN
treatments
name
SSN
bill
trIdS
trId
trId
unbounded
8Challenges
- DTD-conformance recursive, nondeterministic
- integrity constraints validation during
document generation - multi-source queries a single one involves
several databases - context-dependency not strictly top-down or
bottom-up - Previous work
- XML publishing single data source, no
constraints, top-down. - XML integration little support for XML
schema-conformance. - XML query languages type checking provide no
guidance for how to ensure schema-conformance
optimization hard.
Schema-directed XML integration is nontrivial!
9Our middleware for schema-directed integration
view definition language
optimization techniques
- A lightweight language Attribute Integration
Grammar (AIG) - Cost-based optimization in light of
context-dependency
DTD constraints
semantic attributes
semantic rules
10Attribute Integration Grammar (AIG)
- DTD element type definitions e ? ?
- ? PCDATA ? e1, , en
e1 en e - Attributes associated with each element type e
- Inh(e) inherited from parent/siblings
(top-down/sideways) - Syn(e) synthesized from children (bottom-up)
- Syn(e), Inh(e) tuple or set/bag-valued
- Rules associated with each production e ? ?, for
e in ? - Inh(child) Inh(e) Q(Inh(parent),
Syn(sibling) ) - Syn(parent) Syn(e) U (Syn(children))
-- union - Q multi-source SQL query with parameters
- Dependency e2 must be evaluated before e1 if
- Inh(e1) Q( Syn(e2) ) -- acyclic graph
(DAG)
11AIG semantics conceptual evaluation
- following the dependency ordering starting from
the root - report ? patient
- Inh(patient) ? Q1 (Inh(report))
- select Inh(report) as date, p.SSN,
p.name, p.policy - from DB1 patient p, DB1
visitInfo v - where p.SSN v.SSN and v.date
Inh(report) - Recall DB1 patient (SSN, name, policy),
visitInfo (SSN, trId, date) - Parameter in a query Inh(report) as a constant
- Data driven the number of patients depends on Q1
report
Inh
patient
patient
patient
12Multi-source query
- patient ? SSN, name, treatments, bill
- Inh(SSN) Inh(patient).SSN, . . . ,
- Inh(treatments) Q2(v) -- v
Inh(patient) - select t.trId, t.tname
- from DB1 visitInfo i, DB3 cover
c, DB4 treatment t - where i.SSN v.SSN and i.date
v.date and t.trId i.trId - and c.trId i.trId and c.policy
v.policy - a single query uses DB1, DB3 and DB4
- tuple- and set-valued attributes (Inh(SSN),
Inh(treatments))
- Recall DB1 patient(SSN, name, policy),
visitInfo(SSN, trId, date) - DB3 cover (policy, trId)
- DB4 treatment (trId, name),
procedure (trId1, trId2)
13Initial top-down pass context-dependent
- patient ? SSN, name, treatments, bill
- Inh(SSN) Inh(patient).SSN, Inh(name)
Inh(patient).name, - Inh(treatments) Q2(Inh(patient))
- Inh(bill) Syn(treatments) lt--
halt - DTD-directed generate children following the
production - Inh(bill) defined with sibling --
Syn(treatments), dependency ordering evaluate
bill after treatments
lt- - halt
14Initial top-down pass recursion
- treatments ? treatment
- Inh(treatment) ? Inh(treatments) -- set of
(trId, tname) - Data driven treatments expansion depends on
Inh(treatments) - empty expansion terminates, Syn(treatments) is
empty - nonempty expands.
15Leaf step
- treatment ? trId, tname, treatments
- Inh(trId) Inh(treatment).trId, . . .,
-
- trId ? PCDATA
Syn(trId) Inh(trId)
treatments
Inh (trId, tname)
treatment
treatment
treatments
tname
trId
Inh
16Bottom-up step synthesize attributes
- treatment ? trId, tname, treatments
- treatments ? treatment
- Syn(treatments) U Syn(treatment)
- Processing of an element e Inh(e) ? subtree(e) ?
Syn(e)
Syn(treatment) Syn(trId) ? Syn(treatments)
17Sideways step controlled derivation
- patient ? SSN, name, treatments, bill
- Inh(bill) Syn(treatments)
- bill ? item
- Inh(item) ? Q(Inh(bill) )
- select trId, price
- from DB2 billing
- where trId in Inh(bill) -- set
membership test - DTD-directed each step of construction follows a
production
Recall DB2 billing (trId, price)
18Constraint compilation
- Captured with rules on synthesized attributes of
patient - trIdB bag-valued, collecting trIds under item
- trIdS1 set-valued, collecting trIds under
treatment - trIdS2 set-valued, collecting trIds under item
- key patient ( item.trId ? item )
- unique (Syn(patient).trIdB) -- no
duplicates in the bag - foreign key patient ( treatment.trId ?
item.trId ) - subset ( Syn(patient).trIdS1,
Syn(patient).trIdS2) - compilation semantic rules and attributes for
constraints are automatically generated and
evaluated
19Advantages of AIG
- DTD-directed view definition automatically
ensures conformance to DTD -- recursive,
nondeterministic - Constraint compilation automatically captures
integrity constraints in a uniform framework - performance avoid post-materialization checking
- optimization jointly with query evaluation
- exception handling actions when constraints are
violated - Controlled derivation supports context-dependent
generation. - Information passing top-down, bottom-up,
sideways - Multi-source queries optimizer-based
decomposition - One sweep each node is visited at most twice
evaluates its inherited attribute, subtree, then
its synthesized attribute
20Middleware evaluation of AIGs
optimizer
AIG
XML
merging
query plan execution
pre-processing
tagging
scheduling
data
query
DB3
cost statistics
DB2
- pre-processing
- constraint compilation
- multi-source query decomposition ? single-source
queries - optimizer query plan generation using cost
statistics - execution SQL queries ? data sources, results
? mediator - tagging relational tables (paths from root) ?
XML view via merge-sorting
21Optimization
- Goal reduce response time
- costs query execution, data transfer, storage
(caching) - constraints query dependency graph (DAG)
- nodes queries computing inherited/synthesized
attributes - edges dependency relation (producer-consumer
relation) - recursion iterative unfolding by a certain depth
- Optimization techniques
- query merging
- query scheduling
22Query scheduling
- Goal reduce the total response time by
increasing parallelism - Ordering execution of queries on the same site
- e.g., ltQ4, Q5, Q3gt vs. ltQ3, Q5, Q4gt on DB2
- Constraints
- costs execution, communication, caching
overheads - dependency relation
- Finding an optimal schedule NP-hard
23Query merging
- Goal reduce DB visits, leverage DB optimizer
- Composition of queries on the same site
outer-join/union - Tradeoffs
- large result tables with null
communication/caching cost - impact on scheduling -- changing query dependency
graph
Finding an optimal strategy for merging and
scheduling NP-hard
24Cost-based heuristic scheduling
- cost estimate given a fixed schedule, estimate
the completion time of a query Q, comp_time(Q) - statistics eval_cost(Q), size(Q), trans_cost(S,
S, size) - dependency Q cant start before comp_time(Q) if
Q -gt Q - scheduling given a fixed query dependency graph,
a heuristic based on dynamic programming to - find the most costly trailing path of each
query - sort queries to favor critical paths
25Cost-based heuristic merging
- greedy algorithm repeat until no further
improvement - merge each pair of queries on the same source
- modify the query dependency graph accordingly
- invoke scheduling w.r.t. the modified dependency
- estimate the cost
- pick the pair with the biggest improvement to
merge
Interaction between query scheduling and merging
26Preliminary experimental study
Card(3-way self join) 4055 Card(4-way self
join) 6837
- Data set running example insurance/hospital
- recursion unfolding (depth) 3, 4, 5 --
treatments - data size small, medium, large generated via
ToXgene - System
- DB2 v8.1 for Linux
- simulation of data transfer bandwidth1Mbps
27Experimental results ratio of improvement
438/193
61/34
28Summary
- AIG a novel specification language
- ensures DTD-conformance (recursion/nondeterminism)
- captures integrity constraints in a uniform
framework - supports complex transformations controlled
derivation (context-dependency), multi-source
queries, . . . - Optimization techniques nontrivial optimization
problems - constraint compilation, multi-source query
decomposition - query scheduling w.r.t. query dependency graph
- query merging and its interaction with scheduling
A systematic framework for schema-directed XML
integration