Capturing both Types and Constraints in Data Integration - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Capturing both Types and Constraints in Data Integration

Description:

Chee-Yong Chan National Univ. of Singapore. Wenfei Fan Bell Laboratories. Juliana Freire Oregon Health & Science Univ. (OGI) Rajeev Rastogi Bell Laboratories ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 29
Provided by: wenfe4
Category:

less

Transcript and Presenter's Notes

Title: Capturing both Types and Constraints in Data Integration


1
Capturing both Types and Constraints in Data
Integration
  • Michael Benedikt Bell Laboratories
  • Chee-Yong Chan National Univ. of Singapore
  • Wenfei Fan Bell Laboratories
  • Juliana Freire Oregon Health Science
    Univ. (OGI)
  • Rajeev Rastogi Bell Laboratories

2
Schema-directed integration

data exchange
  • Integration
  • extract relevant data from distributed, multiple
    databases
  • construct an XML view
  • Schema-directed conformance to a predefined
    schema (D, ?)
  • D a DTD, type constraints
  • ? a set of XML integrity constraints (keys,
    foreign keys)

3
Example hospital and insurance company
  • patient (SSN, name, policy)
  • visitInfo (SSN, trId, date)

billing (trId, price)
cover (policy, trId)
daily XML report
treatment (trId, name) procedure (trId1, trId2)
  • Given a date, for each patient of that day,
    report
  • SSN, name (DB1)
  • treatments (hierarchy) covered by insurance
    (DB1, DB3, DB4)
  • cost of all and only those treatments received
    (DB2)

4
Predefined schema (D, ?) DTD
  • report ? patient
  • patient ? SSN, name,
    treatments, bill
  • treatments ? treatment
  • treatment ? trId, tname,
    treatments
  • bill ? item
  • item ? trId, price

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
. . .
price
trId
trId
tname
treatments
5
Predefined schema (D, ?) XML constraints ?
  • constraints relative to each patient,
  • key each treatment is charged only once
  • patient ( item.trId ? item )
  • foreign key every treatment has a billing record
  • patient ( treatment.trId ? item.trId )

report
date
patient
patient
patient
name
treatments
SSN
bill
treatment
item
item
treatment
. . .
price
trId
treatments
6
Challenge nondeterministic structure
  • certain structure cannot be decided at compile
    time
  • recursion expansion - the depth of XML tree
  • treatments ? treatment
  • treatment ? trId, tname, treatments
    --- recursive
  • choice of children in a disjunction, e.g. A ? B
    C

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
price
trId
trId
tname
treatments
. . .
unbounded --gt
7
Challenge context dependency
  • bill subtree all and only the trIds in the
    treatments subtree
  • controlled derivation the bill subtree cannot be
    started before the treatments subtree is
    completed.
  • information passing downward, upward, sideways

date
report
patient
date
SSN
treatments
name
SSN
bill
trIdS
trId
trId
unbounded
8
Challenges
  • DTD-conformance recursive, nondeterministic
  • integrity constraints validation during
    document generation
  • multi-source queries a single one involves
    several databases
  • context-dependency not strictly top-down or
    bottom-up
  • Previous work
  • XML publishing single data source, no
    constraints, top-down.
  • XML integration little support for XML
    schema-conformance.
  • XML query languages type checking provide no
    guidance for how to ensure schema-conformance
    optimization hard.

Schema-directed XML integration is nontrivial!
9
Our middleware for schema-directed integration

view definition language
optimization techniques
  • A lightweight language Attribute Integration
    Grammar (AIG)
  • Cost-based optimization in light of
    context-dependency

DTD constraints
semantic attributes

semantic rules
10
Attribute Integration Grammar (AIG)
  • DTD element type definitions e ? ?
  • ? PCDATA ? e1, , en
    e1 en e
  • Attributes associated with each element type e
  • Inh(e) inherited from parent/siblings
    (top-down/sideways)
  • Syn(e) synthesized from children (bottom-up)
  • Syn(e), Inh(e) tuple or set/bag-valued
  • Rules associated with each production e ? ?, for
    e in ?
  • Inh(child) Inh(e) Q(Inh(parent),
    Syn(sibling) )
  • Syn(parent) Syn(e) U (Syn(children))
    -- union
  • Q multi-source SQL query with parameters
  • Dependency e2 must be evaluated before e1 if
  • Inh(e1) Q( Syn(e2) ) -- acyclic graph
    (DAG)

11
AIG semantics conceptual evaluation
  • following the dependency ordering starting from
    the root
  • report ? patient
  • Inh(patient) ? Q1 (Inh(report))
  • select Inh(report) as date, p.SSN,
    p.name, p.policy
  • from DB1 patient p, DB1
    visitInfo v
  • where p.SSN v.SSN and v.date
    Inh(report)
  • Recall DB1 patient (SSN, name, policy),
    visitInfo (SSN, trId, date)
  • Parameter in a query Inh(report) as a constant
  • Data driven the number of patients depends on Q1

report
Inh
patient
patient
patient
12
Multi-source query
  • patient ? SSN, name, treatments, bill
  • Inh(SSN) Inh(patient).SSN, . . . ,
  • Inh(treatments) Q2(v) -- v
    Inh(patient)
  • select t.trId, t.tname
  • from DB1 visitInfo i, DB3 cover
    c, DB4 treatment t
  • where i.SSN v.SSN and i.date
    v.date and t.trId i.trId
  • and c.trId i.trId and c.policy
    v.policy
  • a single query uses DB1, DB3 and DB4
  • tuple- and set-valued attributes (Inh(SSN),
    Inh(treatments))
  • Recall DB1 patient(SSN, name, policy),
    visitInfo(SSN, trId, date)
  • DB3 cover (policy, trId)
  • DB4 treatment (trId, name),
    procedure (trId1, trId2)

13
Initial top-down pass context-dependent
  • patient ? SSN, name, treatments, bill
  • Inh(SSN) Inh(patient).SSN, Inh(name)
    Inh(patient).name,
  • Inh(treatments) Q2(Inh(patient))
  • Inh(bill) Syn(treatments) lt--
    halt
  • DTD-directed generate children following the
    production
  • Inh(bill) defined with sibling --
    Syn(treatments), dependency ordering evaluate
    bill after treatments

lt- - halt
14
Initial top-down pass recursion
  • treatments ? treatment
  • Inh(treatment) ? Inh(treatments) -- set of
    (trId, tname)
  • Data driven treatments expansion depends on
    Inh(treatments)
  • empty expansion terminates, Syn(treatments) is
    empty
  • nonempty expands.

15
Leaf step
  • treatment ? trId, tname, treatments
  • Inh(trId) Inh(treatment).trId, . . .,
  • trId ? PCDATA

Syn(trId) Inh(trId)
treatments
Inh (trId, tname)
treatment
treatment
treatments
tname
trId
Inh
16
Bottom-up step synthesize attributes
  • treatment ? trId, tname, treatments
  • treatments ? treatment
  • Syn(treatments) U Syn(treatment)
  • Processing of an element e Inh(e) ? subtree(e) ?
    Syn(e)

Syn(treatment) Syn(trId) ? Syn(treatments)
17
Sideways step controlled derivation
  • patient ? SSN, name, treatments, bill
  • Inh(bill) Syn(treatments)
  • bill ? item
  • Inh(item) ? Q(Inh(bill) )
  • select trId, price
  • from DB2 billing
  • where trId in Inh(bill) -- set
    membership test
  • DTD-directed each step of construction follows a
    production

Recall DB2 billing (trId, price)
18
Constraint compilation
  • Captured with rules on synthesized attributes of
    patient
  • trIdB bag-valued, collecting trIds under item
  • trIdS1 set-valued, collecting trIds under
    treatment
  • trIdS2 set-valued, collecting trIds under item
  • key patient ( item.trId ? item )
  • unique (Syn(patient).trIdB) -- no
    duplicates in the bag
  • foreign key patient ( treatment.trId ?
    item.trId )
  • subset ( Syn(patient).trIdS1,
    Syn(patient).trIdS2)
  • compilation semantic rules and attributes for
    constraints are automatically generated and
    evaluated

19
Advantages of AIG
  • DTD-directed view definition automatically
    ensures conformance to DTD -- recursive,
    nondeterministic
  • Constraint compilation automatically captures
    integrity constraints in a uniform framework
  • performance avoid post-materialization checking
  • optimization jointly with query evaluation
  • exception handling actions when constraints are
    violated
  • Controlled derivation supports context-dependent
    generation.
  • Information passing top-down, bottom-up,
    sideways
  • Multi-source queries optimizer-based
    decomposition
  • One sweep each node is visited at most twice
    evaluates its inherited attribute, subtree, then
    its synthesized attribute

20
Middleware evaluation of AIGs
optimizer
AIG
XML
merging
query plan execution
pre-processing
tagging
scheduling
data
query
DB3
cost statistics
DB2
  • pre-processing
  • constraint compilation
  • multi-source query decomposition ? single-source
    queries
  • optimizer query plan generation using cost
    statistics
  • execution SQL queries ? data sources, results
    ? mediator
  • tagging relational tables (paths from root) ?
    XML view via merge-sorting
  • DB1

21
Optimization
  • Goal reduce response time
  • costs query execution, data transfer, storage
    (caching)
  • constraints query dependency graph (DAG)
  • nodes queries computing inherited/synthesized
    attributes
  • edges dependency relation (producer-consumer
    relation)
  • recursion iterative unfolding by a certain depth
  • Optimization techniques
  • query merging
  • query scheduling

22
Query scheduling
  • Goal reduce the total response time by
    increasing parallelism
  • Ordering execution of queries on the same site
  • e.g., ltQ4, Q5, Q3gt vs. ltQ3, Q5, Q4gt on DB2
  • Constraints
  • costs execution, communication, caching
    overheads
  • dependency relation
  • Finding an optimal schedule NP-hard

23
Query merging
  • Goal reduce DB visits, leverage DB optimizer
  • Composition of queries on the same site
    outer-join/union
  • Tradeoffs
  • large result tables with null
    communication/caching cost
  • impact on scheduling -- changing query dependency
    graph

Finding an optimal strategy for merging and
scheduling NP-hard
24
Cost-based heuristic scheduling
  • cost estimate given a fixed schedule, estimate
    the completion time of a query Q, comp_time(Q)
  • statistics eval_cost(Q), size(Q), trans_cost(S,
    S, size)
  • dependency Q cant start before comp_time(Q) if
    Q -gt Q
  • scheduling given a fixed query dependency graph,
    a heuristic based on dynamic programming to
  • find the most costly trailing path of each
    query
  • sort queries to favor critical paths

25
Cost-based heuristic merging
  • greedy algorithm repeat until no further
    improvement
  • merge each pair of queries on the same source
  • modify the query dependency graph accordingly
  • invoke scheduling w.r.t. the modified dependency
  • estimate the cost
  • pick the pair with the biggest improvement to
    merge

Interaction between query scheduling and merging
26
Preliminary experimental study
Card(3-way self join) 4055 Card(4-way self
join) 6837
  • Data set running example insurance/hospital
  • recursion unfolding (depth) 3, 4, 5 --
    treatments
  • data size small, medium, large generated via
    ToXgene
  • System
  • DB2 v8.1 for Linux
  • simulation of data transfer bandwidth1Mbps

27
Experimental results ratio of improvement
438/193
61/34
28
Summary
  • AIG a novel specification language
  • ensures DTD-conformance (recursion/nondeterminism)
  • captures integrity constraints in a uniform
    framework
  • supports complex transformations controlled
    derivation (context-dependency), multi-source
    queries, . . .
  • Optimization techniques nontrivial optimization
    problems
  • constraint compilation, multi-source query
    decomposition
  • query scheduling w.r.t. query dependency graph
  • query merging and its interaction with scheduling

A systematic framework for schema-directed XML
integration
Write a Comment
User Comments (0)
About PowerShow.com