LocalasView Data Integration - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

LocalasView Data Integration

Description:

... as the subset of a query over the mediated schema the 'open world assumption' ... Sources may not have all the data from the domain 'open-world assumption' ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 29
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: LocalasView Data Integration


1
Local-as-ViewData Integration
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Database Information Systems
  • October 16, 2008

2
Administrivia
  • Next week
  • Please read Section 3, 6, 7 in the Deshpande et
    al. survey
  • Midterm assignment will be handed out Thursday,
    with a one week clock

3
Recall Kinds of Schema Mappings
  • Global As View (GAV)
  • ?x M(x) ? ?y1y2 R1(x1y1) ? R2(x2y2) ?
    c(x1y1x2y2) where x ? x1x2xn
  • Local As View (LAV)
  • ?x,y,z MR(x,y) ? MS(y,z) ? c(xyz) ? ?w
    R1(x1y1z1w) where x1y1z1 ? xyz
  • Global-Local As View (GLAV), aka Tuple-Generating
    Dependencies (TGDs)
  • ?xyz MR(x,y) ? MS(y,z) ? c1(xyz) ? ?y1y2 R1(w1y1)
    ? R2(w2y2) ? c2(xyzy1y2) ? where w1w2 ? xyz
  • Query is of the form Q(x) - MR(x1,y1),
    MS(x2,y2),

4
Global-As-View Mappings
  • Very easy to implement doesnt require any new
    logic on the part of a regular DBMS engine
  • For instance, Starburst QGM rewrites would work
  • But some drawbacks primarily that
  • We dont have a mechanism to describe when a
    source contains only a subset of the data in the
    mediated schema
  • e.g., All books from this source are of type
    textbook
  • The mediated schema often needs to change as we
    add sources it is somewhat brittle because
    its defined in terms of sources

5
An Alternate ApproachLocal-As-View
  • When you integrate something, you have some
    conceptual model of the integrated domain
  • Define that as a basic frame of reference,
    everything else as a view over it
  • Local as View using mappings that are
    conjunctive queries
  • May have overlapping/incomplete sources
  • Define each source as the subset of a query over
    the mediated schema the open world assumption
  • We can use selection or join predicates to
    specify that a source contains a range of values
  • ComputerBooks() ? Books(Title, , Subj), Subj
    Computers

6
The Local-as-View Model
  • The basic model is the following
  • Local sources are views over the mediated
    schema
  • Sources have the data mediated schema is
    virtual
  • Sources may not have all the data from the domain
    open-world assumption
  • The system must use the sources (views) to answer
    queries over the mediated schema

7
Answering Queries Using Views
  • Assumption conjunctive queries, set semantics
  • Suppose we have a mediated schema show(ID,
    title, year, genre), rating(ID, stars, source)
  • A conjunctive query might be q(t) - show(i,
    t, y, g), rating(i, 5, s)
  • Recall intuitions about this class of queries
  • Adding a conjunct to a query (e.g., t 1997)
    removes answers from the result but never adds
    any
  • Any conjunctive query with at least the same
    constraints conjuncts will give valid answers

8
Why This Class of Mappings Queries?
  • Abiteboul Duschka showed the data complexity of
    answering queries using views with OWA
  • Note that the common inflationary semantics
    version of Datalog must terminate in PTIME, even
    with recursion

9
Query Answering
  • Suppose we have the query
  • q(t) - show(i, t, y, g), rating(i, 5, s)
  • and sources
  • 5star(i) - show(i, t, y, g), rating(i, 5, s)
  • TVguide(t,y,g,r) - show(i, t, y, g), rating(i,
    r, TVGuide)
  • movieInfo(i,t,y,g) - show(i, t, y, g)
  • critics(i,r, s) - rating(i, r, s)
  • goodMovies(t,y) - show(i, t, y, drama),
    rating(i, 5, s), y 1997
  • We want to compose the query with the source
    mappings but theyre in the wrong direction!

10
Query Answering
  • Suppose we have the query
  • q(t) - show(i, t, y, g), rating(i, 5, s)
  • and sources
  • 5star(i) ? show(i, t, y, g), rating(i, 5, s)
  • TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
    TVGuide)
  • movieInfo(i,t,y,g) ? show(i, t, y, g)
  • critics(i,r, s) ? rating(i, r, s)
  • goodMovies(t,y) ? show(i, t, y, drama),
    rating(i, 5, s), y 1997
  • We want to compose the query with the source
    mappings but theyre in the wrong direction!

11
Inverse Rules
  • We can take every mapping definition and invert
    it, though sometimes we may have insufficient
    information
  • If
  • 5star(i) ? show(i, t, y, g), rating(i, 5, s)
  • then it is also the case that
  • show(i,??? ,??? ,??? ,???) ? 5star(i)
  • and we can write this as a datalog rule
  • show(i,-,- ,-,-) - 5star(i)
  • But how to handle the absence of attributes?
  • We know that there must be AT LEAST one instance
    of ??? for each attribute for each show ID
  • So we might simply insert a NULL and define that
    NULL means unknown (as opposed to missing)

12
But NULLs Lose Information
  • Suppose we take these rules and ask for
  • q(t) - show(i, t, y, g), rating(i, 5, s)
  • If we look at the rule
  • goodMovies(t,y) ? show(i, t, y, drama),
    rating(i, 5, s), y 1997
  • By inspection, q(t) ? goodMovies(t,y)
  • But if apply our inversion procedure, we get
  • show(i, t, y, g) ? goodMovies(t,y), i NULL, g
    drama, y 1997
  • rating(i, r, s) ? goodMovies(t,y), i NULL, r
    5, s NULL
  • We need a special NULL so we can figure out
    which IDs and ratings match up

13
The Solution Skolem Functions
  • Skolem functions
  • Conceptual perfect hash functions
  • Each function returns a unique, deterministic
    value for each combination of input values
  • Every function returns a non-overlapping set of
    values (Skolem function F will never return a
    value that matches any of Skolem function Gs
    values)
  • Skolem functions wont ever be part of the answer
    set or the computation it doesnt produce real
    values
  • Theyre just a way of logically generating
    special NULLs

14
Query Answering Using Inverse Rules
  • Invert all rules using the procedures described
  • Take the query and the possible rule expansions
    and execute them in a Datalog interpreter
  • Really the same process as GAV unfolding!
  • In the previous query, we expand with all
    combinations of expansions of show and of rating
    every possible way of combining and
    cross-correlating info from different sources
  • Then discard unsatisfiable rewritings via
    unification, i.e., substituting in constants from
    the query for variables in the view
  • Finally, execute the union of all satisfiable
    rewritings

15
Pros Cons of Inverse Rules
  • Works even with recursive queries, binding
    patterns, FDs on schemas
  • Generally, they take view definitions, split
    them, and re-join them to produce answers
  • Not very efficient
  • No treatment of lt, gt predicates
  • Can we do better? i.e., be more efficient?

16
The Bucket Algorithm
  • Given a query Q with relations and predicates
  • Create a bucket for each subgoal in Q
  • Iterate over each view (source mapping)
  • If source includes buckets subgoal
  • Create mapping between qs vars and the views
    var at the same position
  • If satisfiable with substitutions, add to bucket
  • Do cross-product of buckets, see if result is
    contained (exptime, but queries are probably
    relatively small)
  • For each result, do a containment check to make
    sure the rewriting is contained within the query

17
Lets Try a Bucket Example
  • Query
  • q(t) - show(i, t, y, g), rating(i, 5, s)
  • Sources
  • 5star(i) ? show(i, t, y, g), rating(i, 5, s)
  • TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
    TVGuide)
  • movieInfo(i,t,y,g) ? show(i, t, y, g)
  • critics(i,r,s) ? rating(i, r, s)
  • goodMovies(t,y) ? show(i, t, y, drama),
    rating(i, 5, s), y 1997
  • good98(t,y) ? show(i, t, y, drama), rating(i,
    5, s), y 1998

18
Populating the Buckets
19
Evaluation
  • On the board

20
Example of Containment Testing
  • Suppose we have two queriesq1(t) - show(i, t,
    y, g), rating(i, 5, s) , y 1997 q2(t,y) -
    show(i, t, y, drama), rating(i, 5, s)
  • Intuitively, q1 must contain the same or fewer
    answers vs. q2
  • It has all of the same conditions, except one
    extra conjunction (i.e., its more restricted)
  • Theres no union or any other way it can add more
    data
  • We can say that q2 contains q1 because this holds
    for any instance of our mediated schema

21
Checking Containment via Canonical Databases
  • To test for q1 µ q2
  • Create a canonical DB that contains a tuple for
    each subgoal in q1
  • Execute q2 over it
  • If q2 returns a tuple that matches the head of
    q1, then q1 µ q2
  • (This is an NP-complete algorithm in the size of
    the query. Testing for full first-order logic
    queries is undecidable!!!)
  • Lets see this for our example

22
Example Canonical DB
  • q1(t) - show(i, t, 1997, g), rating(i, 5, s)
  • q2(t,y) - show(i, t, y, drama), rating(i, 5,
    s)

rating
show
Need to get tuple lttgt in executing q2 over this
database
What if q2 didnt ask for g drama?
23
Buckets, Rev. 2 The MiniCon Algorithm
  • A much smarter bucket algorithm
  • In many cases, we dont need to perform the
    cross-product of all items in all buckets
  • Eliminates the need for the containment check
  • This and the Chase Backchase strategy of
    Tannen et al are the two methods most used in
    virtual data integration today
  • The chase procedure (though not the backchase)
    can be achieved using a Datalog program that
    looks much like the inverse rules!

24
Minicon Descriptions (MCDs)
  • Basically, a modification to the bucket approach
  • head homomorphism defines what variables must
    be equated by the query
  • Variable-substituted version of the subgoals
  • Mapping of variable names
  • Info about whats covered
  • Property 1
  • If a variable occurs in the head of a query, then
    there must be a corresponding variable in the
    head of the MCD view
  • If a variable participates in a join predicate in
    the query, then it must be in the head of the view

25
MCD Construction
  • For each subgoal of the query
  • For each subgoal of each view
  • Choose the least restrictive head homomorphism to
    match the subgoal of the query
  • If we can find a way of mapping the variables,
    then add MCD for each possible maximal
    extension of the mapping that satisfies Property 1

26
MCDs for Our Example
  • 5star(i) ? show(i, t, y, g), rating(i, 5, s)
  • TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
    TVGuide)
  • movieInfo(i,t,y,g) ? show(i, t, y, g)
  • critics(i,r,s) ? rating(i, r, s)
  • goodMovies(t,y) ? show(i, t, 1997, drama),
    rating(i, 5, s)
  • good98(t,y) ? show(i, t, 1998, drama),
    rating(i, 5, s)

q(t) - show(i, t, y, g), rating(i, r, s), r 5
27
Combining MCDs
  • Now look for ways of combining pairwise disjoint
    subsets of the goals
  • Greatly reduces the number of candidates!
  • Also proven to be correct without the use of a
    containment check
  • Variations need to be made for
  • Constants in general (I sneaked those in)
  • Semi-interval predicates (x lt c)
  • Note that full-blown inequality predicates are
    co-NP-hard in the size of the data, so they dont
    work

28
MiniCon and LAV Summary
  • The state-of-the-art for AQUV in the relational
    world of data integration
  • Its been extended to support conjunctive
    XQuery as well
  • Scales to large numbers of views, which we need
    in LAV data integration
  • Chase Backchase by Tannen et al.
  • A procedure that has very close connections to
    inverse rules
  • Slightly more general in some ways but
  • Typically produces equivalent rewritings, not
    maximally contained ones
  • Not always polynomial in the size of the data
    (though for some useful settings it is!!!)
Write a Comment
User Comments (0)
About PowerShow.com