Data Integration Methods - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Data Integration Methods

Description:

Today's Trivia Question. 4. A Problem ... where show.title = 'Star Wars' and show.genre = 'sci-fi' ... where $s/title/text() = 'Star Wars' and $s/genre/text ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 39
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Data Integration Methods


1
Data Integration Methods
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Database Information Systems
  • February 16, 2004

2
Administrivia
  • Next reading assignment on scalable query
    reformulation algorithms
  • Pottinger and Halevy MiniCon
  • Write-up summarize the main ideas of this paper

3
Todays Trivia Question
4
A Problem
  • Weve seen that even with normalization and the
    same needs, different people will arrive at
    different schemas
  • In fact, most people also have different needs!
  • Often people build databases in isolation, then
    want to share their data
  • Different systems within an enterprise
  • Different information brokers on the Web
  • Scientific collaborators
  • Researchers who want to publish their data for
    others to use
  • This is the goal of data integration tie
    together different sources, controlled by many
    people, under a common schema

5
Example
  • We want to build the UltimateMovieGuideTM
  • Given
  • TV Guide schema ShowingAt(time, title,
    year) Shows(title, year, genre,
    rating) Director(title, year, name) Starring(tit
    le, year, name)
  • GoodMovies.com FourStarMovie(title, year,
    genre) DecentMovie(title, year)
  • OscarWinners WinningDirector(director, title,
    year)
  • Documentaries.org Documentary(title, year,
    director, producer)

6
Integrating Data
  • Several steps
  • Getting data out from a data source it may have
    its own query/retrieval interface
  • Sometimes it will need a query before it returns
    answers, e.g., a Web form
  • Getting all of the data into the same data model
    and format
  • Translating the data into the same schema
  • Answering queries
  • How might we handle this?

7
Data Warehouses Offline Replication
  • Get experts together, define a schema they think
    best captures all the info
  • Define a database with this schema
  • Define procedural mappings in an ETL tool to
    import the data
  • Perhaps perform data cleaning
  • Periodically copy all of the data from the data
    sources
  • Note that the sources and the warehouse are
    basically independent at this point

Results
Query
Data Warehouse
Remote, Autonomous Data Sources
8
Pros and Cons of Data Warehouses
  • Need to spend time to design the physical
    database layout, as well as logical
  • This actually takes a lot of effort!
  • Data is generally not up-to-date (lazy or offline
    refresh)
  • Queries over the warehouse dont disrupt the data
    sources
  • Can run very heavy-duty computations, including
    data mining and cleaning

9
An Alternative Mediators or Virtual Integration
Systems
  • Get experts together, define a schema they think
    best captures all the info
  • Define as a virtual mediated schema
  • Create declarative mappings specifying how to get
    data from each source into the warehouse
  • Evaluate queries over the mediated schema on the
    fly using the current data at the sources

Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
10
Core Question How Do We Define and Use
Mappings?
  • Queries must be directly composed with mappings
  • Leads to use of views as the means of specifying
    mappings
  • So which direction do we specify views?
  • Mediated relations as views over source relations
  • Source relations as views over mediated relations
  • TSIMMIS chooses option 1
  • Information Mainfold chooses option 2
  • Neither is perfect or comprehensive, as well see

11
The Job of Mappings
  • Between different data sources
  • May have different numbers of tables different
    decompositions
  • Attributes may be broken down differently
    (rating vs. EbertThumb and RoeperThumb)
  • Metadata in one relation may be data in another
  • Values may not exactly correspond (shows vs.
    movies)
  • It may be unclear whether a value is the same
    (COPPOLA vs. Francis Ford Coppola)
  • May have different, but synonymous terms (ImdbID
    123456 ? SSN 987-45-3210)
  • Might have sub/superclass relationships

12
General Techniques
  • Value-value correspondences accomplished using
    concordance tables
  • Join through a table mapping values to values
  • Imdb_Actor(ID, SAG_actor_name)
  • Table-multitable correspondences accomplished
    using joins (in one direction), projections (in
    other direction)
  • Key question what happens if a needed attribute
    is missing? (e.g., DecentMovie has no genre)
  • Super/subclass relationships generally must be
    captured using selection (in one direction),
    union (in other direction)
  • And sometimes we just cant specify the
    correspondence!

13
Some Examples of Mappings
  • Show(ID, Title, Year, Lang, Genre)
  • Movie(ID, Title, Year, Genre, Director, Star1,
    Star2)
  • EnglishMovie(Title, Year, Genre, Rating)
  • Docu(ID, Title, Year)Participant(ID, Name, Role)

PieceOfArt(I, T, Y, English, G) -
EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y)
Movie(I, T, Y, doc, D, S1, S2) - Docu(I, T,
Y), Participant(I, D, Dir), Participant(I,
S1, Cast1), Participant(I, S2, Cast2)
T1
ImdbID CastOf
1234 Catwoman
Name CastOf
Berry, H. Monsters Ball
T2
Need a concordance table from ImdbIDs to actress
names
14
TSIMMIS and Information Manifold
  • Focus Web-based queryable sources
  • CGI forms, online databases, maybe a few RDBMSs
  • Each needs to be mapped into the system not as
    easy as web search but the benefits are
    significant vs. query engines
  • A few parenthetical notes
  • Part of a slew of works on wrappers, source
    profiling, etc.
  • The creation of mappings can be partly automated
    systems such as LSD, Cupid, Clio, do this
  • Today most people look at integrating large
    enterprises (thats where the is!) Nimble,
    BEA, IBM

15
TSIMMIS
  • The Stanford-IBM Manager of Multiple Information
    Sources or, a Yiddish stew
  • An instance of a global-as-view mediation
    system
  • One of the first systems to support
    semi-structured data, which predated XML by
    several years

16
Semi-structured Data OEM
  • Observation given a particular schema, its
    attributes may be unavailable from certain
    sources inherent irregularity
  • Proposal Object Exchange Model, OEM
  • OID ltlabel, type, valuegt
  • 1 show 2 id 15 , 3 title Catwoman
    , 4 year 2004 , 5 lang English ,
    6 genre fantasy , 7 criticsrating 8
    stars 0.5 , 9 source Bob

17
Queries in TSIMMIS
  • Specified in OQL-style language called Lorel
  • OQL was an object-oriented query language
  • Lorel is, in many ways, a predecessor to XQuery
  • Based on path expressions over OEM structures
  • select showwhere show.title Star Wars and
    show.genre sci-fi
  • This is basically like XQuery, which well use in
    place of Lorel and the MSL template language.
    Previous query restated
  • for s in AllData()/showwhere s/title/text()
    Star Wars and s/genre/text()
    sci-fireturn s

18
Query Answering in TSIMMIS
  • Basically, its view unfolding, i.e., composing a
    query with a view
  • The query is the one being asked
  • The views are the MSL templates for the wrappers
  • Some of the views may actually require
    parameters, e.g., an author name, before theyll
    return answers
  • Common for web forms (see Amazon, Google, )
  • XQuery functions (XQuerys version of views)
    support parameters as well, so well see these in
    action

19
A Wrapper Definition in MSL
  • Wrappers have templates and binding patterns (T)
    in MSL
  • S - S ltshow ltgenre Ggtgt // select
    from movie where title T //
  • This reformats a SQL query over Movie(title,
    year, genre)
  • In XQuery, this might look like
  • define function GetShow(t AS xsdstring) as show
    for s in sql(Amazon.DB, select
    from movie where title t )return
    ltshowgtlttitlegttlt/titlegt s/year,
    s/genrelt/showgt


movie
genre
year



The union of GetShows results is unioned with
others to form the view AllData()
20
How to Answer the Query
  • Given our query
  • for s in AllData()/showwhere s/title/text()
    Star Wars and s/genre/text()
    sci-fireturn s
  • Find all wrapper definitions that
  • Contain output enough structure to match the
    conditions of the query
  • Or have already tested the conditions for us!

21
Query Composition with Views
  • We find all views that define book with author
    and title, and we compose the query with each
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgt b/title ltauthorgtxlt/authorgtlt/bookgt
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

book
author
title


22
Example on Board
23
Virtues of TSIMMIS
  • Early adopter of semistructured data, greatly
    predating XML
  • Can support data from many different kinds of
    sources
  • Obviously, doesnt fully solve heterogeneity
    problem
  • Presents a mediated schema that is the union of
    multiple views
  • Query answering based on view unfolding
  • Easily composed in a hierarchy of mediators

24
Limitations of TSIMMIS Approach
  • Some data sources may contain data with certain
    ranges or properties
  • Books by Aho, Students at UPenn,
  • If we ask a query for students at Columbia, dont
    want to bother querying students at Penn
  • How do we express these?
  • Mediated schema is basically the union of the
    various MSL templates as they change, so may
    the mediated schema

25
An Alternate ApproachThe Information Manifold
(Levy et al.)
  • When you integrate something, you have some
    conceptual model of the integrated domain
  • Define that as a basic frame of reference,
    everything else as a view over it
  • Local as View using mappings that are
    conjunctive queries
  • May have overlapping/incomplete sources
  • Define each source as the subset of a query over
    the mediated schema the open world assumption
  • We can use selection or join predicates to
    specify that a source contains a range of values
  • ComputerBooks() ? Books(Title, , Subj), Subj
    Computers

26
The Local-as-View Model
  • The basic model is the following
  • Local sources are views over the mediated
    schema
  • Sources have the data mediated schema is
    virtual
  • Sources may not have all the data from the domain
    open-world assumption
  • The system must use the sources (views) to answer
    queries over the mediated schema

27
Answering Queries Using Views
  • Assumption conjunctive queries, set semantics
  • Suppose we have a mediated schema show(ID,
    title, year, genre), rating(ID, stars, source)
  • A conjunctive query might be q(t) - show(i,
    t, y, g), rating(i, 5, s), y 1997
  • Recall intuitions about this class of queries
  • Adding a conjunct to a query removes answers from
    the result but never adds any
  • Any conjunctive query with at least the same
    constraints conjuncts will give valid answers

28
Query Answering
  • Suppose we have the query
  • q(t) - show(i, t, y, g), rating(i, 5, s), y
    1997
  • and sources
  • 5star(i) ? show(i, t, y, g), rating(i, 5, s)
  • TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
    TVGuide)
  • movieInfo(i,t,y,g) ? show(i, t, y, g)
  • critics(i,r, s) ? rating(i, r, s)
  • goodMovies(t,y) ? show(i, t, y, drama),
    rating(i, 5, s), y 1997
  • We want to compose the query with the source
    mappings but theyre in the wrong direction!

29
Inverse Rules
  • We can take every mapping and invert it, though
    sometimes we may have insufficient information
  • If
  • 5star(i) ? show(i, t, y, g), rating(i, 5, s)
  • then we can also infer that
  • show(i,??? ,??? ,??? ,???) ? 5star(i)
  • But how to handle the absence of the missing
    attributes?
  • We know that there must be AT LEAST one instance
    of ??? for each attribute for each show ID
  • So we might simply insert a NULL and define that
    NULL means unknown (as opposed to missing)

30
But NULLs Lose Information
  • Suppose we take these rules and ask for
  • q(t) - show(i, t, y, g), rating(i, 5, s), y
    1997
  • If we look at the rule
  • goodMovies(t,y) ? show(i, t, y, drama),
    rating(i, 5, s), y 1997
  • By inspection, q(t) ? goodMovies(t,y)
  • But if apply our inversion procedure, we get
  • show(i, t, y, g) ? goodMovies(t,y), i NULL, g
    drama
  • rating(i, r, s) ? goodMovies(t,y), i NULL, r
    5, s NULL
  • We need a special NULL so we can figure out
    which IDs and ratings match up

31
The Solution Skolem Functions
  • Skolem functions
  • Conceptual perfect hash functions
  • Each function returns a unique, deterministic
    value for each combination of input values
  • Every function returns a non-overlapping set of
    values (Skolem function F will never return a
    value that matches any of Skolem function Gs
    values)
  • Skolem functions wont ever be part of the answer
    set or the computation it doesnt produce real
    values
  • Theyre just a way of logically generating
    special NULLs

32
Query Answering Using Inverse Rules
  • Invert all rules using the procedures described
  • Take the query and the possible rule expansions
    and execute them in a Datalog interpreter
  • In the previous query, we expand with all
    combinations of expansions of book and of author
    every possible way of combining and
    cross-correlating info from different sources
  • Then we throw away all unsatisfiable rewritings
    (some expansions will be logically inconsistent)
  • More efficient, but equivalent, algorithms now
    exist
  • Bucket algorithm Levy et al., which we discuss
    next
  • MiniCon Pottinger Halevy (next time)
  • Also related chase and backchase Popa,
    Tannen, Deutsch

33
The Bucket Algorithm
  • Given a query Q with relations and predicates
  • Create a bucket for each subgoal in Q
  • Iterate over each view (source mapping)
  • If source includes buckets subgoal
  • Create mapping between qs vars and the views
    var at the same position
  • If satisfiable with substitutions, add to bucket
  • Do cross-product of buckets, see if result is
    contained (exptime, but queries are probably
    relatively small)
  • For each result, do a containment check to make
    sure the rewriting is contained within the query

34
Lets Try a Bucket Example
  • Query
  • q(t) - show(i, t, y, g), rating(i, 5, s), y
    1997
  • Sources
  • 5star(i) ? show(i, t, y, g), rating(i, 5, s)
  • TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
    TVGuide)
  • movieInfo(i,t,y,g) ? show(i, t, y, g)
  • critics(i,r, s) ? rating(i, r, s)
  • goodMovies(t,y) ? show(i, t, y, drama),
    rating(i, 5, s), y 1997
  • good98(t,y) ? show(i, t, y, drama), rating(i,
    5, s), y 1998

35
Example of Containment Testing
  • Suppose we have two queriesq1(S,C) -
    Student(S, N), Takes(S, C), Course(C, X),
    inCSE(C), Course(C, DB Info
    Systems)q2(S,C) - Student(S, N), Takes(S, C),
    Course(C, X)
  • Intuitively, q1 must contain the same or fewer
    answers vs. q2
  • It has all of the same conditions, except one
    extra conjunction (i.e., its more restricted)
  • Theres no union or any other way it can add more
    data
  • We can say that q2 contains q1 because this holds
    for any instance of our DB Student, Takes,
    Course

36
Checking Containment via Canonical Databases
  • To test for q1 µ q2
  • Create a canonical DB that contains a tuple for
    each subgoal in q1
  • Execute q2 over it
  • If q2 returns a tuple that matches the head of
    q1, then q1 µ q2
  • (This is an NP-complete algorithm in the size of
    the query. Testing for full first-order logic
    queries is undecidable!!!)
  • Lets see this for our example

37
Example Canonical DB
  • q1(S,C) - Student(S, N), Takes(S, C), Course(C,
    X), inCSE(C), Course(C, DB Info Systems)
  • q2(S,C) - Student(S, N), Takes(S, C), Course(C,
    X)

Student
Takes
Course
inCSE
C X
C DB Info Systems
S
S N
S C
Need to get tuple ltS,Cgt in executing q2 over this
database
38
Next Time
  • Well look at the state-of-the-art in query
    reformulation, the MiniCon algorithm
  • Eliminates the need for the containment check
  • Eliminates many cross-product comparisons
  • This and the ChaseBackchase strategy of Tannen
    et al are the two methods most used in virtual
    data integration today
  • Please read the MiniCon paper (Pottinger Halevy)
Write a Comment
User Comments (0)
About PowerShow.com