Information - PowerPoint PPT Presentation

About This Presentation
Title:

Information

Description:

The end-game: 3 options. Have an in-class final exam. 5/8 2:30pm is the designated time ... SIMS/Ariadne (USC/ISI) Emerac/Havasu (ASU) BibFinder (ASU) ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 44
Provided by: subbraoka
Category:

less

Transcript and Presenter's Notes

Title: Information


1
Information Data Integration
  • Combining information from multiple autonomous
    sources

2
The end-game 3 options
  • Have an in-class final exam
  • 5/8 230pm is the designated time
  • Have a take-home exam
  • Make the final home-work into a take home
  • ..and have a mandatory discussion class on 5/8
    230pm

Also, note the change in demo schedule
3
Todays Agenda
  • Discuss Semtag/Seeker
  • Lecture start on Information Integration

4
Information Integration
  • Combining information from multiple autonomous
    information sources
  • And answering queries using the combined
    information
  • Many variations depending on
  • The type of information sources (text? Data?
    Combination?)
  • Data vs. Information integration
  • Horizontal vs. Vertical integration
  • The level of eagerness of the integration
  • Ad hoc vs. Pre-mediated integration
  • Pre-mediation itself can be Warehouse vs online
    approaches
  • Generality of the solution
  • Mashup vs. Mediated

5
Information Integrationas making the database
repositoryof the web..
The IR View
6
The DB View Integration as Mediation
over Autonomous databases
7
Who is dying to have it? (Applications)
Skeptics corner
  • WWW
  • Comparison shopping
  • Portals integrating data from multiple sources
  • B2B, electronic marketplaces
  • Science and culture
  • Medical genetics integrating genomic data
  • Astrophysics monitoring events in the sky.
  • Culture uniform access to all cultural databases
    produced by countries in Europe provinces in
    Canada
  • Enterprise data integration
  • An average company has 49 different databases and
    spends 35 of its IT dollars on integration
    efforts

8
Isnt web mostly text?
Skeptics corner
  • The invisible web is mostly structured
  • Most web servers have back end database servers
  • They dynamically convert (wrap) the structured
    data into readable english
  • ltIndia, New Delhigt gt The capital of India is
    New Delhi.
  • So, if we can unwrap the text, we have
    structured data!
  • (un)wrappers, learning wrappers etc
  • Note also that such dynamic pages cannot be
    crawled...
  • The (coming) Semi-structured web
  • Most pages are at least semi-structured
  • XML standard is expected to ease the
    presenatation/on-the-wire transfer of such pages.
    (BUT..)
  • The Services
  • Travel services, mapping services
  • The Sensors
  • Stock quotes, current temperatures, ticket prices

9
Is it like Expedia/Travelocity/Orbitz
Skeptics corner
  • Surpringly, NO!
  • The online travel sites dont quite need to do
    data integration they just use SABRE
  • SABRE was started in the 60s as a joint project
    between American Airlines and IBM
  • It is the de facto database for most of the
    airline industry (who voluntarily enter their
    data into it)
  • There are very few airlines that buck the SABRE
    trendSouthWest airlines is one (which is why
    many online sites dont bother with South West)
  • So, online travel sites really are talking to a
    single database (not multiple data sources)
  • To be sure, online travel sources do have to
    solve a hard problem. Finding an optimal fare
    (even at a given time) is basically
    computationally intractable (not to mention the
    issues brought in by fluctuating fares). So dont
    be so hard on yourself
  • Check out http//www.maa.org/devlin/devlin_09_02.h
    tml

10
Why isnt this just
Skeptics corner
  • Search engines do text-based retrieval of URLS
  • Works reasonably well for single document texts,
    or for finding sites based on single document
    text
  • Cannot integrate information from multiple
    documents
  • Cannot do effective query relaxation or
    generalization
  • Cannot link documents and databases
  • The aim of Information integration is to support
    query processing over structured and
    semi-structured sources as well as services.

11
Are we talking comparison shopping agents?
  • Certainly closer to the aims of these
  • But
  • Wider focus
  • Consider larger range of databases
  • Consider services
  • Implies more challenges
  • warehousing may not work
  • Manual source characterization/ integration wont
    scale-up

12
(No Transcript)
13
4/26
  • Information Integration 2
  • Focus on Data Integration

14
Information Integration
Text Integration
Data Integration
Collection Selection
Soft-Joins
Data aggregation (vertical integration)
Data Linking (horizontal integration)
15
Different Integration scenarios
  • Collection Selection
  • All sources export text documents
  • E.g. meta-crawler etc.
  • Challenges Similarity definition relevance
    handling
  • Data Aggregation (Vertical)
  • All sources export (parts of a) single relation
  • No need for joins etc
  • Could be warehouse or virtual
  • E.g. BibFinder, Junglee, Employeds etc
  • Challenges Schema mapping data overlap
  • Data Linking (Horizontal)
  • Joins over multiple relations stored in multiple
    DB
  • E.g. Softjoins in WHIRL
  • Ted Kennedy episode
  • Challenges record linkage over text fields
    (object mapping) query reformulation
  • All together (vertical horizontal)
  • Many interesting research issues
  • ..but few actual fielded systems

16
Collection Selection
17
Collection Selection/Meta Search Introduction
  • Metasearch Engine
  • A system that provides unified access to multiple
    existing search engines.
  • Metasearch Engine Components
  • Database Selector
  • Identifying potentially useful databases for
    each user query
  • Document Selector
  • Identifying potentially useful document returned
    from selected databases
  • Query Dispatcher and Result Merger
  • Ranking the selected documents

18
Collection Selection
19
Evaluating collection selection
  • Let c1..cj be the collections that are chosen to
    be accessed for the query Q. Let d1dk be the top
    documents returned from these collections.
  • We compare these results to the results that
    would have been returned from a central union
    database
  • Ground Truth The ranking of documents that the
    retrieval technique (say vector space or jaccard
    similarity) would have retrieved from a central
    union database that is the union of all the
    collections
  • Compare precision of the documents returned by
    accessing

20
Existing Approaches in Database Selection
  • Rough Representative Approaches
  • Using a few keywords or sentences to describe the
    contents of the database
  • Often manually generated
  • Statistical Representative Approaches
  • Contains statistical information for each term in
    databases (such as the document frequency and
    average weight of the term).
  • Scalability issues due to large amount of
    statistics for each db.
  • Learning-based approaches
  • Using past retrieval experience to determine the
    usefulness for future queries
  • Training queries / real user queries
  • The obtained experience against a database will
    be saved as the representative of the database

21
General Scheme Challenges
  • Get a representative of each of the database
  • Representative is a sample of files from the
    database
  • Challenge get an unbiased sample when you can
    only access the database through queries.
  • Compare the query to the representatives to judge
    the relevance of a database
  • Coarse approach Convert the representative files
    into a single file (super-document). Take the
    (vector) similarity between the query and the
    super document of a database to judge that
    databases relevance
  • Finer approach Keep the representative as a
    mini-database. Union the mini-databases to get a
    central mini-database. Apply the query to the
    central mini-database and find the top k answers
    from it. Decide the relevance of each database
    based on which of the answers came from which
    databases representative
  • You can use an estimate of the size of the
    database too
  • What about overlap between collections? (See
    ROSCO paper)

22
Uniform Probing for Content Summary Construction
  • Automatic extraction of document frequency
    statistics from uncooperative databases
  • Callan and Connell TOIS 2001,Callan et al.
    SIGMOD 1999
  • Main Ideas
  • Pick a word and send it as a query to database D
  • RandomSampling-OtherResource(RS-Ord) from a
    dictionary
  • RandomSampling-LearnedResource(RS-Lrd) from
    retrieved documents
  • Retrieval the top-K documents returned
  • If the number of retrieved documents exceeds a
    threshod T, stop, otherwise retart at the
    beginning
  • k4 , T300
  • Compute the sample document frequency for each
    word that appeared in a retrieved document.

23
CORI Net Approach (Representative as a super
document)
  • Representative Statistics
  • The document frequency for each term and each
    database
  • The database frequency for each term
  • Main Ideas
  • Visualize the representative of a database as a
    super document, and the set of all representative
    as a database of super documents
  • Document frequency becomes term frequency in the
    super document, and database frequency becomes
    document frequency in the super database
  • Ranking scores can be computed using a similarity
    function such as the Cosine function

24
ReDDE Approach(Representative as a
mini-collection)
  • Use the representatives as mini collections
  • Construct a union-representative that is the
    union of the mini-collections (such that each
    document keeps information on which collection it
    was sampled from)
  • Send the query first to union-collection, get the
    top-k ranked results
  • See which of the results in the top-k came from
    which mini-collection. The collections are ranked
    in terms of how much their mini-collections
    contributed to the top-k answers of the query.
  • Scale the number of returned results by the
    expected size of the actual collection

25
Data Integration
26
Models for Integration
  • Modified from Alon Halevys slides

27
Solutions for small-scale integration
  • Mostly ad-hoc programming create a special
    solution for every case pay consultants a lot of
    money.
  • Data warehousing load all the data periodically
    into a warehouse.
  • 6-18 months lead time
  • Separates operational DBMS from decision support
    DBMS. (not only a solution to data integration).
  • Performance is good data may not be fresh.
  • Need to clean, scrub you data.

Junglee did this, for employment classifieds
28
Data Warehouse Architecture
OLAP / Decision support/ Data cubes/ data mining
User queries
Junglee did this, for employment classifieds
Relational database (warehouse)
Data extraction programs
Data cleaning/ scrubbing
Data source
Data source
Data source
29
The Virtual Integration Architecture
  • Leave the data in the sources.
  • When a query comes in
  • Determine the relevant sources to the query
  • Break down the query into sub-queries for the
    sources.
  • Get the answers from the sources, and combine
    them appropriately.
  • Data is fresh. Approach scalable
  • Issues
  • Relating Sources Mediator
  • Reformulating the query
  • Efficient planning execution

Garlic IBM, HermesUMDTsimmis,
InfoMasterStanford DISCOINRIA Information
Manifold ATT SIMS/AriadneUSCEmerac/HavasuA
SU
30
Virtual Integration Architecture
Sources can be relational, hierarchical (IMS),
structure files, web sites.
31
Research Projects
  • Garlic (IBM),
  • Information Manifold (ATT)
  • Tsimmis, InfoMaster (Stanford)
  • The Internet Softbot/Razor/Tukwila (UW)
  • Hermes (Maryland)
  • DISCO (INRIA, France)
  • SIMS/Ariadne (USC/ISI)
  • Emerac/Havasu (ASU)
  • BibFinder (ASU)

32
Desiderata for Relating Source-Mediator Schemas
  • Expressive power distinguish between sources
    with closely related data. Hence, be able to
    prune access to irrelevant sources.
  • Easy addition make it easy to add new data
    sources.
  • Reformulation be able to reformulate a user
    query into a query on the sources efficiently and
    effectively.
  • Nonlossy be able to handle all queries that can
    be answered by directly accessing the sources

Reformulation
33
11/9
  • --Google Mashupswhat type of integration are
    they?
  • --Mturk.com?collaborative computing the
    capitalist way

34
Why isnt this just
Skeptics corner
Databases
Distributed Databases
  • No common schema
  • Sources with heterogeneous schemas (and
    ontologies)
  • Semi-structured sources
  • Legacy Sources
  • Not relational-complete
  • Variety of access/process limitations
  • Autonomous sources
  • No central administration
  • Uncontrolled source content overlap
  • Unpredictable run-time behavior
  • Makes query execution hard
  • Predominantly Read-only
  • Could be a blessingless worry about transaction
    management
  • (although the push now is to also support
    transactions on web)

35
Reformulation Problem
  • Given
  • A query Q posed over the mediated schema
  • Descriptions of the data sources
  • Find
  • A query Q over the data source relations, such
    that
  • Q provides only correct answers to Q, and
  • Q provides all possible answers to Q given the
    sources.

36
Approaches for relating source Mediator Schemas
Differences minor for data aggregation
  • Global-as-view (GAV) express the mediated schema
    relations as a set of views over the data source
    relations
  • Local-as-view (LAV) express the source relations
    as views over the mediated schema.
  • Can be combined?

Lets compare them in a movie Database
integration scenario..
37
Example Scenario
  • A mediator for movie databases
  • Want to provide the information about movies and
    where they are playing

38
Global-as-View
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Create View Movie AS
  • select from S1 S1(title,dir,year,genre)
  • union
  • select from S2 S2(title,
    dir,year,genre)
  • union S3(title,dir),
    S4(title,year,genre)
  • select S3.title, S3.dir, S4.year, S4.genre
  • from S3, S4
  • where S3.titleS4.title

Express mediator schema relations as views
over source relations
39
Global-as-View
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Create View Movie AS
  • select from S1 S1(title,dir,year,genre)
  • union
  • select from S2 S2(title,
    dir,year,genre)
  • union S3(title,dir),
    S4(title,year,genre)
  • select S3.title, S3.dir, S4.year, S4.genre
  • from S3, S4
  • where S3.titleS4.title

Express mediator schema relations as views
over source relations
Mediator schema relations are Virtual views on
source relations
40
Global-as-View Example 2
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Create View Movie AS S1(title,dir,year)
  • select title, dir, year, NULL
  • from S1
  • union S2(title,
    dir,genre)
  • select title, dir, NULL, genre
  • from S2

Express mediator schema relations as views
over source relations
Null values
41
Global-as-View Example 2
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Source S4 S4(cinema, genre)

Express mediator schema relations as views
over source relations
Create View Movie AS select NULL, NULL, NULL,
genre from S4 Create View Schedule AS
select cinema, NULL, NULL from S4. But what
if we want to find which cinemas are playing
comedies?
Lossy Mediation
42
Local-as-View example 1
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).

Express source schema relations as views
over mediator relations
43
Mediated schema Movie(title, dir, year,
genre), Schedule(cinema, title, time).
Express source schema relations as views
over mediator relations
S1(title,dir,year,genre)
S3(title,dir)
S5(title,dir,year), year gt1960
S4(Cinema,Genre)
44
Local-as-View Example 2
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Source S4 S4(cinema, genre)

Express source schema relations as views
over mediator relations
Create Source S4 select cinema, genre from
Movie m, Schedule s where m.titles.title
Now if we want to find which cinemas are playing
comedies, there is hope!
45
GAV vs. LAV
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).

Source S4 S4(cinema, genre)
Lossy mediation
46
GAV vs. LAV
  • Not modular
  • Addition of new sources changes the mediated
    schema
  • Can be awkward to write mediated schema without
    loss of information
  • Query reformulation easy
  • reduces to view unfolding (polynomial)
  • Can build hierarchies of mediated schemas
  • Best when
  • Few, stable, data sources
  • well-known to the mediator (e.g. corporate
    integration)
  • Garlic, TSIMMIS, HERMES
  • Modular--adding new sources is easy
  • Very flexible--power of the entire query language
    available to describe sources
  • Reformulation is hard
  • Involves answering queries only using views (can
    be intractablesee below)
  • Best when
  • Many, relatively unknown data sources
  • possibility of addition/deletion of sources
  • Information Manifold, InfoMaster, Emerac, Havasu

47
Reformulation in LAV
  • Given a set of views V1,,Vn, and a query Q, can
    we answer Q using only the answers to V1,,Vn?
  • Notice the use of materialized (pre-computed)
    views
  • ONLY materialized views should be used
  • Approaches
  • Bucket algorithm Levy 96
  • Inverse rules algorithm Duschka, 99
  • Hybrid versions
  • SV-Bucket 2001, MiniCon 2001

48
Reformulation in LAV The issues
Query Find all the years in which Zhang Yimou
released movies. Select year from movie M
where M.diryimou Q(y) - movie(T,D,Y,G),Dyimou
Not executable
Which is the better plan? What are we looking
for? --equivalence? --containment?
--Maximal Containment --Smallest plan?
Q(y) - S1(T,D,Y,G) , Dyimou (1)
Q(y) - S1(T,D,Y,G) , Dyimou Q(y) - S5(T,D,Y)
, Dyimou (2)
49
Maximal Containment
  • Query plan should be sound and complete
  • Sound implies that Query should be contained in
    the Plan (I.e., tuples satisfying the query are
    subset of the tuples satisfying the plan
  • Completeness?
  • Traditional DBs aim for equivalence
  • Query contains the plan Plan contains the query
  • Impossible
  • We want all query tuples that can be extracted
    given the sources we have
  • Maximal containment (no other query plan, which
    contains this plan is available)

P contains Q if P Q (exponential even for
conjunctive (SPJ) query plans)
50
The world of Containment
  • Consider Q1(.) - B1(.) Q2(.) - B2(.)
  • Q1 ? Q2 (contained in) if the answer to Q1 is
    a subset of Q2
  • Basically holds if B1(x) B2(x)
  • Given a query Q, and a query plan Q1,
  • Q1 is a sound query plan if Q1 is contained in Q
  • Q1 is a complete query plan if Q is contained in
    Q1
  • Q1 is a maximally contained query plan if there
    is no Q2 which is a sound query plan for Q1, such
    that Q1 is contained in Q2

51
Computing Containment Checks
  • Consider Q1(.) - B1(.) Q2(.) - B2(.)
  • Q1 ? Q2 (contained in) if the answer to Q1 is
    a subset of Q2
  • Basically holds if B1(x) B2(x)
  • (but entailment is undecidable in general boo
    hoo)
  • Containment can be checked through containment
    mappings, if the queries are conjunctive
    (select/project/join queries, without
    constraints) ONLY EXPONENTIAL!!!--arent you
    relieved?
  • m is a containment mapping from Vars(Q2) to
    Vars(Q1) if
  • m maps every subgoal in the body of Q2 to a
    subgoal in the body of Q1
  • m maps the head of Q2 to the head of Q1
  • Eg Q1(x,y) - R(x), S(y), T(x,y) Q2(u,v) -
    R(u), S(v)
  • Containment mapping u/x v/y

52
Reformulation Algorithms
Bucket Algorithm
Inverse Rules
Q(.) - V1() V2()
S11 S12 S00
S21 S22 S00
V1() - S11() V1() - S12() V1() - S00() V2() -
S21() V2() - S22() V2() - S00()
  • Bucket algorithm
  • Cartesian product of buckets
  • Followed by containment check
  • Inverse Rules
  • plan fragments for mediator relations

P1 contains P2 if P2 P1
Levy
Duschka
53
Q(C,G) - Movie(T,D,Y,G) ,
Schedule(C,T,Ti)
Skolemize Sfs5-1(T,D,Y)
Movie(T,D,Y,G) - S1(T,D,Y,G) Movie(T,D,?y,?g )
- S3(T,D) Movie(T,D,Y,?g) - S5(T,D,Y),
ygt1960 Movie(?t,?d,?y,G) - S4(C,G) Schedule(C,?t
,?ti) - S4(C,G)
S1(T,D,Y,G) S3(T,D) S5(T,D,Y) S4(C,G)
S4(C,G)
Movie(T,D,Y,G)
Sched(C,T,Ti)
Skolemize Sfs4-1(G,C)
54
Take-home vs. In-class(The Tastes Great/Less
Filling debate of CSE 494)
  • More time consuming
  • To set, to take and to grade
  • Caveat people tend to over-estimate the time
    taken to do the take-home since the dont factor
    out the preparation time that is interleaved
    with the exam time
  • ..but may be more realistic
  • Probably will be given on Th 6th and will be due
    by 11th evening. (/-)
  • Less time-consuming
  • To take (sort of like removing a bandage..)
  • and definitely to grade ?
  • Scheduled to be on Tuesday, May 11th 240430

55
Interactive Review on 5/4
  • A large (probably most?) of the class on 5/4 will
    be devoted to interactive semester review
  • Everyone should come prepared with at least 3
    topics/ideas that they got out of the class
  • You will be called in random order and you should
    have enough things to not repeat what others
    before you said.
  • What you say will then be immortalized on the
    class homepage
  • See the old acquired wisdom page on the class
    page.

56
Bucket Algorithm Populating buckets
  • For each subgoal in the query, place relevant
    views in the subgoals bucket
  • Inputs
  • Q(x)- r1(x,y) r2(y,x)
  • V1(a)-r1(a,b)
  • V2(d)-r2(c,d)
  • V3(f)- r1(f,g) r2(g,f)

Buckets
57
Combining Buckets
  • For every combination in the Cartesian products
    from the buckets, check containment in the query

Q(x)- r1(x,y) r2(y,x)
  • Candidate rewritings
  • Q1(x) - V1(x) V2(x) X
  • Q2(x) - V1(x) V3(x) X
  • Q3(x) - V3(x) V2(x) X
  • Q4(x) - V3(x) V3(x)

Bucket Algorithm will check all possible
combinations
Buckets
r1(x,y)
r2(y,x)
58
Complexity of finding maximallycontained plans
in LAV
  • Complexity does change if the sources are not
    conjunctive queries
  • Sources as unions of conjunctive queries
    (NP-hard)
  • Disjunctive descriptions
  • Sources as recursive queries (Undecidable)
  • Comparison predicates
  • Complexity is less dependent on the query
  • Recursion okay but inequality constraints lead
    to NP-hardness
  • Complexity also changes based on Open vs. Closed
    world assumption

Big Two
You can reduce the complexity by taking a
conjunctive query that is an upperbound.
?This just pushes the complexity to
minimization phase
Advertised description Toyota U Olds
True source contents (of Big Two)
Advertised description All cars
Abiteboul Duschka, 98
59
Is XML standardization a magical solution for
Integration?
  • If all WEB sources standardize into XML format
  • Source access (wrapper generation issues) become
    easier to manage
  • BUT all other problems remain
  • Still need to relate source (XML)schemas to
    mediator (XML)schema
  • Still need to reason about source overlap, source
    access limitations etc.
  • Still need to manage execution in the presence of
    source/network uncertainities

60
Query Optimization Challenges
-- Deciding what to optimize --Getting the
statistics on sources --Doing the
optimization We will first talk about
reformulation level challenges
61
Practical issues complicating Reformulation
  • Sources may have access limitations
  • Access restrictions can lead to recursive
    rewritings even when the queries are
    non-recursive!
  • Sources may have overlap
  • Non-minimal rewritings may result when overlap
    information is ignored

62
Source Limitations
  • Sources are not really fully-relational databases
  • Legacy systems
  • Limited access patters
  • (Cans ask a white-pages source for the list of
    all numbers)
  • Limited local processing power
  • Typically only selections (on certain attributes)
    are supported
  • Access limitations modeled in terms of allowed
    (feasible) binding patterns with which the
    source can be accessed
  • E.g. S(X,Y,Z) with feasible patterns f,f,b or
    b,b,f

63
Access Restrictions Recursive Reformulations
  • Create Source S1 as
  • select
  • from Cites
  • given paper1
  • Create Source S2 as
  • select paper
  • from ASU-Papers
  • Create Source S3 as
  • select paper
  • from AwardPapers
  • given paper
  • Query select from AwardPapers

S1bf(p1,p2) - cites(p1,p2) S2(p) -
Asp(p) S3b(p) - Awp(p) Q(p) - Awp(p) Awp(p)
- S3b(p) Asp(p) -
S2(p) Cites(p1,p2) - S1bf(p)
Dom(p) - S2(p) Dom(p) - Dom(p1), S1(p1,p)
KwokWeld, 96 Duschka Levy, 97
64
Managing Source Overlap
  • Often, sources on the Internet have overlapping
    contents
  • The overlap is not centrally managed (unlike
    DDBMSdata replication etc.)
  • Reasoning about overlap is important for plan
    optimality
  • We cannot possibly call all potentially relevant
    sources!
  • Qns How do we characterize, get and exploit
    source overlap?
  • Qualitative approaches (LCW statements)
  • Quantitative approaches (Coverage/Overlap
    statistics)

65
Local Completeness Information
  • If sources are incomplete, we need to look at
    each one of them.
  • Often, sources are locally complete.
  • Movie(title, director, year) complete for years
    after 1960, or for American directors.
  • Question given a set of local completeness
    statements, is a query Q a complete answer to Q?

Problems 1. Sources may not be
interested in giving these! ?Need to learn
?hard to learn! 2. Even if sources are
willing to give, there may not be any
big enough LCWs Saying I
definitely have the car with
vehicle ID XXX is useless
Advertised description
True source contents
Guarantees (LCW Inter-source comparisons)
66
Quantitative ways of modeling inter-source overlap
  • Coverage Overlap statistics Koller et. al.,
    97
  • S1 has 80 of the movies made after 1960 while
    S2 has 60 of the movies
  • S1 has 98 of the movies stored in S2
  • Computing cardinalities of unions given
    intersections

Who gives these statistics? -Third party
-Probing
67
Extremes of Laziness in Data Integration
  • Fully Query-time II (blue sky for now)
  • Get a query from the user on the mediator schema
  • Go discover relevant data sources
  • Figure out their schemas
  • Map the schemas on to the mediator schema
  • Reformulate the user query into data source
    queries
  • Optimize and execute the queries
  • Return the answers
  • Fully pre-fixed II
  • Decide on the only query you want to support
  • Write a (java)script that supports the query by
    accessing specific (pre-determined) sources,
    piping results (through known APIs) to specific
    other sources
  • Examples include Google Map Mashups

(most interesting action is in between)
E.g. We may start with known sources and their
known schemas, do hand-mapping and support
automated reformulation and optimization
68
(No Transcript)
69
  • User queries refer to the mediated schema.
  • Data is stored in the sources in a local schema.
  • Content descriptions provide the semantic
    mappings between the different schemas.
  • Mediator uses the descriptions to translate user
    queries into queries on the sources.

DWIM
70
Dimensions to Consider
  • How many sources are we accessing?
  • How autonomous are they?
  • Can we get meta-data about sources?
  • Is the data structured?
  • Discussion about soft-joins. See slide next
  • Supporting just queries or also updates?
  • Requirements accuracy, completeness,
    performance, handling inconsistencies.
  • Closed world assumption vs. open world?
  • See slide next

71
Soft Joins..WHIRL Cohen
  • We can extend the notion of Joins to Similarity
    Joins where similarity is measured in terms of
    vector similarity over the text attributes. So,
    the join tuples are output n a ranked formwith
    the rank proportional to the similarity
  • Neat idea but does have some implementation
    difficulties
  • Most tuples in the cross-product will have
    non-zero similarities. So, need query processing
    that will somehow just produce highly ranked
    tuples
  • Also other similarity/distance metrics may be
    used
  • E.g. Edit distance

72
(No Transcript)
73
Local Completeness Information
  • If sources are incomplete, we need to look at
    each one of them.
  • Often, sources are locally complete.
  • Movie(title, director, year) complete for years
    after 1960, or for American directors.
  • Question given a set of local completeness
    statements, is a query Q a complete answer to Q?

Problems 1. Sources may not be
interested in giving these! ?Need to learn
?hard to learn! 2. Even if sources are
willing to give, there may not be any
big enough LCWs Saying I
definitely have the car with
vehicle ID XXX is useless
Advertised description
True source contents
Guarantees (LCW Inter-source comparisons)
74
Overview
  • User queries refer to the mediated schema.
  • Data is stored in the sources in a local schema.
  • Content descriptions provide the semantic
    mappings between the different schemas.
  • Mediator uses the descriptions to translate user
    queries into queries on the sources.

Schema Template for the stored data
75
Source Descriptions
  • Contains all meta-information about the sources
  • Logical source contents (books, new cars).
  • Source capabilities (can answer SQL queries)
  • Source completeness (has all books).
  • Physical properties of source and network.
  • Statistics about the data (like in an RDBMS)
  • Source reliability
  • Mirror sources
  • Update frequency.

76
Source Access
We just did this!
  • How do we get the tuples?
  • Many sources give unstructured output
  • Some inherently unstructured while others
    englishify their database-style output
  • Need to (un)Wrap the output from the sources to
    get tuples
  • Wrapper building/Information Extraction
  • Can be done manually/semi-manually

77
Source Fusion/Query Planning
  • Accepts user query and generates a plan for
    accessing sources to answer the query
  • Needs to handle tradeoffs between cost and
    coverage
  • Needs to handle source access limitations
  • Needs to reason about the source
    quality/reputation

78
Monitoring/Execution
  • Takes the query plan and executes it on the
    sources
  • Needs to handle source latency
  • Needs to handle transient/short-term network
    outages
  • Needs to handle source access limitations
  • May need to re-schedule or re-plan

79
Integrator vs. DBMS
Reprise
  • No common schema
  • Sources with heterogeneous schemas
  • Semi-structured sources
  • Legacy Sources
  • Not relational-complete
  • Variety of access/process limitations
  • Autonomous sources
  • No central administration
  • Uncontrolled source content overlap
  • Lack of source statistics
  • Tradeoffs between query plan cost, coverage,
    quality etc.
  • Multi-objective cost models
  • Unpredictable run-time behavior
  • Makes query execution hard
  • Presence of services
  • Need to compose services

vs.
80
Deductive Databases
  • Relations viewed as predicates.
  • Interrelations between relations expressed as
    datalog rules
  • (Horn clauses, without function symbols)
  • Queries correspond to datalog programs
  • Conjunctive queries are datalog programs with a
    single non-recursive rule Correspond to SPJ
    queries in SQL
  • Emprelated(Name,Dname) - Empdep(Name,Dname)
  • Emprelated(Name,Dname) - Empdep(Name,D1),
    Emprelated(D1,Dname)


EDB predicate
IDB predicate
81
BibFinder Case Study
  • See the bibfinder slides

82
What to Optimize
  • Traditional DB optimizers compare candidate plans
    purely in terms of the time they take to produce
    all answers to a query.
  • In Integration scenarios, the optimization is
    multi-objective
  • Total time of execution
  • Cost to first few tuples
  • Often, the users are happier with plans that give
    first tuples faster
  • Coverage of the plan
  • Full coverage is no longer an iron-clad
    requirement
  • Too many relevant sources, Uncontrolled overlap
    between the sources
  • Cant call them all!
  • (Robustness,
  • Access premiums)

83
Source Statistics Needed
  • The size of the source relation and attributes
  • The length and cardinality of the attributes
  • the cardinality of the source relation
  • The feasible access patterns for the source
  • The network bandwidth and latency between the
    source and the integration system
  • Coverage of the source S for a relation R denoted
    by P(SR)
  • Overlap between sources P(S1..Sk R)

Extension of R
84
Getting the Statistics
  • Since the sources are autonomous, the mediator
    needs to actively gather the relevant statistics
  • Learning bandwidth and latency statistics
  • Gruser et. al. 2000 use neural networks to
    learn the response time patterns of web sources
  • Can learn the variation of response times across
    the days of the week and across the hours of the
    day.
  • Learning coverages and overlaps
  • Nie et. al. 2002 use itemset mining techniques
    to learn compact statistics about the spread of
    the mediator schema relations across the
    accessible sources
  • Can trade quality of the statistics for reduced
    space consumption

85
Using LCW rules to minimize plans
  • Basic Idea
  • --If reformulation of Q leads to a union of
    conjunctive plans
  • P1UP2UPk
  • --then, if P1 is complete for Q (under the
    given LCW information), then we can minimize the
    reformulation by pruning P2Pk
  • -- P1?LCW contains P1UP2UPk
  • For Recursive Plans (obtained when the sources
    have access restrictions)
  • --We are allowed to remove a rule r from a
    plan P, if the complete version of r is already
    contained in P-r

Duschka, AAAI-97
Emerac Lambrecht Kambhampati, 99
86
Example
Advertised description
True source contents
Minimization harder in the presence of source
limitations
Guarantees
S1
S1(T,D,Y) - M(T,D,Y) S2(T,Th,C,H) -
Sh(T,Th,C,H) LCW S2(T,Th,C,H) - Sh(T,Th,C,H)
CSeattle S2(T,Th,C,H) - Sh(T,Th,C,H)
S2/3
Q(t,d) - M(T,D,Y) Sh(T,Th,C,H) C Seattle
Query Q(t,d) - S1(T,D,Y) S2( T,Th,C,H)
CSeattle Plan1 Q(t,d) - M(T,D,Y) S2(
T,Th,C,H) CSeattle Plan2
Levy, 96 Duschka, 97 Lambrecht Kambhampati,
99
87
Example
Advertised description
True source contents
Minimization harder in the presence of source
limitations
Guarantees
S1
S1(T,D,Y) - M(T,D,Y) LCW S1(T,D,Y) -
M(T,D,Y), Ygt 1960 S2(T,Th,C,H) - Sh(T,Th,C,H)
S2
Q(t,d) - M(T,D,Y) Sh(T,Th,C,H) C Seattle
Query Q(t,d) - S1(T,D,Y) S2( T,Th,C,H)
CSeattle Plan Q(t,d) - M(T,D,Y)
Sh(T,Th,C,H) Cseattle Ygt1960
The part of the plan
that is complete
Levy, 96 Duschka, 97 Lambrecht Kambhampati,
99
88
Approaches for handling multiple objectives
  • Do staged optimization
  • Information Manifold Optimize for coverage, and
    then for cost
  • Do joint optimization
  • Generate all the non-dominated solutions
    (Pareto-Set)
  • Combine the objectives into a single metric
  • e.g. Havasu/Multi-R
  • Cost increases additively
  • Coverage decreases multiplicatively
  • utility(p) wlog(coverage(p)) -
    (1-w)cost(p)
  • The logarithm ensures coverage additiveCandan
    01

89
Learning Coverage/Overlap Statistics
Challenge Impractical to learn and store all
the statistics for every query.
  • StatMiner A threshold based hierarchical
    association rule mining approach
  • Learns statistics with respect to query
    classes rather than specific queries
  • Defines query classes in terms of attribute-value
    hierarchies
  • Discovers frequent query classes and limits
    statistics to them
  • Maps a users query into its closest ancestor
    class, and uses the statistics of the mapped
    class to estimate the statistics of the query.
  • Handles the efficiency and accuracy tradeoffs by
    adjusting the thresholds.

Havasu Nie et. al. 2002
90
Alternatives to statistics-based approaches
  • Avoid cost-based optimization and depend on
    qualitative easier to get statistics
  • Instead of coverage/overlap statistics use
  • LCW characterizations
  • mirror source characterizations
  • Eg RAZOR, INFORMATION MANIFOLD, EMERAC
  • Instead of tuple transfer statistics use
  • bound is easier assumption
  • high-traffic binding pattern information
  • Eg. INFORMATION MANIFOLD, EMERAC, Yerneni et.
    al.

91
Staged Optimization of Cost Coverage
  • Ranking and choosing top N plansDoan et. al
    ICDE02
  • Finding a physical plan for for each selected plan

Problem Cost and Coverage are interrelated..
92
Joint Optimization of Cost Coverage
  • Havasu/Multi-R Nie et. al. 2001
  • Search in the space of parallel plans
  • Each subplan in the parallel plan contains (a
    subset of) sources relevant for a subgoal
  • Dynamic programming is used to search among the
    subgoal orders
  • Greedy approach is used to create a subplan for a
    particular subgoal
  • Keep adding sources until the utility (defined in
    terms of cost and coverage) starts to worsen
  • Capable of generating plans for a variety of
    cost/coverage tradeoffs

Increasing relative weight of coverage
93
Techniques for optimizing response time for first
tuples
  • Staged approach Generate plans based on other
    objectives and post-process them to improve their
    response time for first-k tuples
  • Typical idea is to replace asymmetric operators
    with symmetric ones
  • e.g. replace nested-loop join with symmetric hash
    join
  • e.g. Telegraph, Tukwila, Niagara
  • Problem Access limitations between sources may
    disallow symmetric operations
  • Solution Use joint optimization approach (e.g.
    Havasu) and consider the cost of first tuples as
    a component of plan utility
  • Viglas Naughton, 2002 describe approaches for
    characterizing the rate of answer delivery
    offered by various query plans.

94
Integrating Services
  • Source can be services rather than data
    repositories
  • Eg. Amazon as a composite service for book buying
  • Separating line is somewhat thin
  • Handling services
  • Description (APII/O spec)
  • WSDL
  • Composition
  • Planning in general
  • Execution
  • Data-flow architectures
  • See next part

95
about XML/Xquery/RDF
Impact of X-standards on Integration
96
Impact of XML on Integration
  • If and when all sources accept Xqueries and
    exchange data in XML format, then
  • Mediator can accept user queries in Xquery
  • Access sources using Xquery
  • Get data back in XML format
  • Merge results and send to user in XML format
  • How about now?
  • Sources can use XML adapters (middle-ware)

97
Services
  • Source can be services rather than data
    repositories
  • Separating line is thin
  • Handling services
  • Description (APII/O spec)
  • WSDL
  • Composition
  • Planning in general
  • Execution
  • Data-flow architectures

SWORD, WWW-2002
Write a Comment
User Comments (0)
About PowerShow.com