Data Integration - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Data Integration

Description:

proprietary. Mediator Architecture. Key idea. hide heterogeneity in wrappers. data format conversions, capabilities. wrapper makes data source look like DB ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 17
Provided by: Office20051
Category:

less

Transcript and Presenter's Notes

Title: Data Integration


1
Data Integration
2
Overview
  • Virtual Tables and Mediators
  • On the fly data integration

3
Data Warehouse Architecture
OLTP
OLAP
OLTP Applications
GUI, Spreadsheets
DB1
Data Warehouse
DB2
DB3
4
Data Warehouse Architecture
  • Data Warehouse Materialized View(s)
  • base data comes from different sources
  • schema of data warehouse can be defined as SQL
    over base data (modulo data cleaning)
  • What is the alternative?
  • data warehouse (normal) View
  • Advantages of Materialized Views
  • guaranteed response time after ETL
  • Disadvantages of Materialized Views
  • investment for ETL (needs to be amortized)
  • freshness of data for real-time warehousing
  • additional resources to store / query views

5
Heterogeneous DBMS
  • Idea Transparently query n databases as if it
    were one database (SQL or XQuery)
  • hide heterogeneity
  • hide distribution
  • Kinds of heterogeneity
  • heterogeneous hardware / OS / platform
  • heterogeneous DBMS products (DB2, Oracle, ...)
  • heterogeneous data models (XML, hierarchical,
    ...)
  • semantic heterogeneity (EUR, USD, ...)

6
Mediator Architecture
Client
Client
Client
SQL
Mediator
SQL
wrapper
wrapper
wrapper
wrapper
wrapper
proprietary
SAP
DB2
LDAP
MySQL
EXCEL
7
Mediator Architecture
  • Key idea
  • hide heterogeneity in wrappers
  • data format conversions, capabilities
  • wrapper makes data source look like DB
  • mediator operates like a distributed DB
  • optimizes join orders
  • optimizes data movement
  • needs to respect capabilities of data sources (!)
  • Does not help for semantic heterogeneity!

8
Heterogeneous DB Products
  • IBM Data Joiner
  • today, part of DB2
  • Products from all major DB vendors
  • Research Prototypes
  • Garlic, TSIMMIS, Info Manifold, ...
  • very popular in the late 90s

9
What is the Schema of a HDB?
  • Which schema is supported by mediator?
  • Local as View
  • the local schema is a view on the global schema
  • involves answering queries using views
  • change local schema independent of global schema
  • Global as View
  • the global schema is a view over the local
    schemas
  • involves view unfolding (much simpler)
  • change to local schema -gt change of global schema
  • N.B.
  • often difficult to find global schema (no
    matter)none of these approaches scale at the
    Web-level

10
Schema Examples
  • Global Schema
  • emp(eno, name, salary, dno)
  • dept(dno, manager, budget)
  • Local Schema 1
  • empL1(eno, name, salary, level, dno)
  • deptL1(dno, budget)
  • Local Schema 2
  • empL2(eno, name, salary, dno)
  • deptL2(dno, manager, budget)

11
Global as View
  • create view emp as
  • (select eno, name, (salary 1.6), dno
  • from empL1)
  • UNION
  • (select from empL2)
  • create view dept as
  • (select d.dno, e.eno, d.budget 1.6
  • from empL1, deptL2
  • where e.dno d.dno and e.level manager)
  • UNION
  • (select from deptL2)

12
GaV Query Processing
  • select max(budget) from dept
  • is equivalent to (after optimizationassumpt.)
  • select max(budget)
  • from (select max(budget 1.6) as budget)
  • from deptL1) union
  • (select max(budget) as budget
  • from deptL2)

13
Local as View
  • create view deptL2 as
  • (select dno, budget / 1.6
  • from dept)
  • create view empL2 as
  • (select eno, name, salary, dno
  • from emp)
  • Query processing complicated, but ultimately the
    same after AQUV.

14
Open Questions
  • How do you execute updates?
  • difficult in the GaV approach
  • How do you define integrity constraints?
  • typically done at the data source level only
  • global integrity constraints very difficult
  • How do you execute transactions?
  • the usual distributed DBMS shabang (2PC, etc.)

15
Capability-based Query Opt.
  • Let us assume that your empL1 DB is served by a
    Web-based interface
  • only look-ups by eno and dno possible
  • Query
  • select max(salary)
  • from emp
  • where dno 4711
  • What is the best feasible plan for this query?

16
Solution
  • select max(salary)
  • from (select eno, name, salary1.6 as salary)
  • from empL1) union
  • (select max(salary)
  • from empL2)
  • How is this done?
  • wrappers provide rules to enumerate possible
    plans
  • wrappers provide cost model for their plans
  • reuse query optimizer (DP) infrastructure
  • Haas et al., VLDB 1997
Write a Comment
User Comments (0)
About PowerShow.com