CSE 636 Data Integration

1 / 22
About This Presentation
Title:

CSE 636 Data Integration

Description:

Excel Files. Web Sites. Web Services. Differences in: Names in schema. Attribute grouping ... Invited Tutorial. Data Integration: a Status Report. Alon Halevy ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: CSE 636 Data Integration


1
CSE 636Data Integration
  • Overview

2
Data Warehouse Architecture
?
OLAP / Decision Support Data Cubes / Data Mining
?
Users
Applications
Relational Database (Warehouse)
ETL Tools (Extract-Transform-Load)
Data Cleaning
Data Source
Data Source
Data Source
3
Virtual Integration Architecture
  • Leave the data in the sources
  • When a query comes in
  • Determine the relevant sources to the query
  • Break down the query into sub-queries for the
    sources
  • Get the answers from the sources, filter them if
    needed and combine them appropriately
  • Data is fresh
  • Otherwise known as
  • On Demand Integration

4
Virtual Integration Architecture
Design-Time
?
?
End Users
Applications
  • Sources can be
  • Relational DBs
  • Excel Files
  • Web Sites
  • Web Services

Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
5
Schema Mappings
  • Differences in
  • Names in schema
  • Attribute grouping
  • Coverage of databases
  • Granularity and format of attributes

Inventory Database B
Authors ISBN FirstName LastName
Books Title ISBN Price DiscountPrice Edition
Inventory Database A
BookCategories ISBN Category
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice Studio
Artists ASIN ArtistName GroupName
6
Issues for Schema Mappings
Design-Time
?
?
End Users
Applications
  • What formalisms to express them?
  • How to create them?
  • Can we discover them somehow?
  • How do we use them?

Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
7
Virtual Integration Architecture
Run-Time
Query
Result
Mediator
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
8
Issues for Query Processing
Reformulation
Query
  • User queries refer to the global schema
  • Data is stored in the sources in a local schema
  • Rewriting algorithms

Mediator
Reformulation
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
9
Issues for Query Processing
Reformulation
Global Schema
SELECT ISBN, Price FROM Books WHERE Title on
the road
Books Title ISBN Price DiscountPrice Edition
Local Schema A
SELECT ItemID, SuggestedPrice FROM
BooksAndMusic WHERE Title on the road AND
ItemType Books
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
10
Issues for Query Processing
Query Translation
Query
  • Different query languages

Mediator
Reformulation
Global Schema
Optimization
Execution
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
11
Issues for Query Processing
Query Translation
Global Schema
SELECT ISBN, Price FROM Books WHERE Title on
the road
Books Title ISBN Price DiscountPrice Edition
Local Source A
http//www.amazon.com/homepage.html?ItemTypeBooks
Titleontheroad
12
Issues for Query Processing
Data Translation
Query
  • Different data models

Mediator
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
13
Issues for Query Processing
Data Translation
Global Schema
Title ISBN Price
On the Road 123 10.86
Books Title ISBN Price DiscountPrice Edition
lttablegt lttrgt lttdgt lta
href/details?isbn123gt ltbgtOn the
Roadlt/bgt lt/agt -- by Jack
Kerouac Paperback ltbrgt lta
href/details?isbn123gt Buy new
lt/agt ltb classpricegt10.86lt/bgt
lt/tdgt lt/trgt lt/tablegt
Local Result A
14
Issues for Query Processing
Query Execution
Query
  • Access as many data sources as needed
  • Duplicate/redundant and irrelevant data
  • Limited query capabilities

Mediator
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
15
Issues for Query Processing
Limited Query Capabilities
SELECT ISBN, Price, DiscountPrice FROM
Books WHERE Title on the road
Global Schema
Books Title ISBN Price DiscountPrice Edition
ISBN Price DiscountPrice
123 10.86 8.86
E
Local Schema A
Local Schema B
BooksAndMusic Title Author ItemID ItemType Suggest
edPrice
DiscountBooks Title Edition ISBN GreatPrice
ItemID SuggestedPrice
123 10.86
ItemID SuggestedPrice
123 10.86
GreatPrice
8.86
B
D
SELECT GreatPrice FROM DiscountBooks WHERE ISBN
?
SELECT ItemID, SuggestedPrice FROM
BooksAndMusic WHERE Title ?
SELECT GreatPrice FROM DiscountBooks WHERE ISBN
123
C
16
Issues for Query Processing
Query Answering
Query
Result
  • Combine the results and further process them if
    needed
  • Mainly union and merge
  • Inconsistencies

Mediator
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
17
Issues for Query Processing
Query Answering (Union)
ISBN Price
123 10.86
456 8.86
ItemID SuggestedPrice
123 10.86
ISBN GreatPrice
456 8.86
18
Issues for Query Processing
Query Answering (Merge)
Primary Key
ISBN Title Edition Price
123 On the Road 2nd 8.86
ISBN Title Edition Price
123 On the Road 2nd 8.86
ItemID Title
123 On the Road
ISBN Edition Price
123 2nd 8.86
Primary Key
Primary Key
19
Issues for Query Processing
Query Answering (Inconsistencies)
Primary Key
ISBN Title Edition Price
123 On the Road 8.86
ISBN Title Edition Price
123 On the Road ??? 8.86
ItemID Title Edition
123 On the Road 1st
ISBN Edition Price
123 2nd 8.86
Primary Key
Primary Key
20
Community-Based Integration
?
Web Domain
  • Fairly-dynamic environment
  • New sources register over time and new
    applications queries are formulated
  • Allow developers to easily build applications
    based on the community schema
  • ?
  • So that each others needs
  • are accommodated
  • ?
  • Allow source owners to easily and independently
    register their source

Web Forms Reports
End Users
?
Application Domain
?
?
Developers
New Application
Application
?
Community Domain
Mediator
Community Owner
Community Schema
Source Domain
?
Web Service
Web Service

Source Owners
Data Source
New Source
Source Schema
21
Peer-Based Integration
Query
Peer 4
Query
Peer 5
Peer 2
Peer 1
Peer 3
22
Peer-Based Integration
  • No need for a central mediated schema
  • Peers serve as mediators for other peers
  • A peer can be both a server and a client
  • Semantic relationships are specified
    locally(between small sets of peers)
  • Queries are posed using the peers schema
  • Answers come from anywhere in the system
  • This is not P2P file sharing.
  • Data has rich semantics

23
References
  • Information integration
  • Maurizio Lenzerini
  • Eighteenth International Joint Conference on
    Artificial Intelligence, IJCAI 2003
  • Invited Tutorial
  • Data Integration a Status Report
  • Alon Halevy
  • German Database Conference (BTW), 2003
  • Invited Talk
Write a Comment
User Comments (0)