Chapter 13: Incorporating Uncertainty into Data Integration - PowerPoint PPT Presentation

About This Presentation

Title:

Chapter 13: Incorporating Uncertainty into Data Integration

Description:

Title: Chapter 8: XML Subject: Collaborative Data Sharing Author: zives Keywords: Principles of Data Integration Description: QDB-MUD Keynote talk Last modified by – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 19

Provided by: ziv6

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 13: Incorporating Uncertainty into Data Integration

1
Chapter 13 Incorporating Uncertainty into Data
Integration
PRINCIPLES OF DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
2
Outline

Sources of uncertainty in data integration
Representing uncertain data (brief overview)
Probabilistic schema mappings

3
Managing Uncertain Data

Databases typically model certain data
A tuple is either true (in the database) or false
(not in the database).
Real life involves a lot of uncertainty
The thief had either blond or brown hair
The sensor reading is often unreliable.
Uncertain databases try to model such uncertain
data and to answer queries in a principled
fashion.
Data integration involves multiple facets of
uncertainty!

4
Uncertainty in Data Integration

Data itself may be uncertain (perhaps its
extracted from an unreliable source)
Schema mappings can be approximate (perhaps
created by an automatic tool)
Reference reconciliation (and hence joins) are
approximate
If the domain is broad enough, even the mediated
schema could involve uncertainty
Queries, often posed as keywords, have uncertain
intent.

5
Outline

Sources of uncertainty in data integration
Representing uncertain data (brief overview)
Probabilistic schema mappings

6
Principles of Uncertain Databases

Instead of describing one possible state of the
world, an uncertain database describes a set of
possible worlds.
The expressive power of the data model determines
which sets of possible world that database can
represent.
Is uncertainty on values of an attribute?
Or on the presence of a tuple?
Can dependencies between tuples be represented?

7
C-Tables Uncertainty without Probabilities

Alice and Bob want to go on a vacation together,
but will go to either Tahiti or Ulaanbaatar.
Candace will definitely go to Ulaanbaatar.
Possible words result from different assignments
to the variables.

8
Representing Complex Distributions

The c-table represents mutual exclusion of
tuples, but doesnt represent probability
distributions.
Representing complex probability distributions
and correlations between tuples requires using
probabilistic graphical models.
A couple of simpler models
Independent tuple probabilities
Block independent probabilities

9
Tuple Independent Model

Assign each tuple a probability.
The probability of every possible world is the
appropriate product of the probabilities for each
of the rows.
pi if row i is in the database, and (1-pi) if
its not.
Cannot represent correlations between tuples.

10
Block Independent Model

You choose one tuple from every block according
to the distribution of that block.
Can represent mutual exclusion, but not
co-dependence (i.e., Alice and Bob going to the
same location).

11
Outline

Sources of uncertainty in data integration
Representing uncertain data (brief overview)
Probabilistic schema mappings

12
Probabilistic Schema Mappings

Source schema
S(pname, email-addr, home-addr, office-addr)
Target schema
T(name, mailing-addr)
We may not be sure which attribute of S
mailing-addr should map to?
Probabilistic schema mappings let us handle such
uncertainty.

13
Probabilistic Schema Mappings
Intuitively, we want to give each mapping a
probability

S(pname, email-addr, home-addr, office-addr)
T(name, mailing-addr)

Possible Mapping Probability
(pname,name),(home-addr, mailing-addr) 0.5
(pname,name),(office-addr, mailing-addr) 0.4
(pname,name),(email-addr, mailing-addr) 0.1
14
What are the Semantics?

S(pname, email-addr, home-addr, office-addr)
T(name, mailing-addr)

Possible Mapping Probability
(pname,name),(home-addr, mailing-addr) 0.5
(pname,name),(office-addr, mailing-addr) 0.4
(pname,name),(email-addr, mailing-addr) 0.1
Should a single mapping apply to the entire
table? (by-table semantics), or can different
mappings apply to different tuples? (by-tuple
semantics)
15
By-Table versus By-Tuple Semantics
Ds
pname email-addr home-addr office-addr
Alice alice_at_ Mountain View Sunnyvale
Bob bob_at_ Sunnyvale Sunnyvale
There are 3 possible databases DT
name mailing-addr
Alice Mountain View
Bob Sunnyvale
name mailing-addr
Alice Sunnyvale
Bob Sunnyvale
name mailing-addr
Alice alice_at_
Bob bob_at_
DT
Pr(m1)0.5 Pr(m2)0.4
Pr(m3)0.1
16
By-Table versus By-Tuple Semantics
pname email-addr home-addr office-addr
Alice alice_at_ Mountain View Sunnyvale
Bob bob_at_ Sunnyvale Sunnyvale
Ds
There are 9 possible databases DT
name mailing-addr
Alice Mountain View
Bob bob_at_
name mailing-addr
Alice Sunnyvale
Bob bob_at_
name mailing-addr
Alice alice_at_
Bob bob_at_

DT
Pr(ltm1,m3gt)0.05 Pr(ltm2,m3gt)0.04
Pr(ltm3,m3gt)0.01
17
Complexity of Query Answering
Answering queries is more expensive under
by-tuple semantics
By-table By-tuple
Data Complexity PTIME P-complete
Mapping Complexity PTIME PTIME
18
Summary of Chapter 13

Uncertainty is everywhere in data integration
Work on this area is really only beginning
Great opportunity for further research.
Probabilistic schema mappings
By-table versus by-tuple semantics
By-tuple semantics is computationally expensive,
but restricted cases can found where query
answering is still polynomial.
Where do the probabilities come from?
Sometimes we interpret statistics as
probabilities
Sometimes the provenance of the data is more
meaningful than the probabilities