Title: Chapter 13: Incorporating Uncertainty into Data Integration
1Chapter 13 Incorporating Uncertainty into Data
Integration
PRINCIPLES OF DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
2Outline
- Sources of uncertainty in data integration
- Representing uncertain data (brief overview)
- Probabilistic schema mappings
3Managing Uncertain Data
- Databases typically model certain data
- A tuple is either true (in the database) or false
(not in the database). - Real life involves a lot of uncertainty
- The thief had either blond or brown hair
- The sensor reading is often unreliable.
- Uncertain databases try to model such uncertain
data and to answer queries in a principled
fashion. - Data integration involves multiple facets of
uncertainty!
4Uncertainty in Data Integration
- Data itself may be uncertain (perhaps its
extracted from an unreliable source) - Schema mappings can be approximate (perhaps
created by an automatic tool) - Reference reconciliation (and hence joins) are
approximate - If the domain is broad enough, even the mediated
schema could involve uncertainty - Queries, often posed as keywords, have uncertain
intent.
5Outline
- Sources of uncertainty in data integration
- Representing uncertain data (brief overview)
- Probabilistic schema mappings
6Principles of Uncertain Databases
- Instead of describing one possible state of the
world, an uncertain database describes a set of
possible worlds. - The expressive power of the data model determines
which sets of possible world that database can
represent. - Is uncertainty on values of an attribute?
- Or on the presence of a tuple?
- Can dependencies between tuples be represented?
7C-Tables Uncertainty without Probabilities
- Alice and Bob want to go on a vacation together,
but will go to either Tahiti or Ulaanbaatar.
Candace will definitely go to Ulaanbaatar. - Possible words result from different assignments
to the variables.
8Representing Complex Distributions
- The c-table represents mutual exclusion of
tuples, but doesnt represent probability
distributions. - Representing complex probability distributions
and correlations between tuples requires using
probabilistic graphical models. - A couple of simpler models
- Independent tuple probabilities
- Block independent probabilities
9Tuple Independent Model
- Assign each tuple a probability.
- The probability of every possible world is the
appropriate product of the probabilities for each
of the rows. - pi if row i is in the database, and (1-pi) if
its not. - Cannot represent correlations between tuples.
10Block Independent Model
- You choose one tuple from every block according
to the distribution of that block. - Can represent mutual exclusion, but not
co-dependence (i.e., Alice and Bob going to the
same location).
11Outline
- Sources of uncertainty in data integration
- Representing uncertain data (brief overview)
- Probabilistic schema mappings
12Probabilistic Schema Mappings
- Source schema
- S(pname, email-addr, home-addr, office-addr)
- Target schema
- T(name, mailing-addr)
- We may not be sure which attribute of S
mailing-addr should map to? - Probabilistic schema mappings let us handle such
uncertainty.
13Probabilistic Schema Mappings
Intuitively, we want to give each mapping a
probability
- S(pname, email-addr, home-addr, office-addr)
- T(name, mailing-addr)
Possible Mapping Probability
(pname,name),(home-addr, mailing-addr) 0.5
(pname,name),(office-addr, mailing-addr) 0.4
(pname,name),(email-addr, mailing-addr) 0.1
14What are the Semantics?
- S(pname, email-addr, home-addr, office-addr)
- T(name, mailing-addr)
Possible Mapping Probability
(pname,name),(home-addr, mailing-addr) 0.5
(pname,name),(office-addr, mailing-addr) 0.4
(pname,name),(email-addr, mailing-addr) 0.1
Should a single mapping apply to the entire
table? (by-table semantics), or can different
mappings apply to different tuples? (by-tuple
semantics)
15By-Table versus By-Tuple Semantics
Ds
pname email-addr home-addr office-addr
Alice alice_at_ Mountain View Sunnyvale
Bob bob_at_ Sunnyvale Sunnyvale
There are 3 possible databases DT
name mailing-addr
Alice Mountain View
Bob Sunnyvale
name mailing-addr
Alice Sunnyvale
Bob Sunnyvale
name mailing-addr
Alice alice_at_
Bob bob_at_
DT
Pr(m1)0.5 Pr(m2)0.4
Pr(m3)0.1
16By-Table versus By-Tuple Semantics
pname email-addr home-addr office-addr
Alice alice_at_ Mountain View Sunnyvale
Bob bob_at_ Sunnyvale Sunnyvale
Ds
There are 9 possible databases DT
name mailing-addr
Alice Mountain View
Bob bob_at_
name mailing-addr
Alice Sunnyvale
Bob bob_at_
name mailing-addr
Alice alice_at_
Bob bob_at_
DT
Pr(ltm1,m3gt)0.05 Pr(ltm2,m3gt)0.04
Pr(ltm3,m3gt)0.01
17Complexity of Query Answering
Answering queries is more expensive under
by-tuple semantics
By-table By-tuple
Data Complexity PTIME P-complete
Mapping Complexity PTIME PTIME
18Summary of Chapter 13
- Uncertainty is everywhere in data integration
- Work on this area is really only beginning
- Great opportunity for further research.
- Probabilistic schema mappings
- By-table versus by-tuple semantics
- By-tuple semantics is computationally expensive,
but restricted cases can found where query
answering is still polynomial. - Where do the probabilities come from?
- Sometimes we interpret statistics as
probabilities - Sometimes the provenance of the data is more
meaningful than the probabilities