Chapter 13: Incorporating Uncertainty into Data Integration - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 13: Incorporating Uncertainty into Data Integration

Description:

Title: Chapter 8: XML Subject: Collaborative Data Sharing Author: zives Keywords: Principles of Data Integration Description: QDB-MUD Keynote talk Last modified by – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 19
Provided by: ziv6
Category:

less

Transcript and Presenter's Notes

Title: Chapter 13: Incorporating Uncertainty into Data Integration


1
Chapter 13 Incorporating Uncertainty into Data
Integration
PRINCIPLES OF DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
2
Outline
  • Sources of uncertainty in data integration
  • Representing uncertain data (brief overview)
  • Probabilistic schema mappings

3
Managing Uncertain Data
  • Databases typically model certain data
  • A tuple is either true (in the database) or false
    (not in the database).
  • Real life involves a lot of uncertainty
  • The thief had either blond or brown hair
  • The sensor reading is often unreliable.
  • Uncertain databases try to model such uncertain
    data and to answer queries in a principled
    fashion.
  • Data integration involves multiple facets of
    uncertainty!

4
Uncertainty in Data Integration
  • Data itself may be uncertain (perhaps its
    extracted from an unreliable source)
  • Schema mappings can be approximate (perhaps
    created by an automatic tool)
  • Reference reconciliation (and hence joins) are
    approximate
  • If the domain is broad enough, even the mediated
    schema could involve uncertainty
  • Queries, often posed as keywords, have uncertain
    intent.

5
Outline
  • Sources of uncertainty in data integration
  • Representing uncertain data (brief overview)
  • Probabilistic schema mappings

6
Principles of Uncertain Databases
  • Instead of describing one possible state of the
    world, an uncertain database describes a set of
    possible worlds.
  • The expressive power of the data model determines
    which sets of possible world that database can
    represent.
  • Is uncertainty on values of an attribute?
  • Or on the presence of a tuple?
  • Can dependencies between tuples be represented?

7
C-Tables Uncertainty without Probabilities
  • Alice and Bob want to go on a vacation together,
    but will go to either Tahiti or Ulaanbaatar.
    Candace will definitely go to Ulaanbaatar.
  • Possible words result from different assignments
    to the variables.

8
Representing Complex Distributions
  • The c-table represents mutual exclusion of
    tuples, but doesnt represent probability
    distributions.
  • Representing complex probability distributions
    and correlations between tuples requires using
    probabilistic graphical models.
  • A couple of simpler models
  • Independent tuple probabilities
  • Block independent probabilities

9
Tuple Independent Model
  • Assign each tuple a probability.
  • The probability of every possible world is the
    appropriate product of the probabilities for each
    of the rows.
  • pi if row i is in the database, and (1-pi) if
    its not.
  • Cannot represent correlations between tuples.

10
Block Independent Model
  • You choose one tuple from every block according
    to the distribution of that block.
  • Can represent mutual exclusion, but not
    co-dependence (i.e., Alice and Bob going to the
    same location).

11
Outline
  • Sources of uncertainty in data integration
  • Representing uncertain data (brief overview)
  • Probabilistic schema mappings

12
Probabilistic Schema Mappings
  • Source schema
  • S(pname, email-addr, home-addr, office-addr)
  • Target schema
  • T(name, mailing-addr)
  • We may not be sure which attribute of S
    mailing-addr should map to?
  • Probabilistic schema mappings let us handle such
    uncertainty.

13
Probabilistic Schema Mappings
Intuitively, we want to give each mapping a
probability
  • S(pname, email-addr, home-addr, office-addr)
  • T(name, mailing-addr)

Possible Mapping Probability
(pname,name),(home-addr, mailing-addr) 0.5
(pname,name),(office-addr, mailing-addr) 0.4
(pname,name),(email-addr, mailing-addr) 0.1
14
What are the Semantics?
  • S(pname, email-addr, home-addr, office-addr)
  • T(name, mailing-addr)

Possible Mapping Probability
(pname,name),(home-addr, mailing-addr) 0.5
(pname,name),(office-addr, mailing-addr) 0.4
(pname,name),(email-addr, mailing-addr) 0.1
Should a single mapping apply to the entire
table? (by-table semantics), or can different
mappings apply to different tuples? (by-tuple
semantics)
15
By-Table versus By-Tuple Semantics
Ds
pname email-addr home-addr office-addr
Alice alice_at_ Mountain View Sunnyvale
Bob bob_at_ Sunnyvale Sunnyvale
There are 3 possible databases DT
name mailing-addr
Alice Mountain View
Bob Sunnyvale
name mailing-addr
Alice Sunnyvale
Bob Sunnyvale
name mailing-addr
Alice alice_at_
Bob bob_at_
DT
Pr(m1)0.5 Pr(m2)0.4
Pr(m3)0.1
16
By-Table versus By-Tuple Semantics
pname email-addr home-addr office-addr
Alice alice_at_ Mountain View Sunnyvale
Bob bob_at_ Sunnyvale Sunnyvale
Ds
There are 9 possible databases DT
name mailing-addr
Alice Mountain View
Bob bob_at_
name mailing-addr
Alice Sunnyvale
Bob bob_at_
name mailing-addr
Alice alice_at_
Bob bob_at_

DT
Pr(ltm1,m3gt)0.05 Pr(ltm2,m3gt)0.04
Pr(ltm3,m3gt)0.01
17
Complexity of Query Answering
Answering queries is more expensive under
by-tuple semantics
By-table By-tuple
Data Complexity PTIME P-complete
Mapping Complexity PTIME PTIME
18
Summary of Chapter 13
  • Uncertainty is everywhere in data integration
  • Work on this area is really only beginning
  • Great opportunity for further research.
  • Probabilistic schema mappings
  • By-table versus by-tuple semantics
  • By-tuple semantics is computationally expensive,
    but restricted cases can found where query
    answering is still polynomial.
  • Where do the probabilities come from?
  • Sometimes we interpret statistics as
    probabilities
  • Sometimes the provenance of the data is more
    meaningful than the probabilities
Write a Comment
User Comments (0)
About PowerShow.com