Management of Probabilistic Data: Foundations and Challenges - PowerPoint PPT Presentation

About This Presentation
Title:

Management of Probabilistic Data: Foundations and Challenges

Description:

Mumbai. Goregaon West. 52. 1. P. City. Street. House-No. ID [Gupta&Sarawagi'2006] ...52 A Goregaon West Mumbai ... Here probabilities are meaningful. 20% of such ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 43
Provided by: DANS154
Category:

less

Transcript and Presenter's Notes

Title: Management of Probabilistic Data: Foundations and Challenges


1
Management of Probabilistic Data Foundations
and Challenges
  • Nilesh Dalvi and Dan Suciu
  • Univerisity of Washington

2
Databases Are Deterministic
  • Applications since 1970s required precise
    semantics
  • Accounting, inventory
  • Database tools are deterministic
  • A tuple is an answer or is not
  • Underlying theory assumes determinism
  • FO (First Order Logic)

3
Future of Data Management
  • We need to cope with uncertainties !
  • Represent uncertainties as probabilities
  • Extend data management tools to handle
    probabilistic data
  • Major paradigm shift affecting both foundations
    and systems

4
Uncertainties Everywhere
  • In the schema mappings
  • Data spaces
  • Pay as you go data integration
  • In the data mapping
  • Life science data integration
  • Object reconciliation, fuzzy joins
  • In the data itself
  • Data by the masses
  • Information Extraction
  • RFID data, sensor data

Halevy2007
?PhilippiKohler2006
Arasu06
GuptaSarawagi2006
Welbourne2007
5
Example 1Data Integration in Life Sciences
B.Louie et al.2007
  • U2 integrates several biological databases

Example find functional annotations of ABCD1
EntrezProtein,Pfam,TIGRFAM,NCBI
Blast,EntrezGene
User types Gene ?ABCD1 U2 finds 80 related
proteins Ranks them by uncertainty score Correct
9 functions are among top 11
Need to represent uncertainties explicitly
6
Example 2Information Extraction
?...52 A Goregaon West Mumbai ...
GuptaSarawagi2006
20 of suchextractionsare correct
Here probabilities are meaningful
7
Example 3RFID Ecosystem at UW
Welbourne2007
8
  • RFID data noisy
  • SIGHTING(tagID, antennaID, time)
  • Derived data Probabilistic
  • John entered Room 524 at 915 prob0.6
  • John carried laptop x77 at 1103 prob0.8
  • . . .
  • Queries
  • Which people were in Room 478 yesterday ?

Massive amounts of probabilistic data from RFIDs,
sensors
9
A Model for Uncertainties
  • Data is probabilistic
  • Queries formulated in a standard language
  • Answers are annotated with probabilities

This talk Probabilistic Databases
10
Probabilistic databasesLong History
  • CavalloPitarelli1987
  • Barbara,Garcia-Molina, Porter1992
  • Lakshmanan,Leone,RossSubrahmanian1997
  • FuhrRoellke1997
  • DalviS2004
  • Widom2005

Focus today the Query Evaluation Problem
11
Has this been solved by AI ?
Fix qInput DB
Input KB
12
Outline
  • Data model
  • Query evaluation
  • Challenges

13
What is a Probabilistic Database (PDB) ?
Barbara et al.1992
Probability
Keys
Non-keys
HasObjectp
What does it mean ?
14
Background
Finite probability space (?, P)
  • ?1, . . ., ?n set of outcomes
  • P ? ? 0,1
  • P(?1) . . . P(?n) 1

Event E ? ?, P(E) ???E P(?)
Independent P(E1 E2) P(E1)
P(E2) Mutual exclusive or disjoint
P(E1E2) 0
15
Possible Worlds Semantics
PDB
?

Possibleworlds
p1p3
p1p4
p1(1- p3-p4-p5)
16
Definitions
Definition A tuple-disjoint/independent table is
R(A1, A2, , Am, B1, , Bn, P)
Definition A tuple-independent table is
R(A1, A2, , Am, P)
Definition Semantics is given by possible worlds
17
HasObject(Object, Time, Person, P)
Disjoint
Inde- pen-
dent
Disjoint
Meets(Person1, Person2, Time, P)
Independent
18
Query Semantics
A boolean query q is an event ? ? q
P(q) ?? q P(?)
Did someone take MyBook to the CoffeeRoom ?
q
HasObject(MyBook,x,t), EnterRoom(x,CoffeeRoom,
t)
?
P(q) 0.96
(meaning quite likely !)
19
Discussion of Data Model
  • Tuple-disjoint/independent tables
  • Simple model, can store in any DBMS
  • More advanced models
  • Symbolic boolean expressions
  • Trio add lineage
  • Probabilistic Relational Models
  • Graphical models

Fuhr and Roellke
Widom05, Das Sarma06, Benjelloun 06
Getoor2006
SenDesphande07
20
Outline
  • Data model
  • Query evaluation
  • Probability of Boolean expressions
  • From queries to Boolean expressions
  • Data complexity of query evaluation
  • Challenges

21
Probability of Boolean Expressions
? X1X2 Ç X1X3 Ç X2X3
P(X1) p1 , P(X2) p2, P(X3) p3
Compute P(?)
?
Pr(?)(1-p1)p2p3 p1(1-p2)p3
p1p2(1-p3) p1p2p3
22
Background
Fix P(X1) P(X2) . . . P(Xn) 1/2
23
Query q Database PDB ? ?
R(x, y), S(x, z)
q
Sp
Rp
PDB
?
X1Y1 Ç X1Y2 Ç X2Y3 Ç X2Y4 Ç X2Y5
?
24
Application to Query Evaluation
Corollary Fix FO query qExact evaluation of
Pr(q) on input PDB is in P
Corollary Fix a conjunctive query
q.Approximation of Pr(q) on input PDB is in
PTIME(FPTRAS)
Graedel,Gurevitch,Hirsch1998
25
BackgroundProbabilistic Networks
R(x, y), S(x, z)
? X1Y1ÇX1Y2ÇX2Y3ÇX2Y4ÇX2Y5
  • Inference hard in general
  • KR techniques exploit local properties
  • E.g. bounded treewidth ? PTIME

Ç
Ç
Ç
?ZabiyakaDarwiche06
Æ
Æ
Æ
Æ
Æ
Note for this querythe treewidth isunbounded
X1
X2
Y2
Y1
Y3
Y4
Y5
p1
p2
q1
q2
q3
q4
q5
26
DS2004
safe plan
q
R(x, y), S(x, z)
The data complexityof this query is PTIME
27
Dichotomy Theorem
Let q be a conjunctive query without self-joins
  • Theorem One of the following holds
  • Either q is in PTIME
  • Or q is P hard

DS2004
In Case (1) q can be computed by a safe plan
and wecall it a safe query
Andritsos et al2006
28
P-Hard Queries
PTIME Queries
h1 R(x), S(x, y), T(y)
R(x, y), S(x, z)
h2 R(x,y), S(y)
R(x, y), S(y), T(a, y)
h3 R(x,y), S(x,y)
R(x), S(x, y), T(y), U(u, y), W(a, u)
. . .
. . .
How do we decide if a query is in PTIME or P
hard ?
29
Hierarchical Queries
sg(x) set of subgoals containing the variable x
in a key position
Definition A query q is hierarchical if forall
x, y sg(x) ? sg(y) or sg(x) ?
sg(y) or sg(x) ? sg(y) ?
30
Case 1 Independent Tuples Only
DS2004
PTIME Queries
Fact If q is hierarchical then q is in PTIME
  • The hierarchy gives the safe plan !
  • Root variable u ? ?-u
  • Connected components ? Join

31
Case 1 Independent Tuples Only
DS2004
P-hard Queries
h1 R(x), S(x, y), T(y)
Recall
h1 is P-hard (reduction from Partitioned
Positive 2DNF)
ProvanBall83
Fact If q is non-hierarchical then it is P-hard.
Proof it contains h1q . . . R(x, . ..),
S(x, y, . . .), T(y, . . .) . . .
Theorem Testing if q is PTIME or P-hard is in AC0
32
Case 2 Independent/disjoint Tuples
?-uD
PTIME Queries
Joinu
R(x), S(x, y), T(y), U(u, y), W(a, u)
?-yD
Wp(a,u)
y
x
T
Joiny
S
R
?-xI
Tp(y)
Up(u,y)
W
U
u
Joinx
Independentproject
  • Root variable ? ?I
  • CCs ? Join
  • Constant key attrs ? ?D

Rp(x)
Sp(x,y)
33
Case 2 Independent/disjoint Tuples
P-hard Queries
Recall
h1 R(x), S(x, y), T(y)
h2 R(x,y), S(y)
P-hard by reduction from PERMANENT
h3 R(x,y), S(x,y)
If the safe-plan algorithm fails on q, then q can
be rewritten to either h1 or h2 or h3 and
hence is P-hard(see paper for details)
Theorem Testing if q is PTIME or P-hard is PTIME
complete
34
Summary on Query Evaluation
  • We understand completely only queries w/o
    self-joins
  • Lessons learned from our system MystiQ
  • When the query is safe
  • Evaluate it exactly, in the database engine
  • Performance close to regular SQL
  • When the query is unsafe
  • Approximate it, compute only top-k
  • Performance one or two orders of magnitude worse

Re2007
35
Outline
  • Data model
  • Query evaluation
  • Challenges

36
Query Optimization
Re2007,Re2007b
  • Even a P-hard query often has subqueries that
    are in PTIME. Needed
  • Combine safe plans probabilistic inference
  • Interesting indepence/disjointness
  • Model a probabilistic engine as black-box

CHALLENGE Integrate a black-box probabilistic
inference in a query processor.
37
Probabilistic Inference Algorithms
  • Open the box ! Logical to physical
  • Examine specific algorithms from KR
  • Variable elimination
  • Junction trees
  • Bounded treewidth

SenDeshpande2007
BravoRamakrishnan2007
CHALLENGE (1) Study the space of optimization
alternatives. (2) Estimate the cost of specific
probabilistic inference algorithms.
38
Open Theory Problems
  • Self-joins are much harder to study
  • Solved only for independent tuples
  • Extend to richer query language
  • Unions, predicates (lt , , ?), aggregates
  • Do hardness results still hold for Pr 1/2 ?

DS2007
CHALLENGE Complete the analysis of the query
complexity over probabilistic databases
39
Complex Probabilistic Model
  • Independent and disjoint tuples are insufficient
    for real applications
  • Capturing complex correlations
  • Lineage
  • Graphical models

Das Sarma06,Benjelloum06
Getoor06,SenDeshpande07
CHALLENGE Explore the connection between complex
models and views
VermaPearl1990
40
Constraints
Shen06, Andritsos06, Richardson06,Chaudhuri07
  • Needed to clean uncertainties in the data
  • Hard constraints
  • Semantics conditional probability
  • Soft constraints
  • What is the semantics ?
  • Lots of prior work, but still little understood

CHALLENGE Study the impact of hard/soft
constraints on query evaluation
41
Information Leakage
?Evfimievski03,MiklauS04,DMS05
  • A view V should not leak information about a
    secret S
  • Issues Which prior P ? What is ?
  • Probability Logic
  • U ?V means P(V U) 1

P(S) P(S V)
Pearl88, Adams98
CHALLENGE Define a probability logic for
reasoning about information leakage
42
Conclusions
  • Prohibitive cost of cleaning data
  • Represent uncertainties explicitly
  • Need to re-examine many assumptions

A call to arms The management of probabilistic
data
Write a Comment
User Comments (0)
About PowerShow.com