Title: Structured Queries for Semistructured Probabilistic Data
1Structured Queries for Semistructured
Probabilistic Data
- Alex Dekhtyar
- Kevin Krol Mathias
- Praveen Gutti
Department of Computer Science University of
Kentucky
2Outline
- Semistructured Probabilistic Objects (SPOs)
- Why?
- How?
- Semistructured Probabilistic Algebra
(SP-Algebra) - SPOQL
3The Story of the SPO model
Year 2000
- Reasoning with uncertainty in AI applications
Bayesian Networks
Markov Decision Processes
4The Story of the SPO model
Year 2000
- Reasoning with uncertainty in AI applications
Bayesian Networks
Markov Decision Processes
- Probability Distributions
- discrete
- finite domains
- conditional (CPTs)
- with meta-information?
The probability distribution of performance in a
database class for CS majors who got A in Data
Structures
5The Story of the SPO model
major CS
Conditional
Probability Table
meta-data
The probability distribution of performance in a
database class for CS majors who got A in Data
Structures
Data StructuresA
6The Story of the SPO model
major CS
Data StructuresA
Probabilistic Object
- Barbara,Garcia-Molina, Porter, 1992
- Lakshmanan, Leone, Ross, Subrahmanian, 1997
- Dey, Sarkar, 1996
- Eiter, Liu, Lukasiewicz, Subrahmanian, 2001
7The Story of the SPO model
major CS
Data StructuresA
Probabilistic Object
Semistricutured
(SPO)
Should be treated as records from the same
relation
8SPO Birds Eye View
? S1
T context (meta-data)
S ltT,V,P,C,?gt
V participating random variables
(any additional information about the
distribution that we know, which isnt stochastic)
P probability table Pdom(V)? 0,1
C conditionals
? path, internal pseudo-id
9Outline
- Semistructured Probabilistic Objects (SPOs)
- Why?
- How?
- Semistructured Probabilistic Algebra
(SP-Algebra) - SPOQL
10What questions would you ask?
- Find all probability distributions for CS majors?
- Find all probability distributions involving
Databases - Find all probability distributions for students
who got A in Data Structures - Find the marginal probability distribution of
performance in OS - Find the probability distribution of performance
in AI for students with A in Data Structures
and Databases.
11What questions would you ask?
- Find the probability of students getting an A
in Databases - Find all grade combinations which have
probability over 9 - Find the joint distributions of grades from the
data in the database
12SP-Algebra
Selection (s) Projection (p) Conditionalization
(µ) Cartesian Product (x) Join ( , ) Mix (
)
13Selection
On context
On participating variables
On conditionals
14Selection
On probabilities
On probability table
15Selection
On probabilities
On probability table
16Projection
On context On conditionals On participating
variables
marginalization
17Conditionalization
Finds conditional probability distributions
- Remove rows not matching condition
- normalize
- 3. add conditional
18Cartesian ProductJoin, Mix
Compatibility
- different participating
- variables
- matching conditionals
X
- common participating
- variables
- matching conditionals
Join or cartesian product, depending on
compatibility
19Cartesian ProductJoin, Mix
X Z Pr
X Y Pr
A A B B C C
A A B B C C
20Cartesian ProductJoin, Mix
P(ZX)
X Z Pr
X Y Pr
A A B B C C
21Cartesian ProductJoin, Mix
X Z Pr
X Y Pr
A A B B C C
X
X
X
22SP-Algebra
Implementation
Selection (s) Projection (p) Conditionalization
(µ) Cartesian Product (x) Join ( , ) Mix (
)
- on top of RDBMS
- JDBC
- needs views
- XML I/O
- mediator
Path (?) of the result SP-Algebra operation
23Outline
- Semistructured Probabilistic Objects (SPOs)
- Why?
- How?
- Semistructured Probabilistic Algebra
(SP-Algebra) - SPOQL
24From Algebra to Structured Queries
Whats more readable?
pDB,AI((AI in V)(Plt0.2)(µOSA(S))) x
smajorCS( pDS,Networks(sPgt0.1 (R) x T))
or
SELECT S.var.DB, S.var.AI, RT.var.DS,
RT.var.Networks FROM S TIMES (R TIMES T) RT WHERE
EXISTS S.var.AI AND S.tbl.problt0.2 AND
R.tbl.probgt0.1 AND RT.major
CS CONDITIONAL S.var.OS A
Declarative query language appealed to
application programmers
25SPOQL Syntax
p
SELECT ltselectlistgt FROM ltfromlistgt WHERE
ltexpressiongt CONDITIONAL ltconditionalizationsgt
X
s
µ
26SPOQL Syntax
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
27SPOQL Syntax
Variable projection on final result
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
28SPOQL Syntax
Projection of context and conditionals
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
29SPOQL Syntax
Join, alias for intermediate result
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
30SPOQL Syntax
Cartesian product, no alias
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
31SPOQL Syntax
mix
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
32SPOQL Syntax
Selection on content applied to final SP-relation
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
Selection on conditions
33SPOQL Syntax
Selection on participating random variables
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
34SPOQL Syntax
Selection on probabilities applied to
intermediate result
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
Selection on probability table
35SPOQL Syntax
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
Conditioning instructions
36SPOQL Syntax
SELECT var.DB, S.cnt.major, R.cnd.OS FROM (S JOIN
R) SR, (T TIMES W) WHERE cnt.year gt 1999 AND
S.cnd.DS A AND
EXISTS T.var.DB AND SR.tbl.P lt
0.1 AND tbl.DB A CONDITIONAL
W.var.DS A AND
R.var.OS B
37SPOQL Semantics. Order of operations
p
s
µ
Only operations that explicitly name S are
performed on S
X
1. Perform all conditionalizations
p
p
2. Perform all selections
3. Perform all projections
s
s
4. Join/CP/Mix with other parts of query
µ
µ
Selections on probability and probability tables
do not commute with other operations
R
S
Order of operations _at_ translation time matters!
38SPOQL Semantics. Questions
- What if need different order?
- How to select which chains to combine
(join/cp/mix)?
39SPOQL Semantics. Questions
- What if need different order?
- nested queries in ltfromlistgt
- How to select which chains to combine
(join/cp/mix)? - instantiate parentheses
- left-to-right
- correct for joins on context
40SPOQL Implementation
- Implemented on top of SPDBMS
- Queries converted into SP-Algebra
- No optimization (yet)
41SPOQL Conclusions
- SPO model stores joint, conditional distributions
of discrete random variables with finite domains
plus additional meta-data - Structured queries are easier to generate within
applications - (easier for humans to comprehend)
- SPOQL-to-SP-Algebra translation requires more
care - see Algorithm Evaluate
42SPO model next steps
- Query optimization
- Incorporation into AI applications (planning
with uncertainty and constraints) - XQuery-powered back-end
43Thank you!