Topk Query Processing in Uncertain Database presentation

About This Presentation

Transcript and Presenter's Notes

Title: Topk Query Processing in Uncertain Database

1
Top-k Query Processing in Uncertain Database

Mohamed A. Soliman, Ihab F. Ilyas,
Kevin Chen-Chuan Chang. ICDE07
Kai, Jiang Fudan University

2
Outline

Introduction
Processing Framework
U-Topk Queries
U-kRanks Queries
Queries with Tuple Independence
Experiments
Conclusion

3
Introduction

Uncertain (probabilistic) data
sensor networks, moving objects tracking, data
cleaning etc.
Uncertain data model
Possible worlds a set of possible instances
Confidence membership uncertainty
Generation rules logical formulas determine
valid worlds
Independent tuples correlated with no rules

4
Uncertain Database
5
Motivation Challenges

Different from traditional top-k queries
Not depend only on score function but also on
membership probability
Two interesting top-k queries
Top-k speeding cars in the last hour
A ranking over the models of the top-k speeding
cars
Interaction between most probable and top-k
several different possible interpretations
Involve both ranking and aggregation across
worlds which is prohibitively expensive

6
Problem Definition U-Topk

Uncertain Top-k Query (U-Topk)
Let D be an uncertain database with possible
worlds space PWPW1, . . . , PWn. Let TT1, .
. . , Tm be a set of k-length tuple vectors,
where for each Ti?T
(1)Tuples of Ti are ordered according to scoring
function F
(2) Ti is the top-k answer for a non empty set
of possible worlds .
A U-Topk query, based on F, returns T?T, where

7
Problem Definition U-kRanks

Uncertain k Ranks Query (U-kRanks)
Let D be an uncertain database with possible
worlds space PWPW1, . . . , PWn. For i1k,
let be a set of tuples, where each
tuple appears at rank i in a non empty set
of possible worlds based on
scoring function F. A U-kRanks query, based on F,
returns , where

8
Processing Framework
9
Data Access

Theorem Among all sequential access methods,
sorted score access is optimal in the number of
retrieved tuples to answer uncertain top-k
queries.
Algorithm A retrieves tuples sequentially out of
score order, cannot decide whether a seen tuple t
belongs to any possible top-k answer or not.
Retrieving tuples in confidence is also not
optimal, cannot guarantee it has seen all tuples
with high scores than t

10
Process Overview
11
Computing State Probabilities

Probability Reduction
Extending a combination of tuple events by adding
another tuple existence/absence event results in
a new combination with at most the same
probability
State Probability
d access to D in F order. P(sl)Pr(sln?I(sl,d))
State Extension (Extend sl with tuple t)
A modified version of sl, event ?t
A state sl1 appended by t, event t

12
U-Topk Queries

OptU-Topk algorithm
Buffer the ranked tuples retrieved from D
Q a priority queue of states ordered on their
probabilities, ties are broken by state length,
initializing with e with state P(s0,0) 1
Lazy materialization, at each step extend only
the state with the highest probability into two
possible state
Terminate when the top state of Q is a complete
state.
Can extended to return n most probable U-Topk
answers

13
(No Transcript)
14
Optimality

Among all algorithms that access tuples ordered
on score, OptU-Topk is optimal in the number of
accessed tuples.
Let x be the reported answer by OptU-Topk. Among
all algorithms that access tuples ordered on
score, there is no algorithm that can skip a
state visited by OptU-Topk and report x as the
U-Topk query answer.

15
U-kRanks Queries

OptU-kRanks
Extend maintained states based on each seen tuple
Computer Pt,i, for i1,,k
For each rank i, remember the most probable
answer obtained so far
Terminate at rank i when
Optimality
Among all algorithms that access tuples ordered
on score, OptU-kRanks is optimal in number of
accessed tuples.

16
(No Transcript)
17
U-Topk Queries with Tuple Independence

Under tuple independence, if all states are
maintained after seeing the same tuples, xn and
ym(nm) would follow the same path to reach a
complete state. If P(xn) gtP(ym), prune ym.
IndepU-Topk groups states into equivalence
classed based on their lengths, keeps at most one
state for each length 0,,k in a candidate set.

18
IndepU-Topk
19
U-kRanks Queries with Tuple Independence

Dynamic Programming

20
Experiment
21
Experiment
22
Experiment
23
Experiment
24
Experiment
25
Conclusion

First paper to address top-k query processing
under possible worlds semantics
Formulate the problem as a state space search,
query algorithms with optimality guarantees on
accessed tuples and materializaed search states.
Process framework leverages existing techniques
and be integrated with existing DBMSs

26
Efficient Top-k Query Evaluation on
Probabilistic Data

Christopher Re, Nilesh Dalvi,
Dan Suciu. ICDE07

27
Outline

Introduction
Preliminaries
Top-k Query Evaluations
Discussion
Experiments
Conclusion

28
Introduction

Imprecise data probabilistic database
Computer and rank top k answer of a SQL query
Answers with approximate probabilities
Shift focus from probabilities to ranks

29
Application Example
30
Challenges

Compute the exact output probabilities is
computationally hard P-complete
Number of potential answers is large
1415 answers in the previous example
Only interested in the first few of them

31
Approach given in the paper

Guarantee
The top k answers are correct
The ranking of the top k answers is correct
Limitations
Probabilities listed explicitly
Do not handle continuous attribute values

32
Possible World

Definition
A probabilistic database over schema S is a pair
(W,P), where WW1,,Wn is a set of database
instances over S, and P W-gt0,1 is a
probability distribution. Each instances Wj for
which P(Wj) gt 0 is called a possible world.

33
Possible World

Representation
Table
Constraint
Each instance Jp over schema Sp represents a
probabilistic database over S, denoted Mod(Jp)
S a single relation name R(A1,,Am,B1,,Bn),
R(A,B)
Jp table Rp (A, B, p)
Wj subsets of

34
Example
35
DNF Formulas over Tuples

Definition
Let (W,P) a probabilistic database, t1,t2, all
the tuples
ti true if ti?W and ti false if ti W
A fomula E, P(E)sum(P(Wi) Etrue in Wi)
E(t1?t5) ?t2, P(E)P(W3)P(W7)P(W10)P(W11)

36
Queries

Consider SQL queries of the form
aggregate op sum,count(sum(1)),min,max
Given (W,P), the answer is a table like

37
Possible worlds semantics

SQL query on Wj a set of tuples

38
Semantics based on DNF Formulas

Possible worldssemantics is not practical
DNF formulas
Modify q -gt qe
Evaluate qe on Jp and denote answer ET
Form of ET t(t1,,tr), t1? ,, tr?
t.Et1?t2??tr
P(t.E) can be computed easily

39
DNF Formulas(continue)

Partition ET by GROUP-BY
ETG1?G2??Gn, Gt1,tm
Computer P(G.E) is P-complete

40
Monte Carlo (MC) Simulation

Naïve MC algorithm
Approximate P(G.E) by repeatedly choose a random
possible world and compute the frequency of
G.Etrue
Luby and Karps improved MC algorithm

41
Property of Luby and Karps algorithm

Let dgt 0
m number of disjuncts
N the number of steps executed
Define
Then

42
Top-k Query Evaluation

Evaluation has two parts
Evaluate the extend SQL query qe and group the
answer tuples
Run a MC simulation on each group to compute the
probabilities then return the top k probabilities
Goal minimize the total number of simulation
steps

43
Multisimulation (MS)

GG1,,Gn with unknown prob p1,pn , goal to
find the k objects with highest prob, denoted
TopK G
Assumptions

44
Multisimulation

Two intervals ai,bi,aj,bj, if biaj, first is
below, second is above
Two intervals are separated if we know pi lt pj
n intervals is k-separated if a set T G of k
intervals any interval in T is above any interval
not in T

45
Multisimulation

Sound strategy round robin
Cost nNopt
Notations and definitions
Topk(x1,,xn) be the ks largest value
Critical region (c,d)(topk(a1,,an),
topk1(b1,,bn))
Top objects TGi d ai T TopK
Bottom object BGi bi c BnTopKØ

46
Multisimulation

There is a k-separation iff critical region is
empty i.e. cd, TopKT
Gi is a double crosser if ai lt c, d lt bi
Gi is a lower (upper) crosser if ai lt c (d lt bi)

47
Multisimulation

MS Algorithm
First, try a double crosser
Then try to find an upper and lower crosser pair
If not exists it means either all crossers have
the same left endpoint aic or the same right
endpoint dbi. Pick the maximal crosser
After each iteration re-compute the critical
region
Stop when cd, return the set T

48
The Multisimulation Algorithm
49
Algorithm Guarantee

The algorithm always terminates and returns the
correct TopK.For any deterministic algorithm
computing the top k and for any clt2 there exists
an instance on which its cost is cNopt.
Let A be any deterministic algorithm for finding
TopK. Then (a) on any instance the cost of
MS_TopK is at most twice the cost of A, and (b)
for any clt1 there exists an instance where the
cost of A is greater than c times the cost of
MS_TopK.

50
Discussion

Extensions
Extend MS to compute and rank the top k
answers.Tk MS_TopK(G, k)
Tk-1 MS_TopK(Tk, k-1)
Tk-2 MS_TopK(Tk-1, k-2)
T1 MS_TopK(T2, 1)
Variation
Any-time algorithm which computes and returns the
top answers in order 1,2,3, and can be stopped
at any time

51
Review the assumptions

Precision
Each step P(p ?aN, bN gt 1-d) and global
precision gt 1-d0, (1- d)N1- d0, d d0/N
Progress fails in general
After step N of MC, midpoint move by 1/N,
while width of interval shrinks only
O(1/N1/2-1/(N1)1/2)O(N-3/2)
Solution run MC N1/2 iteration at each step
s.t. The
midpoint moves between steps N and NNa by
Nt-1
O(1/N1/2 - 1/(NNa)1/2)O(Na-3/2), a1/2
MS algorithm runs at most 2(NoptNopt 1/2) steps

52
Optimization

Initialize the intervals ai, bi to better
estimates than 0, 1
eliminates low ranking objects from start
Safe plan rewriting identify subqueries whose
probabilities can be computed inside of the SQL
engine.

53
Experiment

Queries

54
Experiment
55
Conclusion

Describe a method for answering top-k queries on
probabilistic databases
Prove the technique to be near optimal and
validate it experimentally

56
The End

Thank you all!

Write a Comment

User Comments (0)

About PowerShow.com

Topk Query Processing in Uncertain Database PowerPoint PPT Presentation