Loading...

PPT – Topk Query Processing in Uncertain Database PowerPoint presentation | free to view - id: 1db099-ZDc1Z

The Adobe Flash plugin is needed to view this content

Top-k Query Processing in Uncertain Database

- Mohamed A. Soliman, Ihab F. Ilyas,
- Kevin Chen-Chuan Chang. ICDE07
- Kai, Jiang Fudan University

Outline

- Introduction
- Processing Framework
- U-Topk Queries
- U-kRanks Queries
- Queries with Tuple Independence
- Experiments
- Conclusion

Introduction

- Uncertain (probabilistic) data
- sensor networks, moving objects tracking, data

cleaning etc. - Uncertain data model
- Possible worlds a set of possible instances
- Confidence membership uncertainty
- Generation rules logical formulas determine

valid worlds - Independent tuples correlated with no rules

Uncertain Database

Motivation Challenges

- Different from traditional top-k queries
- Not depend only on score function but also on

membership probability - Two interesting top-k queries
- Top-k speeding cars in the last hour
- A ranking over the models of the top-k speeding

cars - Interaction between most probable and top-k

several different possible interpretations - Involve both ranking and aggregation across

worlds which is prohibitively expensive

Problem Definition U-Topk

- Uncertain Top-k Query (U-Topk)
- Let D be an uncertain database with possible

worlds space PWPW1, . . . , PWn. Let TT1, .

. . , Tm be a set of k-length tuple vectors,

where for each Ti?T - (1)Tuples of Ti are ordered according to scoring

function F - (2) Ti is the top-k answer for a non empty set

of possible worlds . - A U-Topk query, based on F, returns T?T, where

Problem Definition U-kRanks

- Uncertain k Ranks Query (U-kRanks)
- Let D be an uncertain database with possible

worlds space PWPW1, . . . , PWn. For i1 k,

let be a set of tuples, where each

tuple appears at rank i in a non empty set

of possible worlds based on

scoring function F. A U-kRanks query, based on F,

returns , where

Processing Framework

Data Access

- Theorem Among all sequential access methods,

sorted score access is optimal in the number of

retrieved tuples to answer uncertain top-k

queries. - Algorithm A retrieves tuples sequentially out of

score order, cannot decide whether a seen tuple t

belongs to any possible top-k answer or not. - Retrieving tuples in confidence is also not

optimal, cannot guarantee it has seen all tuples

with high scores than t

Process Overview

Computing State Probabilities

- Probability Reduction
- Extending a combination of tuple events by adding

another tuple existence/absence event results in

a new combination with at most the same

probability - State Probability
- d access to D in F order. P(sl)Pr(sln?I(sl,d))
- State Extension (Extend sl with tuple t)
- A modified version of sl, event ?t
- A state sl1 appended by t, event t

U-Topk Queries

- OptU-Topk algorithm
- Buffer the ranked tuples retrieved from D
- Q a priority queue of states ordered on their

probabilities, ties are broken by state length,

initializing with e with state P(s0,0) 1 - Lazy materialization, at each step extend only

the state with the highest probability into two

possible state - Terminate when the top state of Q is a complete

state. - Can extended to return n most probable U-Topk

answers

(No Transcript)

Optimality

- Among all algorithms that access tuples ordered

on score, OptU-Topk is optimal in the number of

accessed tuples. - Let x be the reported answer by OptU-Topk. Among

all algorithms that access tuples ordered on

score, there is no algorithm that can skip a

state visited by OptU-Topk and report x as the

U-Topk query answer.

U-kRanks Queries

- OptU-kRanks
- Extend maintained states based on each seen tuple
- Computer Pt,i, for i1, ,k
- For each rank i, remember the most probable

answer obtained so far - Terminate at rank i when
- Optimality
- Among all algorithms that access tuples ordered

on score, OptU-kRanks is optimal in number of

accessed tuples.

(No Transcript)

U-Topk Queries with Tuple Independence

- Under tuple independence, if all states are

maintained after seeing the same tuples, xn and

ym(nm) would follow the same path to reach a

complete state. If P(xn) gtP(ym), prune ym. - IndepU-Topk groups states into equivalence

classed based on their lengths, keeps at most one

state for each length 0, ,k in a candidate set.

IndepU-Topk

U-kRanks Queries with Tuple Independence

- Dynamic Programming

Experiment

Experiment

Experiment

Experiment

Experiment

Conclusion

- First paper to address top-k query processing

under possible worlds semantics - Formulate the problem as a state space search,

query algorithms with optimality guarantees on

accessed tuples and materializaed search states. - Process framework leverages existing techniques

and be integrated with existing DBMSs

Efficient Top-k Query Evaluation on

Probabilistic Data

- Christopher Re, Nilesh Dalvi,
- Dan Suciu. ICDE07

Outline

- Introduction
- Preliminaries
- Top-k Query Evaluations
- Discussion
- Experiments
- Conclusion

Introduction

- Imprecise data probabilistic database
- Computer and rank top k answer of a SQL query
- Answers with approximate probabilities
- Shift focus from probabilities to ranks

Application Example

Challenges

- Compute the exact output probabilities is

computationally hard P-complete - Number of potential answers is large
- 1415 answers in the previous example
- Only interested in the first few of them

Approach given in the paper

- Guarantee
- The top k answers are correct
- The ranking of the top k answers is correct
- Limitations
- Probabilities listed explicitly
- Do not handle continuous attribute values

Possible World

- Definition
- A probabilistic database over schema S is a pair

(W,P), where WW1, ,Wn is a set of database

instances over S, and P W-gt0,1 is a

probability distribution. Each instances Wj for

which P(Wj) gt 0 is called a possible world.

Possible World

- Representation
- Table
- Constraint
- Each instance Jp over schema Sp represents a

probabilistic database over S, denoted Mod(Jp) - S a single relation name R(A1,
,Am,B1,
,Bn),

R(A,B) - Jp table Rp (A, B, p)
- Wj subsets of

Example

DNF Formulas over Tuples

- Definition
- Let (W,P) a probabilistic database, t1,t2,
all

the tuples - ti true if ti?W and ti false if ti W
- A fomula E, P(E)sum(P(Wi) Etrue in Wi)
- E(t1?t5) ?t2, P(E)P(W3)P(W7)P(W10)P(W11)

Queries

- Consider SQL queries of the form
- aggregate op sum,count(sum(1)),min,max
- Given (W,P), the answer is a table like

Possible worlds semantics

- SQL query on Wj a set of tuples

Semantics based on DNF Formulas

- Possible worldssemantics is not practical
- DNF formulas
- Modify q -gt qe
- Evaluate qe on Jp and denote answer ET
- Form of ET t(t1, ,tr), t1? , , tr?
- t.Et1?t2? ?tr
- P(t.E) can be computed easily

DNF Formulas(continue)

- Partition ET by GROUP-BY
- ETG1?G2? ?Gn, Gt1, tm
- Computer P(G.E) is P-complete

Monte Carlo (MC) Simulation

- Naïve MC algorithm
- Approximate P(G.E) by repeatedly choose a random

possible world and compute the frequency of

G.Etrue - Luby and Karps improved MC algorithm

Property of Luby and Karps algorithm

- Let dgt 0
- m number of disjuncts
- N the number of steps executed
- Define
- Then

Top-k Query Evaluation

- Evaluation has two parts
- Evaluate the extend SQL query qe and group the

answer tuples - Run a MC simulation on each group to compute the

probabilities then return the top k probabilities - Goal minimize the total number of simulation

steps

Multisimulation (MS)

- GG1,
,Gn with unknown prob p1,
pn , goal to

find the k objects with highest prob, denoted

TopK G - Assumptions

Multisimulation

- Two intervals ai,bi,aj,bj, if biaj, first is

below, second is above - Two intervals are separated if we know pi lt pj
- n intervals is k-separated if a set T G of k

intervals any interval in T is above any interval

not in T

Multisimulation

- Sound strategy round robin
- Cost nNopt
- Notations and definitions
- Topk(x1, ,xn) be the ks largest value
- Critical region (c,d)(topk(a1,
,an),

topk1(b1, ,bn)) - Top objects TGi d ai T TopK
- Bottom object BGi bi c BnTopKØ

Multisimulation

- There is a k-separation iff critical region is

empty i.e. cd, TopKT - Gi is a double crosser if ai lt c, d lt bi
- Gi is a lower (upper) crosser if ai lt c (d lt bi)

Multisimulation

- MS Algorithm
- First, try a double crosser
- Then try to find an upper and lower crosser pair
- If not exists it means either all crossers have

the same left endpoint aic or the same right

endpoint dbi. Pick the maximal crosser - After each iteration re-compute the critical

region - Stop when cd, return the set T

The Multisimulation Algorithm

Algorithm Guarantee

- The algorithm always terminates and returns the

correct TopK.For any deterministic algorithm

computing the top k and for any clt2 there exists

an instance on which its cost is cNopt. - Let A be any deterministic algorithm for finding

TopK. Then (a) on any instance the cost of

MS_TopK is at most twice the cost of A, and (b)

for any clt1 there exists an instance where the

cost of A is greater than c times the cost of

MS_TopK.

Discussion

- Extensions
- Extend MS to compute and rank the top k

answers.Tk MS_TopK(G, k) - Tk-1 MS_TopK(Tk, k-1)
- Tk-2 MS_TopK(Tk-1, k-2)
- T1 MS_TopK(T2, 1)
- Variation
- Any-time algorithm which computes and returns the

top answers in order 1,2,3 , and can be stopped

at any time

Review the assumptions

- Precision
- Each step P(p ?aN, bN gt 1-d) and global

precision gt 1-d0, (1- d)N1- d0, d d0/N - Progress fails in general
- After step N of MC, midpoint move by 1/N,

while width of interval shrinks only

O(1/N1/2-1/(N1)1/2)O(N-3/2) - Solution run MC N1/2 iteration at each step
- s.t. The

midpoint moves between steps N and NNa by

Nt-1 - O(1/N1/2 - 1/(NNa)1/2)O(Na-3/2), a1/2
- MS algorithm runs at most 2(NoptNopt 1/2) steps

Optimization

- Initialize the intervals ai, bi to better

estimates than 0, 1 - eliminates low ranking objects from start
- Safe plan rewriting identify subqueries whose

probabilities can be computed inside of the SQL

engine.

Experiment

- Queries

Experiment

Conclusion

- Describe a method for answering top-k queries on

probabilistic databases - Prove the technique to be near optimal and

validate it experimentally

The End

- Thank you all!