Topk Query Processing in Uncertain Database - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Topk Query Processing in Uncertain Database

Description:

Top-k speeding cars in the last hour. A ranking over the models of the top-k speeding cars ... A modified version of sl, event ?t. A state sl 1 appended by {t} ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 57
Provided by: Jian83
Category:

less

Transcript and Presenter's Notes

Title: Topk Query Processing in Uncertain Database


1
Top-k Query Processing in Uncertain Database
  • Mohamed A. Soliman, Ihab F. Ilyas,
  • Kevin Chen-Chuan Chang. ICDE07
  • Kai, Jiang Fudan University

2
Outline
  • Introduction
  • Processing Framework
  • U-Topk Queries
  • U-kRanks Queries
  • Queries with Tuple Independence
  • Experiments
  • Conclusion

3
Introduction
  • Uncertain (probabilistic) data
  • sensor networks, moving objects tracking, data
    cleaning etc.
  • Uncertain data model
  • Possible worlds a set of possible instances
  • Confidence membership uncertainty
  • Generation rules logical formulas determine
    valid worlds
  • Independent tuples correlated with no rules

4
Uncertain Database
5
Motivation Challenges
  • Different from traditional top-k queries
  • Not depend only on score function but also on
    membership probability
  • Two interesting top-k queries
  • Top-k speeding cars in the last hour
  • A ranking over the models of the top-k speeding
    cars
  • Interaction between most probable and top-k
    several different possible interpretations
  • Involve both ranking and aggregation across
    worlds which is prohibitively expensive

6
Problem Definition U-Topk
  • Uncertain Top-k Query (U-Topk)
  • Let D be an uncertain database with possible
    worlds space PWPW1, . . . , PWn. Let TT1, .
    . . , Tm be a set of k-length tuple vectors,
    where for each Ti?T
  • (1)Tuples of Ti are ordered according to scoring
    function F
  • (2) Ti is the top-k answer for a non empty set
    of possible worlds .
  • A U-Topk query, based on F, returns T?T, where

7
Problem Definition U-kRanks
  • Uncertain k Ranks Query (U-kRanks)
  • Let D be an uncertain database with possible
    worlds space PWPW1, . . . , PWn. For i1k,
    let be a set of tuples, where each
    tuple appears at rank i in a non empty set
    of possible worlds based on
    scoring function F. A U-kRanks query, based on F,
    returns , where

8
Processing Framework
9
Data Access
  • Theorem Among all sequential access methods,
    sorted score access is optimal in the number of
    retrieved tuples to answer uncertain top-k
    queries.
  • Algorithm A retrieves tuples sequentially out of
    score order, cannot decide whether a seen tuple t
    belongs to any possible top-k answer or not.
  • Retrieving tuples in confidence is also not
    optimal, cannot guarantee it has seen all tuples
    with high scores than t

10
Process Overview
11
Computing State Probabilities
  • Probability Reduction
  • Extending a combination of tuple events by adding
    another tuple existence/absence event results in
    a new combination with at most the same
    probability
  • State Probability
  • d access to D in F order. P(sl)Pr(sln?I(sl,d))
  • State Extension (Extend sl with tuple t)
  • A modified version of sl, event ?t
  • A state sl1 appended by t, event t

12
U-Topk Queries
  • OptU-Topk algorithm
  • Buffer the ranked tuples retrieved from D
  • Q a priority queue of states ordered on their
    probabilities, ties are broken by state length,
    initializing with e with state P(s0,0) 1
  • Lazy materialization, at each step extend only
    the state with the highest probability into two
    possible state
  • Terminate when the top state of Q is a complete
    state.
  • Can extended to return n most probable U-Topk
    answers

13
(No Transcript)
14
Optimality
  • Among all algorithms that access tuples ordered
    on score, OptU-Topk is optimal in the number of
    accessed tuples.
  • Let x be the reported answer by OptU-Topk. Among
    all algorithms that access tuples ordered on
    score, there is no algorithm that can skip a
    state visited by OptU-Topk and report x as the
    U-Topk query answer.

15
U-kRanks Queries
  • OptU-kRanks
  • Extend maintained states based on each seen tuple
  • Computer Pt,i, for i1,,k
  • For each rank i, remember the most probable
    answer obtained so far
  • Terminate at rank i when
  • Optimality
  • Among all algorithms that access tuples ordered
    on score, OptU-kRanks is optimal in number of
    accessed tuples.

16
(No Transcript)
17
U-Topk Queries with Tuple Independence
  • Under tuple independence, if all states are
    maintained after seeing the same tuples, xn and
    ym(nm) would follow the same path to reach a
    complete state. If P(xn) gtP(ym), prune ym.
  • IndepU-Topk groups states into equivalence
    classed based on their lengths, keeps at most one
    state for each length 0,,k in a candidate set.

18
IndepU-Topk
19
U-kRanks Queries with Tuple Independence
  • Dynamic Programming

20
Experiment
21
Experiment
22
Experiment
23
Experiment
24
Experiment
25
Conclusion
  • First paper to address top-k query processing
    under possible worlds semantics
  • Formulate the problem as a state space search,
    query algorithms with optimality guarantees on
    accessed tuples and materializaed search states.
  • Process framework leverages existing techniques
    and be integrated with existing DBMSs

26
Efficient Top-k Query Evaluation on
Probabilistic Data
  • Christopher Re, Nilesh Dalvi,
  • Dan Suciu. ICDE07

27
Outline
  • Introduction
  • Preliminaries
  • Top-k Query Evaluations
  • Discussion
  • Experiments
  • Conclusion

28
Introduction
  • Imprecise data probabilistic database
  • Computer and rank top k answer of a SQL query
  • Answers with approximate probabilities
  • Shift focus from probabilities to ranks

29
Application Example
30
Challenges
  • Compute the exact output probabilities is
    computationally hard P-complete
  • Number of potential answers is large
  • 1415 answers in the previous example
  • Only interested in the first few of them

31
Approach given in the paper
  • Guarantee
  • The top k answers are correct
  • The ranking of the top k answers is correct
  • Limitations
  • Probabilities listed explicitly
  • Do not handle continuous attribute values

32
Possible World
  • Definition
  • A probabilistic database over schema S is a pair
    (W,P), where WW1,,Wn is a set of database
    instances over S, and P W-gt0,1 is a
    probability distribution. Each instances Wj for
    which P(Wj) gt 0 is called a possible world.

33
Possible World
  • Representation
  • Table
  • Constraint
  • Each instance Jp over schema Sp represents a
    probabilistic database over S, denoted Mod(Jp)
  • S a single relation name R(A1,,Am,B1,,Bn),
    R(A,B)
  • Jp table Rp (A, B, p)
  • Wj subsets of

34
Example
35
DNF Formulas over Tuples
  • Definition
  • Let (W,P) a probabilistic database, t1,t2, all
    the tuples
  • ti true if ti?W and ti false if ti W
  • A fomula E, P(E)sum(P(Wi) Etrue in Wi)
  • E(t1?t5) ?t2, P(E)P(W3)P(W7)P(W10)P(W11)

36
Queries
  • Consider SQL queries of the form
  • aggregate op sum,count(sum(1)),min,max
  • Given (W,P), the answer is a table like

37
Possible worlds semantics
  • SQL query on Wj a set of tuples

38
Semantics based on DNF Formulas
  • Possible worldssemantics is not practical
  • DNF formulas
  • Modify q -gt qe
  • Evaluate qe on Jp and denote answer ET
  • Form of ET t(t1,,tr), t1? ,, tr?
  • t.Et1?t2??tr
  • P(t.E) can be computed easily

39
DNF Formulas(continue)
  • Partition ET by GROUP-BY
  • ETG1?G2??Gn, Gt1,tm
  • Computer P(G.E) is P-complete

40
Monte Carlo (MC) Simulation
  • Naïve MC algorithm
  • Approximate P(G.E) by repeatedly choose a random
    possible world and compute the frequency of
    G.Etrue
  • Luby and Karps improved MC algorithm

41
Property of Luby and Karps algorithm
  • Let dgt 0
  • m number of disjuncts
  • N the number of steps executed
  • Define
  • Then

42
Top-k Query Evaluation
  • Evaluation has two parts
  • Evaluate the extend SQL query qe and group the
    answer tuples
  • Run a MC simulation on each group to compute the
    probabilities then return the top k probabilities
  • Goal minimize the total number of simulation
    steps

43
Multisimulation (MS)
  • GG1,,Gn with unknown prob p1,pn , goal to
    find the k objects with highest prob, denoted
    TopK G
  • Assumptions

44
Multisimulation
  • Two intervals ai,bi,aj,bj, if biaj, first is
    below, second is above
  • Two intervals are separated if we know pi lt pj
  • n intervals is k-separated if a set T G of k
    intervals any interval in T is above any interval
    not in T

45
Multisimulation
  • Sound strategy round robin
  • Cost nNopt
  • Notations and definitions
  • Topk(x1,,xn) be the ks largest value
  • Critical region (c,d)(topk(a1,,an),
    topk1(b1,,bn))
  • Top objects TGi d ai T TopK
  • Bottom object BGi bi c BnTopKØ

46
Multisimulation
  • There is a k-separation iff critical region is
    empty i.e. cd, TopKT
  • Gi is a double crosser if ai lt c, d lt bi
  • Gi is a lower (upper) crosser if ai lt c (d lt bi)

47
Multisimulation
  • MS Algorithm
  • First, try a double crosser
  • Then try to find an upper and lower crosser pair
  • If not exists it means either all crossers have
    the same left endpoint aic or the same right
    endpoint dbi. Pick the maximal crosser
  • After each iteration re-compute the critical
    region
  • Stop when cd, return the set T

48
The Multisimulation Algorithm
49
Algorithm Guarantee
  • The algorithm always terminates and returns the
    correct TopK.For any deterministic algorithm
    computing the top k and for any clt2 there exists
    an instance on which its cost is cNopt.
  • Let A be any deterministic algorithm for finding
    TopK. Then (a) on any instance the cost of
    MS_TopK is at most twice the cost of A, and (b)
    for any clt1 there exists an instance where the
    cost of A is greater than c times the cost of
    MS_TopK.

50
Discussion
  • Extensions
  • Extend MS to compute and rank the top k
    answers.Tk MS_TopK(G, k)
  • Tk-1 MS_TopK(Tk, k-1)
  • Tk-2 MS_TopK(Tk-1, k-2)
  • T1 MS_TopK(T2, 1)
  • Variation
  • Any-time algorithm which computes and returns the
    top answers in order 1,2,3, and can be stopped
    at any time

51
Review the assumptions
  • Precision
  • Each step P(p ?aN, bN gt 1-d) and global
    precision gt 1-d0, (1- d)N1- d0, d d0/N
  • Progress fails in general
  • After step N of MC, midpoint move by 1/N,
    while width of interval shrinks only
    O(1/N1/2-1/(N1)1/2)O(N-3/2)
  • Solution run MC N1/2 iteration at each step
  • s.t. The
    midpoint moves between steps N and NNa by
    Nt-1
  • O(1/N1/2 - 1/(NNa)1/2)O(Na-3/2), a1/2
  • MS algorithm runs at most 2(NoptNopt 1/2) steps

52
Optimization
  • Initialize the intervals ai, bi to better
    estimates than 0, 1
  • eliminates low ranking objects from start
  • Safe plan rewriting identify subqueries whose
    probabilities can be computed inside of the SQL
    engine.

53
Experiment
  • Queries

54
Experiment
55
Conclusion
  • Describe a method for answering top-k queries on
    probabilistic databases
  • Prove the technique to be near optimal and
    validate it experimentally

56
The End
  • Thank you all!
Write a Comment
User Comments (0)
About PowerShow.com