Title: Probabilistic Skylines on Uncertain Data VLDB2007 Jian Pei et al
1Probabilistic Skylines on Uncertain
Data(VLDB2007) Jian Pei et al
- Supervisor Dr Benjamin Kao
- Presenter For
- Date 22 Feb 2008
?? the possible world concept
2Outline
- Motivation
- Traditional and Probabilistic Skyline
- Problem Definition
- Computation Problem and Algorithms (Top down and
Bottom up) - Experimental Results
3MotivationSkyline Analysis on NBA players
performance
Each Player has multiple records
First read the topic and then the subtopic to
let others know what you are doing
Define skyline explanation of the graph, the
larger the better
instance e dominate b,d,c
4MotivationSkyline Analysis on NBA players with
multiple records
5MotivationSkyline Analysis on NBA players with
multiple records
- Easy Approach Averaging
- Arbor (x) is better in assist than Eddy, but Eddy
(point b) dominates all games of Arbor (x). - Bob (point a) bias the aggregate value
not so fair to say Eddy is a worse in assist
than Arbor
not so fair to Bob to be severely affected by
only a game
Complete-Miss need a new graph
6MotivationMotivating result using Probabilistic
Skyline
- Olajuwon and Kobe Bryant are missing from
Aggregate Skyline but present in Probabilistic
Skyline - Their performance vary a lot over games
- Details in experiment analysis
Completed (Miss Pictures of them)
7Traditional and Probabilistic SkylineSemantics
difference of Dominance between objects
Certain Data
Uncertain Data
- Dominance
- Certain model an object dominate another object
with Probability 1. - Uncertain model an object dominate another
object with Probability P.
Assume smaller the value, the better
Miss A flash showing the calculation will be
better
8Traditional and Probabilistic SkylineSemantics
difference of Dominance between objects
Uncertain Data
Certain Data
- Dominance
- Certain model an object dominate another object
with Probability 1. - Uncertain model an object dominate another
object with Probability P.
Assume smaller the value, the better
Miss A flash showing the calculation will be
better
9Traditional and Probabilistic SkylineSemantics
difference of Dominance between objects
Certain Data
Uncertain Data
- Dominance
- Certain model an object dominate another object
with Probability 1. - Uncertain model an object dominate another
object with Probability P.
Assume smaller the value, the better
Miss A flash showing the calculation will be
better
Consider object d
10Traditional and Probabilistic SkylineSemantics
difference of Dominance between objects
Certain Data
Uncertain Data
- Dominance
- Certain model an object dominate another object
with Probability 1. - Uncertain model an object dominate another
object with Probability P.
Assume smaller the value, the better
Miss A flash showing the calculation will be
better
11Traditional and Probabilistic SkylineSemantics
difference of Dominance between objects
Certain Data
Uncertain Data
- Dominance
- Certain model an object dominate another object
with Probability 1. - Uncertain model an object dominate another
object with Probability P.
Assume smaller the value, the better
CompletedMiss A flash showing the calculation
will be better
12Probabilistic SkylineCalculation of Probability
Object A dominating Object C
Pr A?C
1/41/3 (4..)
For easier illustration, discrete case are used
Explanation of Symbols
Miss Need a flash to demonstrate the calculation
of Dominance
13Probabilistic SkylineCalculation of Probability
Object A dominates Object B
Pr A?C
1/41/3 (44..)
For easier illustration, discrete case are used
Explanation of Symbols
Miss Need a flash to demonstrate the calculation
of Dominance
14Probabilistic SkylineCalculation of Probability
Object A dominates Object B
Pr A?C
1/41/3 (440)
2/3
For easier illustration, discrete case are used
Explanation of Symbols
CompletedMiss Need a flash to demonstrate the
calculation of Dominance
15Probabilistic SkylineProbabilistic Skyline From
Dominance to Skyline
- Intuition of finding Skyline, probability of an
object not to be dominated by other objects
OKMiss using flash to do the grouping of object
A,B,C
OKPlease change the equation of 0 (1/3)(1/3)
16Probabilistic SkylineProbabilistic Skyline Idea
- Intuition
- 1) we know the dominance definition
- 2) skyline not dominated by other objects
Miss not dominated demonstration of Object A,B
Consider Object A, instance by instance
17Probabilistic SkylineProbabilistic Skyline Idea
- Intuition
- 1) we know the dominance definition
- 2) skyline not dominated by other objects
Miss not dominated demonstration of Object
A,B we see that instance of Object A is not
dominated by instances of other objects
18Probabilistic SkylineProbabilistic Skyline Idea
- Intuition
- 1) we know the dominance definition
- 2) skyline not dominated by other objects
Miss not dominated demonstration of Object A,B
19Probabilistic SkylineProbabilistic Skyline Idea
- Intuition
- 1) we know the dominance definition
- 2) skyline not dominated by other objects
Miss not dominated demonstration of Object A,B
20Probabilistic SkylineProbabilistic Skyline Idea
- Intuition
- Not dominated by other instances of objects,
Probability of object A being dominated is 0.
Probability skyline of object A is therefore 1.
OKMiss not dominated demonstration of Object
A,B
21Probabilistic SkylineCalculation of
Probabilistic Skyline
Pr (D) ?
Miss another flash to show the calculation of
Skyline Probability of an 7/12
?? where to explain the consequence of an
instance dorminated by an object
22Probabilistic SkylineCalculation of
Probabilistic Skyline
Pr (D) ?
Pr(d1) (1-1/4)
Miss another flash to show the calculation of
Skyline Probability of an 7/12
?? where to explain the consequence of an
instance dorminated by an object
23Probabilistic SkylineCalculation of
Probabilistic Skyline
Pr (D) ?
Pr(d1) (1-1/4)
Pr(d2) (1-1/4) (1-2/3)
Miss another flash to show the calculation of
Skyline Probability of an 7/12
?? where to explain the consequence of an
instance dorminated by an object
24Probabilistic SkylineCalculation of
Probabilistic Skyline
Pr (D) ?
Pr(d1) (1-1/4)
Pr(d2) (1-1/4) (1-2/3)
Pr(d3) (1-1/4)
P(D) 1/3(3/41/43/4)
7/12
OK-Miss another flash to show the calculation of
Skyline Probability of an 7/12
?? where to explain the consequence of an
instance dorminated by an object
25Probabilistic SkylineThe p-skyline
- 1-skyline
- A,B
- 7/12 skyline
- A,B,D
If you have time, use the formula to find Object
c probability as well
26Problem Definition
- Given a set of uncertain objects S and a
probability threshold p (0 p 1), the problem
of probabilistic skyline computation is to
compute the p-skyline on S.
- 1-skyline
- A,B
- 7/12 skyline
- A,B,D
27Computation Problem of p-skyline
- First, each uncertain object may have many
instances. We have to process a large number of
instances. - Second, we have to consider many probabilities in
deriving the probabilistic skylines.
28Algorithms (Top down and Bottom up)
- Data
- Multiple records of objects in the hope of
approximating the probability density function - Techniques
- Bounding
- Pruning
- Refining
The whole algorithms are very detailed,
technique authors use to efficient pruning will
be discussed
Assumption the smaller the value, the better
Please tell the audience clearly what is the
data being processed
29Bottom-up AlgorithmTechnique Minimum Bounding
Box (MBB)
OKMiss flash drawing the bounding box of object
D and demonstrate the two property
30Bottom-up Algorithm - Pruning Techniques (1/3)
using Umin, Umax to decide membership of
p-skyline
- For an uncertain object U and probability
threshold p, if Pr(Umin) the p-skyline. If Pr(Umax) p, then U is in the
p-skyline
OKMiss Flash use figure 3 to illustrate
31Bottom-up Algorithm - Pruning Techniques (2/3)
using Umax to prune instances of objects
- Let U and V be uncertain objects such that U V
. If u is an instance of U and Vmax ? u, then
Pr(u) 0.
C2 is dominated by Umax, dominated by all
instances in object D
Pr(c2) (1 3/3)(..)(..)
0
OKMiss Flash use equation ()()() to illustrate
32Bottom-up Algorithm - Pruning Techniques (3/3)
using subset of instance to prune objects
Estimate Pr(Vmin) upper bound by Pr(Umax)
Pr(Vmin) (1 U/U)(..)(..)
If U is large, more instances dominate Vmin,
then Pr(Vmin) is low
? How to say better
OK Better to use Flash illustration
You can take min cPr(u) for easy
understanding
to estimate the upper bound of Vmin using U
max assume all points of U appear only in U and
green region, such that Vmin is dorminated by
less objects
33Bottom-up Algorithm - Pruning Techniques (3/3)
using subset of instance to prune objects
- Special Case
- As a special case, if there exists an instance u
? U such that Pr(u)
- Very useful an uncertain object partially
computed can be used to prune other objects
34Bottom-up Algorithmsimplified version of
bottom-up algorithm
Input instances of objects and their Umin
- If (u is dominated by another object)
- prune u //c2 is dominated by D
- end if
- If (u is Umin)
- compute Pr (Umin)
- if (Pr(Umin)
- prune u //Umin
- end if
- end if
- Use Pr(u) to update Pr(U)s upper and lower bound
- Decide membership of p-skyline of U
- prune other objects // check with other Umins
- End if
Miss Pictures of illustration
all instances of uncertain object are put into a
list as well as the Umin
35Top-down AlgorithmDifference between top down
and bottom up algorithm
- Bottom up
- Start with single instance of an uncertain object
- Top down
- Start with the whole sets of instances of an
uncertain object
36Top-down AlgorithmIdea of bounding
- The skyline probability of each subset of
uncertain object can be bounded using its MBB. - The skyline probability of the uncertain object
can be bounded as the weighted mean of the bounds
of subsets.
Miss if possible draw a graph with 4 squares
inside it to replace the upper one
37Top-down Algorithmsupporting data structure
partition tree
D
B
C
A
D
B
C
A
B
D
A
C
Miss the look of partition tree, with 2 dimension
Miss Mark the level of partition tree, 0,1,2 etc
for simplicity, a 2d tree will be used to
illustrate the concept for easy understanding
38Top-down Algorithmpartition tree for bounding
D
B
D
B
C
A
C
A
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
- Compare the partition of U with other partition
tree as follows traverse the partition tree of
other uncertain object V, in the depth-first
manner.
wording needed to be changed if possible
dominating object is mentioned
?? Adding possible dominating object before
discussing the algorithms
39Top-down Algorithmall possible situations during
partition trees traversal
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
40Top-down Algorithmsituations 1/3 during
partition tree traversal for bounding calculation
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
41Top-down Algorithmsituations 2/3 during
partition tree traversal for bounding calculation
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
(Place the two trees here, it is better to use
subtree starting at level 1)
42Top-down Algorithmsituations 3/3 during
partition tree traversal for bounding calculation
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
Estimate lower bound
Estimate upper bound
(Place the two trees here, it is better to use
subtree starting at level 1)
43Top-down AlgorithmPruning partition tree 1/3
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
(better to put a tree here)
44Top-down AlgorithmPruning partition tree 2/3
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
B
D
B
D
A
C
A
C
(better to put a tree here)
45Top-down AlgorithmPruning partition tree 3/3
B
D
A
C
B
D
A
C
B
D
A
C
46ExperimentData and Experiment
- Experiment aggregate skyline and probabilistic
skyline (0.1-skyline) - Data Set NBA players performance record(339,721)
- Attributes points, assists, rebounds
47ExperimentResults
- 1) Top 12 players in probabilistic skyline also
appear aggregate skyline - 2) Players like (Olajuwon and Kobe Bryant) appear
only in probabilistic skyline but not aggregate
skyline. - 3) Disagreement between probabilistic skyline and
aggregate skyline. Player A dominate B in
aggregate skyline but reverse in probabilistic
skyline
48Experiment
49ExperimentResults Analysis
- 2) Players like (Olajuwon and Kobe Bryant) appear
only in probabilistic skyline but not aggregate
skyline. - Finding
- Comparing to the aggregate skyline, the
probabilistic skyline finds not only players
consistently performing well, but also
outstanding players with large variances in
performance
50ExperimentResults Analysis
- 3) Disagreement between probabilistic skyline and
aggregate skyline. Ewing(0.13577) has a higher
skyline probability than Brand(0.10966), though
Ewing is dominated by Brand in the aggregate data
set - Finding
- Ewing play very well in few games
- probabilistic skylines disclose interesting
knowledge about uncertain data which cannot be
captured by traditional skyline analysis. - Ranking can be performed on Probabilistic
Skyline, which can not be done on aggregate
skyline
51ExperimentResults Analysis
52Other ExperimentsSynthesis data set
- Data
- Synthesis data sets where instances of objects
are generated in anti-correlated, independent,
and correlated distributions
53Other Experiment resultsEffect of probability
threshold to size of skyline
54Other Experiment resultsEffect of dimensionality
to size of skyline
55Other Experiment resultsEffect of cardinality
(instance) to size of skyline
56Other Experiment resultsScalability with respect
to probability threshold
57Other Experiment results
- Compare Top-Down and Bottom-Up with
dimensionality and cardinality
58The End