SUBSKY: Efficient Computation of Skylines in Subspaces - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

SUBSKY: Efficient Computation of Skylines in Subspaces

Description:

foul rate. turnover rate. 0. 1. 1. For the NBA database, Low turnover rate and low foul rate are two important factors for a defense player ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 34
Provided by: iCs8
Category:

less

Transcript and Presenter's Notes

Title: SUBSKY: Efficient Computation of Skylines in Subspaces


1
SUBSKYEfficient Computation of Skylines in
Subspaces
  • Authors Yufei Tao, Xiaokui Xiao, and Jian Pei
  • Conference ICDE 2006
  • Presenter Kamiru
  • Superviosr Dr. Nikos Mamoulis

2
Skyline Queries
  • Given a set of d-dimenional points, a point p
    dominates another p if
  • piltpi, for all i in d,
  • and pjltpj, for any j in d
  • Skyline queries aim to find the points that are
    not dominated by any point

turnover rate
1
For the NBA database, Low turnover rate and low
foul rate are two important factors for a defense
player
Best point
foul rate
0
1
player
3
Applications of Skyline Queries
  • Find a good hotel to me according to distance and
    price

Hotel D must not a good hotel for this user,
since its price is higher and distance is farther
than other hotels
price 2000
price 500
D
C
B
price 1000
A
price 1500
4
Alternative applications of Skyline Queries - i
  • Some top-k queries are calculated by Skyline
    queries
  • A top-k query retrieves the k tuples in P with
    highest scores according to g
  • where g must be a monotonic function, ex
  • g(p) p.x p.y

5
Alternative applications of Skyline Queries - i
  • Please help me to find who are the top-2 NBA
    players according to sum of their points and
    assists in 2007-2008 season

The values are represented by right-top corner of
each player photo
points
PRUNED
20
Top-2 results must be in top-2 skyband
assists
10
10
0
6
Alternative applications of Skyline Queries - ii
  • Another interesting measurement is dominating
    count (DC)
  • DC is counted by the number of dominating points
    to a query

turnover rate
1
Ex find the top-2 dominating players in the NBA
database according to turnover rate and fould rate
1
1
2
4
0
Best point
foul rate
0
1
player
7
Skyline Computations
  • Two categories of skyline computations
  • Computing from scratch (no index)
  • Relied on index
  • Computing from scratch (no index)
  • Advantage
  • No any pre-computation
  • Not to update any index when data changed
  • Drawback
  • Must calculate from scratch
  • It must scan the entire data at least once

8
Skyline Computations
  • Relied on index
  • Once you built, get to use it many times
  • Lower query cost is occurred by performing the
    search on an appropriate structure
  • B - tree
  • R - tree
  • Since all of us are database people, (I hope) we
    prefer method 2 more

9
Related works
  • Computing from scratch (no index)
  • Block nested loop
  • Sort filter skyline
  • Divide and conquer
  • Bitmap
  • Linear elimination sort for skyline

10
Related works
  • Relied on index
  • B tree approach
  • index
  • R tree approach
  • Nearest neighbor (NN)
  • Branch and bound skyline (BBS)
  • BBS has been proved that is I/O optimal. It
    accesses fewer disk pages than any algorithm
    based on R-trees

11
Related works
  • index

Point p adds to list i if p has the smallest
value in dimension i
y
List x p50.1 p60.25 p20.3 p70.6
p5
p7
p8
List y p40.1 p10.2 p30.3 p80.6
p6
p2
1) Ssky p5
p3
2) Ssky p5,p4
p1
p4
x
3) Ssky p5,p4,p1
Best point
  • All remaining elements in List x are pruned by p1
    since both coordinates of p6 is bigger than p1
  • Due to the same reason, all remaining elements in
    List y are pruned by p1 too

12
Related works
  • BBS

Dominant region
M1
M2
M2
N3
p5
p7
N1
N2
N3
N4
p6
p8
N4
p1
p2
p3
p4
p5
p6
p7
p8
M1
p2
N2
p3
N1
  • HNNp1,p2,N2,M2
  • p1 is the first NN object from best point
  • Dominant region of p1 shows in grey color

p4
p1
Best point
  • p2 is pruned by dominating region
  • Expand N2

13
SUBSKY
  • According to NBA database, we have more than 10
    different attributes for one player
  • Skyline queries may be interested in some
    attributes only

14
SUBSKY
  • Build one R-tree and run BBS
  • BBS is an I/O optimal algorithm based on R-tree
    index, but their approaches are optimized for a
    fixed set of dimensions
  • Build R-trees for all elements in the power set
    of dimensions
  • Hugh storage space

15
SUBSKY for uniform data
  • Anchor point Ac the maximal corner of the data
    space having maximum coordinate on all dimensions

maximum value of the coordinate
Ac
f(p1)
f(p)max(1-pi), where i is from 1 to d
y
1
p1
f(p2)
fsky(psky)
fsky(psky)min(1-pskyi), where i is from 1 to d
p2
psky
Pruning region of psky
No any point p satisfying f(p)ltfsky(psky) can
belong to the skyline
x
Best point
1
A similar result exists for the skyline of any
subspace
16
SUBSKY for uniform data
  • Skyline queries only apply on relevant dimensions
    SUB
  • fsky(psky)min(1-pskyi),
  • where i is in SUB
  • Then,
  • f(p) lt fsky(psky) lt fsky(psky)
  • No any point p satisfying the above equation can
    belong to the skyline

17
SUBSKY for uniform data
  • Assume that our skyline query is interested in
    dimension x and y only
  • First, we sort the data by f(pi)
  • p3, p4, p1, p2, p5
  • Sskyp3, fsky(p3)0.5 min(1-0.5,1-0.3)
  • U0.5 (largest f value in Ssky)
  • Sskyp3,p4, fsky(p4)0.1
  • U0.5
  • Sskyp1,p4, fsky(p1)0.8
  • p3 is removed by adding p1, since it is dominated
    by p1
  • U0.8

18
Analysis
  • Assume that you have 15D uniform distributed
    objects with cardinality 100k, and we want to
    retrieve the skyline in a subspace SUB containing
    any two dimensions.

1
Greater than 90 to find an object in area(?, ?),
where ?0.001, since (1- ?2)100k Therefore, fsky(
p) 1- ? 0.999 The volume evaluates to
0.9991598.5, that is, we only need to access
1.5 of the dataset
?
1
?
19
General SUBSKY
  • In practical, data are usually clustered
  • If the data are clustered, then we should expect
    that one anchor point cannot give us enough
    pruning power

Ac
A1
1
psky
x
Best point
1
20
General SUBSKY
  • Anchors for different clusters

Ac
A1
psky
s3
cluster s1
A2
s2
s4
x
  • Two questions
  • How to find the anchors?
  • How to assign points to anchors?

21
Finding the Anchors
  • First, let us see what a perfect anchor of a
    point p
  • If p is assigned to A, then p can be pruned by
    any skyline point dominating p

Any point on this line is a perfect anchor of
point p
A3
A2
Major perpendicular plane
A1
p
Anti-dominant region of p
22
Finding the Anchors
Major perpendicular plane
  • For each point, find the projections to the plane
  • Ex p1, p2
  • Partition the projected points into m clusters
    using algorithm k-means, and formulate an anchor
    for each cluster

p2
p1
p2
a good anchor
p1
23
Finding the Anchors
  • How to decide an anchor for a cluster?

Blue points are assigned to cluster S. How can we
decide the anchor for S?
A
  • Obtain point B, whose coordinate on each
    dimension equals the lowest coordinate of the
    points in S in their original space on this axis

B
  • Then, the algorithm computes the smallest square
    opposite to B which covers all points in S

24
Assigning Points to Anchors
  • A naïve way is to assign points to their closest
    anchor point in the major perpendicular plane
    (projected space)
  • It is not directly quantifies the benefit of an
    assignment

25
Assigning Points to Anchors
  • In order to assign a point to a good anchor, this
    paper introduces a new measurement which name is
    effective region (ER)

p
Pruning region of p2
Pruning region of p1
All points in yellow region (ER) can make a
pruning region to Ac that cover p
p1
p2
If ER-volume of p is larger, then p has more
chance to be pruned
ER of p
26
Assigning Points to Anchors
A
p
p
p1
p1
ER of p
p2
p2
ER of p
27
Assigning Points to Anchors
  • The pruning volume size of a point p to an anchor
    point Aj is
  • ?max(0,Aji-L8(p,Aj)),
  • where i is from 1 to d
  • Therefore, assign a point p to Aj that produces
    the largest pruning volume size

28
Query example
  • We use the same example in previous slide
  • Assume that we have two anchors, one is Ac and
    the other A is found by K-means (m1)
  • Ac(1,1,1) and A(1,1,0.8)
  • First, we calculate the ER volume of each data
    point with respect to Ac and A

Unit 10-3
29
Query example
  • Sorted list by f
  • Ac p4 p1 p2 p5
  • A p3
  • Sskyp4, fsky(p4)0.5
  • U0.5
  • Sskyp4, p1, fsky(p1)0.8
  • U0.8

30
Experiments
  • 3 real datasets NBA, Household, and Color
  • 2 synthetic data (10D)
  • Uniform
  • Clustered
  • 10 cluster centroids
  • For each centroid, it takes N/10 points whose
    coordinate on each axis follows a Gaussian
    distribution with variance 0.05 and a mean equal
    to the corresponding coordinate of the centroid

31
Experiments
32
Experiments
33
Experiments
3D subspaces, full-space dimensionality is 10
3D subspaces, 1 million cardinality
34
Conclusion
  • The core of SUBSKY is a transformation that
    convert multi-dimensional data into 1D values
  • Show better performance than a I/O optimized
    algorithm in the subspace skyline problem
  • Some continuous monitoring cases are good to
    investigate
  • How to adopt the set of anchor points if data
    update rapidly
  • The f values could be stored in other index
    structure to support fast update

35
Assigning Points to Anchors
  • Therefore, we have two ways to assign points to
    the anchors
  • Assign points to their closest anchor point in
    the major perpendicular plane (projected space)
  • Assign points to their closest anchor point by
    ER-volume in original space
  • The second approach is better than the first in
    the major perpendicular plane, because the
    ER-volume directly quantifies the benefit of an
    assignment
Write a Comment
User Comments (0)
About PowerShow.com