A Unified Approach for Computing Top-k Pairs in Multidimensional Space - PowerPoint PPT Presentation

About This Presentation
Title:

A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Description:

A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema1 Joint work with Xuemin Lin1, Haixun Wang2, Jianmin Wang3 ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 23
Provided by: GenericF
Category:

less

Transcript and Presenter's Notes

Title: A Unified Approach for Computing Top-k Pairs in Multidimensional Space


1
A Unified Approach for Computing Top-k Pairs in
Multidimensional Space
  • Presented By Muhammad Aamir Cheema1
  • Joint work with
  • Xuemin Lin1, Haixun Wang2, Jianmin Wang3, Wenjie
    Zhang1

1 University of New South Wales, Australia 2
Microsoft Research Asia 3 Tsinghua University,
China
2
Introduction
  • Top-k Pairs Query
  • Given a scoring function f() that computes
    the score of a pair of objects, return k pairs of
    objects with smallest scores.
  • Examples

o2
o1
  • k-closest pairs
  • f(ou,ov) dist(ou,ov)
  • Answer (k1) (o1,o2)
  • k-furthest pairs
  • f(ou,ov) - dist(ou,ov)
  • Answer (k1) (o2,o4)
  • f(ou,ov) (ou.x ov.x) (ou.y ov.y)
  • Answer (k1) (o4,o5)

y-axis
o3
o5
o4
x-axis
3
Related Work
K-Closest Pairs Queries
  • Computational geometry M Smid, Handbook on
    Comp. Geometry
  • Database community
  • Hjaltason et. al, SIGMOD 1998
  • Corral et. al, SIGMOD 2000
  • Yang et. al, IDEAS 2002
  • Shan et. al, SSTD 2003

K-Furthest Pairs Queries
Supowit , SODA 1990 Katoh et. al, IJCGA
1995 Corral et. al, DKE 2004
Top-k Queries
  • Fagins Algorithm Fagin, PODS 1996
  • Threshold Algorithm Fagin, JCSS 1999, Nepal
    et. al, ICDE 1999 , G?ntzer et. al, VLDB 2000
  • No Random Access Algoritm Fagin, JCSS 1999,
    Mamoulis et. al, TODS 2007

4
Motivation
  • No existing work for more general queries
  • Other Lp distances (e.g., Manhattan distance) ?
  • More general scoring functions
  • Chromatic queries

SELECT a.id , b.id FROM AGENT a, AGENT b WHERE
a.id lt b.id ORDER BY a.sold b.sold -
a.salary b.salary LIMIT k
SELECT a.id , b.id FROM AGENT a, AGENT b WHERE
a.id lt b.id AND a.manager ltgt b.manager ORDER BY
a.sold b.sold - a.salary b.salary LIMIT k
  • No existing unified algorithm
  • One framework that answers a broad class of
    top-k pairs queries

5
Problem Definition (Preliminaries)
  • Monotonic function
  • f() is monotonic if f(x1,,xN) f(y1,,yN)
    whenever xi yi for every 1 I N
  • Examples
  • f(x1,,xN) x1 x2 xN (summation)
  • f(x1,,xN) (x1 x2 xN) / N (average)

6
Problem Definition (Preliminaries)
  • Loose monotonic function
  • s() takes two parameters and is loose monotonic
    if both of following hold for every fixed value x
  • for every y gt x, s(x,y) either monotonically
    increases or monotonically decreases as y
    increases
  • for every y lt x, s(x,y) either monotonically
    increases or montonically decreases as y
    decreases
  • Loose monotonic functions are more general than
    the monotonic functions

x
y
y
8
-8
5
-3
0
1
2
s2(x,y) (x y)
3
6

1
-2
s1(x,y) x y
1
4

7
Problem Definition
  • Return k pairs of objects with smallest scores.

SCORE (a,b) f (
s1(a,b),,sd(a,b) ) si( ) is called local scoring
function and can be any loose monotonic function
of users choice. f( ) is called global scoring
function and can be any monotonic function that
involves an arbitrary set of attributes.
s1(a,b) a.sold b.sold s2(a,b) -
a.salary b.salary f( ) s1(a,b) s2(a,b)

SELECT a.id , b.id FROM AGENT a, AGENT b WHERE
a.id lt b.id ORDER BY a.sold b.sold -
a.salary b.salary LIMIT k
8
Problem Definition
  • Return k pairs of objects with smallest scores
    among the valid pairs.

Let each object be assigned a color. Chromatic
Queries Homochromatic Queries pairs
containing objects of same color
Heterochromatic Queries pairs containing objects
of different colors
SELECT a.id , b.id FROM AGENT a, AGENT b WHERE
a.id lt b.id ORDER BY a.sold b.sold -
a.salary b.salary LIMIT k
SELECT a.id , b.id FROM AGENT a, AGENT b WHERE
a.id lt b.id AND a.manager ? b.manager ORDER BY
a.sold b.sold - a.salary b.salary LIMIT k
SELECT a.id , b.id FROM AGENT a, AGENT b WHERE
a.id lt b.id AND a.manager b.manager ORDER BY
a.sold b.sold - a.salary b.salary LIMIT k
9
Contributions
Unified algorithm (internal and external)
  • k-closest pairs, k-furthest pairs and variants
    (any Lp distance)
  • queries involving any arbitrary subset of
    attributes
  • chromatic and non-chromatic queries
  • skyline pairs queries and rank based top-k pairs
    queries

No pre-built indexes required
  • efficiently builds a simple data structure
    on-the-fly
  • can answer queries involving filtering
    conditions on objects

Known memory requirement
  • existing R-tree based approaches may require
    arbitrarily large heaps
  • our algorithm requires O(k) space 2d buffer
    pages

SELECT a.id , b.id FROM AGENT a, AGENT b WHERE
a.id lt b.id AND a.age gt 40 AND b.age gt 40 ORDER
BY a.sold b.sold - a.salary
b.salary LIMIT k
Efficient
  • Theoretically Optimal for d 2
  • Experimentally

10
Framework
Top-K algorithms (e.g., FA, TA, NRA etc.)
(o1,o2) 3
(o2,o5) 4
(o1,o3) 9

(o2,o3) 5
(o1,o5) 6
(o1,o2) 6

(o1,o2) 1
(o3,o4) 2
(o1,o4) 5


s2(a,b)
s1(a,b)
sd(a,b)
f ( s1(a,b), s2(a,b), ,sd(a,b) )
How to efficiently create and maintain these
sources???
11
Creating/maintaining sources
Naïve approach
  • Create all possible pairs O(N2)
  • Sort them according to their local scores O(N2
    log N)
  • space requirement O(N2)

Features of our approach
  • Optimal internal memory algorithm
  • requires O(N) space
  • returns first pair in O(N log N)
  • each next best pair is returned in O( log N)
  • Optimal external memory algorithm
  • B number of elements that can be stored in one
    disk page
  • M used internal memory minimum M 2B
  • returns first pair in O(N/B logM/B N/B)
  • each next best pair is returned in O(logM/B N/B)

12
Creating/maintaining sources
  • Initialize
  • sort the objects
  • for each object ou
  • create its best pair (ou,ov)
  • insert (ou,ov) in heap
  • getNextPair()
  • report the top pair (ou,ov) of heap
  • create next best pair of ou
  • enheap the new pair and delete (ou,ov)

(o3,o4) 1
(o2,o3) 2
(o4,o5) 5
(o1,o2) 6
(o5,o6) 10
(o2,o3) 2
(o4,o5) 5
(o3,o5) 6
(o1,o2) 6
(o5,o6) 10
(o2,o4) 3
(o4,o5) 5
(o3,o5) 6
(o1,o2) 6
(o5,o6) 10
s(x,y) x y
6
3
2
1
5
10
6
6
12
14
15
20
30
o1
o2
o3
o4
o5
o6
13
Homochromatic Queries
o2
o6
o1
o3
o4
o5
6
12
14
15
20
30
14
Heterochromatic Queries
  • Let (ou,ov) be the pair
  • ox the object next to ov
  • If ou and ox have different color
  • (ou,ox) is the next best pair
  • else
  • oy the adjacent object of ox
  • (ou,oy) is the next best pair

o2
o6
o1
o3
o4
o5
6
12
14
15
20
30
15
Experiments
  • K-closest pairs queries Corral et. al, SIGMOD
    2000
  • Data size two dataset each containing 100K
    objects
  • k 10

16
Experiments
  • Naive join the dataset with itself using
    nested loop (block nested loop for external
    memory algorithm)
  • Scoring function
  • Local scoring function is either sum or absolute
    difference (chosen randomly)
  • Global scoring function is weighted aggregate
    (weights are chosen randomly and negative weights
    are allowed)

17
Number of Objects
18
Number of attributes (d)
19
Value of k
20
Number of colors
21
Thanks
22
Complexity
Internal memory algorithm
External memory algorithm
d number of local scoring functions involved N
total number of objects V total number of
valid pairs (N2 at most) M internal memory used
by the algorithm B the number of entries one
disk page can store
Write a Comment
User Comments (0)
About PowerShow.com