Progressive and Selective Merge: Computing Top-K with ad-hoc Ranking Functions - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Progressive and Selective Merge: Computing Top-K with ad-hoc Ranking Functions

Description:

Title: Mining Compressed Frequent-Pattern Sets Author: uiucdmg Last modified by: dongxin Created Date: 8/18/2005 12:58:39 AM Document presentation format – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 34
Provided by: uiu46
Category:

less

Transcript and Presenter's Notes

Title: Progressive and Selective Merge: Computing Top-K with ad-hoc Ranking Functions


1
Progressive and Selective Merge Computing Top-K
with ad-hoc Ranking Functions
  • Dong Xin, Jiawei Han, Kevin C.-C. Chang
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign

2
Road Map
  • Introduction
  • Sort-Merge vs. Index-Merge
  • Proposed Solutions
  • Experimental Results
  • Conclusions

3
Motivation Smart Search in Massive Databases
Apartment Search on www.rent.com
  • Search is still a problem in massive databases!
  • Challenges on data
  • Gigabytes of data
  • Hundreds of dimensions
  • Challenges on query processing
  • Flood of results A query may return too many
    answers
  • Quality of answers Users are often interested in
    only a few but high-quality results
  • Efficiency It is slow to process and return so
    many answers!

similar web database search on www.cars.com
www.bizrate.com
4
Lesson from the Web Search Ranked Queries
Ranking is the key in successful Web Search
Ranking Apts by aprice-1000 distance to
shopping
Top-K Results
5
Ranked Query More Example
  • Stock Price Predication
  • Stock Value V
  • Earnings E
  • Research Expenditure R
  • Train a Predication Model from history data
  • f(E,R)

Top-k stocks which are most predicable Select top
k from Stock S order by (S.V-f(S.E, S.R))2 asc
6
Data and Query Model
  • Data Relation R
  • Indices are built on attributes
  • B-tree index on single attribute
  • R-tree index on multiple attributes
  • Query with ad hoc Ranking Function F
  • Attributes from multiple indices
  • F is lower-boundable (assume minimal k)
  • Given a domain D
  • One can derive the lower bound value of F(D)
  • The lower bound value need not be tight

7
Example Ranking Functions
  • Typical functions
  • All linear functions xy
  • All quadratic functions (x-100)2(y-100)2
  • All monotone functions xy
  • Some special functions
  • Min square error (x-y)2
  • Median, min, max of input variables

8
Road Map
  • Introduction
  • Sort-Merge vs. Index-Merge
  • Proposed Solutions
  • Experimental Results
  • Conclusions

9
Previous Study Sort-Merge
  • Sort-Merge Fagin et al, 2001
  • Assume monotone functions
  • Many variations

Select top 1 from Apartment order by price /
sq feet asc
tid Price
t1 500
t2 700
t3 800
t4 1000
t5 1100
t6 1200
t7 1200
t8 1350
tid Price Sq feet
t1 500 600
t2 700 800
t3 800 900
t4 1000 1000
t5 1100 200
t6 1200 500
t7 1200 560
t8 1350 1120
tid Sq feet
t8 1120
t4 1000
t3 900
t2 800
t1 600
t7 560
t6 500
t5 200
Intuition Data appearing top on both lists
likely has higher score
10
Our Solution Index-Merge
  • The function involves both price and sq feet
  • Price and Sq feet are indexed individually

Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3
10, t4 11, t5
12, t6 12, t7 13,t8
2, t5 5, t6 5,t7
6, t1 8, t4
9, t2 10, t8 11,t3
11
Search over Joint Space
  • Merge both indices online
  • Search top results over the joint Space

Join Price and Sq feet during online
computation
The joint state (P1, S2) Boundary 5 - 8 X 6 -
8 Join results t1
(Price, Sq feet)

(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
(P3,S3)
Cartesian product of partitions from price and
sq feet
12
Sort-Merge vs. Index-Merge
Sort-Merge Index-Merge
Data Organization Each attribute is sorted in a list Each attribute or multiple attribute is indexed by hierarchical tree structures
Data Access Methods Sorted Access (SA) Random Access (RA) Node Access (following tree structures)
Ranking Functions Monotone functions Ad-hoc functions (lower-bound-able)
Optimization Opportunities SA only Combine SA and RA Progressive Merge Selective Merge
13
Road Map
  • Introduction
  • Sort-Merge vs. Index-Merge
  • Proposed Solutions
  • Problem Analysis
  • Progressive Merge
  • Selective Merge
  • Experimental Results
  • Conclusions

14
Challenge 1 Complexity of Join
Join Price and Sq feet
(Price, Sq feet)

(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
(P3,S3)
Number of children by joining two index node
Cartesian Product
Complexity of Join B-tree with page size 4kB
The fan-out of a node is 204 Join 2 B-trees
40k children Join 3 B-trees 8M children
15
Challenge 2 Empty State
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
Join Price and Sq feet
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
(Price, Sq feet)
5, t1 7, t2 8,t3
2, t5 5, t6 5,t7

(P1,S1)
(P1,S2)
t1,t2,t3
t5,t6,t7
Empty join state should be pruned
Probability of Empty State Join B-trees on 1M
data Up to 1M non-empty Joint leaf states
Each B-tree has 5140 leaf nodes Join 2
B-trees 25.2M joint leaf (96 are empty)
Join 3 B-trees 126.5G joint leaf (99.999 are
empty)
16
Efficient Search Over The Joint Space
  • Straight forward search is expensive
  • Assemble the complete joint space first
  • Search over the joint space
  • Smart search join as necessary
  • Progressive Merge
  • Identify which joint states have top scores
  • Assemble good joint states first
  • Selective Merge
  • Identify which joint states contain non-empty
    result
  • Avoid generating joint states that are empty

17
Problem Analysis
  • Minimize
  • CPU cost Number of joint states to be assembled
  • I/O cost Number of joint states to be retrieved
  • Performance At a Glance
  • Ranking Function f(A-B)2
  • A, B are indexed by B-trees, data has 1M records

18
Road Map
  • Introduction
  • Sort-Merge vs. Index-Merge
  • Proposed Solutions
  • Problem Analysis
  • Progressive Merge
  • Selective Merge
  • Experimental Results
  • Conclusions

19
Progressive Merge (1)
Ranking Function price 1k2 sq feet -
8002
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
500 - 800 1000-1100 1200-1350
200 - 500 600 - 800 900 -1120
  • Special Case convex function
  • Find the best joint state by the extreme point,
    e.g., best state (P2, S2)
  • Progressive assemble neighboring states, e.g.,
    (P2, S1) (P2, S3) (P1, S2) (P3, S2)

20
Progressive Merge (2)
Ranking Function price 1k2 sq feet -
8002
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
500 - 800 1000-1100 1200-1350
200 - 500 600 - 800 900 -1120
f(Pi, S) 40,000 0
40,000
f(P, Si) 90,000 0
10,000
  • General function not convex, not monotone?
  • Step 1 Sort Pi by f(Pi, S), Sort Si by f(P, Si)

P2 P1 P3
S2 S3 S1
f(Pi, S) min f(x) (x in Pi and S) i.e., price
in Pi node, and sq_feet in S node
Sorted Lists of P and S
21
Progressive Merge (3)
Ranking Function price 1k2 sq feet -
8002
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
500 - 800 1000-1100 1200-1350
200 - 500 600 - 800 900 -1120
f(Pi, S) 40,000 0
40,000
f(P, Si) 90,000 0
10,000
  • General function not convex, not monotone?
  • Step 2 Merge Pi, Sj progressively

(P2, S2) (P2, S3) (S2, P1) (P1, S3) (P2, S1) (P1,
S1) (P3, S2) (P3, S3) (P3, S1)
P2 P1 P3
S2 S3 S1
Sorted Lists of P and S
Progressive merged results
22
Progressive Merge (4)
Ranking Function price 1k2 sq feet -
8002
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
500 - 800 1000-1100 1200-1350
200 - 500 600 - 800 900 -1120
f(Pi, S) 40,000 0
40,000
f(P, Si) 90,000 0
10,000
  • General function not convex, not monotone?
  • Step 3 Stop when the best seen is no worse
    than the unseen

Best seen f(P2, S2)0
(P2, S2) (P2, S3) (S2, P1) (P1, S3) (P2, S1) (P1,
S1) (P3, S2) (P3, S3) (P3, S1)
f(P2,S2) min (f(P3,S), f(P, S3))
Best possible of unseen min ( f(P1, S), f(P, S3)
) 10,000
23
How Good is the Progressive Merge?
  • Optimal case
  • Suppose Sk is the kth best score in the final
    results
  • Optimal algorithm only examines all joint state
    s, such that f(s) gt Sk
  • Convex functions N mN
  • N number of joint states examined by
    neighborhood expansion
  • N optimal number
  • m number of indices to be merged
  • General functions N 2mN
  • N number of joint states examined by threshold
    expansion
  • N any other algorithm in the same category
  • m number of indices to be merged

24
Road Map
  • Introduction
  • Sort-Merge vs. Index-Merge
  • Proposed Solutions
  • Problem Analysis
  • Progressive Merge
  • Selective Merge
  • Experimental Results
  • Conclusions

25
Selective State
Join Price and Sq feet
(Price, Sq feet)

(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
(P3,S3)
P1 contains (t1, t2, t3) S1 contains (t5, t6,
t7) Join P1 and S1 Empty state
How to identify empty states before the joint
states are retrieved?
26
Join Signature
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3
10, t4 11, t5
12, t6 12, t7 13,t8
2, t5 5, t6 5,t7
6, t1 8, t4
9, t2 10, t8 11,t3
Join Signature indicates Empty States Compress
the join signature by bloom filter Check the Join
Signature before assembling states
S1 S2 S3
0 1 1
1 1 0
1 0 1
P1 P2 P3
Data and their location in the index
Join Signature
27
Revisit Comparing to Sort-merge
  • Progressive Merge
  • Corresponds to Sorted Access scheduling in
    sort-merge
  • Node access scheduling in index-merge is more
    complicated
  • Selective Merge
  • Corresponds to Random Access in Sort-merge
  • Random access dose not help in index-merge
    (details in paper)
  • The join signature is also applicable to
    Sort-merge

28
Road Map
  • Introduction
  • Sort-Merge vs. Index-Merge
  • Proposed Solutions
  • Problem Analysis
  • Progressive Merge
  • Selective Merge
  • Experimental Results
  • Conclusions

29
Execution Time w.r.t. Different Functions
  • Join two B-tree indices
  • TS Table Scan, BL Baseline, PE Progressive
    Merge, SIG Signature-based Selective Merge

F(x-a)2(y-b)2
F(x-y2)2
30
Disk Access and Memory Usage
  • Join two B-tree indices, Top 100 query
  • BL Baseline, PE Progressive Merge, SIG
    Signature

F(x-y2)2
31
Road Map
  • Introduction
  • Sort-Merge vs. Index-Merge
  • Proposed Solutions
  • Problem Analysis
  • Progressive Merge
  • Selective Merge
  • Experimental Results
  • Conclusions

32
Discussion and Conclusions
  • The Index-Merge Framework
  • Extends the sort-merge framework
  • Does not confine to monotone functions
  • Efficient Computation
  • Progressive Merge reduce the CPU and memory
    overhead
  • Selective Merge reduce the disk I/O cost

33
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com