Title: Progressive and Selective Merge: Computing Top-K with ad-hoc Ranking Functions
1Progressive and Selective Merge Computing Top-K
with ad-hoc Ranking Functions
- Dong Xin, Jiawei Han, Kevin C.-C. Chang
- Department of Computer Science
- University of Illinois at Urbana-Champaign
2Road Map
- Introduction
- Sort-Merge vs. Index-Merge
- Proposed Solutions
- Experimental Results
- Conclusions
3Motivation Smart Search in Massive Databases
Apartment Search on www.rent.com
- Search is still a problem in massive databases!
- Challenges on data
- Gigabytes of data
- Hundreds of dimensions
- Challenges on query processing
- Flood of results A query may return too many
answers - Quality of answers Users are often interested in
only a few but high-quality results - Efficiency It is slow to process and return so
many answers!
similar web database search on www.cars.com
www.bizrate.com
4Lesson from the Web Search Ranked Queries
Ranking is the key in successful Web Search
Ranking Apts by aprice-1000 distance to
shopping
Top-K Results
5Ranked Query More Example
- Stock Price Predication
- Stock Value V
- Earnings E
- Research Expenditure R
- Train a Predication Model from history data
- f(E,R)
Top-k stocks which are most predicable Select top
k from Stock S order by (S.V-f(S.E, S.R))2 asc
6Data and Query Model
- Data Relation R
- Indices are built on attributes
- B-tree index on single attribute
- R-tree index on multiple attributes
- Query with ad hoc Ranking Function F
- Attributes from multiple indices
- F is lower-boundable (assume minimal k)
- Given a domain D
- One can derive the lower bound value of F(D)
- The lower bound value need not be tight
7Example Ranking Functions
- Typical functions
- All linear functions xy
- All quadratic functions (x-100)2(y-100)2
- All monotone functions xy
- Some special functions
- Min square error (x-y)2
- Median, min, max of input variables
8Road Map
- Introduction
- Sort-Merge vs. Index-Merge
- Proposed Solutions
- Experimental Results
- Conclusions
9Previous Study Sort-Merge
- Sort-Merge Fagin et al, 2001
- Assume monotone functions
- Many variations
Select top 1 from Apartment order by price /
sq feet asc
tid Price
t1 500
t2 700
t3 800
t4 1000
t5 1100
t6 1200
t7 1200
t8 1350
tid Price Sq feet
t1 500 600
t2 700 800
t3 800 900
t4 1000 1000
t5 1100 200
t6 1200 500
t7 1200 560
t8 1350 1120
tid Sq feet
t8 1120
t4 1000
t3 900
t2 800
t1 600
t7 560
t6 500
t5 200
Intuition Data appearing top on both lists
likely has higher score
10Our Solution Index-Merge
- The function involves both price and sq feet
- Price and Sq feet are indexed individually
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3
10, t4 11, t5
12, t6 12, t7 13,t8
2, t5 5, t6 5,t7
6, t1 8, t4
9, t2 10, t8 11,t3
11Search over Joint Space
- Merge both indices online
- Search top results over the joint Space
Join Price and Sq feet during online
computation
The joint state (P1, S2) Boundary 5 - 8 X 6 -
8 Join results t1
(Price, Sq feet)
(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
(P3,S3)
Cartesian product of partitions from price and
sq feet
12Sort-Merge vs. Index-Merge
Sort-Merge Index-Merge
Data Organization Each attribute is sorted in a list Each attribute or multiple attribute is indexed by hierarchical tree structures
Data Access Methods Sorted Access (SA) Random Access (RA) Node Access (following tree structures)
Ranking Functions Monotone functions Ad-hoc functions (lower-bound-able)
Optimization Opportunities SA only Combine SA and RA Progressive Merge Selective Merge
13Road Map
- Introduction
- Sort-Merge vs. Index-Merge
- Proposed Solutions
- Problem Analysis
- Progressive Merge
- Selective Merge
- Experimental Results
- Conclusions
14Challenge 1 Complexity of Join
Join Price and Sq feet
(Price, Sq feet)
(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
(P3,S3)
Number of children by joining two index node
Cartesian Product
Complexity of Join B-tree with page size 4kB
The fan-out of a node is 204 Join 2 B-trees
40k children Join 3 B-trees 8M children
15Challenge 2 Empty State
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
Join Price and Sq feet
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
(Price, Sq feet)
5, t1 7, t2 8,t3
2, t5 5, t6 5,t7
(P1,S1)
(P1,S2)
t1,t2,t3
t5,t6,t7
Empty join state should be pruned
Probability of Empty State Join B-trees on 1M
data Up to 1M non-empty Joint leaf states
Each B-tree has 5140 leaf nodes Join 2
B-trees 25.2M joint leaf (96 are empty)
Join 3 B-trees 126.5G joint leaf (99.999 are
empty)
16Efficient Search Over The Joint Space
- Straight forward search is expensive
- Assemble the complete joint space first
- Search over the joint space
- Smart search join as necessary
- Progressive Merge
- Identify which joint states have top scores
- Assemble good joint states first
- Selective Merge
- Identify which joint states contain non-empty
result - Avoid generating joint states that are empty
17Problem Analysis
- Minimize
- CPU cost Number of joint states to be assembled
- I/O cost Number of joint states to be retrieved
- Performance At a Glance
- Ranking Function f(A-B)2
- A, B are indexed by B-trees, data has 1M records
18Road Map
- Introduction
- Sort-Merge vs. Index-Merge
- Proposed Solutions
- Problem Analysis
- Progressive Merge
- Selective Merge
- Experimental Results
- Conclusions
19Progressive Merge (1)
Ranking Function price 1k2 sq feet -
8002
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
500 - 800 1000-1100 1200-1350
200 - 500 600 - 800 900 -1120
- Special Case convex function
- Find the best joint state by the extreme point,
e.g., best state (P2, S2) - Progressive assemble neighboring states, e.g.,
(P2, S1) (P2, S3) (P1, S2) (P3, S2)
20Progressive Merge (2)
Ranking Function price 1k2 sq feet -
8002
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
500 - 800 1000-1100 1200-1350
200 - 500 600 - 800 900 -1120
f(Pi, S) 40,000 0
40,000
f(P, Si) 90,000 0
10,000
- General function not convex, not monotone?
- Step 1 Sort Pi by f(Pi, S), Sort Si by f(P, Si)
P2 P1 P3
S2 S3 S1
f(Pi, S) min f(x) (x in Pi and S) i.e., price
in Pi node, and sq_feet in S node
Sorted Lists of P and S
21Progressive Merge (3)
Ranking Function price 1k2 sq feet -
8002
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
500 - 800 1000-1100 1200-1350
200 - 500 600 - 800 900 -1120
f(Pi, S) 40,000 0
40,000
f(P, Si) 90,000 0
10,000
- General function not convex, not monotone?
- Step 2 Merge Pi, Sj progressively
(P2, S2) (P2, S3) (S2, P1) (P1, S3) (P2, S1) (P1,
S1) (P3, S2) (P3, S3) (P3, S1)
P2 P1 P3
S2 S3 S1
Sorted Lists of P and S
Progressive merged results
22Progressive Merge (4)
Ranking Function price 1k2 sq feet -
8002
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
500 - 800 1000-1100 1200-1350
200 - 500 600 - 800 900 -1120
f(Pi, S) 40,000 0
40,000
f(P, Si) 90,000 0
10,000
- General function not convex, not monotone?
- Step 3 Stop when the best seen is no worse
than the unseen
Best seen f(P2, S2)0
(P2, S2) (P2, S3) (S2, P1) (P1, S3) (P2, S1) (P1,
S1) (P3, S2) (P3, S3) (P3, S1)
f(P2,S2) min (f(P3,S), f(P, S3))
Best possible of unseen min ( f(P1, S), f(P, S3)
) 10,000
23How Good is the Progressive Merge?
- Optimal case
- Suppose Sk is the kth best score in the final
results - Optimal algorithm only examines all joint state
s, such that f(s) gt Sk - Convex functions N mN
- N number of joint states examined by
neighborhood expansion - N optimal number
- m number of indices to be merged
- General functions N 2mN
- N number of joint states examined by threshold
expansion - N any other algorithm in the same category
- m number of indices to be merged
24Road Map
- Introduction
- Sort-Merge vs. Index-Merge
- Proposed Solutions
- Problem Analysis
- Progressive Merge
- Selective Merge
- Experimental Results
- Conclusions
25Selective State
Join Price and Sq feet
(Price, Sq feet)
(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
(P3,S3)
P1 contains (t1, t2, t3) S1 contains (t5, t6,
t7) Join P1 and S1 Empty state
How to identify empty states before the joint
states are retrieved?
26Join Signature
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3
10, t4 11, t5
12, t6 12, t7 13,t8
2, t5 5, t6 5,t7
6, t1 8, t4
9, t2 10, t8 11,t3
Join Signature indicates Empty States Compress
the join signature by bloom filter Check the Join
Signature before assembling states
S1 S2 S3
0 1 1
1 1 0
1 0 1
P1 P2 P3
Data and their location in the index
Join Signature
27Revisit Comparing to Sort-merge
- Progressive Merge
- Corresponds to Sorted Access scheduling in
sort-merge - Node access scheduling in index-merge is more
complicated - Selective Merge
- Corresponds to Random Access in Sort-merge
- Random access dose not help in index-merge
(details in paper) - The join signature is also applicable to
Sort-merge
28Road Map
- Introduction
- Sort-Merge vs. Index-Merge
- Proposed Solutions
- Problem Analysis
- Progressive Merge
- Selective Merge
- Experimental Results
- Conclusions
29Execution Time w.r.t. Different Functions
- Join two B-tree indices
- TS Table Scan, BL Baseline, PE Progressive
Merge, SIG Signature-based Selective Merge
F(x-a)2(y-b)2
F(x-y2)2
30Disk Access and Memory Usage
- Join two B-tree indices, Top 100 query
- BL Baseline, PE Progressive Merge, SIG
Signature
F(x-y2)2
31Road Map
- Introduction
- Sort-Merge vs. Index-Merge
- Proposed Solutions
- Problem Analysis
- Progressive Merge
- Selective Merge
- Experimental Results
- Conclusions
32Discussion and Conclusions
- The Index-Merge Framework
- Extends the sort-merge framework
- Does not confine to monotone functions
- Efficient Computation
- Progressive Merge reduce the CPU and memory
overhead - Selective Merge reduce the disk I/O cost
33Thank You!