Towards Robust Indexing for Ranked Queries - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Towards Robust Indexing for Ranked Queries

Description:

University of Illinois at Urbana-Champaign. VLDB 2006. 2. Outline. Introduction. Robust Index ... Organize tuples into R-tree and determine a threshold to prune ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 28

Provided by: uiu1

Category:

more less

Transcript and Presenter's Notes

Title: Towards Robust Indexing for Ranked Queries

1
Towards Robust Indexing for Ranked Queries

Dong Xin, Chen Chen, Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
VLDB 2006

2
Outline

Introduction
Robust Index
Compute Robust Index
Exact Solution
Approximate Solution
Multiple Indices
Performance Study
Discussion and Conclusions

3
Introduction
Sample Database R
Ranked Query
Tid A1 A2
t1 0.10 1.00
t2 0.15 0.80
t3 0.25 0.55
t4 0.40 0.35
t5 0.80 0.25
t6 0.30 0.70
t7 0.35 0.50
t8 0.75 0.45
Select top 3 from R order by A1A2 asc
Linear Ranking Functions
Query Results
Tid A1 A2 A1A2
t4 0.40 0.35 0.75
t3 0.25 0.55 0.80
t7 0.35 0.50 0.85
4
Efficient Processing of Ranked Queries

Naïve Solution scan the whole database and
evaluate all tuples
Using indices or materialized views
Distributed Indexing
Sort each attribute individually and merge
attributes by a threshold algorithm (TA) Fagin
et al, PODS96,99,01
Spatial Indexing
Organize tuples into R-tree and determine a
threshold to prune the search space Goldstain et
al, PODS97
Organize tuples into R-tree and retrieve data
progressively Papadias et al, SIGMOD03
Sequential Indexing
Organize tuples into convex hulls Chang et al,
SIGMOD00
Materialize ranked views according to the
preference functions Hristidis et al, SIGMOD01
And More

5
Sequential Indexing

Sequential Index (ranked view)
Linearly sort tuples
No sophisticated data structures
Sequential data access (good for database I/O)
Representative work
Onion Chang et al, SIGMOD00
PREFER Hristidis et al, SIGMOD01
Our proposal Robust Index

6
Review Onion Technique
Sample Database R
A1
Index by Convex hull Retrieve data layer by layer
until the top-k results are found In worst case,
retrieve top-k layers of tuples
t1
Tid A1 A2
t1 0.10 1.00
t2 0.15 0.80
t3 0.25 0.55
t4 0.40 0.35
t5 0.80 0.25
t6 0.30 0.70
t7 0.35 0.50
t8 0.75 0.45
Second layer
t2
t6
t3
t7
t8
t4
t5
First layer
A2
If a and b are non-negative (a, b are weighing
parameters in linear ranking function) Index by
Convex Shell Expect less number of tuples in each
layer
A1
t1
Second layer
Ranked Query
t2
t6
Select top 3 from R order by aA1bA2 asc
t3
t7
t8
t4
t5
First layer
A2
7
Review PREFER System
Sample Database R
Index by the ranking function A1A2
Index on preference ranking function
Tid A1 A2
t1 0.10 1.00
t2 0.15 0.80
t3 0.25 0.55
t4 0.40 0.35
t5 0.80 0.25
t6 0.30 0.70
t7 0.35 0.50
t8 0.75 0.45
A1
t1
Given query ranking function A12A2
t2
t6
t3
t7
t8
Map query ranking function to index ranking
function Will retrieve t1, t2, t3, t4, t6, t7
t4
t5
A2
Ranked Query
Select top 3 from R order by w1A1w2A2 asc
Query ranking function
Map from query to preference
8
Comments on Sequential Indexing

PREFER
Works extremely well when query functions are
close to the index function Sensitive to query
weights
Onion
Less sensitive to query weights Can we do
better?
Both methods
Require considerable online computation
Motivation for robust indexing
Not sensitive to query weights
Push most computation to index building phase

Average tuples retrieved for 10 random queries
asking for top-50 answers Query weights are
randomly selected from 1,2,3,4
9
Outline

Introduction
Robust Index
Compute Robust Index
Exact Solution
Approximate Solution
Multiple Indices
Performance Study
Discussion and Conclusions

10
Robust Indexing Motivating Example
A1
t1
Index by Convex hull (shell) Organize data layer
by layer In order to keep the convexity, each
layer is built conservatively
Second layer
t2
t6
t3
t7
t8
t4
t5
First layer
Robust Index Organize data layer by layer Exploit
dominating properties between data and push a
tuple as deep as possible
A2
A1
Layer 4
t1
Layer 3
t2
Layer 3
t6
t3
t7
t8
t7 dominated by t2 and t4 (for any query, at
least one of t2 and t4 ranks before t7)
t4
t5
First layer
t7 dominated by t3 and t5
A2
11
Robust Indexing Formal Definition

How does it work?
Offline phase
Put each tuple in its deepest layer the minimal
(best) rank of all possible linear queries
Online phase
Retrieve tuples in top-k layers
Evaluate all of them, and report top-k
What are expected?
Correctness
Less tuples in each layer than convex hull
If a tuple does not belong to top-k for any
query, it will not be retrieved

12
Robust Indexing Appealing Properties

Database Friendly
No online algorithm required
Simply use the following SQL statement
Select top k from R
where layer ltk
order by Frank
Space efficient
Suppose the upper bound of the value k is given
(e.g. klt100)
Only need to index those tuples in top 100 layers
Robust indexing uses the minimal space comparing
with other alternatives

13
Outline

Introduction
Robust Index
Compute Robust Index
Exact Solution
Approximate Solution
Multiple Indices
Performance Study
Discussion and Conclusions

14
Robust Indexing Algorithm Highlights

Exact Solution
Compute the deepest layer for each tuple
Complexity
n number of tuples
d number of dimensions
Approximate Solution
Compute the lower bound layer for each tuple
Complexity
Multiple Indices
Transform R to different subspaces by linear
transformation
Build an index in each subspace

15
Exact Solution
Task to compute the minimal rank over all
possible linear queries for tuple t
L1
A1
L2
t1
Given a query Q, with ranking function Fw1A1w2A2
, 0ltw1,w2lt1, and w1w21
t6
L3
t2
L4
t
Q is one-to-one mapped to a line L e.g. A12A2
maps to L1
t5
t3
t4
Alternative Solution Only enumerate (w1,w2)
whose corresponding line passes t and another
tuple, e.g., L1, ,L4 Do not consider t3 and t6
because the corresponding weights does not
satisfy 0ltw1,w2lt1
A2
Naïve Proposal Enumerate all possible
combinations of (w1,w2) Not feasible since the
enumerating space is infinite
16
Exact Solution, cont.
Task to compute the minimal rank over all
possible linear queries for tuple t
L1
LV
A1
L2
t1
Given a query Q, with ranking function Fw1A1w2A2
, 0ltw1,w2lt1, and w1w21
t6
L3
t2
L4
LH
t
t5
LvgtL1 minimal rank is 4 (after t1, t2, t3)
t3
t4
L1gtL2 minimal rank is 3 (after t2, t3)
A2
L2gtL3 minimal rank is 4 (after t2, t3, t4)
Complexity to sort all lines takes O(n log n),
to compute minimal rank for all t, In general,
L3gtL4 minimal rank is 3 (after t3, t4)
L4gtLH minimal rank is 4 (after t3, t4, t5)
Minimal rank (the deepest layer) of t is 3
17
Approximate Solution
Task to compute the lower bould of the minimal
rank of tuple t
I
I4
A1
IV
I3
I2
Four regions II dominating region, data ranked
before t IV dominated region, data ranked after
t I and III?
I1
t
III4
III3
II
III
III2
III1
A2
Step 1 Partition regions I and III
Lower Bounding Theorem Minimal ranking of t gt
card(II) min( card(I3I2I1),
card(I2I1III1), card(I1III1III2),
card(III1III2III3))
Step 2 Count cardinalities of region II and
sub-regions I1,,I4, III1,,III4
Step 3 Match the cardinalities of the
sub-regions and compute the lower bound
18
Approximate Solution, Cont.
Step 2 Count cardinalities of region II and
sub-regions I1,,I4, III1,,III4
I
I4
A1
IV
I3
L
I2
Count the cardinality of region II? 1. All tuples
in region II dominate t 2. A reversed version of
skyline problem 3. Standard divide and conquer
solution (details in the paper)
I1
t
III4
III3
II
III
III2
III1
A2
Tid A1 A2
t 0.50 0.50
t1 0.15 0.80
t2 0.25 0.55
t3 0.40 0.35
Tid A1 A2
t -0.50 0.63
t1 -0.15 0.35
t2 -0.25 0.39
t3 -0.40 0.49
Count the cardinality of region I1? Suppose t
(a1,a2) Line L A1 0.25A2a1 0.25a2 Tuples in
region I1 satisfy -A1 lt -a1 A10.25A2 lt a1
0.25 a2
A1-A1 A2A10.25A2
19
Quality of the Approximate Solution

Complexity
B number of partitions in each subspace
n number of tuples
d number of dimensions
Approximate quality
Assume data forms a uniform distribution
Each subspace is partitioned evenly
Partitioning according to the data distribution
is an important and interesting future topic

20
Multiple Indices

Why?
To relax the constraint
To decompose and strengthen the constraints
How? (e.g., for w1ltw2)
Linearly transform R to R, and build index on R
(A1,A2) gt (A1A2, A2)
Rewrite query weights
(w1,w2) gt (w1,w2-w1)

Ranking function Fw1A1w2A2
Relax
Ranking function
Fw1A1w2A2 Where 0ltw1,w2lt1
Strengthen
Ranking function
Fw1A1w2A2 Where 0ltw1ltw2lt1, or 0ltw2ltw1lt1
Data are projected to a smaller subspace (e.g.,
A1 gtA2 in the transformed subspace) Tuples can
be pushed deeper since more domination can be
found
21
Multiple Indices, Cont.
Number of tuples in top-k layers
Top-k Convex Shell Robust Indexing
5 329 148
10 823 262
20 2064 427
50 6130 813
100 9965 1271
150 10000 1618
200 10000 1922
Using the same index space, robust indexing can
build 8 indices (if the value of k is up bounded
by 100)
Synthetic Data 10K tuples
22
Outline

Introduction
Robust Index
Compute Robust Index
Exact Solution
Approximate Solution
Multiple Indices
Performance Study
Discussion and Conclusions

23
Performance Study

Data
Synthetic data
Real dataset (abalone3D, cover3D)
Measure
Number of tuples retrieved
Execution time not reported, but the robust
indexing is expected to be even better
Approaches for comparison
Onion (convex shell)
PREFER
Approximate Robust Indexing (AppRI), partition10

24
Index Construction Time
Convex Shell, Convex Hull and AppRI are
implemented by C Construction time on PREFER is
not included since it is implemented in
Java Using the system default parameter, PREFER
takes more than 1200 seconds on the 50k data set
25
Query Performance
Average Number of tuples retrieved on synthetic
data
Average Number of tuples retrieved on Cover3D
data set
26
Multiple Indices (Views)
Synthetic Data, 3 dimensions Build 3 robust
indices by decompose the weighting parameters
w1max(w1,w2,w3) w2max(w1,w2,w3) w2max(w1,w2,w3
)
27
Discussion and Conclusions

Strength
Easy to integrate with current DBMS
Good query performance
Practical construction complexity
Limitation
Online index maintenance is expensive (some
weaker maintaining strategies available)
Indexing high dimensional data remains a
challenging problem

Write a Comment

User Comments (0)