Title: Nearest Neighbor Search in Spatial and Spatiotemporal Databases
1Nearest Neighbor Search in Spatial and
Spatiotemporal Databases
- Dimitris Papadias
- Hong Kong University of Science and Technology
2Spatial and spatiotemporal databases
- Spatial databases manage large collection of
multi-dimensional objects. - Important query types
- Window query Retrieve all rivers in CA
- Nearest neighbor Find my nearest gas station
- Spatial join Report pairs of (city C, river R)
such that R crosses C - Spatiotemporal databases deal with the same
queries assuming, however, moving objects - Mobile computing
- Traffic supervision
- Flight control
- Weather forecasting
3R-trees Guttman SIGMOD 84. Sellis et al VLDB 87,
Beckman et al SIGMOD 00
4TPR-trees Saltenis et al., SIGMOD 00, our group
VLDB 03
- Extends the R-tree by introducing the velocity
bounding rectangle (VBR) in non-leaf entries. - Objects are grouped together based on both their
location and velocities.
5Conventional NN search with R-(TPR-) trees
- Depth-first Roussopoulos et al., SIGMOD 95
- Best-first traversal Hjaltason and Samet TODS
99, incremental and optimal
6NN search - other approaches
- Several algorithms and theoretical performance
bounds have been devised for exact and
approximate processing in main memory. Here we
care about I/O efficiency (minimization of node
and page accesses) as well as cost models about
the practical performance (suitable for query
optimization). - Several approaches for NN in high-dimensional
spaces (but the problem is different due to the
dimensionality curse). Here we consider low
dimensional spaces (spatial and spatiotemporal
databases). - Ferhatosmanoglu et al SSTD 01 discover the NN
in a constrained area of the data space (e.g.,
find the NN to the south of the query point). - Korn and Muthukrishnan SIGMOD 00 discuss
reverse nearest neighbor queries, where the goal
is to retrieve the data points whose nearest
neighbor is a specified query point. - Korn et al. VLDB 02 study the same problem in
the context of data streams, where the data are
not known in advance.
7NN search for mobile queries
- Zheng and Lee, SSTD 01 return the current NN
and the validity time of the result. - Restrictions (i) assumes a maximum speed (ii)
applicable only to single NN (iii) requires
voronoi diagrams.
- Song and Roussopoulos, SSTD 01 minimize the
number of queries for moving clients by returning
mgtk NNs. - Problem how to determine m.
IF 2?dist(q,q') ? dist(q,b)-dist(q,a), THEN the 2
NN at q' be among the 4 NN of the first query.
8Time parameterized NN (our group, SIGMOD 02)
- Assuming a constant and known velocity, a TPNN
returns - The current query result R
- The validity period T of R
- The change C of the result at the end of T
Result
Ri, T2, Cj
9TP NN queries Influence Time
- Some objects have infinite influence time.
- The object that will become the next nearest
neighbor is the one with the minimum influence
time.
10Processing TP NN with R- (TPR-) trees
- Influence time of a MBR the earliest possible
time that any object in the MBR will become the
new NN.
- Algorithm traverse the R-tree using depth-first
or best-first traversal using the influence time
instead of the mindist . - Cost of TPNN queries about the same as that of
conventional queries because we have to visit the
influencing nodes anyway (to find the NN).
11Continuous Nearest Neighbors (CNN) (our group,
VLDB 02)
- Given a line segment qs,e, find the NN of
every point on q. - Result representation s(.NNa), s1(.NNc),
s2(.NNf), s3(.NNh), e. - The points (s, s1, s2, s3, e) are the split
points.
12Main idea
- Maintain the set of split points incrementally.
After processing a
After processing c
13Processing TP NN with an R- (TPR-) tree
- Avoid examination of all points.
- Given an MBR E and query segment q, E must be
searched if and only if there exists a split
point si?SL such that dist(si,si.NN) gt
mindist(si, E).
14Location Based NN queries (LBNN) (our group,
SIGMOD 03)
- A location-based kNN query q returns
- The current k NNs
- A validity region such that the result remains
the same as long as q remains in the region. - The validity region of q is the Voronoi Cell (VC)
of the NN o.
15Computing the Voronoi Cell on-the-fly
- Step 1 Find the current NN
- Step 2 Use time TP NN queries to tighten the
validity region
16NN queries in road networks (our group, VLDB 03)
- Find my nearest gas station in terms of driving
distance. - Answer Hotel b (the Euclidean NN is d)
- Assumptions
- We can incrementally compute Euclidean NN using
conventional NN algorithms. - We can compute the network distance between the
query and any point (i.e., the length of the
shortest path connecting them) using Dijkstra's
algorithm.
17Euclidean Restriction Algorithm
1st Euclidean NN
2nd Euclidean NN
18Network Expansion Algorithm
19NN in the presence of obstacles (not published)
- The NN of q in terms of obstructed distance is b,
although the Euclidean NN is a.
20Visibility graphs
- Have been used widely in Computational Geometry
for shortest path problems (e.g., find the
shortest path from pstart to pend that does not
cross any obstacle).
- Problem We cannot maintain the entire visibility
graph in memory for real spatial datasets. - Solution We only need the obstacles and objects
that affect the result of the query.
21Obstacle nearest neighbor algorithm
- Idea Similar to the Euclidean Restriction
algorithm for road networks.
- BUT how do we perform the obstructed distance
computations?
22Obstructed distance computation
- Goal compute the obstructed distance between p
and q. - First retrieve obstacles o1, o2 in the Euclidean
range. - Compute a provisional distance d1(p,q) using only
o1, o2. - d1(p,q) is not enough because the shortest path
is obstructed by o3. - Perform a second Euclidean range query on the
obstacle R-tree using d1(p,q) and retrieve o3,
o4. - Compute a new obstructed distance d2(p,q) taking
o3, o4 into account. - Repeat the process until the obstructed distance
remains the same for two consecutive iterations.
23Other related work
- By our group Similar concepts to the ones
presented here, apply to several other spatial
queries, i.e., TP spatial joins, Continuous
window queries. - Cost Models for TP and continuous queries TODS
03. - Analysis of predictive NN (and other) queries
TODS to appear. - An Efficient Cost Model for Optimization of
Nearest Neighbor Search in Low and Medium
Dimensional Spaces TKDE to appear. - By other groups increasing interest for novel
types of NN search in the context of mobile
computing and data streams applications - Iwerks et al VLDB03 discuss continuous NN in
the presence of object updates. - Shekhar et al ACM GIS 03 discuss the in-route
nearest neighbor query, which, given a
trajectory, retrieves the single NN (e.g., gas
station) that results in the minimum diversion
from the trajectory. - Jensen et al ACM GIS 03 discuss NN for objects
moving on road networks.
24Group NN queries (our group, ICDE 04)
- Input a set Pp1,,pN of static data points in
multidimensional space and a group of query
points Qq1,,qn. - Output the k (?1) data point(s) with the
smallest sum of distances to all points in Q. The
distance between a data point p and Q is defined
as dist(p,Q)?i1npqi, where pqi is the
Euclidean distance between p and query point qi. - Example three users at locations q1, q2 and q3
want to find a meeting point (e.g., a
restaurant) the corresponding query returns the
data point p that minimizes the sum of Euclidean
distances pqi for 1?i?3 - Assumption the data points are indexed by an
R-trees. Q may or may not fit in main memory.
25Multiple Query Method (MQM)
- Idea Perform incremental NN queries for each
point in Q and combine their results.
- ltp10, 7gt, ltp11, 6gt, T5 (23)
- ltp11, 7gt
- T6 (33)
- MQM terminates
- Problem MQM may visit the same node and discover
the same data point many times (for different
query points).
26Minimum Bounding Method (MBM)
- Applies the MBR of Q to prune the search space.
- Heuristic 1 Let M be the MBR of Q, and best_dist
be the distance of the best GNN found so far. A
node N cannot contain qualifying points, if
- Heuristic 2 A node N cannot contain qualifying
points, if
27File Multiple Query Method (F-MQM)
- What happens if Q does not fit in memory.
- F-MQM sorts query points according to their
Hilbert value and splits Q into blocks Q1, ..,
Qm that fit in memory. - For each block, it computes the GNN using one of
the main memory algorithms - It finally combines their results using MQM.
- Complication once a NN of a group has been
retrieved, we cannot compute its global distance
(i.e., with respect to all data points)
immediately.
28F-MQM (cont)
- Solution lazy evaluation
- First we find the GNN p1 of the first group Q1
- Then, we load in memory the second group Q2 and
retrieve its NN p2. At the same time, we also
compute the distance between p1 and Q2. - Similarly, when we load Q3, we update the current
distances of p1 and p2 taking into account the
objects of the third group. - After the end of the first round, we only have
one data point (p1), whose global distance with
respect to all query points has been computed.
29File Minimum Bounding Method (F-MBM)
- First, the points of Q are sorted by their
Hilbert value and are assigned to groups (that
fit in memory) according to this order. - For each group Qi, F-MBM keeps in memory its MBR
Mi and cardinality ni (but not its contents). - F-MBM descends the R-tree of P (in depth-first or
best-first traversal), only following nodes that
may contain qualifying points.
Heuristic Let best_dist be the distance of the
best GNN found so far. A node N can be safely
pruned if