A Spatial Index Structure for High Dimensional Point Data - PowerPoint PPT Presentation

About This Presentation

Title:

A Spatial Index Structure for High Dimensional Point Data

Description:

Dynamic spatial index method has been an active research area. ... Properties of PK-tree. M-Level Clustering Spatial Distribution ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 21

Provided by: WeiW8

Learn more at: http://www.cs.unc.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Spatial Index Structure for High Dimensional Point Data

1
A Spatial Index Structure for High Dimensional
Point Data
PK tree

Wei Wang, Jiong Yang, and Richard Muntz
Data Mining Lab
Department of Computer Science
University of California, Los Angeles

2
Outline

Introduction
Structure of PK-tree
Operations on PK-tree
Performance
Conclusions

3
Introduction

Dynamic spatial index method has been an active
research area.
index structure based on spatial decomposition
PR-Quad tree, K-D tree, K-D-B tree, ...
No overlapping among sibling nodes
How to achieve high disk page utilization for
large dimensionality with skewed data
distributions remains a challenge.
R-tree family of index structure
R-tree, SR-tree, X-tree, ...
Increasing of overlapping among sibling nodes
along with increasing dimensionality degrades
performance severely.

4
Introduction

PK-tree
Spatial decomposition
no overlapping among sibling nodes
Bound on height
Bounds on number of children
Uniqueness for any data set
independent of order of insertion and deletion
Solid theoretical foundation
Fast retrieval and updates

5
Structure of PK-tree

Recursively rectilinear dividing space

Set notation (e.g., ?, ?, ?, ?, ?, ?, ?) is used
to express relationships among cells.
6
Structure of PK-tree

Space is recursively divided until a level LD
such that each cell contains at most one point.

7
Structure of PK-tree

Point cell a non-empty cell at level LD
A cell C is K-instantiable iff
C is a point cell, or
there does not exist (K-1) or less K-instantiable
sub-cells C1, , CK-1 ? C, such that ?d ? D (d ?
C ? d ? ?i0K-1 Ci).

8
Structure of PK-tree
Example of a PK-tree of rank 3
9
Structure of PK-tree

Given a finite set of points D over index space
C0 and dividing ration R, a PK-tree of rank K
(Kgt1) is defined as follows.
The cell at level 0 (C0) is always instantiated
and serves as the root of the PK-tree.
Every node else (except the root) in the PK-tree
is mapped one-to-one to a K-instantiable cell.
For any two nodes C1 and C2 in the PK-tree, C1 is
a child of C2 (or C2 is the parent of C1) iff
C1 is a proper sub-cell of C2, i.e., C1 ? C2, and
there does not exist C3 in the PK-tree such that
C1 ? C3 and C3 ? C2.
Properties existence and uniqueness, bounds on
node outdegree, bounded storage space, bounds on
expected height, no overlapping among sibling
nodes, and so on.

10
Properties of PK-tree
Expected Height of a PK-tree
11
Properties of PK-tree

M-Level Clustering Spatial Distribution
0-level uniform distribution over C0 P(d ?
Ci1 d ? Ci) 1/r
1-level Let A ? C0 be some subset of C0 and Ac
C0 - A. Distributions for points in A and Ac are
0-level clustering spatial distribution.

12
Operations on PK-tree

Pagination of the PK-tree
Pick the parameter K and the number of dimensions
to split at each level such that the maximum size
node is close to a page size.
Allocate one node to a page.
Space utilization can be guaranteed to be at
least 50 and is much more than 50 in
experiments.
Insertion
First follow the path from the root to locate all
(potential) ancestors of the inserted leaf cell.
Then from the leaf level back to the root along
the same path to make all necessary changes
(e.g., instantiate or de-instantiate cells).
Search
K Nearest Neighbor Query
Range Query

13
Performance

Setup Sparc 10 workstation (SunOS 5.5) with 208
MB main memory and a local disk with 9GB capacity
Synthetic Data Sets (each contains 100,000
points)
u uniform distribution
c1, c2 20 of data are uniformly distributed and
80 of data are distributed in disjoint clusters
Height of generated trees

14
Performance

Size of index in MB with 100,000 points

15
Performance

Range query on uniform data distribution

16
Performance

Range query on clustered data distribution

17
Performance

KNN query on uniform data distribution

18
Performance

KNN query on clustered data distribution

19
Performance

Real data set NASA Sky Telescope Data
200,000 two-dimensional points (they are the
coordinates of crater locations on the surface of
Mars)

20
Conclusions

PK-tree employing spatial decomposition to
ensure no overlapping among sibling nodes but
avoiding large number of nodes usually resulting
from a skewed spatial distribution of objects.
The total number of nodes in a PK-tree is O(N)
and the expected height of a PK-tree is O(logN)
under some general conditions.
Other properties uniqueness, bounds on number of
children.
Empirical studies shown that the PK-tree
outperforms SR-tree and X-tree by a wide margin.