A Spatial Index Structure for High Dimensional Point Data - PowerPoint PPT Presentation

About This Presentation
Title:

A Spatial Index Structure for High Dimensional Point Data

Description:

Dynamic spatial index method has been an active research area. ... Properties of PK-tree. M-Level Clustering Spatial Distribution ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 21
Provided by: WeiW8
Learn more at: http://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: A Spatial Index Structure for High Dimensional Point Data


1
A Spatial Index Structure for High Dimensional
Point Data
PK tree
  • Wei Wang, Jiong Yang, and Richard Muntz
  • Data Mining Lab
  • Department of Computer Science
  • University of California, Los Angeles

2
Outline
  • Introduction
  • Structure of PK-tree
  • Operations on PK-tree
  • Performance
  • Conclusions

3
Introduction
  • Dynamic spatial index method has been an active
    research area.
  • index structure based on spatial decomposition
  • PR-Quad tree, K-D tree, K-D-B tree, ...
  • No overlapping among sibling nodes
  • How to achieve high disk page utilization for
    large dimensionality with skewed data
    distributions remains a challenge.
  • R-tree family of index structure
  • R-tree, SR-tree, X-tree, ...
  • Increasing of overlapping among sibling nodes
    along with increasing dimensionality degrades
    performance severely.

4
Introduction
  • PK-tree
  • Spatial decomposition
  • no overlapping among sibling nodes
  • Bound on height
  • Bounds on number of children
  • Uniqueness for any data set
  • independent of order of insertion and deletion
  • Solid theoretical foundation
  • Fast retrieval and updates

5
Structure of PK-tree
  • Recursively rectilinear dividing space

Set notation (e.g., ?, ?, ?, ?, ?, ?, ?) is used
to express relationships among cells.
6
Structure of PK-tree
  • Space is recursively divided until a level LD
    such that each cell contains at most one point.

7
Structure of PK-tree
  • Point cell a non-empty cell at level LD
  • A cell C is K-instantiable iff
  • C is a point cell, or
  • there does not exist (K-1) or less K-instantiable
    sub-cells C1, , CK-1 ? C, such that ?d ? D (d ?
    C ? d ? ?i0K-1 Ci).

8
Structure of PK-tree
Example of a PK-tree of rank 3
9
Structure of PK-tree
  • Given a finite set of points D over index space
    C0 and dividing ration R, a PK-tree of rank K
    (Kgt1) is defined as follows.
  • The cell at level 0 (C0) is always instantiated
    and serves as the root of the PK-tree.
  • Every node else (except the root) in the PK-tree
    is mapped one-to-one to a K-instantiable cell.
  • For any two nodes C1 and C2 in the PK-tree, C1 is
    a child of C2 (or C2 is the parent of C1) iff
  • C1 is a proper sub-cell of C2, i.e., C1 ? C2, and
  • there does not exist C3 in the PK-tree such that
    C1 ? C3 and C3 ? C2.
  • Properties existence and uniqueness, bounds on
    node outdegree, bounded storage space, bounds on
    expected height, no overlapping among sibling
    nodes, and so on.

10
Properties of PK-tree
Expected Height of a PK-tree
11
Properties of PK-tree
  • M-Level Clustering Spatial Distribution
  • 0-level uniform distribution over C0 P(d ?
    Ci1 d ? Ci) 1/r
  • 1-level Let A ? C0 be some subset of C0 and Ac
    C0 - A. Distributions for points in A and Ac are
    0-level clustering spatial distribution.

12
Operations on PK-tree
  • Pagination of the PK-tree
  • Pick the parameter K and the number of dimensions
    to split at each level such that the maximum size
    node is close to a page size.
  • Allocate one node to a page.
  • Space utilization can be guaranteed to be at
    least 50 and is much more than 50 in
    experiments.
  • Insertion
  • First follow the path from the root to locate all
    (potential) ancestors of the inserted leaf cell.
  • Then from the leaf level back to the root along
    the same path to make all necessary changes
    (e.g., instantiate or de-instantiate cells).
  • Search
  • K Nearest Neighbor Query
  • Range Query

13
Performance
  • Setup Sparc 10 workstation (SunOS 5.5) with 208
    MB main memory and a local disk with 9GB capacity
  • Synthetic Data Sets (each contains 100,000
    points)
  • u uniform distribution
  • c1, c2 20 of data are uniformly distributed and
    80 of data are distributed in disjoint clusters
  • Height of generated trees

14
Performance
  • Size of index in MB with 100,000 points

15
Performance
  • Range query on uniform data distribution

16
Performance
  • Range query on clustered data distribution

17
Performance
  • KNN query on uniform data distribution

18
Performance
  • KNN query on clustered data distribution

19
Performance
  • Real data set NASA Sky Telescope Data
  • 200,000 two-dimensional points (they are the
    coordinates of crater locations on the surface of
    Mars)

20
Conclusions
  • PK-tree employing spatial decomposition to
    ensure no overlapping among sibling nodes but
    avoiding large number of nodes usually resulting
    from a skewed spatial distribution of objects.
  • The total number of nodes in a PK-tree is O(N)
    and the expected height of a PK-tree is O(logN)
    under some general conditions.
  • Other properties uniqueness, bounds on number of
    children.
  • Empirical studies shown that the PK-tree
    outperforms SR-tree and X-tree by a wide margin.
Write a Comment
User Comments (0)
About PowerShow.com