Metric based KNN indexing - PowerPoint PPT Presentation

About This Presentation
Title:

Metric based KNN indexing

Description:

A region of the SR-tree is specified by the intersection of a bounding sphere ... SR-Tree combined the use of bounding sphere and bounding rectangle, as the ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 43
Provided by: NYPS2
Category:
Tags: knn | based | indexing | metric | tree

less

Transcript and Presenter's Notes

Title: Metric based KNN indexing


1
Metric based KNN indexing
  • Lecturer Prof Ooi Beng Chin
  • Presenters Frankie Chan HT00-3550Y
  • Tan Zhenqiang HT01-6163J

2
Outline
  • Introduction
  • Examples
  • SR-Tree
  • MVP-Tree
  • Future works
  • Conclusion
  • References

3
Introduction
  • Metric based queries consider relative
    distance of object/point from a given query point
  • Most commonly used metric is the Euclidean
    metric

4
Metric based queries
  • Variations
  • used with joins queries find 3 closest
    restaurants for each of 2 different theaters
  • used with spatial queries find KNN to east of a
    location

5
SR-Tree
  • The SR-tree An Indexing Structure for
    High-Dimensional Nearest Neighbor Queries.
  • Norio Katayama Shinichi Satoh
  • Multimedia Info. Research Div. Software Research
    Div.
  • National Institute of Informatics National
    Institute of Informatics

6
SR-Tree Introduction
  • SR-tree stands for Sphere/Rectangle-Tree
  • SR-Tree is an extension of the R-tree and the
    SS-tree
  • A region of the SR-tree is specified by the
    intersection of a bounding sphere and a bounding
    rectangle

7
The R-Tree Structure
8
The SS-Tree Structure
9
Definitions (SR-Tree)
  • The diameter of a bounded region means
  • diameter of a bounding sphere for the SS-Tree
  • diagonal of a bounding rectangle for the
  • R-Tree

10
Properties
  • Bounding rectangles divides points into smaller
    volume regions. But tends to have longer
    diameters than bounding spheres, especially in
    high-dimensional space.
  • Bounding spheres divides points into
    short-diameter regions. But tends to have larger
    volumes than bounding rectangles.

11
Properties
  • SR-Tree combined the use of bounding sphere and
    bounding rectangle, as the properties are
    complementary to each other.

12
The SR-Tree Structure
13
Bounded regions
14
Indexing Structure
  • The structure of the leaf L
  • The structure of the node N

15
Insertion Algorithm
  • SR-tree insertion algorithm is based on
  • SS-trees centroid-based algorithm
  • Descend down the tree and choose a
  • subtree with the centroid nearest to
  • the new entry
  • SR-tree algorithm updates both
  • bounding spheres (diff. from SS-tree)
  • and bounding rectangles (same as
  • R-tree)

16
Insertion Algorithm
  • Bounding sphere computation
  • Center, X (X1 ,X2 ,..,XD)
  • Radius, r

17
Deletion Algorithm
  • SR-tree deletion algorithm is similar to
  • that of the R-tree
  • If entry deletion do not cause
  • leaf/node under-utilisation, then just
  • remove it
  • Otherwise, remove under-utilised
  • leaf/node and reinsert all orphaned
  • entries

18
Nearest Neighbour Search
  • Algorithm ordered depth-first search
  • It finds a number of points nearest to the
  • query, to make a candidate set
  • Then it revises the candidate set, when it
  • visits every leaf whose region overlaps the
  • range of the candidate set
  • After it visited all leave, the final candidate
    is
  • the search result

19
Definition (NN search)
  • Minimum Distance (MINDIST) euclidean distance
    from query point to the bounded region
  • Minimax Distance (MINMAXDIST) minimum value of
    all the maximum distances between the query point
    and points on each n axes respectively

20
Definition (NN search)
21
Search pruning
  • Region R1 with MINDIST greater than MINMAXDIST of
    another region R2 is discarded, because it cannot
    contain NN (downward pruning)
  • Actual distance from query point P to a given
    object O which is greater than MINMAXDIST of a
    region, is discarded (upward pruning)
  • Region with MINDIST greater than the actual
    distance from query point P to an object O, is
    discarded (upward pruning)

22
Recursive procedure (leaf node)
  • If Node.type LEAF then
  • For I 1 to Node.count
  • dist objectDist(Pt,Node.region)
  • if (dist lt Nearest.dist) then
  • Nearest.dist dist
  • Nearest.region Node.region

23
Recursive procedure (non-leaf node)
  • Else / non-leaf node gt order, sort, prune
    visit node /
  • genBranchList(Pt,Node,branchList)
  • sortBranchList(branchList)
  • / perform downward pruning /
  • last pruneBranch(Node,Pt,NearestBranchList)
  • For I 1 to last
  • newNode Node.branch
  • nearestNeighborSearch(newNode,Pt,Nearest)
  • / perform upward pruning /
  • last pruneBranchList(Node,Pt,Nearest,
  • branchList)

24
Performance Analysis - Insertion
  • Insertion cost of R-tree, SS-tree and SR-tree
    (uniform dataset)

25
Performance Analysis - Query
  • Performance of VAMSplit R-tree, SS-tree and
    SR-tree
  • (Uniform dataset)

26
Performance Analysis - Query
  • Performance of VAMSplit R-tree, SS-tree and
    SR-tree
  • (Real dataset)

27
SR-tree average volume diameter
  • Average volume diameter of the leaf-level
    regions of R-tree, SS-tree and SR-tree (real
    dataset)

28
Strengths
  • SR-Tree divides points into regions with small
    volumes and short diameter.
  • Division of points into smaller regions improves
    disjointness.
  • Smaller volume and diameter enhances the nearest
    neighbour queries performance.

29
Weaknesses
  • SR-tree suffers from the fanout problem
    (branching factor max node entries)
  • The node size grows as dimensionality increases
  • The reduction of the fanout may requires more
    nodes to be read on queries
  • Possibly affect query performance

30
MVP-Tree
  • Indexing Large Metric Spaces For Similarity
    Search Queries
  • Tolga Bozkaya
    Meral Ozsoyoglu
  • Oracle corporation
    Dept. Comp Eng Sci

  • Case Western Reserve Univ.

31
Outline
  • Main idea.
  • Algorithm basis
  • How to build mvp-tree based on the given data.
  • How to do similarity search in mvp-tree.
  • Performance analysis and comparison based on
    experiments.

32
Main Idea
  • Triangle Inequality. Distance based Indexing
  • Adopt more vantage points and levels
  • Pre-computed distances are kept in leaf node.
  • Use pre-computed distances to prune query
    branches.

33
Algorithm Basis
  • vp1,vp2,vp3 are vantage points.
  • Q is given query point
  • p1 belongs to points set
  • The more vantage points the more unnecessary
    query branches are pruned.
  • The distance between two vantage points is
    normally the larger the better

34
How to construct mvp-tree(m,k,p)
  • 1) If S 0 then create an empty tree and
    quit.
  • 2) If S k then Create leaf node L, put all
    data to L, Quit.
  • 3) Choose first vantage point Svp1, Keep
    distances in arrays.
  • 4) Divide S into m groups with same cardinalities
    based on the distances between Svp1 to points in
    S. And keeps distances as well.
  • 5) For the first v above group
  • 5.1) Choose last point in previous group as
    new vantage point.
  • 5.2) Divide the group into m sub-groups with
    same cardinalities based on the distances
    between Svp1 to points in S. And keeps distances
    as well.
  • 6) Recursively create mvp-tree on the mv
    sub-groups based on the steps from 1) to 5).

35
Example
  • Sv1first vantage point(level 1)
  • Sv2vantange points(level 2)
  • D1..kthe distances between
  • data points in leaf node and
  • vantage points
  • x.Pathpthe distances between
  • the data point and the first p vantage points
    along the path from the root to the leaf node
    that keeps it

36
How to do similarity search
  • Depth-first process. Q is the given query object.
    r is the distance.
  • 1) For i1 to m
  • If d(Q, Svi) r then Svi is in the answer
    set.
  • (Svi is the ith vantage points in current
    node.)
  • 2) If current node is leaf node
  • For all data points ( Si) in the node,
  • If for all vantage points Sv,
  • d(Q, Sv) - r d(Si, Sv) d(Q,
    Sv) r holds, and
  • for all i1..p ( PATHi - r
    Si.PATHi PATHi r ) holds,
  • then compute d(Q, Si). If d(Q, Si) r,
    then Si is in the answer set.
  • 3) If the current node is an internal node
  • for all i1..m
  • if d(Q, Svi) r Mi then
    recursively search the first branch
  • (Mi is the maximum of the distances
    between Svi to those points in its child node)

37
Comparison Experiment
  • Performance results for the queries with the data
    set where data points form several physical
    clusters

38
Performance Analysis
  • Experiments to compare mvp-trees with vp-trees
    but use only one vantage point at each level, and
    do not make use of the pre-computed distances
    show that mvp-tree outperforms the vp-tree by 20
    to 80 for varying query ranges and different
    distance distributions.
  • For small query ranges, experiments with
    Euclidean vectors showed that mvp-trees require
    40 to 80 less distance computation compared to
    vp-trees.
  • For higher query ranges, the percentagewise
    difference decrease gradually, yet mvp-trees
    still perform better, making up to 30 less
    distance computations for the largest query
    ranges used.
  • Experiments on gray-level images using the data
    set with 1151 images show mvp-trees performed up
    to 20-30 less distance computations.

39
Strengths
  • Based on the thoughts of triangle inequality,
    more vantage points more unnecessary query
    branches pruned, the longer distances among
    vantage points the better and reusing
    pre-computed distances as much as possible.
  • Mvp-tree is flatter than vp-tree.
  • It is also balanced because of the way it is
    constructed.
  • Experiments show that it is more efficient than
    vp- tree and M-tree.

40
Weaknesses
  • 1.Construction cost O(nlogmn) distances
    computations
  • 2.Additional storage cost are very high. There
    must be an array of size of p in every data point
    in leaf node.
  • 3.Updating and inserting data points maybe
    lead to reconstruction of the mvp-tree.
  • 4.If the insertions cause the tree structure
    to be skewed (that is, the additions of new data
    points change the distance distribution of the
    whole data set), global restructuring may have to
    be done, possibly during off hours of operation.
  • 5.As the mvp-tree is created from an initial
    set of data objects in a top-down fashion, it is
    a rather static index structure.

41
Conclusion
  • metric based indexing can be effective for high
    dimensional and non-uniform datasets (eg.
    Image/video similarity indexing)
  • future work
  • algorithm to perform in both dynamic static
    database environment
  • analyse the use of metric with other attributes
    to enable range queries

42
References
  • N. Katayama and S. Satoh. The SR-tree An
    Indexing Structure for High-Dimensional Nearest
    Neighbor Queries. Proc. ACM SIGMOD Int. Conf. on
    Management of Data, pages 369--380, Tucson,
    Arizona, 1997.
  • R. Kurniawati, J. S. Jin, and J. A. Shepherd. The
    SS -tree An Improved Index Structure for
    Similarity Searches in a High-Dimensional Feature
    Space. SPIE Storage and Retrieval for Image and
    Video Databases V , pages 110--120, 1997.
  • N. Beckmann, H.P. Kriegel, R. Schneider, and B.
    Seeger. The Rtree An Efcient and Robust Access
    Method for Points and Rectangles. In Proc. ACM
    SIGMOD Intl. Symp. on the Management of Data,
    pages 322--331, 1990.
  • N. Roussopoulosi, S. Kelley, and F. Vincent.
    Nearest Neighbor Queries. Proc. ACM SIGMOD, San
    Jose, USA, pages 71-79, May 1995.
  • Tolga Bozkaya and Meral Ozsoyoglu. Indexing Large
    Metric Spaces For Similarity Search Queries.
    Association for Computing Machinery transactions
    on Database System, pages 1-34, 1999.
  • Roberto Figueira Santos Filho, Agma Traina,
    Caetano Traina Jr and Christos Faloutsos.
    Similarity Search Without Tears The OMNI-Family
    Of All-Purpose Access Methods. ICDE 2001.
Write a Comment
User Comments (0)
About PowerShow.com