NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms - PowerPoint PPT Presentation

About This Presentation
Title:

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

Description:

Title: NNH Improving Performance of Nearest-Neighbor Searches using Histograms Author: Liang Jin, Nick Koudas, Chen Li Last modified by: Chen Li Created Date – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 37
Provided by: LiangJin
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms


1
NNH Improving Performance of Nearest-Neighbor
Searches Using Histograms
  • Liang Jin (UC Irvine)
  • Nick Koudas (ATT Labs Research)
  • Chen Li (UC Irvine)
  • Supported by NSF CAREER No. IIS-0238586
  • EDBT 2004

2
NN (nearest-neighbor) search
  • KNN find the k nearest neighbors of an object.

3
Example image search
Query image
  • Images represented as features (color histogram,
    texture moments, etc.)
  • Similarity search using these features
  • Find 10 most similar images for the query image
  • Other applications
  • Web-page search Find 100 most similar pages for
    a given page
  • GIS find 5 closest cities of Irvine
  • Data cleaning

4
NN Algorithms
  • Distance measurement
  • For objects are points, distance well defined
  • Usually Euclidean
  • Other distances possible
  • For arbitrary-shaped objects, assume we have a
    distance function between them
  • Most algorithms assume a high-dimensional tree
    structure for the datasets (e.g., R-tree).

5
Search process (1-NN for example)
  • Most algorithms traverse the structure (e.g.,
    R-tree) top down, and follow a branch-and-bound
    approach
  • Keep a priority queue of nodes (MBR) to be
    visited
  • Sorted based on the minimum distance between q
    and each node
  • Improvement
  • Use MINDIST and MINMAXDIST
  • Reduce the queue size
  • Avoid unnecessary disk IOs to access MBRs

Priority queue
6
Problem
  • Queue size may be large
  • 60,000 objects, 32d (image) vectors, 50 NNs
  • Max queue size 15K entries
  • Avg queue size half (7.5K entries)
  • If queue cant fit in memory, more disk IOs!
  • Problem worse for k-NN joins
  • E.g., 1500 x 1500 join
  • Max queue size 1.7M entries gt 1GB memory!
  • 750 seconds to run
  • Couldnt scale up to 2000 objects!
  • Disk thrashing

7
Our Solution Nearest-Neighbor Histogram (NNH)
  • Main idea
  • Utilizing NNH in a search (KNN, join)
  • Construction and incremental maintenance
  • Experiments
  • Related work

8
NNH Nearest-Neighbor Histograms
pm
p2
p1
m of pivots
Distances of its nearest neighbors r1, r2, ,
They are not part of the database
9
Structure
  • Nearest Neighbor Vectors

each ri is the distance of ps i-th NN T length
of each vector
  • Nearest Neighbor Histogram
  • Collection of m pivots with their NN vectors

10
Outline
  • Main idea
  • Utilizing NNH in a search (KNN, join)
  • Construction and incremental maintenance
  • Experiments
  • Related work

11
Estimate NN distance for query object
  • NNH does not give exact NN information for an
    object
  • But we can estimate an upper bound for the k-NN
    distance ?qest of q

Triangle inequality
12
Estimate NN for query object(cont)
  • Apply the triangle inequality to all pivots
  • Upper bound estimate of NN distance of q
  • Complexity O(m)

13
Utilizing estimates in NN search
  • More pruning prune an mbr if

mbr
MINDIST
q
14
Utilizing estimates in NN join
  • K-NN join for each object o1 in D1, find its
    k-nearest neighbors in D2.
  • Traverse two trees top down keep a queue of pairs

15
Utilizing estimates in NN join (contt)
  • Construct NNH for D2.
  • For each object o1 in D1, keep its estimated NN
    radius ?o1est using NNH of D2.
  • Similar to k-NN query, ignore mbr for o1 if

MINDIST
o1
mbr
16
More powerful prune MBR pairs
17
Prune MBR pairs (cont)
mbr1
mbr2
MINDIST
Prune this MBR pair if
18
Outline
  • Main idea
  • Utilizing NNH in a search (KNN, join)
  • Construction and incremental maintenance
  • Experiments
  • Related work

19
NNH Construction
  • If we have selected the m pivots
  • Just run KNN queries for them to construct NNH
  • Time is O(m)
  • Offline
  • Important selecting pivots
  • Size-Constraint Construction
  • Error-Constraint Construction (see paper)

20
Size-constraint NNH construction
  • of pivots m determines
  • Storage size
  • Initial construction cost
  • Incremental-maintenance cost
  • Choose m best pivots

21
Size-constraint NNH construction
  • Given m ( of pivots), assume
  • query objects are from the database D
  • H(pi,k) doesnt vary too much
  • Goal Find pivots p1, p2, , pm to minimize
    object distances to the pivots
  • Clustering problem
  • Many algorithms available
  • Use K-means for its simplicity and efficiency

22
Incremental Maintenance
  • How to update the NNH when inserting or deleting
    objects?
  • Need to shift each vector
  • Associate a valid length Ei to each NN vector.

23
Outline
  • Main idea
  • Utilizing NNH in a search (KNN, join)
  • Construction and incremental maintenance
  • Experiments
  • Related work

24
Experiments
  • Datasets
  • Corel image database
  • Contains 60,000 images
  • Each image represented by a 32-dimensional float
    vector
  • Time-series data from ATT
  • Similar trends. Report results for Corel data set
  • Test bed
  • PC 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000.
  • GNU C in CYGWIN

25
Goal
  • Is the pruning using NNH estimates powerful?
  • KNN queries
  • NN-join queries
  • Is it cheap to have such a structure?
  • Storage
  • Initial construction
  • Incremental maintenance

26
Improvement in k-NN search
  • Ran k-means algorithm to generate 400 pivots for
    60K objects, and constructed NNH
  • Performed K-NN queries on 100 randomly selected
    query objects.
  • Queue size to measure memory usage.
  • Max queue size
  • Average queue size

27
Reduced Memory Requirement
28
Reduced running time
29
Effects of different of pivots
30
Join Reduced Memory Requirement
31
Join Reduced running time
32
JoinRunning time for different data sizes
33
Cost/Benefit of NNH
For 60,000 float vectors (32-d).
Pivot (m) 10 50 100 150 200 250 300 350 400
Construction time (sec) 0.7 3.59 6.6 9.4 11.5 13.7 15.7 17.8 20.4
Storage space (kB) 2 10 20 30 40 50 60 70 80
Incr mantnce. time (ms) 0 0 0 0 0 0 0 0 0
Improved q-size(kNN)() 40 30 28 24 24 24 23 20 18
Improved q-size(join)() 45 34 28 26 26 25 24 24 22
0 means almost zero.
34
Conclusion
  • NNH efficient, effective approach to improving
    NN-search performance.
  • Can be easily embedded into current
    implementation of NN algorithms.
  • Can be efficiently constructed and maintained.
  • Offers substantial performance advantages.

35
Related work
  • Summary histograms
  • E.g., Jagadish et al VLDB98, Mattias et al
    VLDB00
  • Objective approximate frequency values
  • NN Search algorithms
  • Many algorithms developed
  • Many of them can benefit from NNH
  • Algorithms based on pivots/foci/anchors
  • E.g., Omni Filho et al, ICDE01, Vantage objects
    Vleugels et al VIIS99, M-trees Ciaccia et al
    VLDB97
  • Choose pivots far from each other (to represent
    the intrinsic dimensionality)
  • NNH pivots depend on how clustered the objects
    are
  • Experiments show the differences

36
Work conducted in the Flamingo Project on Data
Cleansing at UC Irvine
Write a Comment
User Comments (0)
About PowerShow.com