NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms - PowerPoint PPT Presentation

About This Presentation

Title:

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

Description:

Title: NNH Improving Performance of Nearest-Neighbor Searches using Histograms Author: Liang Jin, Nick Koudas, Chen Li Last modified by: Chen Li Created Date – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 37

Provided by: LiangJin

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

1
NNH Improving Performance of Nearest-Neighbor
Searches Using Histograms

Liang Jin (UC Irvine)
Nick Koudas (ATT Labs Research)
Chen Li (UC Irvine)
Supported by NSF CAREER No. IIS-0238586
EDBT 2004

2
NN (nearest-neighbor) search

KNN find the k nearest neighbors of an object.

3
Example image search
Query image

Images represented as features (color histogram,
texture moments, etc.)
Similarity search using these features
Find 10 most similar images for the query image
Other applications
Web-page search Find 100 most similar pages for
a given page
GIS find 5 closest cities of Irvine
Data cleaning

4
NN Algorithms

Distance measurement
For objects are points, distance well defined
Usually Euclidean
Other distances possible
For arbitrary-shaped objects, assume we have a
distance function between them
Most algorithms assume a high-dimensional tree
structure for the datasets (e.g., R-tree).

5
Search process (1-NN for example)

Most algorithms traverse the structure (e.g.,
R-tree) top down, and follow a branch-and-bound
approach
Keep a priority queue of nodes (MBR) to be
visited
Sorted based on the minimum distance between q
and each node
Improvement
Use MINDIST and MINMAXDIST
Reduce the queue size
Avoid unnecessary disk IOs to access MBRs

Priority queue
6
Problem

Queue size may be large
60,000 objects, 32d (image) vectors, 50 NNs
Max queue size 15K entries
Avg queue size half (7.5K entries)
If queue cant fit in memory, more disk IOs!
Problem worse for k-NN joins
E.g., 1500 x 1500 join
Max queue size 1.7M entries gt 1GB memory!
750 seconds to run
Couldnt scale up to 2000 objects!
Disk thrashing

7
Our Solution Nearest-Neighbor Histogram (NNH)

Main idea
Utilizing NNH in a search (KNN, join)
Construction and incremental maintenance
Experiments
Related work

8
NNH Nearest-Neighbor Histograms
pm
p2
p1
m of pivots
Distances of its nearest neighbors r1, r2, ,
They are not part of the database
9
Structure

Nearest Neighbor Vectors

each ri is the distance of ps i-th NN T length
of each vector

Nearest Neighbor Histogram
Collection of m pivots with their NN vectors

10
Outline

Main idea
Utilizing NNH in a search (KNN, join)
Construction and incremental maintenance
Experiments
Related work

11
Estimate NN distance for query object

NNH does not give exact NN information for an
object
But we can estimate an upper bound for the k-NN
distance ?qest of q

Triangle inequality
12
Estimate NN for query object(cont)

Apply the triangle inequality to all pivots
Upper bound estimate of NN distance of q

Complexity O(m)

13
Utilizing estimates in NN search

More pruning prune an mbr if

mbr
MINDIST
q
14
Utilizing estimates in NN join

K-NN join for each object o1 in D1, find its
k-nearest neighbors in D2.
Traverse two trees top down keep a queue of pairs

15
Utilizing estimates in NN join (contt)

Construct NNH for D2.
For each object o1 in D1, keep its estimated NN
radius ?o1est using NNH of D2.
Similar to k-NN query, ignore mbr for o1 if

MINDIST
o1
mbr
16
More powerful prune MBR pairs
17
Prune MBR pairs (cont)
mbr1
mbr2
MINDIST
Prune this MBR pair if
18
Outline

Main idea
Utilizing NNH in a search (KNN, join)
Construction and incremental maintenance
Experiments
Related work

19
NNH Construction

If we have selected the m pivots
Just run KNN queries for them to construct NNH
Time is O(m)
Offline
Important selecting pivots
Size-Constraint Construction
Error-Constraint Construction (see paper)

20
Size-constraint NNH construction

of pivots m determines
Storage size
Initial construction cost
Incremental-maintenance cost
Choose m best pivots

21
Size-constraint NNH construction

Given m ( of pivots), assume
query objects are from the database D
H(pi,k) doesnt vary too much
Goal Find pivots p1, p2, , pm to minimize
object distances to the pivots
Clustering problem
Many algorithms available
Use K-means for its simplicity and efficiency

22
Incremental Maintenance

How to update the NNH when inserting or deleting
objects?
Need to shift each vector
Associate a valid length Ei to each NN vector.

23
Outline

Main idea
Utilizing NNH in a search (KNN, join)
Construction and incremental maintenance
Experiments
Related work

24
Experiments

Datasets
Corel image database
Contains 60,000 images
Each image represented by a 32-dimensional float
vector
Time-series data from ATT
Similar trends. Report results for Corel data set
Test bed
PC 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000.
GNU C in CYGWIN

25
Goal

Is the pruning using NNH estimates powerful?
KNN queries
NN-join queries
Is it cheap to have such a structure?
Storage
Initial construction
Incremental maintenance

26
Improvement in k-NN search

Ran k-means algorithm to generate 400 pivots for
60K objects, and constructed NNH
Performed K-NN queries on 100 randomly selected
query objects.
Queue size to measure memory usage.
Max queue size
Average queue size

27
Reduced Memory Requirement
28
Reduced running time
29
Effects of different of pivots
30
Join Reduced Memory Requirement
31
Join Reduced running time
32
JoinRunning time for different data sizes
33
Cost/Benefit of NNH
For 60,000 float vectors (32-d).
Pivot (m) 10 50 100 150 200 250 300 350 400
Construction time (sec) 0.7 3.59 6.6 9.4 11.5 13.7 15.7 17.8 20.4
Storage space (kB) 2 10 20 30 40 50 60 70 80
Incr mantnce. time (ms) 0 0 0 0 0 0 0 0 0
Improved q-size(kNN)() 40 30 28 24 24 24 23 20 18
Improved q-size(join)() 45 34 28 26 26 25 24 24 22
0 means almost zero.
34
Conclusion

NNH efficient, effective approach to improving
NN-search performance.
Can be easily embedded into current
implementation of NN algorithms.
Can be efficiently constructed and maintained.
Offers substantial performance advantages.

35
Related work

Summary histograms
E.g., Jagadish et al VLDB98, Mattias et al
VLDB00
Objective approximate frequency values
NN Search algorithms
Many algorithms developed
Many of them can benefit from NNH
Algorithms based on pivots/foci/anchors
E.g., Omni Filho et al, ICDE01, Vantage objects
Vleugels et al VIIS99, M-trees Ciaccia et al
VLDB97
Choose pivots far from each other (to represent
the intrinsic dimensionality)
NNH pivots depend on how clustered the objects
are
Experiments show the differences