Title: NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms
1NNH Improving Performance of Nearest-Neighbor
Searches Using Histograms
- Liang Jin (UC Irvine)
- Nick Koudas (ATT Labs Research)
- Chen Li (UC Irvine)
- Supported by NSF CAREER No. IIS-0238586
- EDBT 2004
2NN (nearest-neighbor) search
- KNN find the k nearest neighbors of an object.
3Example image search
Query image
- Images represented as features (color histogram,
texture moments, etc.) - Similarity search using these features
- Find 10 most similar images for the query image
- Other applications
- Web-page search Find 100 most similar pages for
a given page - GIS find 5 closest cities of Irvine
- Data cleaning
4NN Algorithms
- Distance measurement
- For objects are points, distance well defined
- Usually Euclidean
- Other distances possible
- For arbitrary-shaped objects, assume we have a
distance function between them - Most algorithms assume a high-dimensional tree
structure for the datasets (e.g., R-tree).
5Search process (1-NN for example)
- Most algorithms traverse the structure (e.g.,
R-tree) top down, and follow a branch-and-bound
approach - Keep a priority queue of nodes (MBR) to be
visited - Sorted based on the minimum distance between q
and each node - Improvement
- Use MINDIST and MINMAXDIST
- Reduce the queue size
- Avoid unnecessary disk IOs to access MBRs
Priority queue
6Problem
- Queue size may be large
- 60,000 objects, 32d (image) vectors, 50 NNs
- Max queue size 15K entries
- Avg queue size half (7.5K entries)
- If queue cant fit in memory, more disk IOs!
- Problem worse for k-NN joins
- E.g., 1500 x 1500 join
- Max queue size 1.7M entries gt 1GB memory!
- 750 seconds to run
- Couldnt scale up to 2000 objects!
- Disk thrashing
7Our Solution Nearest-Neighbor Histogram (NNH)
- Main idea
- Utilizing NNH in a search (KNN, join)
- Construction and incremental maintenance
- Experiments
- Related work
8NNH Nearest-Neighbor Histograms
pm
p2
p1
m of pivots
Distances of its nearest neighbors r1, r2, ,
They are not part of the database
9Structure
each ri is the distance of ps i-th NN T length
of each vector
- Nearest Neighbor Histogram
- Collection of m pivots with their NN vectors
10Outline
- Main idea
- Utilizing NNH in a search (KNN, join)
- Construction and incremental maintenance
- Experiments
- Related work
11Estimate NN distance for query object
- NNH does not give exact NN information for an
object - But we can estimate an upper bound for the k-NN
distance ?qest of q
Triangle inequality
12Estimate NN for query object(cont)
- Apply the triangle inequality to all pivots
- Upper bound estimate of NN distance of q
13Utilizing estimates in NN search
- More pruning prune an mbr if
mbr
MINDIST
q
14Utilizing estimates in NN join
- K-NN join for each object o1 in D1, find its
k-nearest neighbors in D2. - Traverse two trees top down keep a queue of pairs
15Utilizing estimates in NN join (contt)
- Construct NNH for D2.
- For each object o1 in D1, keep its estimated NN
radius ?o1est using NNH of D2. - Similar to k-NN query, ignore mbr for o1 if
MINDIST
o1
mbr
16More powerful prune MBR pairs
17Prune MBR pairs (cont)
mbr1
mbr2
MINDIST
Prune this MBR pair if
18Outline
- Main idea
- Utilizing NNH in a search (KNN, join)
- Construction and incremental maintenance
- Experiments
- Related work
19NNH Construction
- If we have selected the m pivots
- Just run KNN queries for them to construct NNH
- Time is O(m)
- Offline
- Important selecting pivots
- Size-Constraint Construction
- Error-Constraint Construction (see paper)
20Size-constraint NNH construction
- of pivots m determines
- Storage size
- Initial construction cost
- Incremental-maintenance cost
- Choose m best pivots
21Size-constraint NNH construction
- Given m ( of pivots), assume
- query objects are from the database D
- H(pi,k) doesnt vary too much
- Goal Find pivots p1, p2, , pm to minimize
object distances to the pivots - Clustering problem
- Many algorithms available
- Use K-means for its simplicity and efficiency
22Incremental Maintenance
- How to update the NNH when inserting or deleting
objects? - Need to shift each vector
- Associate a valid length Ei to each NN vector.
23Outline
- Main idea
- Utilizing NNH in a search (KNN, join)
- Construction and incremental maintenance
- Experiments
- Related work
24Experiments
- Datasets
- Corel image database
- Contains 60,000 images
- Each image represented by a 32-dimensional float
vector - Time-series data from ATT
- Similar trends. Report results for Corel data set
- Test bed
- PC 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000.
- GNU C in CYGWIN
25Goal
- Is the pruning using NNH estimates powerful?
- KNN queries
- NN-join queries
- Is it cheap to have such a structure?
- Storage
- Initial construction
- Incremental maintenance
26Improvement in k-NN search
- Ran k-means algorithm to generate 400 pivots for
60K objects, and constructed NNH - Performed K-NN queries on 100 randomly selected
query objects. - Queue size to measure memory usage.
- Max queue size
- Average queue size
27Reduced Memory Requirement
28Reduced running time
29Effects of different of pivots
30Join Reduced Memory Requirement
31Join Reduced running time
32JoinRunning time for different data sizes
33Cost/Benefit of NNH
For 60,000 float vectors (32-d).
Pivot (m) 10 50 100 150 200 250 300 350 400
Construction time (sec) 0.7 3.59 6.6 9.4 11.5 13.7 15.7 17.8 20.4
Storage space (kB) 2 10 20 30 40 50 60 70 80
Incr mantnce. time (ms) 0 0 0 0 0 0 0 0 0
Improved q-size(kNN)() 40 30 28 24 24 24 23 20 18
Improved q-size(join)() 45 34 28 26 26 25 24 24 22
0 means almost zero.
34Conclusion
- NNH efficient, effective approach to improving
NN-search performance. - Can be easily embedded into current
implementation of NN algorithms. - Can be efficiently constructed and maintained.
- Offers substantial performance advantages.
35Related work
- Summary histograms
- E.g., Jagadish et al VLDB98, Mattias et al
VLDB00 - Objective approximate frequency values
- NN Search algorithms
- Many algorithms developed
- Many of them can benefit from NNH
- Algorithms based on pivots/foci/anchors
- E.g., Omni Filho et al, ICDE01, Vantage objects
Vleugels et al VIIS99, M-trees Ciaccia et al
VLDB97 - Choose pivots far from each other (to represent
the intrinsic dimensionality) - NNH pivots depend on how clustered the objects
are - Experiments show the differences
36Work conducted in the Flamingo Project on Data
Cleansing at UC Irvine