Nearest Neighbor Retrieval Using Distance-Based Hashing - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Nearest Neighbor Retrieval Using Distance-Based Hashing

Description:

Nearest Neighbor Retrieval Using Distance-Based Hashing ... Experiments on several real-world data sets demonstrate that our method produces ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 2
Provided by: cspeo
Category:

less

Transcript and Presenter's Notes

Title: Nearest Neighbor Retrieval Using Distance-Based Hashing


1
Database Group
Nearest Neighbor Retrieval Using Distance-Based
Hashing Michalis Potamias and Panagiotis
Papapetrou supervised by Prof George Kollios
Analysis
  • Probability of collision between any two objects
  • Same probability on a k-bit hash table
  • Prob of collision in at least one of the l hash
    tables
  • Accuracy, i.e. the probability over all queries Q
    that we will retrieve the nearest neighbor N(Q)
  • LookupCost Expected number of objects that
    collide in at least one of the l hash tables
  • HashCost of distance computations to evaluate
    h-functions
  • Total Cost per query
  • Efficiency (for all Queries)
  • Use Sampling to estimate Accuracy and Efficiency
  • Sample Queries

Hash Based Indexing
  • Idea
  • Come up with hash functions that hash similar
    objects to similar buckets
  • Hash every database object to some buckets
  • At query time apply the same hash function to the
    query
  • Filter Retrieve the collisions. The rest of the
    database is pruned.
  • Refine Compute actual distances. Return the
    object with the smallest distance as the NN.

A method is proposed for indexing spaces with
arbitrary distance measures, so as to achieve
efficient approximate nearest neighbor retrieval.
Hashing methods, such as Locality Sensitive
Hashing (LSH), have been successfully applied for
similarity indexing in vector spaces and string
spaces under the Hamming distance. The key
novelty of the hashing technique proposed here is
that it can be applied to spaces with arbitrary
distance measures. First, we describe a
domain-independent method for constructing a
family of binary hash functions. Then, we use
these functions to construct multiple multi-bit
hash tables. We show that the LSH formalism is
not applicable for analyzing the behavior of
these tables as index structures. We present a
novel formulation, that uses statistical
observations from sample data to analyze
retrieval accuracy and efficiency for the
proposed indexing method. Experiments on several
real-world data sets demonstrate that our method
produces good trade-offs between accuracy and
efficiency, and significantly outperforms
VP-trees, which are a well-known method for
distance-based indexing.
Locality Sensitive Hashing
Additional Optimizations
Locality Sensitive Family of Functions Amplify
the gap between p1 and p2 Randomly pick l hash
vectors of k functions each. Probability of
collision in at least one of l hash tables
Problem
  • Hierarchical DBH
  • Rank Queries according to D(Q,N(Q)
  • Divide space into disjoint subsets (equi-height)
  • Train separate indices for each subset
  • Reduce Hash Cost
  • Use small number of pseudoline points
  • NEAREST NEIGHBOR Given a database S, a distance
    function D our task is for a previous unseen
    query q, locate a point p of the database such
    that the distance between q and every point o of
    the database is greater or equal than the
    distance between p and q.
  • COST MODEL Minimize number of Distance
    Computations
  • Computing D may be very expensive
  • Dynamic Time Warping for Time Series
  • Edit Distance Variants for DNA alignment
  • PROBLEM DEFINITION Define index structure to
    answer Nearest Neighbor queries efficiently
  • A SOLUTION Brute Force! Try them all and
    get the exact answer
  • OUR SOLUTION Are we willing to trade accuracy
    for efficiency ?

Experiments
H using Pseudoline Projections (HDBH)
Works on Arbitrary Space but is not Locality
Sensitive! Define a line projection function that
maps an arbitrary space into the real line
R Real valued ? Discrete valued Hash
tables should be balanced. Thus t1, t2 are
chosen from V
ACCURACY vs. EFFICIENCY How often is the actual
NN retrieved? How much time does NN retrieval
take?
Conclusion
  • General purpose
  • Distance is black box
  • Does not require metric properties
  • Statistical analysis is possible
  • Even when NN is not returned, a very close N is
    returned For many applications thats fine!!
  • Not sublinear in size of DB
  • Statistical (not probabilistic)
  • Need representative sample sets
  • Hands dataset .. actual performance was
    different than simulation ..
  • the training set was not representative!
Write a Comment
User Comments (0)
About PowerShow.com