Similarity Searching in High Dimensions via Hashing - PowerPoint PPT Presentation

About This Presentation

Title:

Similarity Searching in High Dimensions via Hashing

Description:

Number of Views:37

Avg rating:3.0/5.0

Slides: 17

Provided by: Andy4188

Learn more at: https://cis.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: Similarity Searching in High Dimensions via Hashing

1
Similarity Searching in High Dimensions
viaHashing

The approach of using the similarity searching is
to be used only in high dimensions.
The reason, or the idea behind the approach is
that since the selection of features and the
choice of distance is rather heuristic,
determining an appropriate nearest neighbor
should suffice for most practical purposes.

The basic idea is to hash the points from the
database so as to ascertain that the probability
of collision is much higher for objects that are
close to each other than for those that are far
apart.
The necessity arose from the so called curse of
dimensionality fact for the large databases.
In this case all the searching techniques reduce
to linear search, if are being searched for the
appropriate answer.

The similarity search problem involves the
nearest ( most similar ) object in a given
collection of objects to a given query.
Typically the objects of interest are represented
as points in ?d and a distance metric is used to
measure the similarity of the objects.
The basic problem is to perform indexing or
similarity searching for query objects.

The problem arises due to the fact that the
present methods are not entirely satisfactory,
for large d.
And is based on the idea that for most
applications it is not necessary for the exact
answer.
It also provides the user with a time-quality
trade-off.
The above statements are based on the assumption
that the searching for approximate answers is
faster than for finding the exact answers.

The technique is to use locality-sensitive
hashing instead of space sensitive hashing.
The idea is to hash points points using several
hash functions so as to ensure that, for each
function the probability of collision is much
higher for objects that are close to each other.
Then, one can determine near neighbors by hashing
the query point and retrieving elements stored in
buckets containing that point.

The LSH ( locality sensitive hashing ) enabled to
achieve the worst case O(dn1/?) time for
approximate nearest neighbor over a n-point
database.
In the presented paper, the worst time running
time has been improved by the new technique to
O(dn1/(1?)), which is a significant improvement.

8
Preliminaries

ldp is used to denote the Euclidian space ?d
under the lp normal form i.e., when the length of
the vector (x1,,xd) is defined as ( x1p
xdp).
Further, d(p,q) denotes the distance between the
points p and q in ldp
We use Hd to represent the Hamming metric space
of of dimension d.
We use dH(p,q) to denote the Hamming distance.

General definition of the problem is to find K
nearest points in the given database, where K gt
1.
Even for the KNNS problem, our algorithm
generalizes to finding the K (gt1) approximate
nearest neighbors.
Here we wish to find the K points p1,,pk such
the distance of pi to the query q is at the most
(1?) times the distance from the ith nearest
point to q.

10
The Algorithm

11
Locality Sensitive Hashing

The new algorithm is in many respects more
natural the earlier ones it does not require
that a bucket to store only point.
It has better running time.
The analysis is generalized for the case of
secondary memory.

Let C be the largest coordinate in all points in
P.
Then we can embed P into the Hamming cube Hd
with dC.d, by transforming each point
p(x1,,xd) into a binary vector.
vpUnaryc(x1)Unaryc(xd),
where Unaryc(x) denotes the unary
representation of x, i.e., a sequence of x zeroes
followed by C-x ones.

For an integer l, choose I1Il subsets of 1d.
Let pI denote the the projection of vector p on
the coordinate positions as per I and
concatenating the bits in those positions.
Denote gj(p) pIj
For the preprocessing we store each p?P in the
buckets for gj(p), for j1l.
As the total number of buckets may be large, we
compress the buckets by resorting to standard
hashing.

Thus we use two levels of hashing.
The LSH maps the points into buckets gj(p) while
a standard hashing function maps the contents of
these buckets into a hash table of size M.
If a bucket in a given index is full, a new
point cannot be added to it, since it will be
added to another index with a very high
probability.
This saves the overhead of maintaining the link
structure.

To process a query q, we search all the indices
g1(q)gl(q) until we either encounter at least
c.l points or use all the l indices.
The number of disk accesses is always upper
bounded on l, the number of indices.
Let p1,,pt be the points encountered in the
process.
For the output we return the nearest K points, or
fewer in case we could not find so many points as
a result of the search.

The principle behind our method is the
probability of collision of two points p and q is
closely related to the distance between them.
Especially the larger the distance, smaller the
collision property.

Write a Comment

User Comments (0)