Dimension Reduction in the Hamming Cube (and its Applications) - PowerPoint PPT Presentation

About This Presentation
Title:

Dimension Reduction in the Hamming Cube (and its Applications)

Description:

Jon. Alice. Bob. Eve. Panconesi. Kate. Fred. A.Panconesi ? Geometric formulation ... To estimate hamming distance between X and Y (within (1 e)) with small CC, ... – PowerPoint PPT presentation

Number of Views:253
Avg rating:3.0/5.0
Slides: 46
Provided by: csU5
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Dimension Reduction in the Hamming Cube (and its Applications)


1
Dimension Reduction in the Hamming Cube (and its
Applications)

Rafail Ostrovsky UCLA (joint works with
Rabani and Kushilevitz and Rabani)

2
PLAN
  • Problem Formulations
  • Communication complexity game
  • What really happened? (dimension reduction)
  • Solutions to 2 problems
  • ANN
  • k-clustering
  • Whats next?

3
Problem statements
  • Johnson-lindenstrauss lemma n points in high
    dim. Hilbert Space can be embedded into O(logn)
    dim subspace with small distortion
  • Q how do we do it for the Hamming Cube?
  • (we show how to avoid impossibility of
    Charicar-Sahai)

4
Many different formulations of ANN
  • ANN approximate nearest neighbor search
  • (many applications in computational geometry,
    biology/stringology, IR, other areas)
  • Here are different formulations

5
Approximate Searching
  • Motivation given a DB of names, user with a
    target name, find if any of DB names are
    close to the current name, without doing liner
    scan.

Jon Alice Bob Eve Panconesi Kate Fred
A.Panconesi ?
??
6
Geometric formulation
  • Nearest Neighbor Search (NNS) given N blue
    points (and a distance function, say Euclidian
    distance in Rd), store all these points somehow

7
Data structure question
  • given a new red point, find closest blue point.

Naive solution 1 store blue points as is and when given a red point, measure distances to all blue points. Q can we do better?
8
Can we do better?
  • Easy in small dimensions (Voronoi diagrams)
  • Curse of dimensionality in High Dimensions
  • KOR Can get a good approximate solution
    efficiently!

9
Hamming Cube Formulation for ANN
  • Given a DB of N blue n-bit strings, process them
    somehow. Given an n-bit red string find ANN in
    the Hyper-Cube 0,1n
  • Naïve solution 2 pre-compute all (exponential
    ) of answers (want small data-structures!)

00101011 01011001 11101001 10110110 11010101 11011000 10101010 10101111
11010100
??
10
Clustering problem that Ill discuss in detail
  • K-clustering

11
An example of Clustering find centers
  • Given N points in Rd

12
A clustering formulation
  • Find cluster centers

13
Clustering formulation
  • The cost is the sum of distances

14
Main technique
  • First, as a communication game
  • Second, interpreted as a dimension reduction

15
COMMUNICATION COMPLEXITY GAME
  • Given two players Alice and Bob,
  • Alice is secretly given string x
  • Bob is secretly given string y
  • they want to estimate hamming distance between x
    and y with small communication (with small
    error), provided that they have common randomness
  • How can they do it? (say length of xy N)
  • Much easier how do we check that xy ?

16
Main lemma an abstract game
  • How can Alice and Bob estimate hamming distance
    between X and Y with small CC?
  • We assume Alice and Bob share randomness

ALICE X1X2X3X4Xn
BOB Y1Y2Y3Y4Yn
??
17
A simpler question
  • To estimate hamming distance between X and Y
    (within (1 e)) with small CC, sufficient for
    Alice and Bob for any L to be able to distinguish
    X and Y for
  • H(X,Y) lt L OR
  • H(X,Y) gt (1 e) L
  • Q why sampling does not work?

BOB Y1Y2Y3Y4Yn
??
ALICE X1X2X3X4Xn
18
Alice and Bob pick the SAME n-bit blue R each
bit of R1 independently with probability 1/2L
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0


0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
19
What is the difference in probabilities? H(X,Y)
lt L and H(X,Y) gt (1 e) L
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0


0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
20
How do we amplify?
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0


0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
21
How do we amplify? - Repeat, with many
independent Rs but same distribution!
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0


0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
22
a refined game with a small communication
  • How can Alice and Bob distinguish X and Y
  • H(X,Y) lt L OR
  • H(X,Y) gt (1 e) L

BOB Y1Y2Y3Y4Yn For each R XOR (the same subset) of Yi Compare the outputs.
ALICE X1X2X3X4Xn For each R XOR (subset) of Xi Compare the outputs.
?? Pick 1/ e2 logN Rs with correct distribution Compare this linear transformation.
23
Dimension Reduction in the Hamming Cube OR
For each L, we can pick O(log N) Rs and boost the Probabilities! Key Property we get an embedding from large to small cube that preserve ranges around L very well.
24
Dimension Reduction in the Hamming Cube OR
For each L, we can pick O(log N) Rs and boost the Probabilities! Key Property we get an embedding from large to small cube that preserve ranges around L. Key idea in applications can build inverse lookup table for the small cube!
25
Applications
  • Applications of the dimension reduction in the
    Hamming CUBE
  • For ANN in the Hamming cube and Rd
  • For K-Clustering

26
Application to ANN in the Hamming Cube
  • For each possible L build a small cube and
    project original DB to a small cube
  • Pre-compute inverse table for each entry of the
    small cube.
  • Why is this efficient?
  • How do we answer any query?
  • How do we navigate between different L?

27
Putting it All together Users private approx
search from DB
  • Each projection is O(log N) Rs. User picks many
    such projections for each L-range. That defines
    all the embeddings.
  • Now, DB builds inverse lookup tables for each
    projection as new DBs for each L.
  • User can now project its query into small cube
    and use binary search on L

28
MAIN THM KOR
  • Can build poly-size data-structure to do ANN for
    high-dimensional data in time polynomial in d and
    poly-log in N
  • For the hamming cube
  • L_1
  • L_2
  • Square of the Euclidian dist.
  • IM had a similar results, slightly weaker
    guarantee.

29
Dealing with Rd
  • Project to random lines, choose cut points
  • Well, not exactly we need navigation

30
Clustering
  • Huge number of applications (IR, mining, analysis
    of stat data, biology, automatic taxonomy
    formation, web, topic-specific data-collections,
    etc.)
  • Two independent issues
  • Representation of data
  • Forming clusters (many incomparable methods)

31
Representation of data examples
  • Latent semantic indexing yields points in Rd with
    l2 distance (distance indicating similarity)
  • Min-wise permutation (Broder at. al.) approach
    yields points in the hamming metric
  • Many other representations from IR literature
    lead to other metrics, including edit-distance
    metric on strings
  • Recent news OR-95 showed that we can embed
    edit-distance metric into l1 with small
    distortion distortion exp(sqrt(\log n \log log
    n))

32
Geometric Clustering examples
  • Min-sum clustering in Rd form clusters s.t. the
    sum of intra-cluster distances in minimized
  • K-clustering pick k centers in the ambient
    space. The cost is the sum of distances from each
    data-point to the closest center
  • Agglomerative clustering (form clusters below
    some distance-threshold)
  • Q which is better?

33
Methods are (in general) incomparable
34
Min-SUM
35
2-Clustering
36
A k-clustering problem notation
  • N number of points
  • d dimension
  • k number of centers

37
About k-clustering
  • When k if fixed, this is easy for small d
  • Kleinberg, Papadimitriou, Raghavan NP-complete
    for k2 for the cube
  • Drineas, Frieze, Kannan, Vempala, Vinay NP
    complete for Rd for square of the Euclidian
    distance
  • When k is not fixed, this is facility location
    (Euclidian k-median)
  • For fixed d but growing k a PTAS was given by
    Arora, Raghavan, Rao (using dynamic prog.)
  • (this talk) OR PTAS for fixed k, arbitrary d

38
Common tools in geometric PTAS
  • Dynamic programming
  • Sampling Schulman, AS, DLVK
  • DFKVV use SVD
  • Embeddings/dimension reduction seem useless
    because
  • Too many candidate centers
  • May introduce new centers

39
OR k-clustering result
  • A PTAS for fixed k
  • Hamming cube 0,1d
  • l1d
  • l2d (Euclidian distance)
  • Square of the Euclidian distance

40
Main ideas
  • For 2-clustering find a good partition is as good
    as solving the problem
  • Switch to cube
  • Try partitions in the embedded low-dimensional
    data set
  • Given a partition, compute centers and cost in
    the original data send
  • Embedding/dim. reduction used to reduce the
    number of partitions

41
Stronger property of OR dimension reduction
  • Our random linear transformation preserve ranges!

42
THE ALGORITHM
43
The algorithm yet again
  • Guess 2-center distance
  • Map to small cube
  • Partition in the small cube
  • Measure the partition in the big cube
  • THM gets within (1e) of optimal.
  • Disclaimer PTAS is (almost never) practical,
    this shows feasibility only, more ideas are
    needed for a practical solution.

44
Dealing with kgt2
  • Apex of a tournament is a node of max out-degree
  • Fact apex has a path of length 2 to every node
  • Every point is assigned an apex of center
    tournaments
  • Guess all (k choose 2) center distances
  • Embed into (k choose 2) small cubes
  • Guess center-projection in small cubes
  • For every point, for every pair of centers,
    define a tournament which center is closer in
    the projection

45
Conclusions
  • Dimension reduction in the cube allows to deal
    with huge number of incomparable attributes.
  • Embeddings of other metrics into the cube allows
    fast ANN for other metrics
  • Real applications still require considerable
    additional ideas
  • Fun area to work in
Write a Comment
User Comments (0)
About PowerShow.com