Compact Data Representations and their Applications - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Compact Data Representations and their Applications

Description:

... distance ... are points in normed space. Embedding original distance function in ... between two sets of points. Point weights multiple copies of points ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 51
Provided by: csPrin7
Category:

less

Transcript and Presenter's Notes

Title: Compact Data Representations and their Applications


1
Compact Data Representations and their
Applications
  • Moses Charikar
  • Princeton University

2
Sketching Paradigm
  • Construct compact representation (sketch) of data
    such that
  • Interesting functions of data can be computed
    from compact representation

estimated
3
Why care about compact representations ?
  • Practical motivations
  • Algorithmic techniques for massive data sets
  • Compact representations lead to reduced space,
    time requirements
  • Make impractical tasks feasible
  • Theoretical Motivations
  • Interesting mathematical problems
  • Connections to many areas of research

4
Questions
  • What is the data ?
  • What functions do we want to compute on the data
    ?
  • How do we estimate functions on the sketches ?
  • Different considerations arise from different
    combinations of answers
  • Compact representation schemes are functions of
    the requirements

5
What is the data ?
  • Sets, vectors, points in Euclidean space, points
    in a metric space, vertices of a graph.
  • Mathematical representation of objects (e.g.
    documents, images, customer profiles, queries).

6
What functions do we want to compute on the data ?
  • Local functions pairs of objectse.g. distance
    between objects
  • Sketch of each object, such that function can be
    estimated from pairs of sketches
  • Global functions entire data sete.g.
    statistical properties of data
  • Sketch of entire data set, ability to update,
    combine sketches

7
Local functions distance/similarity
  • Distance is a general metric, i.e satisfies
    triangle inequality
  • Normed spacex (x1, x2, , xd) y (y1, y2,
    , yd)
  • Other special metrics (e.g. Earth Mover Distance)

8
Estimating distance from sketches
  • Arbitrary function of sketches
  • Information theory, communication complexity
    question.
  • Sketches are points in normed space
  • Embedding original distance function in normed
    space. Bourgain 85 Linial,London,Rabinovich
    94
  • Original metric is (same) normed space
  • Original data points are high dimensional
  • Sketches are points low dimensions
  • Dimension reduction in normed spacesJohnson
    Lindenstrauss 84

9
Global functions
  • Statistical properties of entire data set
  • Frequency moments
  • Sortedness of data
  • Set membership
  • Size of join of relations
  • Histogram representation
  • Most frequent items in data set
  • Clustering of data

10
Streaming algorithms
  • Perform computation in one (or constant) pass(es)
    over data using a small amount of storage space
  • Availability of sketch function facilitates
    streaming algorithm
  • Additional requirements - sketch should allow
  • Update to incorporate new data items
  • Combination of sketches for different data sets

storage
input
11
Goals
  • Glimpse of sketching techniques, especially in
    geometric settings.
  • Basic theoretical ideas, no messy details
  • Concrete application

12
Talk Outline
  • Dimension reduction
  • Similarity preserving hash functions
  • sketching vector norms
  • sketching Earth Mover Distance (EMD)
  • Application to image retrieval

13
Low Distortion Embeddings
  • Given metric spaces (X1,d1) (X2,d2),embedding
    f X1 ? X2 has distortion D if ratio of
    distances changes by at most D
  • Dimension Reduction
  • Original space high dimensional
  • Make target space be of low dimension, while
    maintaining small distortion

14
Dimension Reduction in L2
  • n points in Euclidean space (L2 norm) can be
    mapped down to O((log n)/?2) dimensions with
    distortion at most 1?.Johnson Lindenstrauss
    84
  • Two interesting properties
  • Linear mapping
  • Oblivious choice of linear mapping does not
    depend on point set
  • Quite simple JL84, FM88, IM98, DG99, Ach01
    Even a random 1/-1 matrix works
  • Many applications

15
Dimension reduction for L1
  • C,Sahai 02Linear embeddings are not good for
    dimension reduction in L1
  • There exist O(n) points in L1 in n dimensions,
    such that any linear mapping with distortion ?
    needs n/?2 dimensions

16
Dimension reduction for L1
  • C, Brinkman 03Strong lower bounds for
    dimension reduction in L1
  • There exist n points in L1 , such that any
    embedding with constant distortion ? needs n1/?2
    dimensions
  • Simpler proof by Lee,Naor 04
  • Does not rule out other sketching techniques

17
Talk Outline
  • Dimension reduction
  • Similarity preserving hash functions
  • sketching vector norms
  • sketching Earth Mover Distance (EMD)
  • Application to image retrieval

18
Similarity Preserving Hash Functions
  • Similarity function sim(x,y), distance d(x,y)
  • Family of hash functions F with probability
    distribution such that

19
Applications
  • Compact representation scheme for estimating
    similarity
  • Approximate nearest neighbor search
    Indyk,Motwani 98 Kushilevitz,Ostrovsky,Rabani
    98

20
Relaxations of SPH
  • Estimate distance measure, not similarity measure
    in 0,1.
  • Measure Ef(h(x),h(y).
  • Estimator will approximate distance function.

21
Sketching Set SimilarityMinwise Independent
Permutations
Broder,Manasse,Glassman,Zweig 97
Broder,C,Frieze,Mitzenmacher 98
22
Sketching L1
  • Design sketch for vectors to estimate L1 norm
  • Hash function to distinguish between small and
    large distances KOR 98
  • Map L1 to Hamming space
  • Bit vectors a(a1,a2,,an) and b(b1,b2,,bn)
  • Distinguish between distances ? (1-e)n/k versus
    ? (1e)n/k
  • XOR random set of k bits
  • Prh(a)h(b) differs by constant in two cases

23
Sketching L1 via stable distributions
  • a(a1,a2,,an) and b(b1,b2,,bn)
  • Sketching L2
  • f(a) Si ai Xi f(b) Si bi XiXi
    independent Gaussian
  • f(a)-f(b) has Gaussian distribution scaled by
    a-b2
  • Form many coordinates, estimate a-b2 by taking
    L2 norm
  • Sketching L1
  • f(a) Si ai Xi f(b) Si bi XiXi
    independent Cauchy distributed
  • f(a)-f(b) has Cauchy distribution scaled by
    a-b1
  • Form many coordinates, estimate a-b1 by taking
    medianIndyk 00 -- streaming applications

24
Earth Mover Distance (EMD)
P
Q
EMD(P,Q)
25
Bipartite/Bichromatic matching
  • Minimum cost matching between two sets of points.
  • Point weights ? multiple copies of points

Fast estimation of bipartite matching
Agarwal,Varadarajan 04
Goal Sketch point set to enable estimation of
min cost matching
26
Detour Classification with pairwise
relationships Kleinberg,Tardos 99
we
27
LP Relaxation and Rounding
Kleinberg,Tardos 99
Chekuri,Khanna,Naor,Zosin 01
28
Approximating metrics by trees
29
EMD on trees embedding into L1
suggested by Piotr Indyk
wT(P)-wT(Q)
EMD(P,Q) STlTwT(P)-wT(Q)
30
EMD on general metrics
  • Approximate metric by probability distribution on
    trees
  • Sample tree from distribution and compute L1
    representation
  • EMD(P,Q) ? Ed(v(P),v(Q)) ? O(log n) EMD(P,Q)

31
Tree approximations for Euclidean points
distortion O(d log ?) Bartal 96, CCGGP 98
proposed by Indyk,Thaper 03 for estimating EMD
32
Talk Outline
  • Dimension reduction
  • Similarity preserving hash functions
  • sketching vector norms
  • sketching Earth Mover Distance (EMD)
  • Application to image retrieval

33
Motivation
  • Apply sketching techniques in complex setting
  • Compact data structures for high-quality and
    efficient image retrieval ?
  • Evaluate effectiveness of sketching techniques
  • Lv,C,Li 04

34
Region Based Image Retrieval (RBIR)
  • Region representation
  • color (histogram, moments, fourier coefficients)
  • position, shape
  • Region based image similarity measure
  • Independent best match Blobworld, NETRA
  • each region in one matched to best region in
    other
  • One-to-one match Windsurf, WALRUS
  • one-to-one matching between two sets of regions
  • EMD match

35
Overview
36
Region Representation
  • Color moments
  • First three moments in HSV color space
  • ? 9-D vector
  • Bounding box
  • Aspect ratio
  • Bounding box size
  • Area ratio
  • Region centroid
  • ? 5-D vector
  • Weighted L1 distance

37
Addressing problems with EMD match
  • Region weights proportional to region size?
  • Big background has disproportionate effect
  • Ground distance region distance?
  • Pair of different regions can have large effect
  • distance meaningless beyond certain point
  • EMD-match
  • Region weights Square root of region size
  • Ground distance Thresholded region distance

38
Compact region representation
  • 14D region vectors ? N?K bit vectors
  • hamming distance ? (weighted) L1 distance
  • XOR groups of K bits ? N bit vector
  • hamming distance ? thresholded L1 distance

39
Thresholding distance by XORing bits
Number of bits XORed control shape of flattened
curve
40
EMD embeddingcombining region vectors
  • Pick random pattern (bit positions and bit
    values)
  • Add region weights matching pattern
  • M such patterns M coordinates of image vector
  • Related to Indyk Thaper 04

41
Evaluation Criteria
  • Effectiveness of EMD match
  • Compactness of data structureswithout
    compromising quality
  • Efficiency and effectiveness of embedding and
    filtering algorithm

42
Evaluation Methodology
  • 10,000 images
  • 32 queries with similar images identifiedhttp//d
    bvis.inf.uni-konstanz.de/research/projects/SimSear
    ch/effpics.html
  • Segmentation via JSEGavg. regions 7.16, min
    1, max 57
  • Effectiveness measured by average
    precisionAverage of precision values at
    positions of k target images

43
(No Transcript)
44
Effectiveness of EMD Match
SIMPLIcity avg. precision 0.331
45
Region representation size
46
Effect of region bit vector size
47
Effect of number of random patterns(raw image
vector size)
48
Search quality Effect of image bit vector size
and filtering
49
Average query time Effect of image bit vector
size and filtering
50
Conclusions
  • Compact representations at the heart of several
    algorithmic techniques for large data sets
  • Compact representations tailored to applications
  • Effective for region based image retrieval
  • Many interesting research questions
  • sketching EMD over points in R2
  • upper bounds for dimension reduction in L1
Write a Comment
User Comments (0)
About PowerShow.com