Compact Data Representations and their Applications - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Compact Data Representations and their Applications

Description:

Construct compact representation (sketch) of data such that ... Glimpse of compact representation techniques in the sketching and streaming domains. ... – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 74
Provided by: csPrin
Category:

less

Transcript and Presenter's Notes

Title: Compact Data Representations and their Applications


1
Compact Data Representations and their
Applications
  • Moses Charikar
  • Princeton University

2
Lots and lots of data
  • ATT
  • Information about who calls whom
  • What information can be got from this data ?
  • Network router
  • Sees high speed stream of packets
  • Detect DOS attacks ? fair resource allocation ?

3
Lots and lots of data
  • Google search engine
  • About 3 billion web pages
  • Many many queries every day
  • How to efficiently process data ?
  • Eliminate near duplicate web pages
  • Query log analysis

4
Sketching Paradigm
  • Construct compact representation (sketch) of data
    such that
  • Interesting functions of data can be computed
    from compact representation

estimated
5
Why care about compact representations ?
  • Practical motivations
  • Algorithmic techniques for massive data sets
  • Compact representations lead to reduced space,
    time requirements
  • Make impractical tasks feasible
  • Theoretical Motivations
  • Interesting mathematical problems
  • Connections to many areas of research

6
Questions
  • What is the data ?
  • What functions do we want to compute on the data
    ?
  • How do we estimate functions on the sketches ?
  • Different considerations arise from different
    combinations of answers
  • Compact representation schemes are functions of
    the requirements

7
What is the data ?
  • Sets, vectors, points in Euclidean space, points
    in a metric space, vertices of a graph.
  • Mathematical representation of objects (e.g.
    documents, images, customer profiles, queries).

8
What functions do we want to compute on the data ?
  • Local functions pairs of objectse.g. distance
    between objects
  • Sketch of each object, such that function can be
    estimated from pairs of sketches
  • Global functions entire data sete.g.
    statistical properties of data
  • Sketch of entire data set, ability to update,
    combine sketches

9
Local functions distance/similarity
  • Distance is a general metric, i.e satisfies
    triangle inequality
  • Normed spacex (x1, x2, , xd) y (y1, y2,
    , yd)
  • Other special metrics (e.g. Earth Mover Distance)

10
Estimating distance from sketches
  • Arbitrary function of sketches
  • Information theory, communication complexity
    question.
  • Sketches are points in normed space
  • Embedding original distance function in normed
    space. Bourgain 85 Linial,London,Rabinovich
    94
  • Original metric is (same) normed space
  • Original data points are high dimensional
  • Sketches are points low dimensions
  • Dimension reduction in normed spacesJohnson
    Lindenstrauss 84

11
Streaming algorithms
  • Perform computation in one (or constant) pass(es)
    over data using a small amount of storage space
  • Availability of sketch function facilitates
    streaming algorithm
  • Additional requirements - sketch should allow
  • Update to incorporate new data items
  • Combination of sketches for different data sets

storage
input
12
Global functions
  • Statistical properties of entire data set
  • Frequency moments
  • Sortedness of data
  • Set membership
  • Size of join of relations
  • Histogram representation
  • Most frequent items in data set
  • Clustering of data

13
Goals
  • Glimpse of compact representation techniques in
    the sketching and streaming domains.
  • Basic ideas, no messy details

14
Talk Outline
  • Classical techniques spectral methods
  • Dimension reduction
  • Similarity preserving hash functions
  • sketching vector norms
  • sketching Earth Mover Distance (EMD)

15
Spectral methods approximating matrices
  • SVD Singular Value DecompositionLSI Latent
    Semantic Indexing
  • Related to PCA Principal Component
    AnalysisMDS MultiDimensional Scaling

16
SVD Matrix Factorization
X U x ? x VT
n
n
r

x
x
r
m
m
Singular Values
Representation
Basis
Restrictions on representation U, V orthonormal
? diagonal
17
Matrix approximation
  • X ?i ui si viT
  • X(k) ?ik1 ui si viT
  • X(k) is best rank k approximation to Xminimizes
    ?ij xij x(k)ij2

18
Dimension Reduction
Xr U x ?r x VT
n
r
k
n
k
0
0

x
x
0
0
0
r
m
m
Singular Values
Representation
Basis
The columns of Xr represent the docs, but in r ltlt
m dimensions Best rank r approximation according
to 2-norm
19
Closely related notions
  • Singular Value Decomposition
  • Karhunen-Loeve (KL) Transform
  • Principal Component Analysis (PCA)
  • Latent Semantic Indexing (LSI)
  • Information retrieval

20
SVD complexity
  • O(min(nm2,mn2))
  • Less work
  • if we want just eigenvalues
  • if we want first k eignevectors
  • if matrix is sparse
  • Implemented in any linear algebra
    package(LINPACK, matlab, Splus, mathematica,)

21
Applications
  • Image processing and compression
  • low rank approximation leads to compressed
    representation, noise reduction
  • Molecular dynamics
  • characterizing protein molecular dynamics
  • higher prinicipal components correspond to large
    scale motions

22
Applications
  • Information retrieval
  • LSI Latent semantic indexing SVD applied to
    term document matrix
  • compute best rank k approximation
  • eigenvectors correspond to linguistic concepts
  • Gene expression data analysis
  • SVD useful preprocessing step
  • grouping genes by transcriptional response,
    grouping assays by expression profile

23
Microarray gene expression data
24
SVD applied to gene expression data
25
Information retrieval
  • X is term document matrix
  • m terms, n documents
  • entry (t,d) for term t and document d is function
    of how many times t occurs d
  • SVD of X gives low dimensional representation Xr
  • Latent Semantic Indexing
  • XrT Xr is matrix of document similarities
  • Columns of Xr represent the documents, but in r
    ltlt m dimensions

26
Semi-precise intuition
  • We accomplish more than dimension reduction here
  • Docs with lots of overlapping terms stay together
  • Terms from these docs also get pulled together.
  • Thus car and automobile get pulled together
    because both co-occur in docs with tires,
    radiator, cylinder, etc.

27
Query processing
  • View a query as a (short) doc
  • call it column 0 of Xr.
  • Now the entries in column 0 of XrTXr give the
    similarities of the query with each doc.
  • Entry (j,0) is the score of doc j on the query.

28
Talk Outline
  • Dimension reduction
  • Similarity preserving hash functions
  • sketching vector norms
  • sketching Earth Mover Distance (EMD)

29
Low Distortion Embeddings
  • Given metric spaces (X1,d1) (X2,d2),embedding
    f X1 ? X2 has distortion D if ratio of
    distances changes by at most D
  • Dimension Reduction
  • Original space high dimensional
  • Make target space be of low dimension, while
    maintaining small distortion

30
Dimension Reduction in L2
  • n points in Euclidean space (L2 norm) can be
    mapped down to O((log n)/?2) dimensions with
    distortion at most 1?.Johnson Lindenstrauss
    84
  • Two interesting properties
  • Linear mapping
  • Oblivious choice of linear mapping does not
    depend on point set
  • Quite simple JL84, FM88, IM98, DG99, Ach01
    Even a random 1/-1 matrix works
  • Many applications

31
Dimension reduction for L1
  • C,Sahai 02Linear embeddings are not good for
    dimension reduction in L1
  • There exist O(n) points in L1 in n dimensions,
    such that any linear mapping with distortion ?
    needs n/?2 dimensions

32
Dimension reduction for L1
  • C, Brinkman 03Strong lower bounds for
    dimension reduction in L1
  • There exist n points in L1 , such that any
    embedding with constant distortion ? needs n1/?2
    dimensions
  • Simpler proof by Lee,Naor 04
  • Does not rule out other sketching techniques

33
Talk Outline
  • Dimension reduction
  • Similarity preserving hash functions
  • sketching vector norms
  • sketching Earth Mover Distance (EMD)

34
Similarity Preserving Hash Functions
  • Similarity function sim(x,y), distance d(x,y)
  • Family of hash functions F with probability
    distribution such that

35
Applications
  • Compact representation scheme for estimating
    similarity
  • Approximate nearest neighbor search
    Indyk,Motwani 98 Kushilevitz,Ostrovsky,Rabani
    98

36
Relaxations of SPH
  • Estimate distance measure, not similarity measure
    in 0,1.
  • Measure Ef(h(x),h(y)).
  • Estimator will approximate distance function.

37
Sketching Set SimilarityMinwise Independent
Permutations
Broder,Manasse,Glassman,Zweig 97
Broder,C,Frieze,Mitzenmacher 98
38
Other similarity functions ?
C02
  • Necessary conditions for existence of similarity
    preserving hash functions.
  • SPH does not exist for Dice coefficient and
    Overlap coefficient.
  • SPH schemes from rounding algorithms
  • Hash function for vectors based on random
    hyperplane rounding.

39
Existence of SPH schemes
  • sim(x,y) admits an SPH scheme if? family of
    hash functions F such that

40
  • Theorem If sim(x,y) admits an SPH scheme then
    1-sim(x,y) satisfies triangle inequality.
  • Proof

41
Non-existence of SPH
42
Stronger Condition
  • Theorem If sim(x,y) admits an SPH scheme then
    (1sim(x,y) )/2 has an SPH scheme with hash
    functions mapping objects to 0,1.
  • Theorem If sim(x,y) admits an SPH scheme then
    1-sim(x,y) is isometrically embeddable in the
    Hamming cube.

43
  • For n vectors, random hyperplane can be chosen
    using O(log2 n) random bits.Indyk,
    Engebretson,Indyk,ODonnell
  • Alternate similarity measure for sets

44
Random Hyperplane Rounding based SPH
  • Collection of vectors
  • Pick random hyperplane through origin (normal
    )
  • Goemans,Williamson

45
Sketching L1
  • Design sketch for vectors to estimate L1 norm
  • Hash function to distinguish between small and
    large distances KOR 98
  • Map L1 to Hamming space
  • Bit vectors a(a1,a2,,an) and b(b1,b2,,bn)
  • Distinguish between distances ? (1-e)n/k versus
    ? (1e)n/k
  • XOR random set of k bits
  • Prh(a)h(b) differs by constant in two cases

46
Sketching L1 via stable distributions
  • a(a1,a2,,an) and b(b1,b2,,bn)
  • Sketching L2
  • f(a) Si ai Xi f(b) Si bi XiXi
    independent Gaussian
  • f(a)-f(b) has Gaussian distribution scaled by
    a-b2
  • Form many coordinates, estimate a-b2 by taking
    L2 norm
  • Sketching L1
  • f(a) Si ai Xi f(b) Si bi XiXi
    independent Cauchy distributed
  • f(a)-f(b) has Cauchy distribution scaled by
    a-b1
  • Form many coordinates, estimate a-b1 by taking
    medianIndyk 00 -- streaming applications

47
Earth Mover Distance (EMD)
P
Q
EMD(P,Q)
48
Bipartite/Bichromatic matching
  • Minimum cost matching between two sets of points.
  • Point weights ? multiple copies of points

Fast estimation of bipartite matching
Agarwal,Varadarajan 04
Goal Sketch point set to enable estimation of
min cost matching
49
Approximating metrics by trees
50
EMD on trees embedding into L1
suggested by Piotr Indyk
wT(P)-wT(Q)
EMD(P,Q) STlTwT(P)-wT(Q)
51
EMD on general metrics
  • Approximate metric by probability distribution on
    trees
  • Sample tree from distribution and compute L1
    representation
  • EMD(P,Q) ? Ed(v(P),v(Q)) ? O(log n) EMD(P,Q)

52
Tree approximations for Euclidean points
distortion O(d log ?) Bartal 96, CCGGP 98
proposed by Indyk,Thaper 03 for estimating EMD
53
Conclusions
  • Compact representations at the heart of several
    algorithmic techniques for large data sets
  • Compact representations tailored to applications
  • Effective for region based image retrieval

54
ISOMAP and LLE
  • Nonlinear dimension reduction methods
  • Learn hidden structure in data
  • See slides of Chan-Su Lee and Rong Xu from
    Michael Littmans course at Rutgers
  • http//www.cs.rutgers.edu/mlittman/courses/lighta
    i03/chansu.ppt
  • http//www.cs.rutgers.edu/mlittman/courses/lighta
    i03/rongxu.ppt

55
Compact Representations in Streaming Algorithms
  • Moses CharikarPrinceton University

56
Compact Representations in Streaming
  • Statistical properties of data streams
  • Distinct elements
  • Frequency moments, norm estimation
  • Frequent items

57
Frequency MomentsAlon, Matias, Szegedy 99
  • Stream consists of elements from 1,2,,n
  • mi number of times i occurs
  • Frequency moment
  • F0 number of distinct elements
  • F1 size of stream
  • F2

58
Overall Scheme
  • Design estimator (i.e. random variable) with the
    right expectation
  • If estimator is tightly concentrated, maintain
    number of independent copies of estimator E1, E2,
    , Er
  • Obtain estimate E from E1, E2, , Er
  • Within (1?) with probability 1-?

59
Randomness
  • Design estimator assuming perfect hash functions,
    as much randomness as needed
  • Too much space required to explicitly store such
    a hash function
  • Fix later by showing that limited randomness
    suffices

60
Distinct Elements
  • Estimate the number of distinct elements in a
    data stream
  • Brute Force solution Maintain list of distinct
    elements seen so far
  • ?(n) storage
  • Can we do better ?

61
Distinct ElementsFlajolet, Martin 83
  • Pick a random hash function hn ? 0,1
  • Say there are k distinct elements
  • Then minimum value of h over k distinct elements
    is around 1/k
  • Apply h() to every element of data stream
    maintain minimum value
  • Estimator 1/minimum

62
(Idealized) Analysis
  • Assume perfectly random hash function hn ?
    0,1
  • S set of k elements of n
  • X min a?S h(a)
  • EX 1/(k1)
  • VarX O(1/k2)
  • Mean of O(1/?2) independent estimators is within
    (1?) of 1/k with constant probability

63
Analysis
  • Alon,Matias,SzegedyAnalysis goes through with
    pairwise independent hash functionh(x) axb
  • 2 approximation
  • O(log n) space
  • Many improvementsBar-Yossef,Jayram,Kumar,Sivakum
    ar,Trevisan

64
Estimating F2
  • F2
  • Brute force solution Maintain counters for all
    distinct elements
  • Sampling ?
  • n1/2 space

65
Estimating F2Alon,Matias,Szegedy
  • Pick a random hash functionhn ? 1,-1
  • hi h(i)
  • Z
  • Z initially 0, add hi every time you see i
  • Estimator X Z2

66
Analyzing the F2 estimator
67
Analyzing the F2 estimator
  • Median of means gives good estimator

68
What about the randomness ?
  • Analysis only requires 4-wise independence of
    hash function h
  • Pick h from 4-wise independent family
  • O(log n) space representation, efficient
    computation of h(i)

69
Properties of F2 estimator
  • sketch of data stream that allows computation
    of
  • Linear function of mi
  • Can be added, subtracted
  • Given two streams, frequencies mi , ni
  • E(Z1-Z2)2
  • Estimate L2 norm of difference
  • How about L1 norm ? Lp norm ?

70
Stable Distributions
  • p-Stable distribution DIf X1, X2, Xn are
    i.i.d. samples from D,m1X1m2X2mnXn is
    distributed as(m1,m2,,mn)pX
  • Defining property up to scale factor
  • Gaussian distribution is 2-stable
  • Cauchy distribution is 1-stable
  • p-Stable distributions exist for all0 lt p ? 2

71
Estimating Lp normIndyk 00
  • Compute Z m1X1m2X2mnXn
  • Distributed as (m1,m2,,mn)pX
  • Estimate scale factor of distributionfrom Z1,
    Z2, Zr
  • Given i.i.d. samples from p-stable distribution,
    how to estimate scale ?
  • Compute statistical property of samples and
    compare to that of distribution

72
Estimating scale factor
  • Zi distributed as (m1,m2,,mn)pX
  • Estimate scale factor of distributionfrom Z1,
    Z2, Zr
  • Mean(Z1,Z2,,Zr) works for Gaussian distribution
  • p1, Cauchy distribution does not have finite
    mean !
  • Median(Z1,Z2,,Zr) works in this case
  • Note sketch is linear, nice properties follow

73
What about the randomness ?
  • Analog of 4-wise independence for F2 estimator ?
  • Key insight p-stable based sketch computation
    done in O(log n) space
  • Use pseudo-random number generator which fools
    any space bounded computation Nisan 90
  • Difference between using truly random and
    psuedo-random bits is negligible
  • Random seed of polylogarithmic size, efficient
    generation of required pseudo-random variables

74
Talk Outline
  • Similarity preserving hash functions
  • Similarity estimation
  • Statistical properties of data streams
  • Distinct elements
  • Frequency moments, norm estimation
  • Frequent items

75
Variants of F2 estimatorAlon, Gibbons, Matias,
Szegedy
  • Estimate join size of two relations(m1,m2,)
    (n1,n2,)
  • Variance may be too high

76
Finding Frequent Items
C,Chen,Farach-Colton 02
  • Goal
  • Given a data stream, return an approximate list
    of the k most frequent items in one pass and
    sub-linear space
  • Applications
  • Analyzing search engine queries, network
    traffic.

77
Finding Frequent Items
  • ai ith most frequent element
  • mi frequency
  • If we had an oracle that gave us exact
    frequencies, can find most frequent items in one
    pass
  • Solution
  • A data structure called a Count Sketch that
    gives good estimates of frequencies of the high
    frequency elements at every point in the stream

78
Intuition
  • Consider a single counter X with a single hash
    function ha ? 1, -1
  • On seeing each element ai, update the counter
    with X h(ai)
  • X ? mi h(ai)
  • Claim EX h(ai) mi
  • Proof idea Cross-terms cancel because of
    pairwise independence

79
Finding the max element
  • Problem with the single counter scheme variance
    is too high
  • Replace with an array of t counters, using
    independent hash functions h1... ht

h1 a ? 1, -1
ht a ? 1, -1
80
Analysis of array of counters data structure
  • Expectation still correct
  • Claim Variance of final estimate lt ? mi2 /t
  • Variance of each estimate lt ? mi2
  • Proof idea cross-terms cancel
  • Set t O(log n ? mi2 / (?m1)2) to get answer
    with high prob.
  • Proof idea median of averages

81
Problem with array of counters data structure
  • Variance of estimator dominated by contribution
    of large elements
  • Estimates for important elements such as ak
    corrupted by larger elements (variance much more
    than mk2 )
  • To avoid collisions, replace each counter with a
    hash table of b counters to spread out the large
    elements

82
In Conclusion
  • Simple powerful ideas at the heart of several
    algorithmic techniques for large data sets
  • Sketches of data tailored to applications
  • Many interesting research questions

83
Content-Based Similarity Search
  • Moses Charikar
  • Princeton University
  • Joint work with
  • Qin Lv, William Josephson, Zhe Wang, Perry Cook,
    Matthew Hoffman, Kai Li

84
Motivation
  • Massive amounts of feature-rich digital data
  • Audio, video, digital photos, scientific sensor
    data
  • Noisy, high-dimensional
  • Traditional file systems/search tools inadequate
  • Exact match
  • Keyword-based search
  • Annotations
  • Need content-based similarity search

85
Motivation
  • Recent progress of theoretical studies on
    sketches
  • compact data representation for estimation of
    pairwise similarity/distance
  • Compact data structures for high-quality and
    efficient content-based similarity search?

86
Compact representation
sketch
complex object
0
0
0
0
0
1
1
1
1
1
  • Distance measured by (weighted) l1
    distanced(x,y) Si wixi-yi
  • Better still, hamming distance between bit
    vectors
  • Distance between sketches estimates distance
    between objects
  • Several theoretical constructions of sketches
    forsets, vectors, earth mover distance (EMD).

1
0
0
0
0
0
0
1
1
1
87
Outline
  • Motivation
  • System architecture
  • Implementation details
  • Segmentation feature extraction
  • Sketch construction
  • Filtering
  • Indexing
  • Performance evaluation
  • Conclusions future work

88
System Architecture
89
Similarity Search Engine Architecture
Pre-processing
Query time
90
Similarity Search Problem
  • Similarity search finding objects similar to a
    query object i.e. containing similar features
  • Object representation
  • Distance function d (X, Y)
  • Nearest neighbor query
  • K-nearest neighbor (KNN)
  • Approximate nearest neighbor (ANN)

91
Object Representation Distance Function
Earth Mover Distance (EMD)
92
Segmentation Feature Extraction (1)
  • Derive a small set of features that characterize
    the important attributes of a data object
  • Data-dependent

93
Segmentation Feature Extraction (1)
  • Image Data
  • JSEG image segmentation tool
  • Each segments by a 14-dimension feature vector
  • Color moments
  • First three moments in HSV color space
  • ? 9-D vector
  • Bounding box
  • Aspect ratio, Bounding box size, Area ratio,
    Region centroid
  • ? 5-D vector
  • Segment weight ? square root of segment size
  • l1 distance between segments, EMD between images

94
Segmentation Feature Extraction (2)
  • Audio Data
  • Phonetic segmentation feature extraction using
    MARSYAS
  • Each segment
  • 50 sliding windows x 6 MFCC parameters 300
  • Segment weight ? segment length
  • Segment distance l1 distance
  • Sentence distance EMD

95
Segmentation Feature Extraction (3)
  • 3D shape data
  • 32 decomposing spheres
  • Spherical harmonic descriptor (SHD)
  • Spherical harmonic coefficients up to order 16
  • 32 x 17 544 dimensions
  • l2 distance

96
Sketch Construction
  • Sketches tiny data structures that can be used
    to estimate properties of original data
  • High-dimensional feature vector ? N?K bit vector
  • hamming distance ? original feature vector
    distance
  • XOR groups of K bits ? N bit vector
  • hamming distance ? thresholded distance

97
Thresholding distance by XORing bits
Number of bits XORed control shape of flattened
curve
98
Filtering for Similarity Search
  • EMD computation is expensive
  • Filtering
  • Scans through the entire dataset
  • Uses a much faster distance function to filter
    out bad answers
  • Computes EMD for a much smaller candidate set
  • Criteria in picking candidate objects
  • Has at least one segment that is close enough to
    one of the top segments of the query object

99
Indexing for Similarity Search
  • a leveled tree where each level is a cover for
    the level beneath it
  • Nesting
  • Covering tree For every node , there
    exists a node satisfying
    and exactly one such q is a parent of p
  • Separation For all nodes ,

100
Performance Evaluation
  • Can we achieve high-quality similarity search
    results at high speed?
  • How small can the sketches be as the metadata of
    the similarity search engine?
  • What are the performance tradeoffs of
  • Brute-force
  • Filtering
  • Indexing

101
Benchmarks
  • Search quality benchmark suite
  • VARY image 10k images, 32 sets
  • TIMIT audio 6300 sentences, 450 sets
  • PSB shape 1814 models, 92 sets
  • Search speed benchmark suite
  • Mixed image dataset 600k images
  • Mixed audio dataset 60K sentences
  • Mixed shape dataset 40k shape models

102
Search Quality Metrics
  • Given a query q with k similar objects
  • First-tier
  • Percentage of similar objects returned within
    rank k
  • Second-tier
  • Percentage of similar objects returned within
    rank 2k
  • Average precision

103
Search Quality Search Speed
104
Search Quality vs. Sketch Size
105
Brute Force, Filtering, Indexing
106
Conclusions Future Work
  • A general purpose content-based similarity search
    system
  • high-quality similarity search with reasonably
    high speed
  • Using sketches reduces metadata size
  • Filtering indexing speeds up similarity search
  • Future work
  • More efficient distance function than EMD
  • Further investigation of indexing data structures
  • More data types
  • video, genomic microarray data, other sensor data

107
(No Transcript)
108
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com