SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

About This Presentation

Title:

SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

Description:

... things work in expectation, but poor variance properties. Non-uniform Sampling: With 'good' ... Our nonuniform sampling minimizes the variance of the estimator. ... – PowerPoint PPT presentation

Number of Views:200

Avg rating:3.0/5.0

Slides: 108

Provided by: petro152

Category:

more less

Transcript and Presenter's Notes

Title: SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

1
SDM06 TUTORIAL Randomized Algorithms for
Matrices and Massive Data Sets
Michael W. Mahoney Yahoo! Research
Petros Drineas CS - RPI
Tutorial given at SIAM Data Mining Meeting April
22, 2006 (Most recent copy) available at
http//www.cs.yale.edu/homes/mmahoney http//www.
cs.rpi.edu/drinep
2
Randomized Linear Algebra Algorithms

Goal To develop and analyze (fast) Monte Carlo
algorithms for performing useful computations on
large (and later not so large!) matrices and
tensors.
Matrix Multiplication
Computation of the Singular Value Decomposition
Computation of the CUR Decomposition
Testing Feasibility of Linear Programs
Least Squares Approximation
Tensor computations SVD generalizations
Tensor computations CUR generalization
Such computations generally require time which is
superlinear in the number of nonzero elements of
the matrix/tensor, e.g., O(n3) for n x n matrices.

3
Example the CUR decomposition
Algorithmic Motivation To speed up computations
in applications where extremely large data sets
are modeled by matrices and, e.g., O(n2) space
and O(n3) time is not an option. Structural
Motivation To reveal novel structural properties
of the datasets, given sufficient computational
time, that are useful in applications.
Goal make A-CUR small.
Why? (Algorithmic) After making two passes over
A, we can compute provably good C, U, and R and
store them (sketch) instead of A O(mn) vs.
O(n2) space.
Why? Given a sample consisting of a few columns
(C) and a few rows (R) of A, we can compute U and
reconstruct A as CUR. If the sampling
probabilities are not too bad, we get provably
good accuracy.
Why? (Structural) Given sufficient time, we can
find C, U and R such that A CUR is very
small. This might lead to better understanding
of the data.
4
Applications of such algorithms

Matrices arise, e.g., since m objects (documents,
genomes, images, web pages), each with n
features, may be represented by an m x n matrix
A.
Covariance Matrices
Latent Semantic Indexing
DNA Microarray Data
Eigenfaces and Image Recognition
Similarity Queries
Matrix Reconstruction
LOTS of other data applications!!
More generally,
Linear and Nonlinear Programming Applications
Design of Approximation Algorithms
Statistical Learning Theory Applications

5
Overview (1/2)

Data Streaming Models and Random Sampling
Matrix Multiplication
Singular Value Decomposition
CUR Matrix Decomposition
Applications of Matrix CUR
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method

6
Overview (2/2)

Tensor-based data sets
Tensor-CUR
Hyperspectral data
Recommendation systems
From Very-Large to Medium-Sized Data
Relative-error CX and CUR Matrix Decompositions
L2 Regression Problems
Application to DNA SNP Data
Conclusions and Open Problems

7
The Pass Efficient Model

Motivation Amount of disk/tape space has
increased enormously RAM and computing speeds
have increased less rapidly.
Can store large amounts of data, but
Cannot process these data with traditional
algorithms.
In the Pass-Efficient Model
Data are assumed to be stored on disk/tape.
Algorithm has access to the data via a pass over
the data.
Algorithm is allowed additional RAM space and
additional computation time.
An algorithm is pass-efficient if it uses a small
constant number of passes and sublinear
additional time and space to compute a
description of the solution.
Note If data are an m x n matrix A, then
algorithms which require additional time and
space that is O(mn) or O(1) are pass-efficient.

8
Random Sampling

Random Sampling and Randomized Algorithms
Better complexity properties (randomization as a
resource).
Simpler algorithms and/or analysis (maybe
de-randomize later).
Uniform Sampling
Typically things work in expectation, but poor
variance properties.
Non-uniform Sampling
With good probabilities, can make the variance
small.

9
Overview (1/2)

Data Streaming Models and Random Sampling
Matrix Multiplication
Singular Value Decomposition
CUR Matrix Decomposition
Applications of Matrix CUR
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method

10
Approximating Matrix Multiplication
(D. Kannan FOCS 01, and D., Kannan, M. TR
04, SICOMP 06) Problem Statement Given an
m-by-n matrix A and an n-by-p matrix B,
approximate the product AB, OR,
equivalently, Approximate the sum of n rank-one
matrices.
Each term in the summation is a rank-one matrix
11
by random sampling

Algorithm
Fix a set of probabilities pi, i1,,n, summing
up to 1.
For t1 up to s,
set jt i, where Pr(jt i) pi
(Pick s terms of the sum, with replacement, with
respect to the pi.)
Approximate AB by the sum of the s terms, after
scaling.

12
Random sampling (contd)
Keeping the terms j1, j2, js.
13
The algorithm (matrix notation)
14
Simple Lemmas

The input matrices are given in sparse
unordered representation e.g., their non-zero
entries are presented as triples (i, j, Aij) in
any order.
The expectation of CR (element-wise) is AB.
Our nonuniform sampling minimizes the variance of
the estimator.
It is easy to implement the sampling in two
passes.
If the matrices are dense the algorithm runs in
O(smp) time, instead of O(nmp) time,
It requires O(smsp) RAM space.
Does not tamper with the sparsity of the matrices.

15
Error Bounds
For the above algorithm,
For the above algorithm, with probability at
least 1-?,

This is a relative error bound if ABF
?(AF BF), i.e. if there is not much
cancellation in the multiplication.
We removed the expectation (by applying a
martingale argument) and so have an extra
log(1/?) factor.
Markovs inequality would also remove the
expectation, introducing an extra 1/? factor.

16
Special case B AT
If B AT, then the sampling probabilities are
Also, R CT, and the error bounds are
17
Special case B AT (contd)
(Rudelson Vershynin 04, Vershynin 04)
Improvement for the spectral norm bound for the
special case B AT.

Uses a result of M. Rudelson for random vectors
in isotropic position.
Tight concentration results can be proven using
Talagrands theory.
The sampling procedure is slightly different s
columns/rows are kept in expectation, i.e.,
column i is picked with probability

18
Overview (1/2)

Data Streaming Models and Random Sampling
Matrix Multiplication
Singular Value Decomposition
CUR Matrix Decomposition
Applications of Matrix CUR
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method

19
Singular Value Decomposition (SVD)
U (V) orthogonal matrix containing the left
(right) singular vectors of A. S diagonal matrix
containing the singular values of A.

Exact computation of the SVD takes O(minmn2 ,
m2n) time.
The top few singular vectors/values can be
approximated faster (Lanczos/ Arnoldi methods).

20
Rank k approximations (Ak)

Uk (Vk) orthogonal matrix containing the top k
left (right) singular vectors of A.
S k diagonal matrix containing the top k
singular values of A.
Also, AkUkUkTA.

Ak is a matrix of rank k such that A-Ak2,F is
minimized over all rank k matrices! This property
of very useful in the context of Principal
Component Analysis.
21
Approximating SVD in O(n) time

(D., Frieze, Kannan, Vempala Vinay SODA 99,
JML 04, D. Kannan, M. TR 04, SICOMP 06)
Given m x n matrix A
Sample c columns from A and rescale to form the
m x c matrix C.
Compute the m x k matrix Hk of the top k left
singular vectors of C.

Structural Theorem For any probabilities and
number of columns A-HkHkTA2,F2
A-Ak2,F2 2vkAAT-CCTF Algorithmic
Theorem If pi A(i)2/AF2 and c 4?2k/?2,
then A-HkHkTA2,F2 A-Ak2,F2
?AF2. Proof via matrix multiplication
theorem and matrix perturbation theory.
22
Example of randomized SVD
A
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
Original matrix
After sampling columns
Compute the top k left singular vectors of the
matrix C and store them in the 512-by-k matrix Hk.
23
Example of randomized SVD (contd)
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
A
HkHkTA
A and HkHkTA are close.
24
Element-wise sampling
(Achlioptas McSherry, STOC 01, JACM 05)

The Algorithm in 2 lines
To approximate a matrix A, keep a few elements
of the matrix (instead of rows or columns) and
zero out the remaining elements.
Compute a rank k approximation to this sparse
matrix (using Lanczos methods).

A-S2 is bounded ! (i) the singular values of
A and S are close, and (ii, under additional
assumptions) the top k left (right) singular
vectors of S span a subspace that is close the to
subspace spanned by the top k left (right)
singular vectors of A.
25
Element-wise sampling (contd)

Approximating singular values fast
Zero out (a large number of) elements of A,
scale the remaining ones appropriately.
Compute the singular values of the resulting
sparse matrix using iterative techniques.
(Good choice for pij pij sAij2/?i,j Aij2,
where s denotes the expected number of elements
that we seek to keep in S.)
Note Each element is kept or discarded
independently of the others.

Similar ideas have been used to
explain the success of Latent Semantic Indexing
(LSI) (Papadimitriou, Raghavan, Tamaki, Vempala,
PODS 98 Azar, Fiat, Karlin, McSherry, Saia
STOC 01)
design recommendation systems (Azar, Fiat,
Karlin, McSherry, Saia STOC 01 )
speedup kernel computations (Achlioptas,
McSherry, and Schölkopf, NIPS 02)

26
Element-wise vs. row/column sampling

Quite different techniques!
Row/column sampling preserves subspace/structural
properties of the matrices.
Element-wise sampling explains how adding noise
and/or quantizing the elements of a matrix
perturbs its singular values/vectors.
The two techniques should be complementary!

Some similarities and differences
Similar error bounds.
Element-wise sampling is doable in one pass!
Running time of element-wise sampling depends on
the speed of, e.g., Arnoldi methods.
Element-wise methods do not seem amenable to
many of the extensions we will present.

27
Overview (1/2)

Data Streaming Models and Random Sampling
Matrix Multiplication
Singular Value Decomposition
CUR Matrix Decomposition
Applications of Matrix CUR
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method

28
A novel CUR matrix decomposition
(D. Kannan, SODA 03, D., Kannan, M. TR 04,
SICOMP 06)
Create an approximation to the original matrix
of the following form
29
The CUR decomposition
Given a large m-by-n matrix A (stored on disk),
compute a decomposition CUR of A such that

C consists of c O(k/?2) columns of A.
R consists of r O(k/?2) rows of A.
C (R) is created using importance sampling, e.g.
columns (rows) are picked in i.i.d. trials with
respect to probabilities

C, U, R can be stored in O(mn) space, after
making two passes through the entire matrix A,
using O(mn) additional space and time.

The product CUR satisfies (with high
probability)

30
Computing U

Intuition (which can be formalized - see later)
The CUR algorithm expresses every row of the
matrix A as a linear combination of a small
subset of the rows of A.
This small subset consists of the rows in R.
Given a row of A say A(i) the algorithm
computes a good fit for the row A(i) using the
rows in R as the basis, by approximately solving

But, only c O(1) elements of the i-th row are
given as input. So, we only approximate the
optimal vector u instead of computing it
exactly. Actually, the pass-efficient CUR
decomposition approximates the approximation.
31
Error bounds for CUR
Assume Ak is the best rank k approximation to
A (through SVD). Then, if we pick O(k/?2) rows
and O(k/?2) columns,
32
Previous CUR-type decompositions
(For details see Drineas Mahoney, A Randomized
Algorithm for a Tensor-Based Generalization of
the SVD, 05.)
33
Lower Bounds
Question How many queries does a sampling
algorithm need to approximate a given function
accurately with high probability?

(Ziv Bar-Yossef 03, 04) Lower bounds for the
low rank matrix approximation problem and the
matrix reconstruction problem.
Any sampling algorithm that w.h.p. finds a good
low rank approximation requires ?(mn) queries.
Even if the algorithm is given the exact weight
distribution over the columns of a matrix it will
still require ?(k/?4) queries.
Finding a matrix D such that A-DF ? AF
requires ?(mn) queries and that finding a D such
that A-D2 ? A2 requires ?(mn) queries.

Applied to our results
The LinearTimeSVD algorithm is optimal w.r.t.
?F bounds.
The ConstantTimeSVD algorithm is optimal w.r.t.
?2 bounds up to poly factors.
The CUR algorithm is optimal for constant ?.

34
Overview (1/2)

Data Streaming Models and Random Sampling
Matrix Multiplication
Singular Value Decomposition
CUR Matrix Decomposition
Applications of Matrix CUR
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method

35
CUR application Data Mining

Database An m-by-n matrix A, e.g., m (gt106)
objects and n(gt105) features.
Queries Given a new object x, find similar
objects (nearest neighbors) in A.
Closeness Two normalized objects x and y are
close xTd cos(x,d) is high.
Given a query vector x, the matrix product Ax
computes all the angles/distances.
Key observation The exact value xT d might not
be necessary.
The feature values in the vectors are set by
coarse heuristics.
It is in general enough to see if xT d gt
Threshold.
Algorithm Given a query vector x, compute CUR
x to identify nearest neighbors.
Theorem We have a bound on the worst case of x
using CUR instead of A

36
CUR application Genetic Microarray Data
Exploit structural properties of CUR in
biological applications
Experimental conditions
Find a good set of genes and arrays to include
in C and R? Provable and/or heuristic strategies
are acceptable.
genes

Common in Biological/Chemical/Medical
applications of PCA
Explain the singular vectors, by mapping them to
meaningful biological processes.
This is a challenging task (think
reification) !
CUR is a low-rank decomposition in terms of the
data that practitioners understand.
Use it to explain the data and do dimensionality
reduction, classification, clustering.
Gene microarray data M., D., Alter (UT Austin)
(sporulation and cell cycle data).

37
CUR application Recommendation Systems
(D., Raghavan, Kerenidis, STOC 02)

The problem m customers and n products Aij is
the (unknown) utility of product j for customer
i.
The goal recreate A from a few samples to
recommend high utility products.
(KRRT98) Assuming strong clustering of the
products, competitive algorithms even with only 2
samples/customer.
(AFKMS01) Assuming sampling of ?(mn) entries of
A and a gap requirement, accurately recreate A.
Question Can we get competitive performance by
sampling o(mn) elements?
Answer Apply the CUR decomposition

38
Kernel-CUR Motivation

Kernel-based learning methods to extract
non-linear structure
Choose features to define a (dot product) space
F.
Map the data, X, to F by ? X -gt F.
Do classification, regression, and clustering in
F with linear methods (SVMs,GPs,SVD).
If the Gram matrix G, where Gijkij(?(X(i)),
?(X(j))), is dense but has low numerical rank,
then calculations of interest need O(n2) space
and O(n3) time
matrix inversion in GP prediction,
quadratic programming problems in SVMs,
computation of eigendecomposition of G.
Relevant recent work using low-rank methods
(Williams and Seeger, NIPS 01, etc.) Nystrom
method for out-of-sample extensions.
(Achlioptas, McSherry, and Schölkopf, NIPS 02)
randomized kernels.

39
Kernel-CUR Decomposition
(D. M., COLT 05, TR 05, JMLR 05)

Input n x n SPSD matrix G, probabilities pi,
11,,n, c lt n, and k lt c.
Algorithm
Let C be the n x c matrix containing c randomly
sampled columns of G.
Let W be the c x c matrix with containing
intersection of C and CT.
Theorem Let pi Gii2/ ?i Gii2. If c O(k
log(1/?)/?4), then w.p. at least 1-?,

If c O(log(1/?)/?4), then with probability at
least 1-?,
40
Overview (2/2)

Tensor-based data sets
Tensor-CUR
Hyperspectral data
Recommendation systems
From Very-Large to Medium-Sized Data
Relative-error CX and CUR Matrix Decompositions
L2 Regression Problems
Application to DNA SNP Data
Conclusions and Open Problems

41
Datasets modeled as tensors

Tensors (naively, a dataset subscripted by
multiple indices) appear both in Math and CS.
Represent high dimensional functions.
Connections to complexity theory (i.e., matrix
multiplication complexity).
Statistical applications (i.e., ICA, HOS, etc.).
Large data-set applications (e.g., Medical
Imaging Hyperspectral Imaging)
Problem There does not exist a definition of
tensor rank (and associated tensor SVD) with the
nice properties found in the matrix case.
(Lek-Heng Lim 05 strong impossibility
results!)
Common heuristic unfold the tensor along a
mode and apply Linear Algebra.
We will do this, but note that this kills the
essential tensor structure.

42
Datasets modeled as tensors (contd)
Goal Extract structure from a tensor dataset A
(naively, a dataset subscripted by multiple
indices) using a small number of samples.

Tensor rank (minimum number of rank-one
tensors) is NP-hard to compute.
Tensor ?-rank (unfold along the ?th mode and
the the matrix SVD) is a commonly-used heuristic.

Randomized-Tensor-CUR unfold along a
distinguished mode and reconstruct. Randomized-T
ensor-SVD unfold along every mode and choose
columns. (Drineas Mahoney, A Randomized
Algorithm for a Tensor-Based Generalization of
the SVD, TR05.)
43
The TensorCUR algorithm (3-modes)

Choose the preferred mode ? (e.g., time).
Pick a few representative slabs (let R denote
the tensor of the sampled slabs).
Use only information in a small number of
representative fibers (let C denote the tensor
of sampled fibers and U a low-dimensional
encoding tensor).
Express the remaining slabs as linear
combinations of the basis of sampled slabs.

44
Tensor-CUR application Hyperspectral Image
Analysis
(with M. Maggioni and R. Coifman at Yale)
Goal Extract structure from temporally-resolved
images or spectrally-resolved images of medical
interest using a small number of samples (images
and/or pixels).
Note A temporally or spectrally resolved image
may be viewed as a tensor (naively, a dataset
subscripted by multiple indices) or as a matrix
(whose columns have internal structure that is
not modeled).
m x n x p tensor A or mn x p matrix A
Note The chosen images are a dictionary from the
data to express every image. Note The chosen
pixels are a dictionary from the data to express
every pixel.
45
(No Transcript)
46
Sampling hyperspectral data

Sample slabs depending on total absorption.
For example, absorption at two pixel types

Sample fibers uniformly (since intensity depends
on stain).
47
Eigen-analysis of slabs and fibers
48
Look at the exact (65-th) slab.
49
The (65-th) slab approximately reconstructed
This slab was reconstructed by approximate
least-squares fit to the basis from slabs 41 and
50, using 1000 (of 250K) pixels/fibers.
50
Tissue Classification - Exact Data
51
Tissue Classification - Ns12 Nf1000
52
Tensor-CUR application Recommendation Systems

Important Comment
Utility is ordinal and not cardinal concept.
Compare products dont assign utility values.
Recommendation Model Revisited
Every customer has an n-by-n matrix (whose
entries are 1,-1) and represent pair-wise
product comparisons.
There are m such matrices, forming an
n-by-n-by-m 3-mode tensor A.
Extract the structure of this tensor.

53
Application to Jester Joke Recommendations
Use just the 14,140 full users who rated all
100 Jester jokes. For each user, convert the
utility vector to 100 x 100 pair-wise preference
matrix. Choose, e.g., 300 slabs/users, and a
small number of fibers/comparisons.
54
Overview (2/2)

Tensor-based data sets
Tensor-CUR
Hyperspectral data
Recommendation systems
From Very-Large to Medium-Sized Data
Relative-error CX and CUR Matrix Decompositions
L2 Regression Problems
Application to DNA SNP Data
Conclusions and Open Problems

55
Modeling data as matrices

People studying data
put the data onto a graph or into a vector space
even if the data dont naturally or obviously
live there
and perform graph operations or vector space
operations
to extract information from the data.
Such data often have structure unrelated to the
graphical or linear algebraic structure implicit
in the modeling.
This non-modeled structure is difficult to
formalize.
Practitioners often have extensive field-specific
intuition about the data.
This intuition is often used to choose where
the data live.
The choice of where the data live may capture
non-modeled structure.

56
Modeling data as matrices (contd)

Matrices often arise since n objects
(documents, genomes, images, web pages), each
with m features, may be represented by an m x n
matrix A.
Such data matrices often have structure
for linear structure, SVD or PCA is often used,
for non-linear structure, kernel, e.g.,
diffusion-based, methods used,
other structures include sparsity,
nonnegativity, etc.
Note We know what the rows/columns mean from
the application area.
Goal Develop principled provably-accurate
algorithmic methods such that
they are agnostic with respect to any particular
field,
one can fruitfully couple them to the
field-specific intuition,
they perform well on complex non-toy data sets.

57
SVD and low-rank approximations

Theorem Let A be an m x n matrix rank ?.
Truncate the SVD of A by keeping k ? terms Ak
Uk ?k VkT. This gives the best rank-k
approximation to A.
Interesting properties of truncated SVD
Used in data analysis via Principal Components
Analysis (PCA) .
The rows of Uk ( UA,k) are NOT orthogonal and
are NOT unit length.
The lengths/Euclidean norms of the rows of Uk
capture a notion of information dispersal.
Gives a low-rank approximation with a very
particular structure (rotate-rescale-rotate).
Best at capturing Frobenius (and other) norm.
Problematic w.r.t. sparsity, interpretability,
etc.

58
Problems with SVD/Eigen-Analysis

Problems arise since structure in the data is not
respected by mathematical operations on the data
Reification - maximum variance directions are
just that.
Interpretability - what does a linear
combination of 6000 genes mean.
Sparsity - is destroyed by orthogonalization.
Non-negativity - is a convex and not linear
algebraic notion.
The SVD gives two bases to diagonalize the
matrix.
Truncating gives a low-rank matrix approximation
with a very particular structure.
Think rotation-with-truncation rescaling
rotation-back-up.
Question Do there exist better low-rank
matrix approximations.
better structural properties for certain
applications.
better at respecting relevant structure.
better for interpretability and informing
intuition.

59
Exactly and approximately rank-k matrices

Theorem Let the m x n matrix A be exactly rank
k. Then
There exists k linear combinations of columns,
rows, etc. such that
A Ak Uk?kVkT.
There exists k actual columns and rows of A,
permuted, such that

Take-home message Low-rank structure IS
redundancy in columns/rows. Theorem Let the m x
n matrix A be approximately rank k. Then A Ak
Uk?kVkT is the best approximation. Question
Can we express approximately rank k matrices in
terms of their actual columns and/or rows.
60
Dictionaries for data analysis

Discrete Fourier Transform (DCT)
fj ?i0,,N-1 xn cosj(n1/2)?/N
the basis is fixed.
O(N2) or O(Nlog(N)) computation to determine
coefficients.
Singular Value Decomposition (SVD)
A ?i1,,? ?iU(i)V(i)T ?i1,,? ?i Ai
O(N3) computation to determine basis and
coefficients.
Many other more complex/expensive procedures
depending on the application.
Question Can actual data points and/or feature
vectors be the dictionary?
Core-sets on graphs.
CUR-decompositions on matrices.

61
CX and CUR matrix decompositions
Recall Matrices are about their rows and
columns. Recall Low-rank matrices have
redundancy in their rows and columns. Def A CX
matrix decomposition is a low-rank approximation
explicitly expressed in terms of a small number
of columns of the original matrix A (e.g., PCA
CCA). Def A CUR matrix decomposition is a
low-rank approximation explicitly expressed in
terms of a small number of columns and rows of
the original matrix A.
62
Dictionaries the SVD

A U ?VT ? i1,...,? ?iU(i)V(i)T,
where U(i),V(i) eigen-cols and eigen-rows.
Approximate A(j) ? i1,...,k zijU(i)
by minzij A(j) - ? i1,...,k zijU(i) 2
Z UkTA --gt A Ak (UkUkT)A
project onto space of top k eigen-cols.
Z ?kVkT --gt A Ak Uk(?kVkT)
approximate every column of A i.t.o. a small
number of eigen-rows and a low-dimensional
encoding matrix ?k.

63
Dictionaries columns and rows

A CUR ? ij uijC(i)R(i), where UW and W
intersection of C and R,
where C(i),R(i) actual-cols and actual-rows.
Approximate A(j) ? i1,...,c yijC(i)
by minyij A(j) - ? i1,...,c yijC(i) 2
Y CA --gt A PCA (CC)A
project onto space of those c actual-cols.
Y WR --gt A PCA C(WR)
approximate every column of A i.t.o. a small
number of actual-rows and a low-dimensional
encoding matrix UW.

64
Overview (2/2)

Tensor-based data sets
Tensor-CUR
Hyperspectral data
Recommendation systems
From Very-Large to Medium-Sized Data
Relative-error CX and CUR Matrix Decompositions
L2 Regression Problems
Application to DNA SNP Data
Conclusions and Open Problems

65
Problem formulation (1 of 3)

Never mind columns and rows - just deal with
columns (for now) of the matrix A.
Could ask to find the best k of n columns of
A.
Combinatorial problem - trivial algorithm takes
nk time.
Probably NP-hard if k is not fixed.
Lets ask a different question.
Fix a rank parameter k.
Lets over-sample columns by a little (e.g.,
k3, 10k, k2, etc.).
Try to get close (additive error or relative
error) to the best rank-k approximation..

66
Problem formulation (2 of 3)
Ques Do there exist O(k), or O(k2), or ,
columns s.t. A-CCA2,F lt A-Ak2,F
?AF Ans Yes - and can find them in O(mn)
space and time after two passes over the data!
(DFKVV99,DKM04) Ques Do there exist O(k), or
O(k2), or , columns such that A-CCA2,F lt
(1?)-1A-Ak2,F ?tAF Ans Yes - and can
find them in O(mn) space and time after t passes
over the data! (RVW05,DM05) Ques Do there
exist, and can we find, O(k), or O(k2), or ,
columns such that A-CCAF lt
(1?)A-AkF Ans Yes, they exist - existential
proof - no non-exhaustive algorithm given!
(RVW05,DRVW06) Ans ...
67
Problem formulation (3 of 3)
Ques Do there exist O(k), or O(k2), or ,
columns and rows such that A-CUR2,F lt
A-Ak2,F ?AF Ans Yes - lots of them,
and can find them in O(mn) space and time after
two passes over the data! (DK03,DKM04) Note
lots of them since these are randomized Monte
Carlo algorithms! Ques Do there exist O(k), or
O(k2), or , columns and rows such
that A-CURF lt (1?)A-AkF Ans
68
Theorem Relative-Error CUR
Fix any k, ?, ?. Then, there exists a Monte
Carlo algorithm that uses O(SVD(Ak)) time to find
C and R and construct U s.t. holds with
probability at least 1-?, by picking c O( k2
log(1/?) / ?2 ) columns, and r O( k4 log2(1/?)
/ ?6 ) rows. (Current theory work we can
improve the sampling complexity to c,rO(k
poly(1/?, 1/?)).) (Current empirical work we can
usually choose c,r k4.) (Dont worry about ?
choose ?1 if you want!)
69
L2 Regression problems

First consider overconstrained problems, n gtgt d.
Typically, there is no x such that Ax b.
Can generalize to non-overconstrained problems if
rank(A)k.
We seek sampling-based algorithms for
approximating l2 regression.
Nontrivial structural insights in overconstrained
problems
Nontrivial algorithmic insights for
non-overconstrained problems.

70
Creating an induced subproblem

Algorithm
Fix a set of probabilities pi, i1n, summing up
to 1.
Pick r indices from 1n in r i.i.d. trials,
with respect to the pis.
For each sampled index j, keep the j-th row of A
and the j-th element of b rescale both by
(1/rpj)1/2.

71
The induced subproblem
72
Our main L2 Regression result
If the pi satisfy certain conditions, then with
probability at least 1-?,
?(A) condition number of A
The sampling complexity is
(New improvement we can reduce the sampling
complexity to r O(d).)
73
Conditions for the probabilities
The conditions that the pi must satisfy, for some
?1, ?2, ?3 ? (0,1
74
Rows of left singular vectors
What do the lengths of the rows of the n x d
matrix U UA mean? Consider possible n x d
matrices U of d left singular vectors Ink k
columns from the identity row lengths 0 or
1 Ink x -gt x Hnk k columns from the n x
n Hadamard (real Fourier) matrix row lengths
all equal Hnk x -gt maximally dispersed Uk
k columns from any orthogonal matrix row
lengths between 0 and 1 Lengths of the rows of U
UA correspond to a notion of information
dispersal. Where in Rm the ? information in A is
sent, not what the ? information is.
75
Comments on L2 Regression

Main point The relevant information for l2
regression if n gtgt d is contained in an induced
subproblem of size O(d2)-by-d.
In O(nd2) O(SVD(A)) O(SVD(Ad)) time we can
easily compute pis that satisfy all three
conditions, with ?1 ?2 ?3 1/3.
Too expensive in practice for this
over-constrained problem!
NOT too expensive when applied to CX and CUR
matrix problems!!
Key observation (FKV98, DK01, DKM04, RV04) Us
is almost orthogonal, and we can bound the
spectral and the Frobenius norm of UsT Us I.
NOTE K. Clarkson in SODA 2005 analyzed
sampling-based algorithms for overconstrained l1
regression (? 1 ) problems.

76
L2 Regression and CUR Approximation

Extended L2 Regression Algorithm
Input m x n matrix A, m x p matrix B, and a
rank parameter k.
Output n x p matrix X approximately solving
minX AkX - B F.
Algrthm Randomly sample rO(d2) or rO(d) rows
from Ak and B
Solve the induced sub-problem.
Xopt AkB (SAk)SB
Cmplxty O(SVD(Ak)) time and space
Corollary 1 Approximately solve minX ATkX -
ATF to get columns C such that A-CCAF
(1?)A-AkF.
Corollary 2 Approximately solve minX CX -
AF to get rows R such that A-CURF (1?)
A-CCAF.

77
Theorem Relative-Error CUR
Fix any k, ?, ?. Then, there exists a Monte
Carlo algorithm that uses O(SVD(Ak)) time to find
C and R and construct U s.t. holds with
probability at least 1-?, by picking c O( k2
log(1/?) / ?2 ) columns, and r O( k4 log2(1/?)
/ ?6 ) rows. (Current theory work we can
improve the sampling complexity to c,rO(k
poly(1/?, 1/?)).) (Current empirical work we can
usually choose c,r k4.) (Dont worry about ?
choose ?1 if you want!)
78
Subsequent relative-error algorithms

November 2005 Drineas, Mahoney, and
Muthukrishnan
The first relative-error low-rank matrix
approximation algorithm.
O(SVD(Ak)) time and O(k2) columns for both CX
and CUR decompositions.
January 2006 Har-Peled
Used ?-nets and VC-dimension arguments on
optimal k-flats.
O(mn k2 log(k)) - linear in mn time to get
1? approximation.
March 2006 Despande and Vempala
Used a volume sampling - adaptive sampling
procedure of RVW05, DRVW06.
O(Mk/?) O(SVD(Ak)) time and O(k log(k))
columns for CX-like decomposition.
April 2006 Drineas, Mahoney, and Muthukrishnan
Improved the DMM November 2005 result to O(k
log(k)) columns.

79
Overview (2/2)

Tensor-based data sets
Tensor-CUR
Hyperspectral data
Recommendation systems
From Very-Large to Medium-Sized Data
Relative-error CX and CUR Matrix Decompositions
L2 Regression Problems
Application to DNA SNP Data
Conclusions and Open Problems

80
CUR data application DNA tagging-SNPs
(data from K. Kidds lab at Yale University,
joint work with Dr. Paschou at Yale University)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are 10 million SNPs in the human genome,
so this table could have 10 million columns.
81
Recall the human genome

Human genome 3 billion base pairs
30,000 40,000 genes
The functionality of 97 of the genome is
unknown.
BUT individual differences (polymorphic
variation) at 1 b.p. per thousand.

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT
GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG
CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG
TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC
CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG
CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC
GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC
AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG
GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG
GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA
GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG
AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA
CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG
CT CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG
GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GT TT GG AA
individuals
SNPs occur quite frequently within the genome
allowing the tracking of disease genes and
population histories. Thus, SNPs are effective
markers for genomic research.
82
Two copies of a chromosome (father, mother)
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
83
C
C
Two copies of a chromosome (father, mother)

An individual could be
Heterozygotic (in our study, CT TC)
Homozygotic at the first allele, e.g., C

An individual could be
Heterozygotic (in our study, CT TC)
Homozygotic at the first allele, e.g., C
Homozygotic at the second allele, e.g., T

Genetic Association Studies
Locate causative genes for common complex
disorders (e.g., diabetes, heart disease, etc.)
by identifying association between affection
status and known SNPs.
No prior knowledge about the function of the
gene(s) or the etiology of the disorder is
necessary.

Biology and Association Studies The subsequent
investigation of candidate genes that are in
physical proximity with the associated SNPs is
the first step towards understanding the
etiological pathway of a disorder and designing
a drug. Data Analysis and Association Studies
Susceptibility alleles (and genotypes carrying
them) should be more common in the patient
population.
86
SNPs carry redundant information

Key observation non-random relationship between
SNPs.
Human genome is organized into block-like
structure.
Strong intra-block correlations.
We can focus only on tagSNPs.

Among different populations (eg., European,
Asian, African, etc.), different patterns of SNP
allele frequencies or SNP correlations are often
observed.
Understanding such differences is crucial in
order to develop the next generation of drugs
that will be population specific (eventually
genome specific) and not just disease
specific.

87
Funding

Mapping the whole genome sequence of a single
individual is very expensive.
Mapping all the SNPs is also quite expensive,
but the costs are dropping fast.

HapMap project (100,000,000 funding from NIH
and other sources)
Map all 10,000,000 SNPs for 270 individuals from
4 different populations (YRI, CEU, CHB, JPT), in
order to create a genetic map to be used by
researchers.
Also, funding from pharmaceutical companies,
NSF, the Department of Justice, etc.

Is it possible to identify the ethnicity of a
suspect from his DNA?
88
Research directions
Research questions (working within a
population) (i) Are different SNPs correlated,
within or across populations? (ii) Find a good
set of tagging-SNPs capturing the diversity of a
chromosomal region of the human genome. (iii)
Find a set of individuals that capture the
diversity of a chromosomal region. (iii) Is
extrapolation feasible?
Why? - Understand structural properties of the
human genome. - Save time/money by assaying only
the tSNPs and predicting the rest. - Save
time/money by running (drug) tests only on the
cell lines of the selected individuals.
Existing literature Pairwise metrics of SNP
correlation, called LD (linkage disequilibrium)
distance, based on nucleotide frequencies and
co-occurrences. Almost no metrics exist for
measuring correlation between more than 2 SNPs
and LD is very difficult to generalize. Exhaustive
and semi-exhaustive algorithms in order to pick
good ht-SNPs that have small LD distance with
all other SNPs. Using Linear Algebra an SVD
based algorithm was proposed by Lin Altman, Am.
J. Hum. Gen. 2004.
89
The DNA SNP data

Samples from 38 different populations.
Average size 50 subjects/population.
For each subject 63 SNPs were assayed, from a
region in chromosome 17 called SORCS3, 900,000
bases long.
We are in the process of analyzing HapMap data
as well as more 3 regions assayed by Kidds lab
(with Asif Javed).

90
(No Transcript)
91
Encoding the data
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals

How?
Exactly two nucleotides (out of A,G,C,T) appear
in each column.
Thus, the two alleles might be both equal to the
first one (encode by 1), both equal to the
second one (encode by -1), or different (encode
by 0).

Notes
Order of the alleles is irrelevant, so TG is
the same as GT.
Encoding, e.g., GG to 1 and TT to -1 is not
any different (for our purposes) from encoding
GG to -1 and TT to 1.
(Flipping the signs of the columns of a matrix
does not affect our techniques.)

92
Evaluating (linear) structure
For each population ? We ran SVD to determine
the optimal number k of eigenSNPs covering 90
of the variance.
If we pick the top k left singular vectors we can
express every column (i.e, SNP) of A as a linear
combination of the left singular vectors loosing
10 of the data.
? We ran CUR to pick a small number (e.g., k2)
of columns of A and express every column (i.e.,
SNP) of A as a linear combination of the picked
columns, loosing 10 of the data.
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals
93
(No Transcript)
94
Predicting SNPs within a population
Split the individuals in two sets training and
test. Given a small number of SNPs for all
individuals (tagging-SNPs), and all SNPs for
individuals in the training set, predict the
unassayed SNPs. Tagging-SNPs are selected using
only the training set.
SNPs
Training individuals, chosen uniformly at
random (for a few subjects, we are given all
SNPs)
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
95
(No Transcript)
96
(No Transcript)
97
Predicting SNPs across populations
Given all SNPs for all individuals in population
X, and a small number of tagging-SNPs for
population Y, predict all unassayed SNPs for all
individuals of Y. Tagging-SNPs are selected
using only the training set. Training set
individuals in X. Test set individuals in Y. A
contains all individuals in both X and Y.
SNPs
All individuals in population X.
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
98
FIG. 6
99
(No Transcript)
100
(No Transcript)
101
Keeping both SNPs and individuals
Given a small number of SNPs for all individuals,
and all SNPs for some judiciously chosen
individuals, predict the values of the remaining
SNPs.
SNPs
Basis individuals JUDICIOUCLY CHOSEN (for a
few subjects, we are given all SNPs)
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
102
(No Transcript)
103
(No Transcript)
104
Overview (2/2)

Tensor-based data sets
Tensor-CUR
Hyperspectral data
Recommendation systems
From Very-Large to Medium-Sized Data
Relative-error CX and CUR Matrix Decompositions
L2 Regression Problems
Application to DNA SNP Data
Conclusions and Open Problems

105
Conclusions Open Problems

Impose other structural properties in CUR-type
decompositions
Non-negativity
Element quantization, e.g., to 0,1,-1
Block-SVD type structure
Robust heuristics and robust extensions
Especially for noisy data
L1 norm bounds
Extension to different statistical learning
problems
Matrix reconstruction
Regression
Classification, e.g. SVM
Clustering

106
Conclusions Open Problems

Relate to traditional numerical linear algebra
Gu and Eisenstat - deterministically find
well-conditioned columns
Goreinov and Tyrtyshnikov - volume-maximization
and conditioning criteria
Stewart - backward error analysis
Empirical evaluation of different sampling
probabilities
Uniform
Non-uniform norms of rows/columns or of
left/right singular vectors
Others, depending on the problem to be solved?
Use CUR and CWCT for improved interpretability
for data matrices
Structural vs. algorithmic issues

107
Workshop on Algorithms for Modern Massive Data
Sets(http//www.stanford.edu/group/mmds/)
_at_ Stanford University and Yahoo! Research, June
21-24, 2006 Objective - Explore novel
techniques for modeling and analyzing massive,
high-dimensional, and nonlinear-structured data.
- Bring together computer scientists,
computational and applied mathematicians,
statisticians, and practitioners to promote
cross-fertilization of ideas. Organizers G.
Golub, M. W. Mahoney, L-H. Lim, and P.
Drineas. Sponsors NSF, Yahoo! Research, Ask!.

Write a Comment

User Comments (0)