TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

About This Presentation

Title:

TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

Description:

Algorithm has access to the data via a pass over the data. ... is not known in advance, can select one element u.a.r. in one pass over the data. ... – PowerPoint PPT presentation

Number of Views:220

Avg rating:3.0/5.0

Slides: 153

Provided by: petro152

Category:

more less

Transcript and Presenter's Notes

Title: TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

1
TUTORIAL Randomized Algorithms for Matrices and
Massive Data Sets
Michael W. Mahoney Yale University Dept. of
Mathematics (now at Yahoo! Research)
Petros Drineas RPI Dept. of Computer Science
(Most recent copy) available at
http//www.cs.yale.edu/homes/mmahoney http//www.
cs.rpi.edu/drinep
2
Randomized Linear Algebra Algorithms

Goal To develop and analyze (fast) Monte Carlo
algorithms for performing useful computations on
large matrices and tensors.
Matrix Multiplication
Computation of the Singular Value Decomposition
Computation of the CUR Decomposition
Testing Feasibility of Linear Programs
Least Squares Approximation
Tensor computations SVD generalizations
Tensor computations CUR generalization
Such computations generally require time which is
superlinear in the number of nonzero elements of
the matrix/tensor, e.g., O(n3) for n n matrices.

3
Randomized Linear Algebra Algorithms
Motivation

(Algorithmic) To speed up computations in
applications where extremely large data sets are
modeled by matrices/tensors and, e.g., O(n3) time
is not an option.

(Algorithmic) To reduce the memory requirements
in applications where the data sets are modeled
by matrices/tensors, and storing the whole data
is not an option, because the data sets are
extremely large.

(Equivalent to the above) To provide some
analysis of the performance of the accuracy
performance of simple algorithms when only a
small sample of the full data set is available.

(Structural) To reveal novel structural
properties of the datasets, given sufficient
computational time.

4
Example the CUR decomposition
Goal make (some norm) of A-CUR small.
Why? After making two passes over A, we can
compute provably good C, U, and R and store them
(sketch) instead of A O(mn) vs. O(n2) space.
Why? Given a sample consisting of a few columns
(C) and a few rows (R) of A, we can compute U and
reconstruct A as CUR if the sampling
probabilities are not too bad, we get provably
good accuracy.
Why? Given sufficient time, we can find C, U and
R such that A CUR is very small. This might
lead to better understanding of the data.
5
Applications of such algorithms

Matrices arise, e.g., since m objects (documents,
genomes, images, web pages), each with n
features, may be represented by an m x n matrix
A.
Covariance Matrices
Latent Semantic Indexing
DNA Microarray Data
Eigenfaces and Image Recognition
Similarity Queries
Matrix Reconstruction
LOTS of other data applications!!

More generally,
Linear and Nonlinear Programming Applications
Design of Approximation Algorithms
Statistical Learning Theory Applications

6
Overview (1/3)

Data Streaming Models and Random Sampling
Matrix Multiplication
Feasibility testing of Linear Programs
Singular Value Decomposition
Sampling rows/columns to approximate singular
vectors/values
Sampling elements to approximate singular
vectors/values
CUR Matrix Decomposition
Lower bounds on sampling complexities

7
Overview (2/3)

Applications of Matrix CUR
Designing approximation algorithms
Data mining
Microarray data
Recommendation Systems
Tensor-based data sets
Tensor-CUR
Applications of Tensor-CUR
Hyperspectral data
Recommendation systems

8
Overview (3/3)

Kernel-CUR and the Nystrom Method
Regression problems
Least squares problems
The CUR algorithm revisited
A general bound
Iterative sampling of rows/columns
Optimal choices of rows/columns
Application finding ht-SNPs
Conclusions open problems

9
Computation on Massive Data Sets
Data are too large to fit into main memory they
are either not stored or are stored in external
memory. Algorithms that compute on data streams
examine the stream, keep a small sketch of the
data, and perform computations on the
sketch. Algorithms are (typically) randomized
and approximate.

Evaluate performance by measures such as
the time to process an item
the number of passes over the data
the additional workspace and time
the quality of the approximation returned
Munro Paterson 78 studied the relation
between the amount of internal storage available
and the number of passes required to select the
k-th highest of n inputs.

10
The Pass Efficient Model

Motivation Amount of disk/tape space has
increased enormously RAM and computing speeds
have increased less rapidly.
Can store large amounts of data, but
Cannot process these data with traditional
algorithms.

In the Pass-Efficient Model
Data are assumed to be stored on disk/tape.
Algorithm has access to the data via a pass over
the data.
Algorithm is allowed additional RAM space and
additional computation time.
An algorithm is pass-efficient if it uses a small
constant number of passes and sublinear
additional time and space to compute a
description of the solution.

Note If data are an m x n matrix A, then
algorithms which require additional time and
space that is O(mn) or O(1) are pass-efficient.
11
Random Sampling
Random sampling is used to estimate some
parameter defined over a very large set by
looking at only a very small subset. Uniform
Sampling every piece of data is equally likely
to be picked.

Advantages
Coins can be tossed blindly.
Even if the number of data elements is not known
in advance, can select one element u.a.r. in one
pass over the data.
Much recent work on quantities that may be
approximated with a small uniformly drawn sample.

Disadvantages
Many quantities cannot be approximated well with
a small random sample that is uniformly drawn.
E.g., estimate the sum of N numbers, where one
number is very large and the others very small.

12
Random Sampling (contd)

Nonuniform Sampling
Advantages
Can obtain much more generality and big gains,
e.g., can approximately solve problems in sparse
as well as dense matrices.
Smaller sampling complexity for similar bounds.
Disadvantages
Must determine the nonuniform probabilities
multiple passes over the data usually needed.

13
Overview (1/3)

Data Streaming Models and Random Sampling
Matrix Multiplication
Feasibility testing of Linear Programs
Singular Value Decomposition
Sampling rows/columns to approximate singular
vectors/values
Sampling elements to approximate singular
vectors/values
CUR Matrix Decomposition
Lower bounds on sampling complexities

14
Approximating Matrix Multiplication
(D. Kannan FOCS 01, and D., Kannan, M. TR
04, SICOMP 05) Problem Statement Given an
m-by-n matrix A and an n-by-p matrix B,
approximate the product AB, OR,
equivalently, Approximate the sum of n rank-one
matrices.
Each term in the summation is a rank-one matrix
15
by random sampling

Algorithm
Fix a set of probabilities pi, i1n, summing up
to 1.
For t1 up to s,
set jt i, where Pr(jt i) pi
(Pick s terms of the sum, with replacement, with
respect to the pi.)
Approximate AB by the sum of the s terms, after
scaling.

16
Random sampling (contd)
Keeping the terms j1, j2, js.
17
The algorithm (matrix notation)
18
The algorithm (matrix notation, contd)
19
Simple Lemmas

The input matrices are given in sparse
unordered representation e.g., their non-zero
entries are presented as triples (i, j, Aij) in
any order.
The expectation of CR (element-wise) is AB.
Our nonuniform sampling minimizes the variance of
the estimator.
It is easy to implement the sampling in two
passes.
If the matrices are dense the algorithm runs in
O(smp) time, instead of O(nmp) time,
It requires O(smsp) RAM space.
Does not tamper with the sparsity of the matrices.

20
Error Bounds
For the above algorithm,
If ABF ?(AF BF), then the above
bound is a relative error bound. This happens if
there is not much cancellation in the
multiplication.
21
Error Bounds (tight concentration)
For the above algorithm, with probability at
least 1-?,
Notice that we removed the expectation (by
applying a martingale argument) and having an
extra log(1/?) factor. (Markovs inequality would
also remove the expectation, introducing an extra
1/? factor.)
22
Special case B AT
If B AT, then the sampling probabilities are
Also, R CT, and the error bounds are
23
Special case B AT (contd)
(Rudelson Vershynin 04, Vershynin 04)
Improvement for the spectral norm bound for the
special case B AT.
24
Empirical evaluation setup
(Data from D. Lewis E. Cohen, SODA 97 J. of
Algorithms 99)
Database Dense document-concept matrix A with
5000 documents and 320 concepts. Experiment Our
goal was to identify all document-document
matches, e.g. compute AAT and identify all
entries larger than some threshold ? (proximity
problem).

The algorithm
Approximate AAT (using our method) by CCT.
Find all entries of CCT that are larger that
?-?, ? gt 0 (candidate entries).
Deterministically compute the dot products for
candidate entries.
Return the ones that are larger than ? as
matches.

25
Empirical evaluation results (? 0.85)
26
Empirical evaluation results (? 0.75)
27
Experiments uniform sampling
28
Random sampling of Linear Programs
(D., Kannan, M. TR 04 STACS 05)
Let P(i) denote the i th column of the r-by-n
matrix P and suppose that the following Linear
Program is feasible
29
The picture
Sample a few variables!
30
Another picture
31
A converse LP sampling lemma
Let P(i) denote the i th column of the r-by-n
matrix P and suppose that the following Linear
Program is infeasible
32
The picture
Sample a few variables!
33
Another picture
34
Overview (1/3)

Data Streaming Models and Random Sampling
Matrix Multiplication
Feasibility testing of Linear Programs
Singular Value Decomposition
Sampling rows/columns to approximate singular
vectors/values
Sampling elements to approximate singular
vectors/values
CUR Matrix Decomposition
Lower bounds on sampling complexities

35
Singular Value Decomposition (SVD)
U (V) orthogonal matrix containing the left
(right) singular vectors of A. S diagonal matrix
containing the singular values of A.

Exact computation of the SVD takes O(minmn2 ,
m2n) time.
The top few singular vectors/values can be
approximated faster (Lanczos/ Arnoldi methods).

36
Rank k approximations (Ak)

Uk (Vk) orthogonal matrix containing the top k
left (right) singular vectors of A.
S k diagonal matrix containing the top k
singular values of A.
Also, AkUkUkTA.

Ak is a matrix of rank k such that A-Ak2,F is
minimized over all rank k matrices! This property
of very useful in the context of Principal
Component Analysis.
37
Approximating SVD in O(n) time

(Frieze, Kannan Vempala FOCS 98, D., Frieze,
Kannan, Vempala Vinay SODA 99, JML 04, D.
Kannan, M. TR 04, SICOMP 05)
Given m x n matrix A
Sample c columns from A and rescale to form the
m x c matrix C.
Compute the m x k matrix Hk of the top k left
singular vectors of C.

Structural Theorem For any probabilities and
number of columns A-HkHkTA2,F2
A-Ak2,F2 2vkAAT-CCTF Algorithmic
Theorem If pi A(i)2/AF2 and c 4?2k/?2,
then A-HkHkTA2,F2 A-Ak2,F2
?AF2. Proof via matrix multiplication
theorem and matrix perturbation theory.
38
Example of randomized SVD
A
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
Original matrix
After sampling columns
Compute the top k left singular vectors of the
matrix C and store them in the 512-by-k matrix Hk.
39
Example of randomized SVD (contd)
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
A
HkHkTA
A and HkHkTA are close.
40
Element-wise sampling
(Achlioptas McSherry, STOC 01, JACM 05)

The Algorithm in 2 lines
To approximate a matrix A, keep a few elements
of the matrix (instead of rows or columns) and
zero out the remaining elements.
Compute a rank k approximation to this sparse
matrix (using Lanczos methods).

A-S2 is bounded ! (i) the singular values of
A and S are close, and (ii, under additional
assumptions) the top k left (right) singular
vectors of S span a subspace that is close the to
subspace spanned by the top k left (right)
singular vectors of A.
41
How to use it

Approximating singular values fast
Zero out (a large number of) elements of A,
scale the remaining ones appropriately.
Compute the singular values of the resulting
sparse matrix using iterative techniques.
(Good choice for pij pij sAij2/?i,j Aij2,
where s denotes the expected number of elements
that we seek to keep in S.)
Note Each element is kept or discarded
independently of the others.

Similar ideas have been used to
explain the success of Latent Semantic Indexing
(LSI)
(Papadimitriou, Raghavan, Tamaki, Vempala, PODS
98 Azar, Fiat, Karlin, McSherry, Saia STOC
01)
design recommendation systems
(Azar, Fiat, Karlin, McSherry, Saia STOC 01 )
speedup kernel computations
(Achlioptas, McSherry, and Schölkopf, NIPS 02)

42
Element-wise vs. row/column sampling

Quite different techniques!
Row/column sampling preserves structural
properties of the matrices.
Element-wise sampling explains how adding noise
and/or quantizing the elements of a matrix
perturbs its singular values/vectors.
The two techniques could be complementary!

Some similarities and differences
Similar error bounds.
Element-wise sampling is doable in one pass!
Running time of element-wise sampling depends on
the speed of, e.g., Arnoldi methods.

43
Overview (1/3)

Data Streaming Models and Random Sampling
Matrix Multiplication
Feasibility testing of Linear Programs
Singular Value Decomposition
Sampling rows/columns to approximate singular
vectors/values
Sampling elements to approximate singular
vectors/values
CUR Matrix Decomposition
Lower bounds on sampling complexities

44
A novel CUR matrix decomposition
(D. Kannan, SODA 03, D., Kannan, M. TR 04,
SICOMP 05)
Create an approximation to the original matrix
of the following form
45
The CUR decomposition
Given a large m-by-n matrix A (stored on disk),
compute a decomposition CUR of A such that
46
The CUR decomposition (contd)

Given a large m-by-n matrix A (stored on disk),
compute a decomposition CUR of A such that
C, U, R can be stored in O(mn) space, after
making two passes through the entire matrix A,
using O(mn) additional space and time.

The product CUR satisfies (with high probability)

47
Computing U

Intuition (which can be formalized)
The CUR algorithm essentially expresses every
row of the matrix A as a linear combination of a
small subset of the rows of A.
This small subset consists of the rows in R.
Given a row of A say A(i) the algorithm
computes a good fit for the row A(i) using the
rows in R as the basis, by approximately solving

Notice that only c O(1) element of the i-th row
are given as input. However, a vector of
coefficients u can still be computed.
48
Computing U (contd)
Given c elements of A(i) the algorithm computes a
good fit for the row A(i) using the rows in R as
the basis, by approximately solving
However, our CUR decomposition approximates the
vectors u instead of exactly computing
them. Later in this talk new error bounds using
the optimal coefficients.
49
Error bounds for CUR
Assume Ak is the best rank k approximation to
A (through SVD). Then, if we pick O(k/?2) rows
and O(k/?2) columns,
50
Other (randomized) CUR decompositions

Computing U in constant (O(1) instead of O(mn))
space and time
(D., Kannan, M. TR 04, SICOMP 05)
samples O(poly(k,?)) rows and columns of A
needs an extra pass.
significantly improves the error bounds of
Frieze, Kannan, and Vempala, FOCS 98.

Computing U and R for any C
(D., M., Muthukrishnan 05)
For any subset of the columns, denoted C (e.g.,
chosen by the practitioner)
Obtain bounds of the form A - CUR F ( 1
? ) A - CC A F
Uses ideas from approximating L2 Regression
problems by random sampling.
Can combine with recent algorithms/heuristics
for choosing columns more details later.

51
Other CUR decompositions
(G.W. Stewart, Num. Math. 99, TR 04)

Compute sparse low-rank approximations to sparse
matrices.
The quasi-Gram-Schmidt (variant of QR) algorithm
Input m x n matrix A.
Output m x k matrix C (k columns of A) and
upper-triangular k x k matrix SC (that
orthogonalizes these columns).

52
Other CUR decompositions, contd
(Goreinov, Tyrtyshnikov, and Zamarashkin, LAA 97
Goreinov, Tyrtyshnikov, Cont. Math. 01)
If A is approximately rank-k to accuracy ? (for
? sufficiently small) then there exist C, R, and
W such that

Notes
Scattering applications with large matrices with
low-rank blocks.
Relate to maximum volume concept in
interpolation theory.
They call the CUR decomposition the
pseudoskeleton component of A.

53
Lower Bounds
Question How many queries does a sampling
algorithm need to approximate a given function
accurately with high probability?

(Ziv Bar-Yossef 03, 04) Lower bounds for the
low rank matrix approximation problem and the
matrix reconstruction problem.
Any sampling algorithm that w.h.p. finds a good
low rank approximation requires ?(mn) queries.
Even if the algorithm is given the exact weight
distribution over the columns of a matrix it will
still require ?(k/?4) queries.
Finding a matrix D such that A-DF ? AF
requires ?(mn) queries and that finding a D such
that A-D2 ? A2 requires ?(mn) queries.

Applied to our results
The LinearTimeSVD algorithm is optimal w.r.t.
?F bounds.
The ConstantTimeSVD algorithm is optimal w.r.t.
?2 bounds up to poly factors.
The CUR algorithm is optimal for constant ?.

54
Overview (2/3)

Applications of Matrix CUR
Designing approximation algorithms
Data mining
Microarray data
Recommendation Systems
Tensor-based data sets
Tensor-CUR
Applications of Tensor-CUR
Hyperspectral data
Recommendation systems

55
Approximating Max-Cut
Max-Cut (NP-hard for general and dense graphs,
Max-SNP) Given a graph G(V,E), Vn, partition
V in two disjoint subsets V1 and V2 such that the
number of edges of E that have one endpoint in V1
and one endpoint in V2 is maximized.
Goemans Williamson 94 .878-approximation
algorithm (might be tight!)
Arora, Karger Karpinski 95 LP based, en2
additive error() De La Vega 96 Combinatorial
methods, en2 additive error Frieze Kannan 96
Linear Algebraic methods, en2 additive
error() Goldreich, Goldwasser Ron 96
Property Testing, en2 additive error Alon, De La
Vega, Kannan Karpinski 02, 03 Cut
Decomposition, en2 additive error() All the
above algorithms run in constant time/space. ()
More generally, it achieves enr for all problems
in Max-r-CSP (Max-SNP)
56
Weighted Max-Cut via CUR
(D., Kannan, M. TR 04 and STACS 05) Let A
denote the adjacency matrix of G. All previous
algorithms also guarantee an
additive error approximation for the weighted
Max-Cut problem, where Amax is the maximum edge
weight in the graph.
57
The algorithm (sketch)
Compute the MaxCut of CUR, thus getting an
approximation to the MaxCut of A. It turns out
that we can solve the problem on CUR in constant
time, in constant space
58
Data mining with CUR
We are given m (gt106) objects and n(gt105)
features describing the objects. Database An
m-by-n matrix A (Aij shows the importance of
feature j for object i). E.g., m documents,
represented w.r.t. n terms. Queries Given a new
object x, find similar objects in the database
(nearest neighbors).
59
Data mining with CUR (contd)
Two objects are close if the angle between
their corresponding (normalized) vectors is
small. So, xTd cos(x,d) is high when the two
objects are close. Ax computes all the angles
and answers the query.

Key observation The exact value xT d might not
be necessary.
The feature values in the vectors are set by
coarse heuristics.
It is in general enough to see if xT d gt
Threshold.

60
Data mining with CUR (contd)
Assume that CUR is an approximation to A, such
that CUR is stored efficiently (e.g. in
RAM). Given a query vector x, instead of
computing A x, compute CUR x to identify its
nearest neighbors.
(Also recall how we used the matrix
multiplication algorithm to solve the proximity
problem, namely find all pairs of objects that
are close.)
61
Genetic microarray data and CUR
Exploit structural properties of CUR in
biological applications
Experimental conditions
Find a good set of genes and arrays to include
in C and R? Provable and/or heuristic strategies
are acceptable.
genes

Common in Biological/Chemical/Medical
applications of PCA
Explain the singular vectors, by mapping them to
meaningful biological processes.
This is a challenging task (think
reification) !
CUR is a low-rank decomposition in terms of the
data that practitioners understand.
Use it to explain the data and do dimensionality
reduction, classification, clustering.
Gene microarray data M., D., Alter (UT Austin)
(sporulation and cell cycle data).

62
Recommendation Systems
(D., Raghavan, Kerenidis, STOC 02)
The problem Assume the existence of m customers
and n products. Then, A is an (unknown) matrix
s.t. Aij is the utility of product j for customer
i. Our goal is to recreate A from a few samples,
s.t. we can recommend high utility products to
customers.

(Kumar, Raghavan, Rajagopalan Tomkins FOCS
98)
Assuming strong clustering of the products, they
offer competitive algorithms even with only 2
samples/customer.
(Azar, Fiat, Karlin, McSherry Saia STOC 01)
Assuming sampling of ?(mn) entries of A and a
certain gap requirement, they -very- accurately
recreate A.

63
Recommendation Systems
The problem Assume the existence of m customers
and n products. Then, A is an (unknown) matrix
s.t. Aij is the utility of product j for customer
i. Our goal is to recreate A from a few samples,
s.t. we can recommend high utility products to
customers.
Critical assumption (supported by experimental
evidence) There exist a small, constant (e.g. k)
number of different customer types or,
equivalently, the rows of the matrix A are
(approximately) linearly dependent or,
equivalently, the matrix A has a good low-rank
(e.g. k) approximation.
64
Recommendation Systems (contd)
Question Can we get competitive performance by
sampling less than ?(mn) elements?
65
Recommendation Systems (contd)
Details Sample a constant number of rows and
columns of A and compute ACUR. Assuming that A
has a good low rank approximation, This
implies that we can make relatively accurate
recommendations with far fewer samples.
66
Overview (2/3)

Applications of Matrix CUR
Designing approximation algorithms
Data mining
Microarray data
Recommendation Systems
Tensor-based data sets
Tensor-CUR
Applications of Tensor-CUR
Hyperspectral data
Recommendation systems

67
Datasets modeled as tensors
Goal Extract structure from a tensor dataset A
(naively, a dataset subscripted by multiple
indices) using a small number of samples.
Q. What do we know about tensor
decompositions? A. Not much, although tensors
arise in numerous applications.
m n p tensor A
68
Tensors in Applications

Tensors appear both in Math and CS.
Represent high dimensional functions.
Connections to complexity theory (i.e., matrix
multiplication complexity).
Statistical applications (i.e., Independent
Component Analysis, higher order statistics,
etc.).
Large data-set applications (e.g., Medical
Imaging Hyperspectral Imaging)

Problem There does not exist a definition of
tensor rank (and associated tensor SVD) with the
nice properties found in the matrix
case. (Lek-Heng Lim 05 strong impossibility
results!) Heuristic solution unfold the
tensor along a mode and apply Linear Algebra.
69
Tensor rank
(Hastad, J of Algorithms 90, DeLaVega, Kannan,
Karpinski, Vempala STOC 05)
A definition of tensor rank Given a tensor find
the minimum number of rank one tensors into it
can be decomposed.

computing it is NP-hard
only weak bounds are known
tensor rank depends on the underlying ring of
scalars
successive rank one approxi-imations are no good

70
Tensor ?-rank
The ?-rank of a tensor Given

Pros
Easily computable,
Cons
different ?-ranks are different,
information is lost.

create the unfolded matrix
and compute its rank, called the ?-rank.
unfold
71
Tensors in real applications

3 classes of tensors in data applications
All modes are comparable (e.g., tensor faces,
chemometrics)
A priori dominant mode (e.g., coarse scales,
microarrays vs. time, images vs. frequency)
All other combinations

(D. and M. 05) TensorSVD paper deals with
(1). We will focus on (2), where there is a
preferred mode.
72
The TensorCUR algorithm (3-modes)

Choose the preferred mode ? (time)
Pick a few representative snapshots
Express all the other snapshots in terms of the
representative snapshots.

73
The TensorCUR algorithm (contd)
2 samples

Let R denote the tensor of the sampled
snapshots.
Express the remaining images as linear
combinations of the sampled snapshots.

m genes
n environmental conditions
74
The TensorCUR algorithm (contd)
Theorem
Unfold R along the a dimension and pre-multiply
by CU
Best rank ka approximation to Aa

TensorCUR
Framework for dealing with very large
tensor-based data sets,
to extract a sketch in a principled and
optimal manner,
which can be coupled with more traditional
methods of data analysis.

How to proceed
Can recurse on each sub-tensor in R,
or do SVD, exact or approximate,
or do kernel-based diffusion analysis,
or do wavelet-based methods.

75
Overview (2/3)

Applications of Matrix CUR
Designing approximation algorithms
Data mining
Microarray data
Recommendation Systems
Tensor-based data sets
Tensor-CUR
Applications of Tensor-CUR
Hyperspectral data
Recommendation systems

76
TensorCUR on scientific data sets

Apply random sampling methodology and
kernel-based Laplacian methods to large physical
and chemical and biological data sets.
Common application areas of large data set
analysis
telecommunications,
finance,
web-based modeling, and
astronomy.
Scientific data sets are quite different
with respect to their size,
with respect to their noise properties, and
with respect to the available field-specific
intuition.

77
Data sets being considered

Sequence and mutational data from G-protein
coupled receptors
to identify mutants with enhanced stability
properties.
Genomic microarray data
to understand large-scale genomic and cellular
behavior.
Hyperspectral colon cancer data
for improved detection of anomalous behavior.
Time-resolved fMRI data
to better represent large, complex visual
brain-imaging data.
Simulational data
to more efficiently conduct large scale
computations.

78
(No Transcript)
79
(No Transcript)
80
Sampling hyperspectral data
Sample slabs depending on total absorption
Sample fibers uniformly (since intensity depends
on stain).
81
Look at the exact 65-th slab.
82
The 65-th slab approximately reconstructed
This slab was reconstructed by approximate
least-squares fit to the basis from slabs 41 and
50, using 1000 (of 250K) pixels/fibers.
83
Tissue Classification - Exact Data
84
Tissue Classification - Ns12 Nf1000
85
TensorCUR on Recommendation Systems
Our previous setup Assume the existence of m
customers and n products. Then, A is an (unknown)
matrix s.t. Aij is the utility of product j for
customer i. Our goal is to recreate A from a few
samples, s.t. we can recommend high utility
products to customers.
products
Customer sample (guinea pigs)
customers
Customer sample (purchases, small surveys)
86
Recommendation Systems (contd)
Comment It is folklore knowledge in economics
literature that product utility is an ordinal and
not a cardinal quantity. Thus, it is more natural
to compare products than assign utility values.
Model revisited Every customer has an n-by-n
matrix (whose entries are 1) and represent
pairwise product comparisons. Overall, there are
m such matrices, forming an n-by-n-by-m 3-mode
array or a three-dimensional tensor, denoted by
A. We seek to extract the structure of this
tensor by sampling.
87
Recommendation Systems (contd)
Our goal Recreate the tensor A from a few
samples in order to recommend high utility
products to the customers.
88
Overview (3/3)

Kernel-CUR and the Nystrom Method
Regression problems
Least squares problems
The CUR algorithm revisited
A general bound
Iterative sampling of rows/columns
Optimal choices of rows/columns
Application finding ht-SNPs
Conclusions open problems

89
Motivation for Kernels (1 of 3)

Methods to extract linear structure from the
data
Support Vector Machines (SVMs).
Gaussian Processes (GPs).
Singular Value Decomposition (SVD) and the
related PCA.

Kernel-based learning methods to extract
non-linear structure
Choose features to define a (dot product) space
F.
Map the data, X, to F by ? X ! F.
Do classification, regression, and clustering in
F with linear methods.

90
Motivation for Kernels (2 of 3)

Use dot products for information about mutual
positions.
Define the kernel or Gram matrix
Gijkij(?(X(i)), ?(X(j))).
Algorithms that are expressed in terms of dot
products may be given as input the Gram matrix G
instead of the data covariance matrix XTX.

Note Isomap, LLE, graph Laplacian eigenmaps,
Hessian eigenmaps, SDE (dimensionality reduction
methods for nonlinear manifolds) are kernel-PCA
for particular Gram matrices. Note for Mercer
kernels, G is SPSD.
91
Motivation for Kernels (3 of 3)

If the Gram matrix G, where Gijkij(?(X(i)),
?(X(j))), is dense but has low numerical rank,
then calculations of interest still need O(n2)
space and O(n3) time
matrix inversion in GP prediction,
quadratic programming problems in SVMs,
computation of eigendecomposition of G.
Relevant recent work using low-rank methods
(Achlioptas, McSherry, and Schölkopf, NIPS 02),
randomized kernels.
Nystrom method for out-of-sample extensions
(Williams and Seeger, NIPS 01 and others),.

92
Kernel-CUR
(D. M., COLT 05, TR 05)

Main algorithm
Randomized algorithm to approximate a Gram
matrix.
Low-rank approximation in terms of columns (and
rows) of GXTX.
Main quality-of-approximation theorem
Provably good approximation if nonuniform
probabilities are used.
Discussion of the Nystrom method
Nystrom method for integral equations and matrix
problems.
Relationship to randomized SVD and CUR
algorithms.

93
Kernel-CUR Algorithm
Input n x n SPSD matrix G, probabilities pi,
11,,n, c lt n, and k lt c. Output n x c
matrix C, and c x c matrix Wk (s.t. CWkCT G).

Algorithm
Pick c columns of G in i.i.d. trials, with
replacement and with respect to the pi let L
be the set of indices of the sampled columns.
Scale each sampled column (with index i 2 L) by
dividing its by (cpi)1/2.
Let C be the n x c matrix containing the
rescaled sampled columns.
Let W be the c x c submatrix of G with entries
Gij/(cpi1/2pj1/2), i,j 2 L.
Compute Wk.

94
Notes on the algorithm

Note the structural simplicity of the algorithm
C consists of a small number of representative
data points.
W consists of the induced subgraph defined by
those points.

Computational resource requirements
Assume the data X (or Gram matrix G) are stored
externally.
Algorithm performs two passes over the data.
Algorithm uses O(n) additional scratch space and
additional computation time.

95
Analysis of Kernel-CUR
Let ? gt 0 and ? O(log(1/?)). Construct an
approximation CWkCT with the Kernel-CUR
algorithm by sampling c columns of G with
probabilities pi Gii2/ ?i Gii2. If c O(k
log(1/?)/?4), then with probability at least 1-?,
96
Interpreting the sampling probabilities

If the sampling probabilities were
pi G(i)2/GF2
they would provide a bias towards data points
that are more important - longer and/or more
representative.
the additional error would be ?GF and not ?
?i Gii2 ?XF2 (where GXTX).

Our sampling probabilities ignore correlations
pi Gii2/ ?i Gii2 X(i)2/XF2
97
The Nystrom Method (1 of 3)
Consider the eigenfunction problem
Discretize with a quadrature rule (wj are the
weights and sj are the quadrature points)
98
The Nystrom Method (2 of 3)
99
The Nystrom Method (3 of 3)

Randomized SVD Algorithms
Randomly sample columns (or rows).
Compute/approximate low-dimensional singular
vectors.
Nystrom-extend to approximate Hk, the
high-dimensional singular vectors.
Bound A-HkHkTA2,F lt A-Ak2,F
?AF.
Randomized CUR Algorithms
Randomly sample columns and rows.
Bound A-CUR2,F lt A-Ak2,F ?AF.
Do not need or use the SPSD property.

100
Fast Computation with Kernels
Adjacency matrix, t0
Kernel-based diffusion

To construct a coarse-grained version of the data
graph
Construct landmarks,
Partition/Quantization,
Diffusion wavelets.

101
Overview (3/3)

Kernel-CUR and the Nystrom Method
Regression problems
Least squares problems
The CUR algorithm revisited
A general bound
Iterative sampling of rows/columns
Optimal choices of rows/columns
Application finding ht-SNPs
Conclusions open problems

102
Regression problems
(D., M. Muthukrishnan, 05)
We seek sampling-based algorithms for solving l2
regression. We are interested in overconstrained
problems, n gtgt d. Typically, there is no x such
that Ax b.
103
Induced regression problems
sampled rows of b
sampled rows of A
104
Regression problems, definition
There is work by K. Clarkson in SODA 2005 on
sampling-based algorithms for l1 regression (?
1 ) for overconstrained problems.
105
Exact solution
106
Singular Value Decomposition (SVD)
U (V) orthogonal matrix containing the left
(right) singular vectors of A. S diagonal matrix
containing the singular values of A. ? rank of A.
107
Questions
Can sampling methods provide accurate estimates
for l2 regression? Is it possible to approximate
the optimal vector and the optimal value value Z2
by only looking at a small sample from the
input? (Even if it takes some sophisticated
oracles to actually perform the sampling )
Equivalently, is there an induced subproblem of
the full regression problem, whose optimal
solution and its value Z2,s aproximates the
optimal solution and its value Z2?
108
Creating an induced subproblem

Algorithm
Fix a set of probabilities pi, i1n, summing up
to 1.
Pick r indices from 1n in r i.i.d. trials,
with respect to the pis.
For each sampled index j, keep the j-th row of A
and the j-th element of b rescale both by
(1/rpj)1/2.

109
The induced subproblem
110
Our results
If the pi satisfy certain conditions, then with
probability at least 1-?,
111
Our results, contd
If the pi satisfy certain conditions, then with
probability at least 1-?,
?(A) condition number of A
The sampling complexity is
112
Back to induced subproblems
sampled elements of b, rescaled
sampled rows of A, rescaled
The relevant information for l2 regression if n
gtgt d is contained in an induced subproblem of
size O(d2)-by-d. (upcoming writeup we can reduce
the sampling complexity to r O(d).)
113
Conditions on the probabilities, SVD
U (V) orthogonal matrix containing the left
(right) singular vectors of A. S diagonal matrix
containing the singular values of A. ? rank of A.
Let U(i) denote the i-th row of U. Let U? 2 Rn
(n -?) denote the orthogonal complement of U.
114
Conditions for the probabilities
The conditions that the pi must satisfy, for some
?1, ?2, ?3 2 (0,1
115
Computing good probabilities
In O(nd2) time we can easily compute pis that
satisfy all three conditions, with ?1 ?2 ?3
1/3. (Too expensive!)
Open question can we compute good
probabilities faster, in a pass efficient manner?
Some assumptions might be acceptable (e.g.,
bounded condition number of A, etc.)
116
Overview (3/3)

Kernel-CUR and the Nystrom Method
Regression problems
Least squares problems
The CUR algorithm revisited
A general bound
Iterative sampling of rows/columns
Optimal choices of rows/columns
Application finding ht-SNPs
Conclusions open problems

117
Revisiting the CUR decomposition
(D., M. Muthukrishnan, 05)
Create an approximation to A, using rows and
columns of A
Goal provide (good) bounds for some norm of the
error matrix A CUR

How do we draw the rows and columns of A to
include in C and R?
How do we construct U?

118
A simple algorithm
Create an approximation to A, using rows and
columns of A
Algorithm Step 1 pick a few columns of A and
include them in C (basis columns). (Any set of
columns will do good choices will be discussed
later) Step 2 express all columns of A as
linear combinations of the basis columns.
119
Step 2
basis columns
i-th column
The second step is trivial it amounts to solving
an l2 regression problem. Unfortunately, it
necessitates looking at every element of A(i) and
expresses A as CCA.
Can we approximately find the vector of
coefficients by looking only at a few elements of
A(i)? Yes!
120
Step 3
i-th column of R
D rescaling, diagonal matrix
sampled rows of C
Pick a few ( r ) rows of A and solve induced
sampled l2 regression problems using these rows
for all i 2 1n (every column of A). (Not
straightforward from the l2 regression results.)
121
Error bounds (given C)
In O(c2m cmn) O(mn) time, we can construct D
and R such that holds with probability at
least 1-?. We need to pick r O(c2 log(1/?)/?2)
rows. (upcoming writeup we can reduce c2 to c.)
Constructing D and R we compute a set of
probabilities pi and sample and rescale rows of A
in r i.i.d. trials with respect to the pis.
122
Forming the pi (1/3)
U (V) orthogonal matrix containing the left
(right) singular vectors of C. S diagonal matrix
containing the singular values of C. ? rank of C.
Let U(i) denote the i-th row of U. Let U? 2 Rm
(m -?) denote the orthogonal complement of U.
123
Forming the pi (2/3)
Compute pi that satisfy, for some ?1, ?2, ?3 2
(0,1
lengths of rows of matrix of left singular
vectors of C
Component of A not in the span of the columns of C
124
Forming the pi (3/3)
In O(c2m mnc) O(mn) time we can easily
compute pi that satisfy all the constraints, with
?1, ?2, ?3 all equal to 1/3.
125
Overall decomposition
columns of A
rows of A
intersection of C and R (r c)
126
Overall decomposition (a variant)
columns of A
rows of A
intersection of C and R
(Goreinov, Tyrtyrshnikov, Zamarashkin LAA 98)
Error bounds if C and R contain the columns and
rows of A that define a parallelpiped of maximal
volume.
127
A (first) choice for C
(D., Frieze, Kannan, Vempala Vinay 99, 03 and
D., Kannan, M. TR 04 SICOMP 05) (For any k)
Assume Ak is the best rank k approximation to A
(through SVD). Then, after making two passes
through A, we can construct a matrix C such that,
with probability at least 1-?,
We need to pick O( k log(1/?) / ?2 ) columns to
include in C. (Columns are picked with
probability proportional to their Euclidean
length squared recall the SVD algorithm early in
this tutorial.)
128
CUR error
(For any k) In O(mn) time, we can construct C, U,
and R such that holds with probability at
least 1-?, by picking O( k log(1/?) / ?2 )
columns, and O( k2 log3(1/?) / ?6 ) rows.
129
A (second) choice for C
(Rademacher, Vempala Wang 04, D. M. TR
05) (For any k) Assume Ak is the best rank k
approximation to A (through SVD). Then, after
making 2t passes through A, we can construct a
matrix C such that, with probability at least 1-?,
We need to pick columns of A in t rounds, picking
O(k log(1/?) / ?2) columns per round.
130
The MultiPass sampling algorithm

R A S empty set
For i1t
Pick c columns from A0 each column is sampled
with probability proportional to its Euclidean
length squared.
S S sampled columns from step 1
R A PS A
End For

131
Intuition

R A S empty set
For i1t
Pick c columns from A0 each column is sampled
with probability proportional to its Euclidean
length squared.
S S sampled columns from step 1
R A PS A
End For

A simple idea this step allows us to focus on
the subspace of A that is not sufficiently
spanned by the current selection of columns.
132
CUR error
(For any k) In O(mnt) time, we can construct C,
U, and R such that holds with probability at
least 1-?, by picking O( t k log(1/?) / ?2 )
columns, and O( t2 k2 log3(1/?) / ?6 ) rows.
133
A (third) choice for C
(Rademacher, Vempala Wang 04) (For any k)
Assume Ak is the best rank k approximation to A
(through SVD). Then, there exist c O(k2/?2)
columns of A such that
( It seems possible to use the volume work by
Goreinov et. al. to design randomized strategies
to pick such columns. )
134
CUR error
(For any k) In O(nc mn) time, we can construct
C, U, and R such that holds with probability
at least 1-?, by picking O( k2 / ?2 ) columns,
and O( k4 log(1/?) / ?6 ) rows. Notice that we
can pick the rows efficiently, given the columns.
135
Example finding ht-SNPs
(data from K. Kidds lab at Yale University,
joint work with Dr. Paschou at Yale University)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the
genome. (Locations at the human genome where
typically two alternate nucleotide bases are
observed.)
Why are SNPs important they occur quite
frequently within the genome allowing the
tracking of disease genes and population
histories. Thus, they are effective markers for
genomic research! There are 10 million SNPs in
the human genome (under some assumptions).
136
Research topics
Research Topics (working within a specific
chromosomal region of the human genome) (i)
Are different SNPs correlated, either within a
population, or across different populations?
(Folklore knowledge yes). (ii) Find a good
set of haplotype tagging-SNPs capturing the
diversity of a chromosomal region of the human
genome. (iii) Is extrapolation feasible? (Recall
Nystrom/CUR extrapolation.) This would save (a
lot of) time and money!
Existing literature Pairwise metrics of SNP
correlation, called LD (linkage disequilibrium)
distance, based on nucleotide frequencies and
co-occurrences. Almost no metrics exist for
measuring correlation between more than 2 SNPs
and LD is very difficult to generalize. Exhaustive
and semi-exhaustive algorithms in order to pick
good ht-SNPs that have small LD distance with
all other SNPs. Using Linear Algebra an SVD
based algorithm was proposed by Lin Altman, Am.
J. Hum. Gen. 2004.
137
The raw data
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT
GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG
CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG
TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC
CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG
CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC
GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC
AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG
GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG
GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA
GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG
AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA
CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG
CT CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG
GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GT TT GG AA
individuals

Notes
Each SNP genotype consists of two alleles.
Observed SNPs are in known positions in the human
genome.
Exactly two nucleotides appear in each column.

138
Encoding the data
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals

How?
Exactly two nucleotides (out of A,G,C,T) appear
in each column.
Thus, the two alleles might be both equal to the
first one (encode by 1), both equal to the
second one (encode by -1), or different (encode
by 0).

Notes
Order of the alleles is irrelevant, so TG is
the same as GT.
Encoding, e.g., GG to 1 and TT to -1 is not
any different (for our purposes) from encoding
GG to -1 and TT to 1.
(Recall that flipping the signs of the columns of
a matrix only changes the signs of the left
singular vectors.)

139
Evaluating (linear) structure

For each population
We examined samples from 38 different
populations, with an average size of 50 subjects.
For each subject, we were given 63 SNPs, covering
a chromosomal region of the human genome which is
¼ 900,000 bases long, close to the end of the
long arm of chromosome 17.
For each population, we ran SVD to determine the
optimal number k of principal components that
are necessary in order to cover 90 of its
spectrum.

140
Algorithms

We ran various schemes to select good columns
(i.e., ht-SNPs).

A greedy Multi-Pass heuristic scheme gave the
best results.
(Select one column in each round, subtracting
from A the projection of A on this column and
repeating.)
Notice that some provable results exist for
similar algorithms.

Nice feature SVD provides a non-trivial (maybe
not achievable) lower bound. In many cases, the
lower bound is attained by the greedy
heuristic! (In our data, at most k5 columns
suffice to extract 90 of the structure.)
141
America
Oceania
Asia
Europe
Africa
142
Extrapolation
Given a small number of SNPs for all subjects,
and all SNPs for some subjects, extrapolate the
values of the missing SNPs.
SNPs
Training data (for a few subjects, we are
given all SNPs)
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
143
Extrapolation
We split our data in training (90) and test
(10). The training set corresponds to R for the
test set we are given data on a small number of
SNPs, picked using our greedy multipass heuristic
on the training set.
144
Rondonian Surui
145
Arizona Pima
146
Atayal
147
Samaritan
148
European American
149
African American
150
Overview (3/3)

Kernel-CUR and the Nystrom Method
Regression problems
Least squares problems
The CUR algorithm revisited
A general bound
Iterative sampling of rows/columns
Optimal choices of rows/columns
Application finding ht-SNPs
Conclusions open problems

151
Conclusions Open Problems

Empirical evaluation of different sampling
probabilities
Uniform
Row/column lengths squared
Lengths of the rows of the left/right singular
vectors
Inverse squares of row/column lengths to pick
outliers
Others, depending on the problem to be solved?
Empirical evaluation for different learning
problems
Matrix reconstruction, regression,
classification, clustering, etc.
Use CUR for improved interpretability for data
matrices
Structural vs. algorithmic issues
Use CWCT for improved interpretability of Gram
Matrices
Structural vs. algorithmic issues

152
Conclusions Open Problems

Impose other structural properties in CUR-type
decompositions
Non-negativity
Element quantization, e.g., to 0,1,-1
Block-SVD type structure
Robust heuristics and robust extensions
Especially for noisy data
Quasi Monte-Carlo vs. Monte-Carlo vs.
Deterministic heuristics of the sketch for
different data sets
Especially for not-so-large datasets
Relate the pass-efficient framework to practice
passes over the data
a priori knowledge regarding data generation
oracle based models