Title: VLDB06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets
1VLDB06 TUTORIAL Randomized Algorithms for
Matrices and Massive Data Sets
Michael W. Mahoney Yahoo! Research
Petros Drineas CS - RPI
Tutorial given at VLDB 2006 in Seoul, Korea.
http//www.cs.yale.edu/homes/mmahoney http//www.
cs.rpi.edu/drinep
2Randomized Linear Algebra Algorithms
- Goal To develop and analyze (fast) Monte Carlo
algorithms for performing useful computations on
large (and later not so large!) matrices and
tensors. - Matrix Multiplication
- Computation of the Singular Value Decomposition
- Computation of the CUR Decomposition
- Testing Feasibility of Linear Programs
- Least Squares Approximation
- Tensor computations SVD generalizations
- Tensor computations CUR generalization
- Such computations generally require time which is
superlinear in the number of nonzero elements of
the matrix/tensor, e.g., O(n3) for n x n matrices.
3Example the CUR decomposition
Algorithmic Motivation To speed up computations
in applications where extremely large data sets
are modeled by matrices and, e.g., O(n2) space
and O(n3) time is not an option. Structural
Motivation To reveal novel structural properties
of the datasets, given sufficient computational
time, that are useful in applications.
Goal make A-CUR small.
Why? (Algorithmic) After making two passes over
A, we can compute provably good C, U, and R and
store them (sketch) instead of A O(mn) vs.
O(n2) space.
Why? Given a sample consisting of a few columns
(C) and a few rows (R) of A, we can compute U and
reconstruct A as CUR. If the sampling
probabilities are not too bad, we get provably
good accuracy.
Why? (Structural) Given sufficient time, we can
find C, U and R such that A CUR is very
small. This might lead to better understanding
of the data.
4Applications of such algorithms
- Matrices arise, e.g., since m objects (documents,
genomes, images, web pages), each with n
features, may be represented by an m x n matrix
A. - Covariance Matrices
- Latent Semantic Indexing
- DNA Microarray Data
- Eigenfaces and Image Recognition
- Similarity Queries
- Matrix Reconstruction
- LOTS of other data applications!!
- More generally,
- Linear and Nonlinear Programming Applications
- Design of Approximation Algorithms
- Statistical Learning Theory Applications
5Overview (1/2)
- Data Streaming Models and Random Sampling
- Matrix Multiplication
- Singular Value Decomposition
- CUR Matrix Decomposition
- Applications of Matrix CUR
- Data mining
- DNA microarray (and DNA SNP) data
- Recommendation Systems
- Kernel-CUR and the Nystrom Method
6Overview (2/2)
- Tensor-based data sets
- Tensor-CUR
- Hyperspectral data
- Recommendation systems
-
- From Very-Large to Medium-Sized Data
- Relative-error CX and CUR Matrix Decompositions
- L2 Regression Problems
- Application to DNA SNP Data
- Conclusions and Open Problems
7The Pass Efficient Model
- Motivation Amount of disk/tape space has
increased enormously RAM and computing speeds
have increased less rapidly. - Can store large amounts of data, but
- Cannot process these data with traditional
algorithms. - In the Pass-Efficient Model
- Data are assumed to be stored on disk/tape.
- Algorithm has access to the data via a pass over
the data. - Algorithm is allowed additional RAM space and
additional computation time. - An algorithm is pass-efficient if it uses a small
constant number of passes and sublinear
additional time and space to compute a
description of the solution. - Note If data are an m x n matrix A, then
algorithms which require additional time and
space that is O(mn) or O(1) are pass-efficient.
8Random Sampling
- Random Sampling and Randomized Algorithms
- Better complexity properties (randomization as a
resource). - Simpler algorithms and/or analysis (maybe
de-randomize later). - Uniform Sampling
- Typically things work in expectation, but poor
variance properties. - Non-uniform Sampling
- With good probabilities, can make the variance
small.
9Overview (1/2)
- Data Streaming Models and Random Sampling
- Matrix Multiplication
- Singular Value Decomposition
- CUR Matrix Decomposition
- Applications of Matrix CUR
- Data mining
- DNA microarray (and DNA SNP) data
- Recommendation Systems
- Kernel-CUR and the Nystrom Method
10Approximating Matrix Multiplication
(D. Kannan FOCS 01, and D., Kannan, M. TR
04, SICOMP 06) Problem Statement Given an
m-by-n matrix A and an n-by-p matrix B,
approximate the product AB, OR,
equivalently, Approximate the sum of n rank-one
matrices.
Each term in the summation is a rank-one matrix
11by random sampling
- Algorithm
- Fix a set of probabilities pi, i1,,n, summing
up to 1. - For t1 up to s,
- set jt i, where Pr(jt i) pi
- (Pick s terms of the sum, with replacement, with
respect to the pi.) - Approximate AB by the sum of the s terms, after
scaling.
12Random sampling (contd)
Keeping the terms j1, j2, js.
13The algorithm (matrix notation)
14Simple Lemmas
- The input matrices are given in sparse
unordered representation e.g., their non-zero
entries are presented as triples (i, j, Aij) in
any order. - The expectation of CR (element-wise) is AB.
- Our nonuniform sampling minimizes the variance of
the estimator. - It is easy to implement the sampling in two
passes. - If the matrices are dense the algorithm runs in
O(smp) time, instead of O(nmp) time, - It requires O(smsp) RAM space.
- Does not tamper with the sparsity of the matrices.
15Error Bounds
For the above algorithm,
For the above algorithm, with probability at
least 1-?,
- This is a relative error bound if ABF
?(AF BF), i.e. if there is not much
cancellation in the multiplication. - We removed the expectation (by applying a
martingale argument) and so have an extra
log(1/?) factor. - Markovs inequality would also remove the
expectation, introducing an extra 1/? factor.
16Special case B AT
If B AT, then the sampling probabilities are
Also, R CT, and the error bounds are
17Special case B AT (contd)
(Rudelson Vershynin 04, Vershynin 04)
Improvement for the spectral norm bound for the
special case B AT.
- Uses a result of M. Rudelson for random vectors
in isotropic position. - Tight concentration results can be proven using
Talagrands theory. - The sampling procedure is slightly different s
columns/rows are kept in expectation, i.e.,
column i is picked with probability
18Overview (1/2)
- Data Streaming Models and Random Sampling
- Matrix Multiplication
- Singular Value Decomposition
- CUR Matrix Decomposition
- Applications of Matrix CUR
- Data mining
- DNA microarray (and DNA SNP) data
- Recommendation Systems
- Kernel-CUR and the Nystrom Method
19Singular Value Decomposition (SVD)
U (V) orthogonal matrix containing the left
(right) singular vectors of A. S diagonal matrix
containing the singular values of A.
- Exact computation of the SVD takes O(minmn2 ,
m2n) time. - The top few singular vectors/values can be
approximated faster (Lanczos/ Arnoldi methods).
20Rank k approximations (Ak)
- Uk (Vk) orthogonal matrix containing the top k
left (right) singular vectors of A. - S k diagonal matrix containing the top k
singular values of A. - Also, AkUkUkTA.
Ak is a matrix of rank k such that A-Ak2,F is
minimized over all rank k matrices! This property
of very useful in the context of Principal
Component Analysis.
21Approximating SVD in O(n) time
- (D., Frieze, Kannan, Vempala Vinay SODA 99,
JML 04, D. Kannan, M. TR 04, SICOMP 06) - Given m x n matrix A
- Sample c columns from A and rescale to form the
m x c matrix C. - Compute the m x k matrix Hk of the top k left
singular vectors of C.
Structural Theorem For any probabilities and
number of columns A-HkHkTA2,F2
A-Ak2,F2 2vkAAT-CCTF Algorithmic
Theorem If pi A(i)2/AF2 and c 4?2k/?2,
then A-HkHkTA2,F2 A-Ak2,F2
?AF2. Proof via matrix multiplication
theorem and matrix perturbation theory.
22Example of randomized SVD
A
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
Original matrix
After sampling columns
Compute the top k left singular vectors of the
matrix C and store them in the 512-by-k matrix Hk.
23Example of randomized SVD (contd)
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
A
HkHkTA
A and HkHkTA are close.
24Element-wise sampling
(Achlioptas McSherry, STOC 01, JACM 05)
- The Algorithm in 2 lines
- To approximate a matrix A, keep a few elements
of the matrix (instead of rows or columns) and
zero out the remaining elements. - Compute a rank k approximation to this sparse
matrix (using Lanczos methods).
A-S2 is bounded ! (i) the singular values of
A and S are close, and (ii, under additional
assumptions) the top k left (right) singular
vectors of S span a subspace that is close the to
subspace spanned by the top k left (right)
singular vectors of A.
25Overview (1/2)
- Data Streaming Models and Random Sampling
- Matrix Multiplication
- Singular Value Decomposition
- CUR Matrix Decomposition
- Applications of Matrix CUR
- Data mining
- DNA microarray (and DNA SNP) data
- Recommendation Systems
- Kernel-CUR and the Nystrom Method
26CX and CUR matrix decompositions
Recall Matrices are about their rows and
columns. Recall Low-rank matrices have
redundancy in their rows and columns. Def A CX
matrix decomposition is a low-rank approximation
explicitly expressed in terms of a small number
of columns of the original matrix A (e.g., PCA
CCA). Def A CUR matrix decomposition is a
low-rank approximation explicitly expressed in
terms of a small number of columns and rows of
the original matrix A.
27A novel CUR matrix decomposition
(D. Kannan, SODA 03, D., Kannan, M. TR 04,
SICOMP 06)
Create an approximation to the original matrix
of the following form
28The CUR decomposition
Given a large m-by-n matrix A (stored on disk),
compute a decomposition CUR of A such that
- C consists of c O(k/?2) columns of A.
- R consists of r O(k/?2) rows of A.
- C (R) is created using importance sampling, e.g.
columns (rows) are picked in i.i.d. trials with
respect to probabilities
- C, U, R can be stored in O(mn) space, after
making two passes through the entire matrix A,
using O(mn) additional space and time.
- The product CUR satisfies (with high
probability)
29Computing U
- Intuition (which can be formalized - see later)
- The CUR algorithm expresses every row of the
matrix A as a linear combination of a small
subset of the rows of A. - This small subset consists of the rows in R.
- Given a row of A say A(i) the algorithm
computes a good fit for the row A(i) using the
rows in R as the basis, by approximately solving
But, only c O(1) elements of the i-th row are
given as input. So, we only approximate the
optimal vector u instead of computing it
exactly. Actually, the pass-efficient CUR
decomposition approximates the approximation.
30Error bounds for CUR
Assume Ak is the best rank k approximation to
A (through SVD). Then, if we pick O(k/?2) rows
and O(k/?2) columns,
31Previous CUR-type decompositions
(For details see Drineas Mahoney, A Randomized
Algorithm for a Tensor-Based Generalization of
the SVD, 05.)
32Overview (1/2)
- Data Streaming Models and Random Sampling
- Matrix Multiplication
- Singular Value Decomposition
- CUR Matrix Decomposition
- Applications of Matrix CUR
- Data mining
- DNA microarray (and DNA SNP) data
- Recommendation Systems
- Kernel-CUR and the Nystrom Method
33CUR application Data Mining
- Database An m-by-n matrix A, e.g., m (gt106)
objects and n(gt105) features. - Queries Given a new object x, find similar
objects (nearest neighbors) in A. - Closeness Two normalized objects x and y are
close xTd cos(x,d) is high. - Given a query vector x, the matrix product Ax
computes all the angles/distances. - Key observation The exact value xT d might not
be necessary. - The feature values in the vectors are set by
coarse heuristics. - It is in general enough to see if xT d gt
Threshold. - Algorithm Given a query vector x, compute CUR
x to identify nearest neighbors. - Theorem We have a bound on the worst case of x
using CUR instead of A
34CUR application Genetic Microarray Data
Exploit structural properties of CUR in
biological applications
Experimental conditions
Find a good set of genes and arrays to include
in C and R? Provable and/or heuristic strategies
are acceptable.
genes
- Common in Biological/Chemical/Medical
applications of PCA - Explain the singular vectors, by mapping them to
meaningful biological processes. - This is a challenging task (think
reification) ! - CUR is a low-rank decomposition in terms of the
data that practitioners understand. - Use it to explain the data and do dimensionality
reduction, classification, clustering. - Gene microarray data M., D., Alter (UT Austin)
(sporulation and cell cycle data).
35CUR application Recommendation Systems
(D., Raghavan, Kerenidis, STOC 02)
- The problem m customers and n products Aij is
the (unknown) utility of product j for customer
i. - The goal recreate A from a few samples to
recommend high utility products. - (KRRT98) Assuming strong clustering of the
products, competitive algorithms even with only 2
samples/customer. - (AFKMS01) Assuming sampling of ?(mn) entries of
A and a gap requirement, accurately recreate A. - Question Can we get competitive performance by
sampling o(mn) elements? - Answer Apply the CUR decomposition
36Kernel-CUR Motivation
- Kernel-based learning methods to extract
non-linear structure - Choose features to define a (dot product) space
F. - Map the data, X, to F by ? X -gt F.
- Do classification, regression, and clustering in
F with linear methods (SVMs,GPs,SVD). - If the Gram matrix G, where Gijkij(?(X(i)),
?(X(j))), is dense but has low numerical rank,
then calculations of interest need O(n2) space
and O(n3) time - matrix inversion in GP prediction,
- quadratic programming problems in SVMs,
- computation of eigendecomposition of G.
- Relevant recent work using low-rank methods
- (Williams and Seeger, NIPS 01, etc.) Nystrom
method for out-of-sample extensions. - (Achlioptas, McSherry, and Schölkopf, NIPS 02)
randomized kernels.
37Kernel-CUR Decomposition
(D. M., COLT 05, TR 05, JMLR 05)
- Input n x n SPSD matrix G, probabilities pi,
11,,n, c lt n, and k lt c. - Algorithm
- Let C be the n x c matrix containing c randomly
sampled columns of G. - Let W be the c x c matrix with containing
intersection of C and CT. - Theorem Let pi Gii2/ ?i Gii2. If c O(k
log(1/?)/?4), then w.p. at least 1-?,
If c O(log(1/?)/?4), then with probability at
least 1-?,
38Overview (2/2)
- Tensor-based data sets
- Tensor-CUR
- Hyperspectral data
- Recommendation systems
-
- From Very-Large to Medium-Sized Data
- Relative-error CX and CUR Matrix Decompositions
- L2 Regression Problems
- Application to DNA SNP Data
- Conclusions and Open Problems
39Datasets modeled as tensors
- Tensors (naively, a dataset subscripted by
multiple indices) appear both in Math and CS. - Represent high dimensional functions.
- Connections to complexity theory (i.e., matrix
multiplication complexity). - Statistical applications (i.e., ICA, HOS, etc.).
- Large data-set applications (e.g., Medical
Imaging Hyperspectral Imaging) - Problem There does not exist a definition of
tensor rank (and associated tensor SVD) with the
nice properties found in the matrix case. - (Lek-Heng Lim 05 strong impossibility
results!) - Common heuristic unfold the tensor along a
mode and apply Linear Algebra. - We will do this, but note that this kills the
essential tensor structure.
40Datasets modeled as tensors (contd)
Goal Extract structure from a tensor dataset A
(naively, a dataset subscripted by multiple
indices) using a small number of samples.
- Tensor rank (minimum number of rank-one
tensors) is NP-hard to compute. - Tensor ?-rank (unfold along the ?th mode and
the the matrix SVD) is a commonly-used heuristic.
Randomized-Tensor-CUR unfold along a
distinguished mode and reconstruct. Randomized-T
ensor-SVD unfold along every mode and choose
columns. (Drineas Mahoney, A Randomized
Algorithm for a Tensor-Based Generalization of
the SVD, TR05.)
41The TensorCUR algorithm (3-modes)
- Choose the preferred mode ? (e.g., time).
- Pick a few representative slabs (let R denote
the tensor of the sampled slabs). - Use only information in a small number of
representative fibers (let C denote the tensor
of sampled fibers and U a low-dimensional
encoding tensor). - Express the remaining slabs as linear
combinations of the basis of sampled slabs.
42Tensor-CUR application Hyperspectral Image
Analysis
(with M. Maggioni and R. Coifman at Yale)
Goal Extract structure from temporally-resolved
images or spectrally-resolved images of medical
interest using a small number of samples (images
and/or pixels).
Note A temporally or spectrally resolved image
may be viewed as a tensor (naively, a dataset
subscripted by multiple indices) or as a matrix
(whose columns have internal structure that is
not modeled).
m x n x p tensor A or mn x p matrix A
Note The chosen images are a dictionary from the
data to express every image. Note The chosen
pixels are a dictionary from the data to express
every pixel.
43(No Transcript)
44Eigen-analysis of slabs and fibers
45Look at the exact (65-th) slab.
46The (65-th) slab approximately reconstructed
This slab was reconstructed by approximate
least-squares fit to the basis from slabs 41 and
50, using 1000 (of 250K) pixels/fibers.
47Tissue Classification - Exact Data
48Tissue Classification - Ns12 Nf1000
49Tensor-CUR application Recommendation Systems
- Important Comment
- Utility is ordinal and not cardinal concept.
- Compare products dont assign utility values.
- Recommendation Model Revisited
- Every customer has an n-by-n matrix (whose
entries are 1,-1) and represent pair-wise
product comparisons. - There are m such matrices, forming an
n-by-n-by-m 3-mode tensor A. - Extract the structure of this tensor.
50Application to Jester Joke Recommendations
Use just the 14,140 full users who rated all
100 Jester jokes. For each user, convert the
utility vector to 100 x 100 pair-wise preference
matrix. Choose, e.g., 300 slabs/users, and a
small number of fibers/comparisons.
51Overview (2/2)
- Tensor-based data sets
- Tensor-CUR
- Hyperspectral data
- Recommendation systems
-
- From Very-Large to Medium-Sized Data
- Relative-error CX and CUR Matrix Decompositions
- L2 Regression Problems
- Application to DNA SNP Data
- Conclusions and Open Problems
52Modeling data as matrices
- People studying data
- put the data onto a graph or into a vector space
- even if the data dont naturally or obviously
live there - and perform graph operations or vector space
operations - to extract information from the data.
- Such data often have structure unrelated to the
graphical or linear algebraic structure implicit
in the modeling. - This non-modeled structure is difficult to
formalize. - Practitioners often have extensive field-specific
intuition about the data. - This intuition is often used to choose where
the data live.
53Modeling data as matrices (contd)
- Matrices often arise since n objects
(documents, genomes, images, web pages), each
with m features, may be represented by an m x n
matrix A. - Such data matrices often have structure
- for linear structure, SVD or PCA is often used,
- for non-linear structure, kernel, e.g.,
diffusion-based, methods used, - other structures include sparsity,
nonnegativity, etc. - Note We know what the rows/columns mean from
the application area.
54Problems with SVD/Eigen-Analysis
- Problems arise since structure in the data is not
respected by mathematical operations on the data
- Reification - maximum variance directions are
just that. - Interpretability - what does a linear
combination of 6000 genes mean. - Sparsity - is destroyed by orthogonalization.
- Non-negativity - is a convex and not linear
algebraic notion. - The SVD gives two bases to diagonalize the
matrix. - Truncating gives a low-rank matrix approximation
with a very particular structure. - Think rotation-with-truncation rescaling
rotation-back-up. - Question Do there exist better low-rank
matrix approximations. - better structural properties for certain
applications. - better at respecting relevant structure.
- better for interpretability and informing
intuition.
55Dictionaries for data analysis
- Discrete Fourier Transform (DCT)
- fj ?i0,,N-1 xn cosj(n1/2)?/N
- the basis is fixed.
- O(N2) or O(Nlog(N)) computation to determine
coefficients. - Singular Value Decomposition (SVD)
- A ?i1,,? ?iU(i)V(i)T ?i1,,? ?i Ai
- O(N3) computation to determine basis and
coefficients. - Many other more complex/expensive procedures
depending on the application. - Question Can actual data points and/or feature
vectors be the dictionary? - Core-sets on graphs.
- CUR-decompositions on matrices.
56Overview (2/2)
- Tensor-based data sets
- Tensor-CUR
- Hyperspectral data
- Recommendation systems
-
- From Very-Large to Medium-Sized Data
- Relative-error CX and CUR Matrix Decompositions
- L2 Regression Problems
- Application to DNA SNP Data
- Conclusions and Open Problems
57Problem formulation (1 of 3)
- Never mind columns and rows - just deal with
columns (for now) of the matrix A. - Could ask to find the best k of n columns of
A. - Combinatorial problem - trivial algorithm takes
nk time. - Probably NP-hard if k is not fixed.
- Lets ask a different question.
- Fix a rank parameter k.
- Lets over-sample columns by a little (e.g.,
k3, 10k, k2, etc.). - Try to get close (additive error or relative
error) to the best rank-k approximation..
58Problem formulation (2 of 3)
Ques Do there exist O(k), or O(k2), or ,
columns s.t. A-CCA2,F lt A-Ak2,F
?AF Ans Yes - and can find them in O(mn)
space and time after two passes over the data!
(DFKVV99,DKM04) Ques Do there exist, and can
we find, O(k), or O(k2), or , columns such
that A-CCAF lt (1?)A-AkF Ans Yes, they
exist - existential proof - no non-exhaustive
algorithm given! (RVW05,DRVW06) Ans ...
59Problem formulation (3 of 3)
Ques Do there exist O(k), or O(k2), or ,
columns and rows such that A-CUR2,F lt
A-Ak2,F ?AF Ans Yes - and can find
them in O(mn) space and time after two passes
over the data! (DK03,DKM04) Ques Do there
exist O(k), or O(k2), or , columns and rows such
that A-CURF lt (1?)A-AkF Ans
60L2 Regression problems
- First consider overconstrained problems, n gtgt d.
- Typically, there is no x such that Ax b.
- Can generalize to non-overconstrained problems if
rank(A)k. - We seek sampling-based algorithms for
approximating l2 regression. - Nontrivial structural insights in overconstrained
problems - Nontrivial algorithmic insights for
non-overconstrained problems.
61Our main L2 Regression result
If the pi satisfy certain conditions, then with
probability at least 1-?,
?(A) condition number of A
The sampling complexity is
(New improvement we can reduce the sampling
complexity to r O(d).)
62L2 Regression and CUR Approximation
- Extended L2 Regression Algorithm
- Input m x n matrix A, m x p matrix B, and a
rank parameter k. - Output n x p matrix X approximately solving
minX AkX - B F. - Algrthm Randomly sample rO(d2) or rO(d) rows
from Ak and B - Solve the induced sub-problem.
- Xopt AkB (SAk)SB
- Cmplxty O(SVD(Ak)) time and space
- Corollary 1 Approximately solve minX ATkX -
ATF to get columns C such that A-CCAF
(1?)A-AkF. - Corollary 2 Approximately solve minX CX -
AF to get rows R such that A-CURF (1?)
A-CCAF.
63Theorem Relative-Error CUR
Fix any k, ?, ?. Then, there exists a Monte
Carlo algorithm that uses O(SVD(Ak)) time to find
C and R and construct U s.t. holds with
probability at least 1-?, by picking c O( k2
log(1/?) / ?2 ) columns, and r O( k4 log2(1/?)
/ ?6 ) rows. (Current theory work we can
improve the sampling complexity to c,rO(k
poly(1/?, 1/?)).) (Current empirical work we can
usually choose c,r k4.) (Dont worry about ?
choose ?1 if you want!)
64Subsequent relative-error algorithms
- November 2005 Drineas, Mahoney, and
Muthukrishnan - The first relative-error low-rank matrix
approximation algorithm. - O(SVD(Ak)) time and O(k2) columns for both CX
and CUR decompositions. - January 2006 Har-Peled
- Used ?-nets and VC-dimension arguments on
optimal k-flats. - O(mn k2 log(k)) - linear in mn time to get
1? approximation. - March 2006 Despande and Vempala
- Used a volume sampling - adaptive sampling
procedure of RVW05, DRVW06. - O(Mk/?) O(SVD(Ak)) time and O(k log(k))
columns for CX-like decomposition. - April 2006 Drineas, Mahoney, and Muthukrishnan
- Improved the DMM November 2005 result to O(k
log(k)) columns.
65Overview (2/2)
- Tensor-based data sets
- Tensor-CUR
- Hyperspectral data
- Recommendation systems
-
- From Very-Large to Medium-Sized Data
- Relative-error CX and CUR Matrix Decompositions
- L2 Regression Problems
- Application to DNA SNP Data
- Conclusions and Open Problems
66CUR data application DNA tagging-SNPs
(data from K. Kidds lab at Yale University,
joint work with Dr. Paschou at Yale University)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are 10 million SNPs in the human genome,
so this table could have 10 million columns.
67Recall the human genome
- Human genome 3 billion base pairs
- 30,000 40,000 genes
- The functionality of 97 of the genome is
unknown. - BUT individual differences (polymorphic
variation) at 1 b.p. per thousand.
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT
GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG
CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG
TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC
CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG
CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC
GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC
AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG
GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG
GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA
GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG
AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA
CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG
CT CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG
GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GT TT GG AA
individuals
SNPs occur quite frequently within the genome
allowing the tracking of disease genes and
population histories. Thus, SNPs are effective
markers for genomic research.
68Two copies of a chromosome (father, mother)
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
69C
C
Two copies of a chromosome (father, mother)
- An individual could be
- Heterozygotic (in our study, CT TC)
- Homozygotic at the first allele, e.g., C
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
70T
T
Two copies of a chromosome (father, mother)
- An individual could be
- Heterozygotic (in our study, CT TC)
- Homozygotic at the first allele, e.g., C
- Homozygotic at the second allele, e.g., T
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
71The DNA SNP data
- Samples from 38 different populations.
- Average size 50 subjects/population.
- For each subject 63 SNPs were assayed, from a
region in chromosome 17 called SORCS3, 900,000
bases long. - We are in the process of analyzing HapMap data
as well as more 3 regions assayed by Kidds lab
(with Asif Javed).
72(No Transcript)
73Encoding the data
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals
- How?
- Exactly two nucleotides (out of A,G,C,T) appear
in each column. - Thus, the two alleles might be both equal to the
first one (encode by 1), both equal to the
second one (encode by -1), or different (encode
by 0).
- Notes
- Order of the alleles is irrelevant, so TG is
the same as GT. - Encoding, e.g., GG to 1 and TT to -1 is not
any different (for our purposes) from encoding
GG to -1 and TT to 1. - (Flipping the signs of the columns of a matrix
does not affect our techniques.)
74Evaluating (linear) structure
For each population ? We ran SVD to determine
the optimal number k of eigenSNPs covering 90
of the variance.
If we pick the top k left singular vectors we can
express every column (i.e, SNP) of A as a linear
combination of the left singular vectors loosing
10 of the data.
? We ran CUR to pick a small number (e.g., k2)
of columns of A and express every column (i.e.,
SNP) of A as a linear combination of the picked
columns, loosing 10 of the data.
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals
75(No Transcript)
76Predicting SNPs within a population
Split the individuals in two sets training and
test. Given a small number of SNPs for all
individuals (tagging-SNPs), and all SNPs for
individuals in the training set, predict the
unassayed SNPs. Tagging-SNPs are selected using
only the training set.
SNPs
Training individuals, chosen uniformly at
random (for a few subjects, we are given all
SNPs)
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
77(No Transcript)
78(No Transcript)
79Predicting SNPs across populations
Given all SNPs for all individuals in population
X, and a small number of tagging-SNPs for
population Y, predict all unassayed SNPs for all
individuals of Y. Tagging-SNPs are selected
using only the training set. Training set
individuals in X. Test set individuals in Y. A
contains all individuals in both X and Y.
SNPs
All individuals in population X.
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
80FIG. 6
81(No Transcript)
82(No Transcript)
83Overview (2/2)
- Tensor-based data sets
- Tensor-CUR
- Hyperspectral data
- Recommendation systems
-
- From Very-Large to Medium-Sized Data
- Relative-error CX and CUR Matrix Decompositions
- L2 Regression Problems
- Application to DNA SNP Data
- Conclusions and Open Problems
84Conclusions Open Problems
- Random sampling and vector space operations in
terms of columns and/or rows of matrices are
algorithmically tractable and applicable to data
matrices. - Applications to data analysis
- Several applications described.
- Applications to statistical learning
- Matrix reconstruction, regression,
classification, and clustering. - Applications to information retrieval
- Vector space model for large and small
documents. - Applications to scientific and internet
databases - Less-structured data, in particular.