VLDB06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets - PowerPoint PPT Presentation

1 / 84
About This Presentation
Title:

VLDB06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

Description:

Least Squares Approximation. Tensor computations: SVD generalizations ... Compute a rank k approximation to this sparse matrix (using Lanczos methods) ... – PowerPoint PPT presentation

Number of Views:220
Avg rating:3.0/5.0
Slides: 85
Provided by: petro155
Category:

less

Transcript and Presenter's Notes

Title: VLDB06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets


1
VLDB06 TUTORIAL Randomized Algorithms for
Matrices and Massive Data Sets
Michael W. Mahoney Yahoo! Research
Petros Drineas CS - RPI
Tutorial given at VLDB 2006 in Seoul, Korea.
http//www.cs.yale.edu/homes/mmahoney http//www.
cs.rpi.edu/drinep
2
Randomized Linear Algebra Algorithms
  • Goal To develop and analyze (fast) Monte Carlo
    algorithms for performing useful computations on
    large (and later not so large!) matrices and
    tensors.
  • Matrix Multiplication
  • Computation of the Singular Value Decomposition
  • Computation of the CUR Decomposition
  • Testing Feasibility of Linear Programs
  • Least Squares Approximation
  • Tensor computations SVD generalizations
  • Tensor computations CUR generalization
  • Such computations generally require time which is
    superlinear in the number of nonzero elements of
    the matrix/tensor, e.g., O(n3) for n x n matrices.

3
Example the CUR decomposition
Algorithmic Motivation To speed up computations
in applications where extremely large data sets
are modeled by matrices and, e.g., O(n2) space
and O(n3) time is not an option. Structural
Motivation To reveal novel structural properties
of the datasets, given sufficient computational
time, that are useful in applications.
Goal make A-CUR small.
Why? (Algorithmic) After making two passes over
A, we can compute provably good C, U, and R and
store them (sketch) instead of A O(mn) vs.
O(n2) space.
Why? Given a sample consisting of a few columns
(C) and a few rows (R) of A, we can compute U and
reconstruct A as CUR. If the sampling
probabilities are not too bad, we get provably
good accuracy.
Why? (Structural) Given sufficient time, we can
find C, U and R such that A CUR is very
small. This might lead to better understanding
of the data.
4
Applications of such algorithms
  • Matrices arise, e.g., since m objects (documents,
    genomes, images, web pages), each with n
    features, may be represented by an m x n matrix
    A.
  • Covariance Matrices
  • Latent Semantic Indexing
  • DNA Microarray Data
  • Eigenfaces and Image Recognition
  • Similarity Queries
  • Matrix Reconstruction
  • LOTS of other data applications!!
  • More generally,
  • Linear and Nonlinear Programming Applications
  • Design of Approximation Algorithms
  • Statistical Learning Theory Applications

5
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

6
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

7
The Pass Efficient Model
  • Motivation Amount of disk/tape space has
    increased enormously RAM and computing speeds
    have increased less rapidly.
  • Can store large amounts of data, but
  • Cannot process these data with traditional
    algorithms.
  • In the Pass-Efficient Model
  • Data are assumed to be stored on disk/tape.
  • Algorithm has access to the data via a pass over
    the data.
  • Algorithm is allowed additional RAM space and
    additional computation time.
  • An algorithm is pass-efficient if it uses a small
    constant number of passes and sublinear
    additional time and space to compute a
    description of the solution.
  • Note If data are an m x n matrix A, then
    algorithms which require additional time and
    space that is O(mn) or O(1) are pass-efficient.

8
Random Sampling
  • Random Sampling and Randomized Algorithms
  • Better complexity properties (randomization as a
    resource).
  • Simpler algorithms and/or analysis (maybe
    de-randomize later).
  • Uniform Sampling
  • Typically things work in expectation, but poor
    variance properties.
  • Non-uniform Sampling
  • With good probabilities, can make the variance
    small.

9
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

10
Approximating Matrix Multiplication
(D. Kannan FOCS 01, and D., Kannan, M. TR
04, SICOMP 06) Problem Statement Given an
m-by-n matrix A and an n-by-p matrix B,
approximate the product AB, OR,
equivalently, Approximate the sum of n rank-one
matrices.
Each term in the summation is a rank-one matrix
11
by random sampling
  • Algorithm
  • Fix a set of probabilities pi, i1,,n, summing
    up to 1.
  • For t1 up to s,
  • set jt i, where Pr(jt i) pi
  • (Pick s terms of the sum, with replacement, with
    respect to the pi.)
  • Approximate AB by the sum of the s terms, after
    scaling.

12
Random sampling (contd)
Keeping the terms j1, j2, js.
13
The algorithm (matrix notation)
14
Simple Lemmas
  • The input matrices are given in sparse
    unordered representation e.g., their non-zero
    entries are presented as triples (i, j, Aij) in
    any order.
  • The expectation of CR (element-wise) is AB.
  • Our nonuniform sampling minimizes the variance of
    the estimator.
  • It is easy to implement the sampling in two
    passes.
  • If the matrices are dense the algorithm runs in
    O(smp) time, instead of O(nmp) time,
  • It requires O(smsp) RAM space.
  • Does not tamper with the sparsity of the matrices.

15
Error Bounds
For the above algorithm,
For the above algorithm, with probability at
least 1-?,
  • This is a relative error bound if ABF
    ?(AF BF), i.e. if there is not much
    cancellation in the multiplication.
  • We removed the expectation (by applying a
    martingale argument) and so have an extra
    log(1/?) factor.
  • Markovs inequality would also remove the
    expectation, introducing an extra 1/? factor.

16
Special case B AT
If B AT, then the sampling probabilities are
Also, R CT, and the error bounds are
17
Special case B AT (contd)
(Rudelson Vershynin 04, Vershynin 04)
Improvement for the spectral norm bound for the
special case B AT.
  • Uses a result of M. Rudelson for random vectors
    in isotropic position.
  • Tight concentration results can be proven using
    Talagrands theory.
  • The sampling procedure is slightly different s
    columns/rows are kept in expectation, i.e.,
    column i is picked with probability

18
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

19
Singular Value Decomposition (SVD)
U (V) orthogonal matrix containing the left
(right) singular vectors of A. S diagonal matrix
containing the singular values of A.
  • Exact computation of the SVD takes O(minmn2 ,
    m2n) time.
  • The top few singular vectors/values can be
    approximated faster (Lanczos/ Arnoldi methods).

20
Rank k approximations (Ak)
  • Uk (Vk) orthogonal matrix containing the top k
    left (right) singular vectors of A.
  • S k diagonal matrix containing the top k
    singular values of A.
  • Also, AkUkUkTA.

Ak is a matrix of rank k such that A-Ak2,F is
minimized over all rank k matrices! This property
of very useful in the context of Principal
Component Analysis.
21
Approximating SVD in O(n) time
  • (D., Frieze, Kannan, Vempala Vinay SODA 99,
    JML 04, D. Kannan, M. TR 04, SICOMP 06)
  • Given m x n matrix A
  • Sample c columns from A and rescale to form the
    m x c matrix C.
  • Compute the m x k matrix Hk of the top k left
    singular vectors of C.

Structural Theorem For any probabilities and
number of columns A-HkHkTA2,F2
A-Ak2,F2 2vkAAT-CCTF Algorithmic
Theorem If pi A(i)2/AF2 and c 4?2k/?2,
then A-HkHkTA2,F2 A-Ak2,F2
?AF2. Proof via matrix multiplication
theorem and matrix perturbation theory.
22
Example of randomized SVD
A
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
Original matrix
After sampling columns
Compute the top k left singular vectors of the
matrix C and store them in the 512-by-k matrix Hk.
23
Example of randomized SVD (contd)
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
A
HkHkTA
A and HkHkTA are close.
24
Element-wise sampling
(Achlioptas McSherry, STOC 01, JACM 05)
  • The Algorithm in 2 lines
  • To approximate a matrix A, keep a few elements
    of the matrix (instead of rows or columns) and
    zero out the remaining elements.
  • Compute a rank k approximation to this sparse
    matrix (using Lanczos methods).

A-S2 is bounded ! (i) the singular values of
A and S are close, and (ii, under additional
assumptions) the top k left (right) singular
vectors of S span a subspace that is close the to
subspace spanned by the top k left (right)
singular vectors of A.
25
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

26
CX and CUR matrix decompositions
Recall Matrices are about their rows and
columns. Recall Low-rank matrices have
redundancy in their rows and columns. Def A CX
matrix decomposition is a low-rank approximation
explicitly expressed in terms of a small number
of columns of the original matrix A (e.g., PCA
CCA). Def A CUR matrix decomposition is a
low-rank approximation explicitly expressed in
terms of a small number of columns and rows of
the original matrix A.
27
A novel CUR matrix decomposition
(D. Kannan, SODA 03, D., Kannan, M. TR 04,
SICOMP 06)
Create an approximation to the original matrix
of the following form
28
The CUR decomposition
Given a large m-by-n matrix A (stored on disk),
compute a decomposition CUR of A such that
  • C consists of c O(k/?2) columns of A.
  • R consists of r O(k/?2) rows of A.
  • C (R) is created using importance sampling, e.g.
    columns (rows) are picked in i.i.d. trials with
    respect to probabilities
  • C, U, R can be stored in O(mn) space, after
    making two passes through the entire matrix A,
    using O(mn) additional space and time.
  • The product CUR satisfies (with high
    probability)

29
Computing U
  • Intuition (which can be formalized - see later)
  • The CUR algorithm expresses every row of the
    matrix A as a linear combination of a small
    subset of the rows of A.
  • This small subset consists of the rows in R.
  • Given a row of A say A(i) the algorithm
    computes a good fit for the row A(i) using the
    rows in R as the basis, by approximately solving

But, only c O(1) elements of the i-th row are
given as input. So, we only approximate the
optimal vector u instead of computing it
exactly. Actually, the pass-efficient CUR
decomposition approximates the approximation.
30
Error bounds for CUR
Assume Ak is the best rank k approximation to
A (through SVD). Then, if we pick O(k/?2) rows
and O(k/?2) columns,
31
Previous CUR-type decompositions
(For details see Drineas Mahoney, A Randomized
Algorithm for a Tensor-Based Generalization of
the SVD, 05.)
32
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

33
CUR application Data Mining
  • Database An m-by-n matrix A, e.g., m (gt106)
    objects and n(gt105) features.
  • Queries Given a new object x, find similar
    objects (nearest neighbors) in A.
  • Closeness Two normalized objects x and y are
    close xTd cos(x,d) is high.
  • Given a query vector x, the matrix product Ax
    computes all the angles/distances.
  • Key observation The exact value xT d might not
    be necessary.
  • The feature values in the vectors are set by
    coarse heuristics.
  • It is in general enough to see if xT d gt
    Threshold.
  • Algorithm Given a query vector x, compute CUR
    x to identify nearest neighbors.
  • Theorem We have a bound on the worst case of x
    using CUR instead of A

34
CUR application Genetic Microarray Data
Exploit structural properties of CUR in
biological applications
Experimental conditions
Find a good set of genes and arrays to include
in C and R? Provable and/or heuristic strategies
are acceptable.
genes
  • Common in Biological/Chemical/Medical
    applications of PCA
  • Explain the singular vectors, by mapping them to
    meaningful biological processes.
  • This is a challenging task (think
    reification) !
  • CUR is a low-rank decomposition in terms of the
    data that practitioners understand.
  • Use it to explain the data and do dimensionality
    reduction, classification, clustering.
  • Gene microarray data M., D., Alter (UT Austin)
    (sporulation and cell cycle data).

35
CUR application Recommendation Systems
(D., Raghavan, Kerenidis, STOC 02)
  • The problem m customers and n products Aij is
    the (unknown) utility of product j for customer
    i.
  • The goal recreate A from a few samples to
    recommend high utility products.
  • (KRRT98) Assuming strong clustering of the
    products, competitive algorithms even with only 2
    samples/customer.
  • (AFKMS01) Assuming sampling of ?(mn) entries of
    A and a gap requirement, accurately recreate A.
  • Question Can we get competitive performance by
    sampling o(mn) elements?
  • Answer Apply the CUR decomposition

36
Kernel-CUR Motivation
  • Kernel-based learning methods to extract
    non-linear structure
  • Choose features to define a (dot product) space
    F.
  • Map the data, X, to F by ? X -gt F.
  • Do classification, regression, and clustering in
    F with linear methods (SVMs,GPs,SVD).
  • If the Gram matrix G, where Gijkij(?(X(i)),
    ?(X(j))), is dense but has low numerical rank,
    then calculations of interest need O(n2) space
    and O(n3) time
  • matrix inversion in GP prediction,
  • quadratic programming problems in SVMs,
  • computation of eigendecomposition of G.
  • Relevant recent work using low-rank methods
  • (Williams and Seeger, NIPS 01, etc.) Nystrom
    method for out-of-sample extensions.
  • (Achlioptas, McSherry, and Schölkopf, NIPS 02)
    randomized kernels.

37
Kernel-CUR Decomposition
(D. M., COLT 05, TR 05, JMLR 05)
  • Input n x n SPSD matrix G, probabilities pi,
    11,,n, c lt n, and k lt c.
  • Algorithm
  • Let C be the n x c matrix containing c randomly
    sampled columns of G.
  • Let W be the c x c matrix with containing
    intersection of C and CT.
  • Theorem Let pi Gii2/ ?i Gii2. If c O(k
    log(1/?)/?4), then w.p. at least 1-?,

If c O(log(1/?)/?4), then with probability at
least 1-?,
38
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

39
Datasets modeled as tensors
  • Tensors (naively, a dataset subscripted by
    multiple indices) appear both in Math and CS.
  • Represent high dimensional functions.
  • Connections to complexity theory (i.e., matrix
    multiplication complexity).
  • Statistical applications (i.e., ICA, HOS, etc.).
  • Large data-set applications (e.g., Medical
    Imaging Hyperspectral Imaging)
  • Problem There does not exist a definition of
    tensor rank (and associated tensor SVD) with the
    nice properties found in the matrix case.
  • (Lek-Heng Lim 05 strong impossibility
    results!)
  • Common heuristic unfold the tensor along a
    mode and apply Linear Algebra.
  • We will do this, but note that this kills the
    essential tensor structure.

40
Datasets modeled as tensors (contd)
Goal Extract structure from a tensor dataset A
(naively, a dataset subscripted by multiple
indices) using a small number of samples.
  • Tensor rank (minimum number of rank-one
    tensors) is NP-hard to compute.
  • Tensor ?-rank (unfold along the ?th mode and
    the the matrix SVD) is a commonly-used heuristic.

Randomized-Tensor-CUR unfold along a
distinguished mode and reconstruct. Randomized-T
ensor-SVD unfold along every mode and choose
columns. (Drineas Mahoney, A Randomized
Algorithm for a Tensor-Based Generalization of
the SVD, TR05.)
41
The TensorCUR algorithm (3-modes)
  • Choose the preferred mode ? (e.g., time).
  • Pick a few representative slabs (let R denote
    the tensor of the sampled slabs).
  • Use only information in a small number of
    representative fibers (let C denote the tensor
    of sampled fibers and U a low-dimensional
    encoding tensor).
  • Express the remaining slabs as linear
    combinations of the basis of sampled slabs.

42
Tensor-CUR application Hyperspectral Image
Analysis
(with M. Maggioni and R. Coifman at Yale)
Goal Extract structure from temporally-resolved
images or spectrally-resolved images of medical
interest using a small number of samples (images
and/or pixels).
Note A temporally or spectrally resolved image
may be viewed as a tensor (naively, a dataset
subscripted by multiple indices) or as a matrix
(whose columns have internal structure that is
not modeled).
m x n x p tensor A or mn x p matrix A
Note The chosen images are a dictionary from the
data to express every image. Note The chosen
pixels are a dictionary from the data to express
every pixel.
43
(No Transcript)
44
Eigen-analysis of slabs and fibers
45
Look at the exact (65-th) slab.
46
The (65-th) slab approximately reconstructed
This slab was reconstructed by approximate
least-squares fit to the basis from slabs 41 and
50, using 1000 (of 250K) pixels/fibers.
47
Tissue Classification - Exact Data
48
Tissue Classification - Ns12 Nf1000
49
Tensor-CUR application Recommendation Systems
  • Important Comment
  • Utility is ordinal and not cardinal concept.
  • Compare products dont assign utility values.
  • Recommendation Model Revisited
  • Every customer has an n-by-n matrix (whose
    entries are 1,-1) and represent pair-wise
    product comparisons.
  • There are m such matrices, forming an
    n-by-n-by-m 3-mode tensor A.
  • Extract the structure of this tensor.

50
Application to Jester Joke Recommendations
Use just the 14,140 full users who rated all
100 Jester jokes. For each user, convert the
utility vector to 100 x 100 pair-wise preference
matrix. Choose, e.g., 300 slabs/users, and a
small number of fibers/comparisons.
51
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

52
Modeling data as matrices
  • People studying data
  • put the data onto a graph or into a vector space
  • even if the data dont naturally or obviously
    live there
  • and perform graph operations or vector space
    operations
  • to extract information from the data.
  • Such data often have structure unrelated to the
    graphical or linear algebraic structure implicit
    in the modeling.
  • This non-modeled structure is difficult to
    formalize.
  • Practitioners often have extensive field-specific
    intuition about the data.
  • This intuition is often used to choose where
    the data live.

53
Modeling data as matrices (contd)
  • Matrices often arise since n objects
    (documents, genomes, images, web pages), each
    with m features, may be represented by an m x n
    matrix A.
  • Such data matrices often have structure
  • for linear structure, SVD or PCA is often used,
  • for non-linear structure, kernel, e.g.,
    diffusion-based, methods used,
  • other structures include sparsity,
    nonnegativity, etc.
  • Note We know what the rows/columns mean from
    the application area.

54
Problems with SVD/Eigen-Analysis
  • Problems arise since structure in the data is not
    respected by mathematical operations on the data
  • Reification - maximum variance directions are
    just that.
  • Interpretability - what does a linear
    combination of 6000 genes mean.
  • Sparsity - is destroyed by orthogonalization.
  • Non-negativity - is a convex and not linear
    algebraic notion.
  • The SVD gives two bases to diagonalize the
    matrix.
  • Truncating gives a low-rank matrix approximation
    with a very particular structure.
  • Think rotation-with-truncation rescaling
    rotation-back-up.
  • Question Do there exist better low-rank
    matrix approximations.
  • better structural properties for certain
    applications.
  • better at respecting relevant structure.
  • better for interpretability and informing
    intuition.

55
Dictionaries for data analysis
  • Discrete Fourier Transform (DCT)
  • fj ?i0,,N-1 xn cosj(n1/2)?/N
  • the basis is fixed.
  • O(N2) or O(Nlog(N)) computation to determine
    coefficients.
  • Singular Value Decomposition (SVD)
  • A ?i1,,? ?iU(i)V(i)T ?i1,,? ?i Ai
  • O(N3) computation to determine basis and
    coefficients.
  • Many other more complex/expensive procedures
    depending on the application.
  • Question Can actual data points and/or feature
    vectors be the dictionary?
  • Core-sets on graphs.
  • CUR-decompositions on matrices.

56
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

57
Problem formulation (1 of 3)
  • Never mind columns and rows - just deal with
    columns (for now) of the matrix A.
  • Could ask to find the best k of n columns of
    A.
  • Combinatorial problem - trivial algorithm takes
    nk time.
  • Probably NP-hard if k is not fixed.
  • Lets ask a different question.
  • Fix a rank parameter k.
  • Lets over-sample columns by a little (e.g.,
    k3, 10k, k2, etc.).
  • Try to get close (additive error or relative
    error) to the best rank-k approximation..

58
Problem formulation (2 of 3)
Ques Do there exist O(k), or O(k2), or ,
columns s.t. A-CCA2,F lt A-Ak2,F
?AF Ans Yes - and can find them in O(mn)
space and time after two passes over the data!
(DFKVV99,DKM04) Ques Do there exist, and can
we find, O(k), or O(k2), or , columns such
that A-CCAF lt (1?)A-AkF Ans Yes, they
exist - existential proof - no non-exhaustive
algorithm given! (RVW05,DRVW06) Ans ...
59
Problem formulation (3 of 3)
Ques Do there exist O(k), or O(k2), or ,
columns and rows such that A-CUR2,F lt
A-Ak2,F ?AF Ans Yes - and can find
them in O(mn) space and time after two passes
over the data! (DK03,DKM04) Ques Do there
exist O(k), or O(k2), or , columns and rows such
that A-CURF lt (1?)A-AkF Ans
60
L2 Regression problems
  • First consider overconstrained problems, n gtgt d.
  • Typically, there is no x such that Ax b.
  • Can generalize to non-overconstrained problems if
    rank(A)k.
  • We seek sampling-based algorithms for
    approximating l2 regression.
  • Nontrivial structural insights in overconstrained
    problems
  • Nontrivial algorithmic insights for
    non-overconstrained problems.

61
Our main L2 Regression result
If the pi satisfy certain conditions, then with
probability at least 1-?,
?(A) condition number of A
The sampling complexity is
(New improvement we can reduce the sampling
complexity to r O(d).)
62
L2 Regression and CUR Approximation
  • Extended L2 Regression Algorithm
  • Input m x n matrix A, m x p matrix B, and a
    rank parameter k.
  • Output n x p matrix X approximately solving
    minX AkX - B F.
  • Algrthm Randomly sample rO(d2) or rO(d) rows
    from Ak and B
  • Solve the induced sub-problem.
  • Xopt AkB (SAk)SB
  • Cmplxty O(SVD(Ak)) time and space
  • Corollary 1 Approximately solve minX ATkX -
    ATF to get columns C such that A-CCAF
    (1?)A-AkF.
  • Corollary 2 Approximately solve minX CX -
    AF to get rows R such that A-CURF (1?)
    A-CCAF.

63
Theorem Relative-Error CUR
Fix any k, ?, ?. Then, there exists a Monte
Carlo algorithm that uses O(SVD(Ak)) time to find
C and R and construct U s.t. holds with
probability at least 1-?, by picking c O( k2
log(1/?) / ?2 ) columns, and r O( k4 log2(1/?)
/ ?6 ) rows. (Current theory work we can
improve the sampling complexity to c,rO(k
poly(1/?, 1/?)).) (Current empirical work we can
usually choose c,r k4.) (Dont worry about ?
choose ?1 if you want!)
64
Subsequent relative-error algorithms
  • November 2005 Drineas, Mahoney, and
    Muthukrishnan
  • The first relative-error low-rank matrix
    approximation algorithm.
  • O(SVD(Ak)) time and O(k2) columns for both CX
    and CUR decompositions.
  • January 2006 Har-Peled
  • Used ?-nets and VC-dimension arguments on
    optimal k-flats.
  • O(mn k2 log(k)) - linear in mn time to get
    1? approximation.
  • March 2006 Despande and Vempala
  • Used a volume sampling - adaptive sampling
    procedure of RVW05, DRVW06.
  • O(Mk/?) O(SVD(Ak)) time and O(k log(k))
    columns for CX-like decomposition.
  • April 2006 Drineas, Mahoney, and Muthukrishnan
  • Improved the DMM November 2005 result to O(k
    log(k)) columns.

65
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

66
CUR data application DNA tagging-SNPs
(data from K. Kidds lab at Yale University,
joint work with Dr. Paschou at Yale University)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are 10 million SNPs in the human genome,
so this table could have 10 million columns.
67
Recall the human genome
  • Human genome 3 billion base pairs
  • 30,000 40,000 genes
  • The functionality of 97 of the genome is
    unknown.
  • BUT individual differences (polymorphic
    variation) at 1 b.p. per thousand.

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT
GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG
CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG
TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC
CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG
CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC
GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC
AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG
GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG
GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA
GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG
AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA
CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG
CT CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG
GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GT TT GG AA
individuals
SNPs occur quite frequently within the genome
allowing the tracking of disease genes and
population histories. Thus, SNPs are effective
markers for genomic research.
68
Two copies of a chromosome (father, mother)
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
69
C
C
Two copies of a chromosome (father, mother)
  • An individual could be
  • Heterozygotic (in our study, CT TC)
  • Homozygotic at the first allele, e.g., C

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
70
T
T
Two copies of a chromosome (father, mother)
  • An individual could be
  • Heterozygotic (in our study, CT TC)
  • Homozygotic at the first allele, e.g., C
  • Homozygotic at the second allele, e.g., T

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
71
The DNA SNP data
  • Samples from 38 different populations.
  • Average size 50 subjects/population.
  • For each subject 63 SNPs were assayed, from a
    region in chromosome 17 called SORCS3, 900,000
    bases long.
  • We are in the process of analyzing HapMap data
    as well as more 3 regions assayed by Kidds lab
    (with Asif Javed).

72
(No Transcript)
73
Encoding the data
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals
  • How?
  • Exactly two nucleotides (out of A,G,C,T) appear
    in each column.
  • Thus, the two alleles might be both equal to the
    first one (encode by 1), both equal to the
    second one (encode by -1), or different (encode
    by 0).
  • Notes
  • Order of the alleles is irrelevant, so TG is
    the same as GT.
  • Encoding, e.g., GG to 1 and TT to -1 is not
    any different (for our purposes) from encoding
    GG to -1 and TT to 1.
  • (Flipping the signs of the columns of a matrix
    does not affect our techniques.)

74
Evaluating (linear) structure
For each population ? We ran SVD to determine
the optimal number k of eigenSNPs covering 90
of the variance.
If we pick the top k left singular vectors we can
express every column (i.e, SNP) of A as a linear
combination of the left singular vectors loosing
10 of the data.
? We ran CUR to pick a small number (e.g., k2)
of columns of A and express every column (i.e.,
SNP) of A as a linear combination of the picked
columns, loosing 10 of the data.
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals
75
(No Transcript)
76
Predicting SNPs within a population
Split the individuals in two sets training and
test. Given a small number of SNPs for all
individuals (tagging-SNPs), and all SNPs for
individuals in the training set, predict the
unassayed SNPs. Tagging-SNPs are selected using
only the training set.
SNPs
Training individuals, chosen uniformly at
random (for a few subjects, we are given all
SNPs)
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
77
(No Transcript)
78
(No Transcript)
79
Predicting SNPs across populations
Given all SNPs for all individuals in population
X, and a small number of tagging-SNPs for
population Y, predict all unassayed SNPs for all
individuals of Y. Tagging-SNPs are selected
using only the training set. Training set
individuals in X. Test set individuals in Y. A
contains all individuals in both X and Y.
SNPs
All individuals in population X.
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
80
FIG. 6
81
(No Transcript)
82
(No Transcript)
83
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

84
Conclusions Open Problems
  • Random sampling and vector space operations in
    terms of columns and/or rows of matrices are
    algorithmically tractable and applicable to data
    matrices.
  • Applications to data analysis
  • Several applications described.
  • Applications to statistical learning
  • Matrix reconstruction, regression,
    classification, and clustering.
  • Applications to information retrieval
  • Vector space model for large and small
    documents.
  • Applications to scientific and internet
    databases
  • Less-structured data, in particular.
Write a Comment
User Comments (0)
About PowerShow.com