SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets - PowerPoint PPT Presentation

1 / 107
About This Presentation
Title:

SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

Description:

... things work in expectation, but poor variance properties. Non-uniform Sampling: With 'good' ... Our nonuniform sampling minimizes the variance of the estimator. ... – PowerPoint PPT presentation

Number of Views:200
Avg rating:3.0/5.0
Slides: 108
Provided by: petro152
Category:

less

Transcript and Presenter's Notes

Title: SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets


1
SDM06 TUTORIAL Randomized Algorithms for
Matrices and Massive Data Sets
Michael W. Mahoney Yahoo! Research
Petros Drineas CS - RPI
Tutorial given at SIAM Data Mining Meeting April
22, 2006 (Most recent copy) available at
http//www.cs.yale.edu/homes/mmahoney http//www.
cs.rpi.edu/drinep
2
Randomized Linear Algebra Algorithms
  • Goal To develop and analyze (fast) Monte Carlo
    algorithms for performing useful computations on
    large (and later not so large!) matrices and
    tensors.
  • Matrix Multiplication
  • Computation of the Singular Value Decomposition
  • Computation of the CUR Decomposition
  • Testing Feasibility of Linear Programs
  • Least Squares Approximation
  • Tensor computations SVD generalizations
  • Tensor computations CUR generalization
  • Such computations generally require time which is
    superlinear in the number of nonzero elements of
    the matrix/tensor, e.g., O(n3) for n x n matrices.

3
Example the CUR decomposition
Algorithmic Motivation To speed up computations
in applications where extremely large data sets
are modeled by matrices and, e.g., O(n2) space
and O(n3) time is not an option. Structural
Motivation To reveal novel structural properties
of the datasets, given sufficient computational
time, that are useful in applications.
Goal make A-CUR small.
Why? (Algorithmic) After making two passes over
A, we can compute provably good C, U, and R and
store them (sketch) instead of A O(mn) vs.
O(n2) space.
Why? Given a sample consisting of a few columns
(C) and a few rows (R) of A, we can compute U and
reconstruct A as CUR. If the sampling
probabilities are not too bad, we get provably
good accuracy.
Why? (Structural) Given sufficient time, we can
find C, U and R such that A CUR is very
small. This might lead to better understanding
of the data.
4
Applications of such algorithms
  • Matrices arise, e.g., since m objects (documents,
    genomes, images, web pages), each with n
    features, may be represented by an m x n matrix
    A.
  • Covariance Matrices
  • Latent Semantic Indexing
  • DNA Microarray Data
  • Eigenfaces and Image Recognition
  • Similarity Queries
  • Matrix Reconstruction
  • LOTS of other data applications!!
  • More generally,
  • Linear and Nonlinear Programming Applications
  • Design of Approximation Algorithms
  • Statistical Learning Theory Applications

5
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

6
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

7
The Pass Efficient Model
  • Motivation Amount of disk/tape space has
    increased enormously RAM and computing speeds
    have increased less rapidly.
  • Can store large amounts of data, but
  • Cannot process these data with traditional
    algorithms.
  • In the Pass-Efficient Model
  • Data are assumed to be stored on disk/tape.
  • Algorithm has access to the data via a pass over
    the data.
  • Algorithm is allowed additional RAM space and
    additional computation time.
  • An algorithm is pass-efficient if it uses a small
    constant number of passes and sublinear
    additional time and space to compute a
    description of the solution.
  • Note If data are an m x n matrix A, then
    algorithms which require additional time and
    space that is O(mn) or O(1) are pass-efficient.

8
Random Sampling
  • Random Sampling and Randomized Algorithms
  • Better complexity properties (randomization as a
    resource).
  • Simpler algorithms and/or analysis (maybe
    de-randomize later).
  • Uniform Sampling
  • Typically things work in expectation, but poor
    variance properties.
  • Non-uniform Sampling
  • With good probabilities, can make the variance
    small.

9
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

10
Approximating Matrix Multiplication
(D. Kannan FOCS 01, and D., Kannan, M. TR
04, SICOMP 06) Problem Statement Given an
m-by-n matrix A and an n-by-p matrix B,
approximate the product AB, OR,
equivalently, Approximate the sum of n rank-one
matrices.
Each term in the summation is a rank-one matrix
11
by random sampling
  • Algorithm
  • Fix a set of probabilities pi, i1,,n, summing
    up to 1.
  • For t1 up to s,
  • set jt i, where Pr(jt i) pi
  • (Pick s terms of the sum, with replacement, with
    respect to the pi.)
  • Approximate AB by the sum of the s terms, after
    scaling.

12
Random sampling (contd)
Keeping the terms j1, j2, js.
13
The algorithm (matrix notation)
14
Simple Lemmas
  • The input matrices are given in sparse
    unordered representation e.g., their non-zero
    entries are presented as triples (i, j, Aij) in
    any order.
  • The expectation of CR (element-wise) is AB.
  • Our nonuniform sampling minimizes the variance of
    the estimator.
  • It is easy to implement the sampling in two
    passes.
  • If the matrices are dense the algorithm runs in
    O(smp) time, instead of O(nmp) time,
  • It requires O(smsp) RAM space.
  • Does not tamper with the sparsity of the matrices.

15
Error Bounds
For the above algorithm,
For the above algorithm, with probability at
least 1-?,
  • This is a relative error bound if ABF
    ?(AF BF), i.e. if there is not much
    cancellation in the multiplication.
  • We removed the expectation (by applying a
    martingale argument) and so have an extra
    log(1/?) factor.
  • Markovs inequality would also remove the
    expectation, introducing an extra 1/? factor.

16
Special case B AT
If B AT, then the sampling probabilities are
Also, R CT, and the error bounds are
17
Special case B AT (contd)
(Rudelson Vershynin 04, Vershynin 04)
Improvement for the spectral norm bound for the
special case B AT.
  • Uses a result of M. Rudelson for random vectors
    in isotropic position.
  • Tight concentration results can be proven using
    Talagrands theory.
  • The sampling procedure is slightly different s
    columns/rows are kept in expectation, i.e.,
    column i is picked with probability

18
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

19
Singular Value Decomposition (SVD)
U (V) orthogonal matrix containing the left
(right) singular vectors of A. S diagonal matrix
containing the singular values of A.
  • Exact computation of the SVD takes O(minmn2 ,
    m2n) time.
  • The top few singular vectors/values can be
    approximated faster (Lanczos/ Arnoldi methods).

20
Rank k approximations (Ak)
  • Uk (Vk) orthogonal matrix containing the top k
    left (right) singular vectors of A.
  • S k diagonal matrix containing the top k
    singular values of A.
  • Also, AkUkUkTA.

Ak is a matrix of rank k such that A-Ak2,F is
minimized over all rank k matrices! This property
of very useful in the context of Principal
Component Analysis.
21
Approximating SVD in O(n) time
  • (D., Frieze, Kannan, Vempala Vinay SODA 99,
    JML 04, D. Kannan, M. TR 04, SICOMP 06)
  • Given m x n matrix A
  • Sample c columns from A and rescale to form the
    m x c matrix C.
  • Compute the m x k matrix Hk of the top k left
    singular vectors of C.

Structural Theorem For any probabilities and
number of columns A-HkHkTA2,F2
A-Ak2,F2 2vkAAT-CCTF Algorithmic
Theorem If pi A(i)2/AF2 and c 4?2k/?2,
then A-HkHkTA2,F2 A-Ak2,F2
?AF2. Proof via matrix multiplication
theorem and matrix perturbation theory.
22
Example of randomized SVD
A
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
Original matrix
After sampling columns
Compute the top k left singular vectors of the
matrix C and store them in the 512-by-k matrix Hk.
23
Example of randomized SVD (contd)
Title
C\Petros\Image Processing\baboondet.eps
Creator
MATLAB, The Mathworks, Inc.
Preview
This EPS picture was not saved
with a preview included in it.
Comment
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
A
HkHkTA
A and HkHkTA are close.
24
Element-wise sampling
(Achlioptas McSherry, STOC 01, JACM 05)
  • The Algorithm in 2 lines
  • To approximate a matrix A, keep a few elements
    of the matrix (instead of rows or columns) and
    zero out the remaining elements.
  • Compute a rank k approximation to this sparse
    matrix (using Lanczos methods).

A-S2 is bounded ! (i) the singular values of
A and S are close, and (ii, under additional
assumptions) the top k left (right) singular
vectors of S span a subspace that is close the to
subspace spanned by the top k left (right)
singular vectors of A.
25
Element-wise sampling (contd)
  • Approximating singular values fast
  • Zero out (a large number of) elements of A,
    scale the remaining ones appropriately.
  • Compute the singular values of the resulting
    sparse matrix using iterative techniques.
  • (Good choice for pij pij sAij2/?i,j Aij2,
    where s denotes the expected number of elements
    that we seek to keep in S.)
  • Note Each element is kept or discarded
    independently of the others.
  • Similar ideas have been used to
  • explain the success of Latent Semantic Indexing
    (LSI) (Papadimitriou, Raghavan, Tamaki, Vempala,
    PODS 98 Azar, Fiat, Karlin, McSherry, Saia
    STOC 01)
  • design recommendation systems (Azar, Fiat,
    Karlin, McSherry, Saia STOC 01 )
  • speedup kernel computations (Achlioptas,
    McSherry, and Schölkopf, NIPS 02)

26
Element-wise vs. row/column sampling
  • Quite different techniques!
  • Row/column sampling preserves subspace/structural
    properties of the matrices.
  • Element-wise sampling explains how adding noise
    and/or quantizing the elements of a matrix
    perturbs its singular values/vectors.
  • The two techniques should be complementary!
  • Some similarities and differences
  • Similar error bounds.
  • Element-wise sampling is doable in one pass!
  • Running time of element-wise sampling depends on
    the speed of, e.g., Arnoldi methods.
  • Element-wise methods do not seem amenable to
    many of the extensions we will present.

27
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

28
A novel CUR matrix decomposition
(D. Kannan, SODA 03, D., Kannan, M. TR 04,
SICOMP 06)
Create an approximation to the original matrix
of the following form
29
The CUR decomposition
Given a large m-by-n matrix A (stored on disk),
compute a decomposition CUR of A such that
  • C consists of c O(k/?2) columns of A.
  • R consists of r O(k/?2) rows of A.
  • C (R) is created using importance sampling, e.g.
    columns (rows) are picked in i.i.d. trials with
    respect to probabilities
  • C, U, R can be stored in O(mn) space, after
    making two passes through the entire matrix A,
    using O(mn) additional space and time.
  • The product CUR satisfies (with high
    probability)

30
Computing U
  • Intuition (which can be formalized - see later)
  • The CUR algorithm expresses every row of the
    matrix A as a linear combination of a small
    subset of the rows of A.
  • This small subset consists of the rows in R.
  • Given a row of A say A(i) the algorithm
    computes a good fit for the row A(i) using the
    rows in R as the basis, by approximately solving

But, only c O(1) elements of the i-th row are
given as input. So, we only approximate the
optimal vector u instead of computing it
exactly. Actually, the pass-efficient CUR
decomposition approximates the approximation.
31
Error bounds for CUR
Assume Ak is the best rank k approximation to
A (through SVD). Then, if we pick O(k/?2) rows
and O(k/?2) columns,
32
Previous CUR-type decompositions
(For details see Drineas Mahoney, A Randomized
Algorithm for a Tensor-Based Generalization of
the SVD, 05.)
33
Lower Bounds
Question How many queries does a sampling
algorithm need to approximate a given function
accurately with high probability?
  • (Ziv Bar-Yossef 03, 04) Lower bounds for the
    low rank matrix approximation problem and the
    matrix reconstruction problem.
  • Any sampling algorithm that w.h.p. finds a good
    low rank approximation requires ?(mn) queries.
  • Even if the algorithm is given the exact weight
    distribution over the columns of a matrix it will
    still require ?(k/?4) queries.
  • Finding a matrix D such that A-DF ? AF
    requires ?(mn) queries and that finding a D such
    that A-D2 ? A2 requires ?(mn) queries.
  • Applied to our results
  • The LinearTimeSVD algorithm is optimal w.r.t.
    ?F bounds.
  • The ConstantTimeSVD algorithm is optimal w.r.t.
    ?2 bounds up to poly factors.
  • The CUR algorithm is optimal for constant ?.

34
Overview (1/2)
  • Data Streaming Models and Random Sampling
  • Matrix Multiplication
  • Singular Value Decomposition
  • CUR Matrix Decomposition
  • Applications of Matrix CUR
  • Data mining
  • DNA microarray (and DNA SNP) data
  • Recommendation Systems
  • Kernel-CUR and the Nystrom Method

35
CUR application Data Mining
  • Database An m-by-n matrix A, e.g., m (gt106)
    objects and n(gt105) features.
  • Queries Given a new object x, find similar
    objects (nearest neighbors) in A.
  • Closeness Two normalized objects x and y are
    close xTd cos(x,d) is high.
  • Given a query vector x, the matrix product Ax
    computes all the angles/distances.
  • Key observation The exact value xT d might not
    be necessary.
  • The feature values in the vectors are set by
    coarse heuristics.
  • It is in general enough to see if xT d gt
    Threshold.
  • Algorithm Given a query vector x, compute CUR
    x to identify nearest neighbors.
  • Theorem We have a bound on the worst case of x
    using CUR instead of A

36
CUR application Genetic Microarray Data
Exploit structural properties of CUR in
biological applications
Experimental conditions
Find a good set of genes and arrays to include
in C and R? Provable and/or heuristic strategies
are acceptable.
genes
  • Common in Biological/Chemical/Medical
    applications of PCA
  • Explain the singular vectors, by mapping them to
    meaningful biological processes.
  • This is a challenging task (think
    reification) !
  • CUR is a low-rank decomposition in terms of the
    data that practitioners understand.
  • Use it to explain the data and do dimensionality
    reduction, classification, clustering.
  • Gene microarray data M., D., Alter (UT Austin)
    (sporulation and cell cycle data).

37
CUR application Recommendation Systems
(D., Raghavan, Kerenidis, STOC 02)
  • The problem m customers and n products Aij is
    the (unknown) utility of product j for customer
    i.
  • The goal recreate A from a few samples to
    recommend high utility products.
  • (KRRT98) Assuming strong clustering of the
    products, competitive algorithms even with only 2
    samples/customer.
  • (AFKMS01) Assuming sampling of ?(mn) entries of
    A and a gap requirement, accurately recreate A.
  • Question Can we get competitive performance by
    sampling o(mn) elements?
  • Answer Apply the CUR decomposition

38
Kernel-CUR Motivation
  • Kernel-based learning methods to extract
    non-linear structure
  • Choose features to define a (dot product) space
    F.
  • Map the data, X, to F by ? X -gt F.
  • Do classification, regression, and clustering in
    F with linear methods (SVMs,GPs,SVD).
  • If the Gram matrix G, where Gijkij(?(X(i)),
    ?(X(j))), is dense but has low numerical rank,
    then calculations of interest need O(n2) space
    and O(n3) time
  • matrix inversion in GP prediction,
  • quadratic programming problems in SVMs,
  • computation of eigendecomposition of G.
  • Relevant recent work using low-rank methods
  • (Williams and Seeger, NIPS 01, etc.) Nystrom
    method for out-of-sample extensions.
  • (Achlioptas, McSherry, and Schölkopf, NIPS 02)
    randomized kernels.

39
Kernel-CUR Decomposition
(D. M., COLT 05, TR 05, JMLR 05)
  • Input n x n SPSD matrix G, probabilities pi,
    11,,n, c lt n, and k lt c.
  • Algorithm
  • Let C be the n x c matrix containing c randomly
    sampled columns of G.
  • Let W be the c x c matrix with containing
    intersection of C and CT.
  • Theorem Let pi Gii2/ ?i Gii2. If c O(k
    log(1/?)/?4), then w.p. at least 1-?,

If c O(log(1/?)/?4), then with probability at
least 1-?,
40
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

41
Datasets modeled as tensors
  • Tensors (naively, a dataset subscripted by
    multiple indices) appear both in Math and CS.
  • Represent high dimensional functions.
  • Connections to complexity theory (i.e., matrix
    multiplication complexity).
  • Statistical applications (i.e., ICA, HOS, etc.).
  • Large data-set applications (e.g., Medical
    Imaging Hyperspectral Imaging)
  • Problem There does not exist a definition of
    tensor rank (and associated tensor SVD) with the
    nice properties found in the matrix case.
  • (Lek-Heng Lim 05 strong impossibility
    results!)
  • Common heuristic unfold the tensor along a
    mode and apply Linear Algebra.
  • We will do this, but note that this kills the
    essential tensor structure.

42
Datasets modeled as tensors (contd)
Goal Extract structure from a tensor dataset A
(naively, a dataset subscripted by multiple
indices) using a small number of samples.
  • Tensor rank (minimum number of rank-one
    tensors) is NP-hard to compute.
  • Tensor ?-rank (unfold along the ?th mode and
    the the matrix SVD) is a commonly-used heuristic.

Randomized-Tensor-CUR unfold along a
distinguished mode and reconstruct. Randomized-T
ensor-SVD unfold along every mode and choose
columns. (Drineas Mahoney, A Randomized
Algorithm for a Tensor-Based Generalization of
the SVD, TR05.)
43
The TensorCUR algorithm (3-modes)
  • Choose the preferred mode ? (e.g., time).
  • Pick a few representative slabs (let R denote
    the tensor of the sampled slabs).
  • Use only information in a small number of
    representative fibers (let C denote the tensor
    of sampled fibers and U a low-dimensional
    encoding tensor).
  • Express the remaining slabs as linear
    combinations of the basis of sampled slabs.

44
Tensor-CUR application Hyperspectral Image
Analysis
(with M. Maggioni and R. Coifman at Yale)
Goal Extract structure from temporally-resolved
images or spectrally-resolved images of medical
interest using a small number of samples (images
and/or pixels).
Note A temporally or spectrally resolved image
may be viewed as a tensor (naively, a dataset
subscripted by multiple indices) or as a matrix
(whose columns have internal structure that is
not modeled).
m x n x p tensor A or mn x p matrix A
Note The chosen images are a dictionary from the
data to express every image. Note The chosen
pixels are a dictionary from the data to express
every pixel.
45
(No Transcript)
46
Sampling hyperspectral data
  • Sample slabs depending on total absorption.
  • For example, absorption at two pixel types

Sample fibers uniformly (since intensity depends
on stain).
47
Eigen-analysis of slabs and fibers
48
Look at the exact (65-th) slab.
49
The (65-th) slab approximately reconstructed
This slab was reconstructed by approximate
least-squares fit to the basis from slabs 41 and
50, using 1000 (of 250K) pixels/fibers.
50
Tissue Classification - Exact Data
51
Tissue Classification - Ns12 Nf1000
52
Tensor-CUR application Recommendation Systems
  • Important Comment
  • Utility is ordinal and not cardinal concept.
  • Compare products dont assign utility values.
  • Recommendation Model Revisited
  • Every customer has an n-by-n matrix (whose
    entries are 1,-1) and represent pair-wise
    product comparisons.
  • There are m such matrices, forming an
    n-by-n-by-m 3-mode tensor A.
  • Extract the structure of this tensor.

53
Application to Jester Joke Recommendations
Use just the 14,140 full users who rated all
100 Jester jokes. For each user, convert the
utility vector to 100 x 100 pair-wise preference
matrix. Choose, e.g., 300 slabs/users, and a
small number of fibers/comparisons.
54
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

55
Modeling data as matrices
  • People studying data
  • put the data onto a graph or into a vector space
  • even if the data dont naturally or obviously
    live there
  • and perform graph operations or vector space
    operations
  • to extract information from the data.
  • Such data often have structure unrelated to the
    graphical or linear algebraic structure implicit
    in the modeling.
  • This non-modeled structure is difficult to
    formalize.
  • Practitioners often have extensive field-specific
    intuition about the data.
  • This intuition is often used to choose where
    the data live.
  • The choice of where the data live may capture
    non-modeled structure.

56
Modeling data as matrices (contd)
  • Matrices often arise since n objects
    (documents, genomes, images, web pages), each
    with m features, may be represented by an m x n
    matrix A.
  • Such data matrices often have structure
  • for linear structure, SVD or PCA is often used,
  • for non-linear structure, kernel, e.g.,
    diffusion-based, methods used,
  • other structures include sparsity,
    nonnegativity, etc.
  • Note We know what the rows/columns mean from
    the application area.
  • Goal Develop principled provably-accurate
    algorithmic methods such that
  • they are agnostic with respect to any particular
    field,
  • one can fruitfully couple them to the
    field-specific intuition,
  • they perform well on complex non-toy data sets.

57
SVD and low-rank approximations
  • Theorem Let A be an m x n matrix rank ?.
    Truncate the SVD of A by keeping k ? terms Ak
    Uk ?k VkT. This gives the best rank-k
    approximation to A.
  • Interesting properties of truncated SVD
  • Used in data analysis via Principal Components
    Analysis (PCA) .
  • The rows of Uk ( UA,k) are NOT orthogonal and
    are NOT unit length.
  • The lengths/Euclidean norms of the rows of Uk
    capture a notion of information dispersal.
  • Gives a low-rank approximation with a very
    particular structure (rotate-rescale-rotate).
  • Best at capturing Frobenius (and other) norm.
  • Problematic w.r.t. sparsity, interpretability,
    etc.

58
Problems with SVD/Eigen-Analysis
  • Problems arise since structure in the data is not
    respected by mathematical operations on the data
  • Reification - maximum variance directions are
    just that.
  • Interpretability - what does a linear
    combination of 6000 genes mean.
  • Sparsity - is destroyed by orthogonalization.
  • Non-negativity - is a convex and not linear
    algebraic notion.
  • The SVD gives two bases to diagonalize the
    matrix.
  • Truncating gives a low-rank matrix approximation
    with a very particular structure.
  • Think rotation-with-truncation rescaling
    rotation-back-up.
  • Question Do there exist better low-rank
    matrix approximations.
  • better structural properties for certain
    applications.
  • better at respecting relevant structure.
  • better for interpretability and informing
    intuition.

59
Exactly and approximately rank-k matrices
  • Theorem Let the m x n matrix A be exactly rank
    k. Then
  • There exists k linear combinations of columns,
    rows, etc. such that
  • A Ak Uk?kVkT.
  • There exists k actual columns and rows of A,
    permuted, such that

Take-home message Low-rank structure IS
redundancy in columns/rows. Theorem Let the m x
n matrix A be approximately rank k. Then A Ak
Uk?kVkT is the best approximation. Question
Can we express approximately rank k matrices in
terms of their actual columns and/or rows.
60
Dictionaries for data analysis
  • Discrete Fourier Transform (DCT)
  • fj ?i0,,N-1 xn cosj(n1/2)?/N
  • the basis is fixed.
  • O(N2) or O(Nlog(N)) computation to determine
    coefficients.
  • Singular Value Decomposition (SVD)
  • A ?i1,,? ?iU(i)V(i)T ?i1,,? ?i Ai
  • O(N3) computation to determine basis and
    coefficients.
  • Many other more complex/expensive procedures
    depending on the application.
  • Question Can actual data points and/or feature
    vectors be the dictionary?
  • Core-sets on graphs.
  • CUR-decompositions on matrices.

61
CX and CUR matrix decompositions
Recall Matrices are about their rows and
columns. Recall Low-rank matrices have
redundancy in their rows and columns. Def A CX
matrix decomposition is a low-rank approximation
explicitly expressed in terms of a small number
of columns of the original matrix A (e.g., PCA
CCA). Def A CUR matrix decomposition is a
low-rank approximation explicitly expressed in
terms of a small number of columns and rows of
the original matrix A.
62
Dictionaries the SVD
  • A U ?VT ? i1,...,? ?iU(i)V(i)T,
  • where U(i),V(i) eigen-cols and eigen-rows.
  • Approximate A(j) ? i1,...,k zijU(i)
  • by minzij A(j) - ? i1,...,k zijU(i) 2
  • Z UkTA --gt A Ak (UkUkT)A
  • project onto space of top k eigen-cols.
  • Z ?kVkT --gt A Ak Uk(?kVkT)
  • approximate every column of A i.t.o. a small
    number of eigen-rows and a low-dimensional
    encoding matrix ?k.

63
Dictionaries columns and rows
  • A CUR ? ij uijC(i)R(i), where UW and W
    intersection of C and R,
  • where C(i),R(i) actual-cols and actual-rows.
  • Approximate A(j) ? i1,...,c yijC(i)
  • by minyij A(j) - ? i1,...,c yijC(i) 2
  • Y CA --gt A PCA (CC)A
  • project onto space of those c actual-cols.
  • Y WR --gt A PCA C(WR)
  • approximate every column of A i.t.o. a small
    number of actual-rows and a low-dimensional
    encoding matrix UW.

64
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

65
Problem formulation (1 of 3)
  • Never mind columns and rows - just deal with
    columns (for now) of the matrix A.
  • Could ask to find the best k of n columns of
    A.
  • Combinatorial problem - trivial algorithm takes
    nk time.
  • Probably NP-hard if k is not fixed.
  • Lets ask a different question.
  • Fix a rank parameter k.
  • Lets over-sample columns by a little (e.g.,
    k3, 10k, k2, etc.).
  • Try to get close (additive error or relative
    error) to the best rank-k approximation..

66
Problem formulation (2 of 3)
Ques Do there exist O(k), or O(k2), or ,
columns s.t. A-CCA2,F lt A-Ak2,F
?AF Ans Yes - and can find them in O(mn)
space and time after two passes over the data!
(DFKVV99,DKM04) Ques Do there exist O(k), or
O(k2), or , columns such that A-CCA2,F lt
(1?)-1A-Ak2,F ?tAF Ans Yes - and can
find them in O(mn) space and time after t passes
over the data! (RVW05,DM05) Ques Do there
exist, and can we find, O(k), or O(k2), or ,
columns such that A-CCAF lt
(1?)A-AkF Ans Yes, they exist - existential
proof - no non-exhaustive algorithm given!
(RVW05,DRVW06) Ans ...
67
Problem formulation (3 of 3)
Ques Do there exist O(k), or O(k2), or ,
columns and rows such that A-CUR2,F lt
A-Ak2,F ?AF Ans Yes - lots of them,
and can find them in O(mn) space and time after
two passes over the data! (DK03,DKM04) Note
lots of them since these are randomized Monte
Carlo algorithms! Ques Do there exist O(k), or
O(k2), or , columns and rows such
that A-CURF lt (1?)A-AkF Ans
68
Theorem Relative-Error CUR
Fix any k, ?, ?. Then, there exists a Monte
Carlo algorithm that uses O(SVD(Ak)) time to find
C and R and construct U s.t. holds with
probability at least 1-?, by picking c O( k2
log(1/?) / ?2 ) columns, and r O( k4 log2(1/?)
/ ?6 ) rows. (Current theory work we can
improve the sampling complexity to c,rO(k
poly(1/?, 1/?)).) (Current empirical work we can
usually choose c,r k4.) (Dont worry about ?
choose ?1 if you want!)
69
L2 Regression problems
  • First consider overconstrained problems, n gtgt d.
  • Typically, there is no x such that Ax b.
  • Can generalize to non-overconstrained problems if
    rank(A)k.
  • We seek sampling-based algorithms for
    approximating l2 regression.
  • Nontrivial structural insights in overconstrained
    problems
  • Nontrivial algorithmic insights for
    non-overconstrained problems.

70
Creating an induced subproblem
  • Algorithm
  • Fix a set of probabilities pi, i1n, summing up
    to 1.
  • Pick r indices from 1n in r i.i.d. trials,
    with respect to the pis.
  • For each sampled index j, keep the j-th row of A
    and the j-th element of b rescale both by
    (1/rpj)1/2.

71
The induced subproblem
72
Our main L2 Regression result
If the pi satisfy certain conditions, then with
probability at least 1-?,
?(A) condition number of A
The sampling complexity is
(New improvement we can reduce the sampling
complexity to r O(d).)
73
Conditions for the probabilities
The conditions that the pi must satisfy, for some
?1, ?2, ?3 ? (0,1
74
Rows of left singular vectors
What do the lengths of the rows of the n x d
matrix U UA mean? Consider possible n x d
matrices U of d left singular vectors Ink k
columns from the identity row lengths 0 or
1 Ink x -gt x Hnk k columns from the n x
n Hadamard (real Fourier) matrix row lengths
all equal Hnk x -gt maximally dispersed Uk
k columns from any orthogonal matrix row
lengths between 0 and 1 Lengths of the rows of U
UA correspond to a notion of information
dispersal. Where in Rm the ? information in A is
sent, not what the ? information is.
75
Comments on L2 Regression
  • Main point The relevant information for l2
    regression if n gtgt d is contained in an induced
    subproblem of size O(d2)-by-d.
  • In O(nd2) O(SVD(A)) O(SVD(Ad)) time we can
    easily compute pis that satisfy all three
    conditions, with ?1 ?2 ?3 1/3.
  • Too expensive in practice for this
    over-constrained problem!
  • NOT too expensive when applied to CX and CUR
    matrix problems!!
  • Key observation (FKV98, DK01, DKM04, RV04) Us
    is almost orthogonal, and we can bound the
    spectral and the Frobenius norm of UsT Us I.
  • NOTE K. Clarkson in SODA 2005 analyzed
    sampling-based algorithms for overconstrained l1
    regression (? 1 ) problems.

76
L2 Regression and CUR Approximation
  • Extended L2 Regression Algorithm
  • Input m x n matrix A, m x p matrix B, and a
    rank parameter k.
  • Output n x p matrix X approximately solving
    minX AkX - B F.
  • Algrthm Randomly sample rO(d2) or rO(d) rows
    from Ak and B
  • Solve the induced sub-problem.
  • Xopt AkB (SAk)SB
  • Cmplxty O(SVD(Ak)) time and space
  • Corollary 1 Approximately solve minX ATkX -
    ATF to get columns C such that A-CCAF
    (1?)A-AkF.
  • Corollary 2 Approximately solve minX CX -
    AF to get rows R such that A-CURF (1?)
    A-CCAF.

77
Theorem Relative-Error CUR
Fix any k, ?, ?. Then, there exists a Monte
Carlo algorithm that uses O(SVD(Ak)) time to find
C and R and construct U s.t. holds with
probability at least 1-?, by picking c O( k2
log(1/?) / ?2 ) columns, and r O( k4 log2(1/?)
/ ?6 ) rows. (Current theory work we can
improve the sampling complexity to c,rO(k
poly(1/?, 1/?)).) (Current empirical work we can
usually choose c,r k4.) (Dont worry about ?
choose ?1 if you want!)
78
Subsequent relative-error algorithms
  • November 2005 Drineas, Mahoney, and
    Muthukrishnan
  • The first relative-error low-rank matrix
    approximation algorithm.
  • O(SVD(Ak)) time and O(k2) columns for both CX
    and CUR decompositions.
  • January 2006 Har-Peled
  • Used ?-nets and VC-dimension arguments on
    optimal k-flats.
  • O(mn k2 log(k)) - linear in mn time to get
    1? approximation.
  • March 2006 Despande and Vempala
  • Used a volume sampling - adaptive sampling
    procedure of RVW05, DRVW06.
  • O(Mk/?) O(SVD(Ak)) time and O(k log(k))
    columns for CX-like decomposition.
  • April 2006 Drineas, Mahoney, and Muthukrishnan
  • Improved the DMM November 2005 result to O(k
    log(k)) columns.

79
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

80
CUR data application DNA tagging-SNPs
(data from K. Kidds lab at Yale University,
joint work with Dr. Paschou at Yale University)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are 10 million SNPs in the human genome,
so this table could have 10 million columns.
81
Recall the human genome
  • Human genome 3 billion base pairs
  • 30,000 40,000 genes
  • The functionality of 97 of the genome is
    unknown.
  • BUT individual differences (polymorphic
    variation) at 1 b.p. per thousand.

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT
GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG
CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG
TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC
CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG
CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC
GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC
AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG
GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG
GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA
GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG
AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA
CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG
CT CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG
GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GT TT GG AA
individuals
SNPs occur quite frequently within the genome
allowing the tracking of disease genes and
population histories. Thus, SNPs are effective
markers for genomic research.
82
Two copies of a chromosome (father, mother)
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
83
C
C
Two copies of a chromosome (father, mother)
  • An individual could be
  • Heterozygotic (in our study, CT TC)
  • Homozygotic at the first allele, e.g., C

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
84
T
T
Two copies of a chromosome (father, mother)
  • An individual could be
  • Heterozygotic (in our study, CT TC)
  • Homozygotic at the first allele, e.g., C
  • Homozygotic at the second allele, e.g., T

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
85
Why are SNPs important?
  • Genetic Association Studies
  • Locate causative genes for common complex
    disorders (e.g., diabetes, heart disease, etc.)
    by identifying association between affection
    status and known SNPs.
  • No prior knowledge about the function of the
    gene(s) or the etiology of the disorder is
    necessary.

Biology and Association Studies The subsequent
investigation of candidate genes that are in
physical proximity with the associated SNPs is
the first step towards understanding the
etiological pathway of a disorder and designing
a drug. Data Analysis and Association Studies
Susceptibility alleles (and genotypes carrying
them) should be more common in the patient
population.
86
SNPs carry redundant information
  • Key observation non-random relationship between
    SNPs.
  • Human genome is organized into block-like
    structure.
  • Strong intra-block correlations.
  • We can focus only on tagSNPs.
  • Among different populations (eg., European,
    Asian, African, etc.), different patterns of SNP
    allele frequencies or SNP correlations are often
    observed.
  • Understanding such differences is crucial in
    order to develop the next generation of drugs
    that will be population specific (eventually
    genome specific) and not just disease
    specific.

87
Funding
  • Mapping the whole genome sequence of a single
    individual is very expensive.
  • Mapping all the SNPs is also quite expensive,
    but the costs are dropping fast.
  • HapMap project (100,000,000 funding from NIH
    and other sources)
  • Map all 10,000,000 SNPs for 270 individuals from
    4 different populations (YRI, CEU, CHB, JPT), in
    order to create a genetic map to be used by
    researchers.
  • Also, funding from pharmaceutical companies,
    NSF, the Department of Justice, etc.

Is it possible to identify the ethnicity of a
suspect from his DNA?
88
Research directions
Research questions (working within a
population) (i) Are different SNPs correlated,
within or across populations? (ii) Find a good
set of tagging-SNPs capturing the diversity of a
chromosomal region of the human genome. (iii)
Find a set of individuals that capture the
diversity of a chromosomal region. (iii) Is
extrapolation feasible?
Why? - Understand structural properties of the
human genome. - Save time/money by assaying only
the tSNPs and predicting the rest. - Save
time/money by running (drug) tests only on the
cell lines of the selected individuals.
Existing literature Pairwise metrics of SNP
correlation, called LD (linkage disequilibrium)
distance, based on nucleotide frequencies and
co-occurrences. Almost no metrics exist for
measuring correlation between more than 2 SNPs
and LD is very difficult to generalize. Exhaustive
and semi-exhaustive algorithms in order to pick
good ht-SNPs that have small LD distance with
all other SNPs. Using Linear Algebra an SVD
based algorithm was proposed by Lin Altman, Am.
J. Hum. Gen. 2004.
89
The DNA SNP data
  • Samples from 38 different populations.
  • Average size 50 subjects/population.
  • For each subject 63 SNPs were assayed, from a
    region in chromosome 17 called SORCS3, 900,000
    bases long.
  • We are in the process of analyzing HapMap data
    as well as more 3 regions assayed by Kidds lab
    (with Asif Javed).

90
(No Transcript)
91
Encoding the data
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals
  • How?
  • Exactly two nucleotides (out of A,G,C,T) appear
    in each column.
  • Thus, the two alleles might be both equal to the
    first one (encode by 1), both equal to the
    second one (encode by -1), or different (encode
    by 0).
  • Notes
  • Order of the alleles is irrelevant, so TG is
    the same as GT.
  • Encoding, e.g., GG to 1 and TT to -1 is not
    any different (for our purposes) from encoding
    GG to -1 and TT to 1.
  • (Flipping the signs of the columns of a matrix
    does not affect our techniques.)

92
Evaluating (linear) structure
For each population ? We ran SVD to determine
the optimal number k of eigenSNPs covering 90
of the variance.
If we pick the top k left singular vectors we can
express every column (i.e, SNP) of A as a linear
combination of the left singular vectors loosing
10 of the data.
? We ran CUR to pick a small number (e.g., k2)
of columns of A and express every column (i.e.,
SNP) of A as a linear combination of the picked
columns, loosing 10 of the data.
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0
1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1
-1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1
-1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0
1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1
0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1
-1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1
0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1
0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1
1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0
-1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0
0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1
1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1
1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 0 -1 -1 1
individuals
93
(No Transcript)
94
Predicting SNPs within a population
Split the individuals in two sets training and
test. Given a small number of SNPs for all
individuals (tagging-SNPs), and all SNPs for
individuals in the training set, predict the
unassayed SNPs. Tagging-SNPs are selected using
only the training set.
SNPs
Training individuals, chosen uniformly at
random (for a few subjects, we are given all
SNPs)
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
95
(No Transcript)
96
(No Transcript)
97
Predicting SNPs across populations
Given all SNPs for all individuals in population
X, and a small number of tagging-SNPs for
population Y, predict all unassayed SNPs for all
individuals of Y. Tagging-SNPs are selected
using only the training set. Training set
individuals in X. Test set individuals in Y. A
contains all individuals in both X and Y.
SNPs
All individuals in population X.
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
98
FIG. 6
99
(No Transcript)
100
(No Transcript)
101
Keeping both SNPs and individuals
Given a small number of SNPs for all individuals,
and all SNPs for some judiciously chosen
individuals, predict the values of the remaining
SNPs.
SNPs
Basis individuals JUDICIOUCLY CHOSEN (for a
few subjects, we are given all SNPs)
individuals
SNP sample (for all subjects, we are given a
small number of SNPs)
102
(No Transcript)
103
(No Transcript)
104
Overview (2/2)
  • Tensor-based data sets
  • Tensor-CUR
  • Hyperspectral data
  • Recommendation systems
  • From Very-Large to Medium-Sized Data
  • Relative-error CX and CUR Matrix Decompositions
  • L2 Regression Problems
  • Application to DNA SNP Data
  • Conclusions and Open Problems

105
Conclusions Open Problems
  • Impose other structural properties in CUR-type
    decompositions
  • Non-negativity
  • Element quantization, e.g., to 0,1,-1
  • Block-SVD type structure
  • Robust heuristics and robust extensions
  • Especially for noisy data
  • L1 norm bounds
  • Extension to different statistical learning
    problems
  • Matrix reconstruction
  • Regression
  • Classification, e.g. SVM
  • Clustering

106
Conclusions Open Problems
  • Relate to traditional numerical linear algebra
  • Gu and Eisenstat - deterministically find
    well-conditioned columns
  • Goreinov and Tyrtyshnikov - volume-maximization
    and conditioning criteria
  • Stewart - backward error analysis
  • Empirical evaluation of different sampling
    probabilities
  • Uniform
  • Non-uniform norms of rows/columns or of
    left/right singular vectors
  • Others, depending on the problem to be solved?
  • Use CUR and CWCT for improved interpretability
    for data matrices
  • Structural vs. algorithmic issues

107
Workshop on Algorithms for Modern Massive Data
Sets(http//www.stanford.edu/group/mmds/)
_at_ Stanford University and Yahoo! Research, June
21-24, 2006 Objective - Explore novel
techniques for modeling and analyzing massive,
high-dimensional, and nonlinear-structured data.
- Bring together computer scientists,
computational and applied mathematicians,
statisticians, and practitioners to promote
cross-fertilization of ideas. Organizers G.
Golub, M. W. Mahoney, L-H. Lim, and P.
Drineas. Sponsors NSF, Yahoo! Research, Ask!.
Write a Comment
User Comments (0)
About PowerShow.com