View by Category

Loading...

PPT – Information Retrieval PowerPoint presentation | free to download - id: 6f939-ZDc1Z

The Adobe Flash plugin is needed to view this content

About This Presentation

Write a Comment

User Comments (0)

Transcript and Presenter's Notes

Information Retrieval Data Mining A Linear

Algebraic Perspective

Petros Drineas Rensselaer Polytechnic

Institute Computer Science Department

To access my web page

drineas

Modern data

Facts Computers make it easy to collect and

store data. Costs of storage are very low and

are dropping very fast. (most laptops have a

storage capacity of more than 100 GB ) When it

comes to storing data The current policy

typically is store everything in case it is

needed later instead of deciding what could be

deleted.

Data mining

Facts Computers make it easy to collect and

store data. Costs of storage are very low and

are dropping very fast. (most laptops have a

storage capacity of more than 100 GB ) When it

comes to storing data The current policy

typically is store everything in case it is

needed later instead of deciding what could be

deleted. Data Mining Extract useful information

from the massive amount of available data.

About the tutorial

Tools Introduce matrix algorithms and matrix

decompositions for data mining and information

retrieval applications. Goal Learn a model for

the underlying physical system generating the

dataset.

About the tutorial

Tools Introduce matrix algorithms and matrix

decompositions for data mining and information

retrieval applications. Goal Learn a model for

the underlying physical system generating the

dataset.

data

Math is necessary to design and analyze

principled algorithmic techniques to data-mine

the massive datasets that have become ubiquitous

in scientific research.

mathematics

algorithms

Why linear (or multilinear) algebra?

Data are represented by matrices Numerous modern

datasets are in matrix form. Data are

represented by tensors Data in the form of

tensors (multi-mode arrays) are becoming very

common in the data mining and information

retrieval literature in the last few years.

Why linear (or multilinear) algebra?

Data are represented by matrices Numerous modern

datasets are in matrix form. Data are

represented by tensors Data in the form of

tensors (multi-mode arrays) are becoming very

common in the data mining and information

retrieval literature in the last few

years. Linear algebra (and numerical analysis)

provide the fundamental mathematical and

algorithmic tools to deal with matrix and tensor

computations. (This tutorial will focus on

matrices pointers to some tensor decompositions

will be provided.)

Why matrix decompositions?

- Matrix decompositions
- (e.g., SVD, QR, SDD, CX and CUR, NMF, MMMF, etc.)
- They use the relationships between the available

data in order to identify components of the

underlying physical system generating the data. - Some assumptions on the relationships between

the underlying components are necessary. - Very active area of research some matrix

decompositions are more than one century old,

whereas others are very recent.

Overview

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

Datasets in the form of matrices

We are given m objects and n features describing

the objects. (Each object has n numeric values

describing it.) Dataset An m-by-n matrix A, Aij

shows the importance of feature j for object

i. Every row of A represents an object. Goal We

seek to understand the structure of the data,

e.g., the underlying process generating the data.

Market basket matrices

n products (e.g., milk, bread, wine, etc.)

Common representation for association rule

mining.

- Data mining tasks
- Find association rules
- E.g., customers who buy product x buy product y

with probility 89. - Such rules are used to make item display

decisions, advertising decisions, etc.

m customers

Aij quantity of j-th product purchased by the

i-th customer

Social networks (e-mail graph)

n users

Represents the email communications between

groups of users.

- Data mining tasks
- cluster the users
- identify dense networks of users (dense

subgraphs)

n users

Aij number of emails exchanged between users i

and j during a certain time period

Document-term matrices

A collection of documents is represented by an

m-by-n matrix (bag-of-words model).

n terms (words)

- Data mining tasks
- Cluster or classify documents
- Find nearest neighbors
- Feature selection find a subset of terms that

(accurately) clusters or classifies documents.

m documents

Aij frequency of j-th term in i-th document

Document-term matrices

A collection of documents is represented by an

m-by-n matrix (bag-of-words model).

n terms (words)

- Data mining tasks
- Cluster or classify documents
- Find nearest neighbors
- Feature selection find a subset of terms that

(accurately) clusters or classifies documents.

m documents

Aij frequency of j-th term in i-th document

Example later

Recommendation systems

The m-by-n matrix A represents m customers and n

products.

products

Data mining task Given a few samples from A,

recommend high utility products to customers.

customers

Aij utility of j-th product to i-th customer

Biology microarray data

tumour specimens

Microarray Data Rows genes (¼ 5,500) Columns

46 soft-issue tumour specimens (different types

of cancer, e.g., LIPO, LEIO, GIST, MFH,

etc.) Tasks Pick a subset of genes (if it

exists) that suffices in order to identify the

cancer type of a patient

genes

Nielsen et al., Lancet, 2002

Biology microarray data

tumour specimens

Microarray Data Rows genes (¼ 5,500) Columns

46 soft-issue tumour specimens (different types

of cancer, e.g., LIPO, LEIO, GIST, MFH,

etc.) Tasks Pick a subset of genes (if it

exists) that suffices in order to identify the

cancer type of a patient

genes

Example later

Nielsen et al., Lancet, 2002

Human genetics

Single Nucleotide Polymorphisms the most common

type of genetic variation in the genome across

different individuals. They are known locations

at the human genome where two alternate

nucleotide bases (alleles) are observed (out of

A, C, G, T).

SNPs

AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT

AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT

CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT

TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG

GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG

GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG

TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC

AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT

CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC

CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG

CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT

AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC

CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC

CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT

AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG

AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA

CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG

TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG

AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA

GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG

AA

individuals

Matrices including hundreds of individuals and

more than 300,000 SNPs are publicly

available. Task split the individuals in

different clusters depending on their ancestry,

and find a small subset of genetic markers that

are ancestry informative.

Human genetics

Single Nucleotide Polymorphisms the most common

type of genetic variation in the genome across

different individuals. They are known locations

at the human genome where two alternate

nucleotide bases (alleles) are observed (out of

A, C, G, T).

SNPs

AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT

AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT

CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT

TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG

GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG

GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG

TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC

AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT

CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC

CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG

CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT

AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC

CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC

CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT

AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG

AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA

CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG

TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG

AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA

GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG

AA

individuals

Matrices including hundreds of individuals and

more than 300,000 SNPs are publicly

available. Task split the individuals in

different clusters depending on their ancestry,

and find a small subset of genetic markers that

are ancestry informative.

Example later

Tensors recommendation systems

- Economics
- Utility is ordinal and not cardinal concept.
- Compare products dont assign utility values.
- Recommendation Model Revisited
- Every customer has an n-by-n matrix (whose

entries are 1,-1) and represent pair-wise

product comparisons. - There are m such matrices, forming an

n-by-n-by-m 3-mode tensor A.

Tensors hyperspectral images

Spectrally resolved images may be viewed as a

tensor.

Task Identify and analyze regions of

significance in the images.

Overview

x

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

The Singular Value Decomposition (SVD)

Recall data matrices have m rows (one for each

object) and n columns (one for each

feature). Matrix rows points (vectors) in a

Euclidean space, e.g., given 2 objects (x d),

each described with respect to two features, we

get a 2-by-2 matrix. Two objects are close if

the angle between their corresponding vectors is

small.

SVD, intuition

Let the blue circles represent m data points in a

2-D Euclidean space. Then, the SVD of the m-by-2

matrix of the data will return

Singular values

?2

?1 measures how much of the data variance is

explained by the first singular vector. ?2

measures how much of the data variance is

explained by the second singular vector.

?1

SVD formal definition

? rank of A U (V) orthogonal matrix containing

the left (right) singular vectors of A. S

diagonal matrix containing the singular values of

A.

Rank-k approximations via the SVD

?

A

VT

U

features

significant

sig.

noise

noise

significant

noise

objects

Rank-k approximations (Ak)

Uk (Vk) orthogonal matrix containing the top k

left (right) singular vectors of A. S k diagonal

matrix containing the top k singular values of A.

PCA and SVD

Principal Components Analysis (PCA) essentially

amounts to the computation of the Singular Value

Decomposition (SVD) of a covariance matrix. SVD

is the algorithmic tool behind MultiDimensional

Scaling (MDS) and Factor Analysis. SVD is the

Rolls-Royce and the Swiss Army Knife of Numerical

Linear Algebra. Dianne OLeary, MMDS 06

Ak as an optimization problem

Frobenius norm

Given ?, it is easy to find X from standard least

squares. However, the fact that we can find the

optimal ? is intriguing!

Ak as an optimization problem

Frobenius norm

Given ?, it is easy to find X from standard least

squares. However, the fact that we can find the

optimal ? is intriguing! Optimal ? Uk, optimal

X UkTA.

LSI Ak for document-term matrices(Berry,

Dumais, and O'Brien 92)

Latent Semantic Indexing (LSI) Replace A by Ak

apply clustering/classification algorithms on Ak.

n terms (words)

- Pros
- Less storage for small k.
- O(kmkn) vs. O(mn)
- Improved performance.
- Documents are represented in a concept space.

m documents

Aij frequency of j-th term in i-th document

LSI Ak for document-term matrices(Berry,

Dumais, and O'Brien 92)

Latent Semantic Indexing (LSI) Replace A by Ak

apply clustering/classification algorithms on Ak.

n terms (words)

- Pros
- Less storage for small k.
- O(kmkn) vs. O(mn)
- Improved performance.
- Documents are represented in a concept space.
- Cons
- Ak destroys sparsity.
- Interpretation is difficult.
- Choosing a good k is tough.

m documents

Aij frequency of j-th term in i-th document

Ak and k-means clustering(Drineas, Frieze,

Kannan, Vempala, and Vinay 99)

k-means clustering A standard objective function

that measures cluster quality. (Often denotes an

iterative algorithm that attempts to optimize the

k-means objective function.) k-means objective

Input set of m points in Rn, positive integer

k Output a partition of the m points to k

clusters Partition the m points to k clusters in

order to minimize the sum of the squared

Euclidean distances from each point to its

cluster centroid.

k-means, contd

We seek to split the input points in 5 clusters.

k-means, contd

We seek to split the input points in 5

clusters. The cluster centroid is the average

of all the points in the cluster.

k-means a matrix formulation

Let A be the m-by-n matrix representing m points

in Rn. Then, we seek to

X is a special cluster membership matrix Xij

denotes if the i-th point belongs to the j-th

cluster.

k-means a matrix formulation

Let A be the m-by-n matrix representing m points

in Rn. Then, we seek to

X is a special cluster membership matrix Xij

denotes if the i-th point belongs to the j-th

cluster.

clusters

- Columns of X are normalized to have unit length.

- (We divide each column by the square root of the

number of points in the cluster.) - Every row of X has at most one non-zero element.
- (Each element belongs to at most one cluster.)
- X is an orthogonal matrix, i.e., XTX I.

points

SVD and k-means

If we only require that X is an orthogonal matrix

and remove the condition on the number of

non-zero entries per row of X, then

is easy to minimize! The solution is X Uk.

SVD and k-means

If we only require that X is an orthogonal matrix

and remove the condition on the number of

non-zero entries per row of X, then

is easy to minimize! The solution is X Uk.

- Using SVD to solve k-means
- We can get a 2-approximation algorithm for

k-means. - (Drineas, Frieze, Kannan, Vempala, and Vinay 99,

04) - We can get heuristic schemes to assign points to

clusters. - (Zha, He, Ding, Simon, and Gu 01)
- There exist PTAS (based on random projections)

for the k-means problem. - (Ostrovsky and Rabani 00, 02)
- Deeper connections between SVD and clustering in

Kannan, Vempala, and Vetta 00, 04.

Ak and Kleinbergs HITS algorithm(Kleinberg 98,

99)

Hypertext Induced Topic Selection (HITS) A link

analysis algorithm that rates Web pages for their

authority and hub scores. Authority score an

estimate of the value of the content of the

page. Hub score an estimate of the value of the

links from this page to other pages. These values

can be used to rank Web search results.

Ak and Kleinbergs HITS algorithm

Hypertext Induced Topic Selection (HITS) A link

analysis algorithm that rates Web pages for their

authority and hub scores. Authority score an

estimate of the value of the content of the

page. Hub score an estimate of the value of the

links from this page to other pages. These values

can be used to rank Web search results. Authority

a page that is pointed to by many pages with

high hub scores. Hub a page pointing to many

pages that are good authorities. Recursive

definition notice that each node has two scores.

Ak and Kleinbergs HITS algorithm

Phase 1 Given a query term (e.g., jaguar),

find all pages containing the query term (root

set). Expand the resulting graph by one move

forward and backward (base set).

Ak and Kleinbergs HITS algorithm

Phase 2 Let A be the adjacency matrix of the

(directed) graph of the base set. Let h , a 2 Rn

be the vectors of hub (authority) scores. Then,

h Aa and a ATh h AATh and a ATAa.

Ak and Kleinbergs HITS algorithm

Phase 2 Let A be the adjacency matrix of the

(directed) graph of the base set. Let h , a 2 Rn

be the vectors of hub (authority) scores. Then,

h Aa and a ATh h AATh and a ATAa.

Thus, the top left (right) singular vector of A

corresponds to hub (authority) scores.

Ak and Kleinbergs HITS algorithm

Phase 2 Let A be the adjacency matrix of the

(directed) graph of the base set. Let h , a 2 Rn

be the vectors of hub (authority) scores. Then,

h Aa and a ATh h AATh and a ATAa.

Thus, the top left (right) singular vector of A

corresponds to hub (authority) scores. What about

the rest? They provide a natural way to extract

additional densely linked collections of hubs and

authorities from the base set. See the jaguar

example in Kleinberg 99.

SVD example microarray data

genes

Microarray Data (Nielsen et al., Lancet,

2002) Columns genes (¼ 5,500) Rows 32 patients,

three different cancer types (GIST, LEIO, SynSarc)

SVD example microarray data

Microarray Data Applying k-means with k3 in

this 3D space results to 3 misclassifications.

Applying k-means with k3 but retaining 4 PCs

results to one misclassification. Can we find

actual genes (as opposed to eigengenes) that

achieve similar results?

SVD example ancestry-informative SNPs

Single Nucleotide Polymorphisms the most common

type of genetic variation in the genome across

different individuals. They are known locations

at the human genome where two alternate

nucleotide bases (alleles) are observed (out of

A, C, G, T).

SNPs

AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT

AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT

CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT

TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG

GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG

GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG

TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC

AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT

CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC

CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG

CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT

AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC

CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC

CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT

AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG

AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA

CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG

TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG

AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA

GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG

AA

individuals

There are ¼ 10 million SNPs in the human genome,

so this table could have 10 million columns.

Two copies of a chromosome (father, mother)

Two copies of a chromosome (father, mother)

SNPs

AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT

AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT

CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT

TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG

GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG

GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG

TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC

AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT

CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC

CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG

CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT

AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC

CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC

CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT

AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG

AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA

CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG

TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG

AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA

GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG

AA

individuals

C

C

Two copies of a chromosome (father, mother)

- An individual could be
- Heterozygotic (in our study, CT TC)
- Homozygotic at the first allele, e.g., C

SNPs

AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT

AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT

CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT

TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG

GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG

GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG

TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC

AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT

CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC

CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG

CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT

AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC

CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC

CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT

AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG

AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA

CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG

TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG

AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA

GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG

AA

individuals

T

T

Two copies of a chromosome (father, mother)

- An individual could be
- Heterozygotic (in our study, CT TC)
- Homozygotic at the first allele, e.g., C
- Homozygotic at the second allele, e.g., T

- Encode as 0
- Encode as 1
- Encode as -1

SNPs

AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT

AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT

CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT

TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG

GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG

GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG

TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC

AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT

CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC

CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG

CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT

AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC

CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC

CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT

AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG

AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA

CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG

TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG

AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA

GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG

AA

individuals

(a) Why are SNPs really important?

Association studies Locating causative genes for

common complex disorders (e.g., diabetes, heart

disease, etc.) is based on identifying

association between affection status and known

SNPs. No prior knowledge about the function of

the gene(s) or the etiology of the disorder is

necessary.

The subsequent investigation of candidate genes

that are in physical proximity with the

associated SNPs is the first step towards

understanding the etiological pathway of a

disorder and designing a drug.

(b) Why are SNPs really important?

Among different populations (eg., European,

Asian, African, etc.), different patterns of SNP

allele frequencies or SNP correlations are often

observed.

Understanding such differences is crucial in

order to develop the next generation of drugs

that will be population specific (eventually

genome specific) and not just disease

specific.

The HapMap project

- Mapping the whole genome sequence of a single

individual is very expensive. - Mapping all the SNPs is also quite expensive,

but the costs are dropping fast.

HapMap project (130,000,000 funding from NIH

and other sources) Map approx. 4 million SNPs

for 270 individuals from 4 different populations

(YRI, CEU, CHB, JPT), in order to create a

genetic map to be used by researchers.

Also, funding from pharmaceutical companies, NSF,

the Department of Justice, etc.

Is it possible to identify the ethnicity of a

suspect from his DNA?

CHB and JPT

- Let A be the 902.7 million matrix of the CHB and

JPT population in HapMap. - Run SVD (PCA) on A, keep the two (left) singular

vectors, and plot the results. - Run a (naïve, e.g., k-means) clustering

algorithm to split the data points in two

clusters.

Paschou, Ziv, Burchard, Mahoney, and Drineas, to

appear in PLOS Genetics 07 (data from E. Ziv and

E. Burchard, UCSF)

Paschou, Mahoney, Javed, Kidd, Pakstis, Gu, Kidd,

and Drineas, Genome Research 07 (data from K.

Kidd, Yale University)

(No Transcript)

EigenSNPs can not be assayed

Not altogether satisfactory the (top two left)

singular vectors are linear combinations of all

SNPs, and of course can not be assayed! Can

we find actual SNPs that capture the information

in the (top two left) singular vectors? (E.g.,

spanning the same subspace ) Will get back to

this later

Overview

x

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

x

CX decomposition

C

C

Constrain ? to contain exactly k columns of

A. Notation replace ? by C(olumns). Easy to

prove that optimal X CA. (C is the

Moore-Penrose pseudoinverse of C.) Also called

interpolative approximation. (some extra

conditions on the elements of X are required)

CX decomposition

C

C

Why? If A is an object-feature matrix, then

selecting representative columns is equivalent

to selecting representative features. This

leads to easier interpretability compare to

eigenfeatures, which are linear combinations of

all features.

Column Subset Selection problem (CSS)

Given an m-by-n matrix A, find k columns of A

forming an m-by-k matrix C that minimizes the

above error over all O(nk) choices for C.

Column Subset Selection problem (CSS)

Given an m-by-n matrix A, find k columns of A

forming an m-by-k matrix C that minimizes the

above error over all O(nk) choices for C. C

pseudoinverse of C, easily computed via the SVD

of C. (If C U ? VT, then C V ?-1 UT.) PC

CC is the projector matrix on the subspace

spanned by the columns of C.

Column Subset Selection problem (CSS)

Given an m-by-n matrix A, find k columns of A

forming an m-by-k matrix C that minimizes the

above error over all O(nk) choices for C. PC

CC is the projector matrix on the subspace

spanned by the columns of C.

Complexity of the problem? O(nkmn) trivially

works NP-hard if k grows as a function of n.

(NP-hardness in Civril Magdon-Ismail 07)

Spectral norm

Given an m-by-n matrix A, find k columns of A

forming an m-by-k matrix C such that

- is minimized over all O(nk) possible choices for

C. - Remarks
- PCA is the projection of A on the subspace

spanned by the columns of C. - The spectral or 2-norm of an m-by-n matrix X is

A lower bound for the CSS problem

For any m-by-k matrix C consisting of at most k

columns of A

Ak

- Remarks
- This is also true if we replace the spectral norm

by the Frobenius norm. - This is a potentially weak lower bound.

Prior work numerical linear algebra

- Numerical Linear Algebra algorithms for CSS
- Deterministic, typically greedy approaches.
- Deep connection with the Rank Revealing QR

factorization. - Strongest results so far (spectral norm) in

O(mn2) time

some function p(k,n)

Prior work numerical linear algebra

- Numerical Linear Algebra algorithms for CSS
- Deterministic, typically greedy approaches.
- Deep connection with the Rank Revealing QR

factorization. - Strongest results so far (Frobenius norm) in

O(nk) time

Working on p(k,n) 1965 today

Prior work theoretical computer science

- Theoretical Computer Science algorithms for CSS
- Randomized approaches, with some failure

probability. - More than k rows are picked, e.g., O(poly(k))

rows. - Very strong bounds for the Frobenius norm in low

polynomial time. - Not many spectral norm bounds

The strongest Frobenius norm bound

Given an m-by-n matrix A, there exists an O(mn2)

algorithm that picks at most O( k log k / ?2 )

columns of A such that with probability at least

1-10-20

The CX algorithm

Input m-by-n matrix A, 0 lt ? lt 1, the

desired accuracy Output C, the matrix consisting

of the selected columns

- CX algorithm
- Compute probabilities pj summing to 1
- Let c O(k log k / ?2)
- For each j 1,2,,n, pick the j-th column of A

with probability min1,cpj - Let C be the matrix consisting of the chosen

columns - (C has in expectation at most c columns)

Subspace sampling (Frobenius norm)

Vk orthogonal matrix containing the top k right

singular vectors of A. S k diagonal matrix

containing the top k singular values of A.

Remark The rows of VkT are orthonormal vectors,

but its columns (VkT)(i) are not.

Subspace sampling (Frobenius norm)

Vk orthogonal matrix containing the top k right

singular vectors of A. S k diagonal matrix

containing the top k singular values of A.

Remark The rows of VkT are orthonormal vectors,

but its columns (VkT)(i) are not.

Subspace sampling in O(mn2) time

Normalization s.t. the pj sum up to 1

Prior work in TCS

- Drineas, Mahoney, and Muthukrishnan 2005
- O(mn2) time, O(k2/?2) columns
- Drineas, Mahoney, and Muthukrishnan 2006
- O(mn2) time, O(k log k/?2) columns
- Deshpande and Vempala 2006
- O(mnk2) time and O(k2 log k/?2) columns
- They also prove the existence of

k columns of A forming a matrix C, such that - Compare to prior best existence result

Open problems

- Design
- Faster algorithms (next slide)
- Algorithms that achieve better approximation

guarantees (a hybrid approach)

Prior work spanning NLA and TCS

- Woolfe, Liberty, Rohklin, and Tygert 2007
- (also Martinsson, Rohklin, and Tygert 2006)
- O(mn logk) time, k columns, same spectral norm

bounds as prior work - Beautiful application of the Fast

Johnson-Lindenstrauss transform of Ailon-Chazelle

A hybrid approach(Boutsidis, Mahoney, and

Drineas 07)

- Given an m-by-n matrix A (assume m n for

simplicity) - (Randomized phase) Run a randomized algorithm to

pick c O(k logk) columns. - (Deterministic phase) Run a deterministic

algorithm on the above columns to pick exactly k

columns of A and form an m-by-k matrix C.

Not so simple

A hybrid approach(Boutsidis, Mahoney, and

Drineas 07)

- Given an m-by-n matrix A (assume m n for

simplicity) - (Randomized phase) Run a randomized algorithm to

pick c O(k logk) columns. - (Deterministic phase) Run a deterministic

algorithm on the above columns to pick exactly k

columns of A and form an m-by-k matrix C.

Not so simple

Our algorithm runs in O(mn2) and satisfies, with

probability at least 1-10-20,

Comparison Frobenius norm

Our algorithm runs in O(mn2) and satisfies, with

probability at least 1-10-20,

- We provide an efficient algorithmic result.
- We guarantee a Frobenius norm bound that is at

most (k logk)1/2 worse than the best known

existential result.

Comparison spectral norm

Our algorithm runs in O(mn2) and satisfies, with

probability at least 1-10-20,

- Our running time is comparable with NLA

algorithms for this problem. - Our spectral norm bound grows as a function of

(n-k)1/4 instead of (n-k)1/2! - Do notice that with respect to k our bound is

k1/4log1/2k worse than previous work. - To the best of our knowledge, our result is the

first asymptotic improvement of the work of Gu

Eisenstat 1996.

Randomized phase O(k log k) columns

- Randomized phase c O(k logk)
- Compute probabilities pj summing to 1
- For each j 1,2,,n, pick the j-th column of A

with probability min1,cpj - Let C be the matrix consisting of the chosen

columns - (C has in expectation at most c columns)

Subspace sampling

Vk orthogonal matrix containing the top k right

singular vectors of A. S k diagonal matrix

containing the top k singular values of A.

Remark We need more elaborate subspace sampling

probabilities than previous work.

Subspace sampling in O(mn2) time

Normalization s.t. the pj sum up to 1

Deterministic phase k columns

- Deterministic phase
- Let S1 be the set of indices of the columns

selected by the randomized phase. - Let (VkT)S1 denote the set of columns of VkT

with indices in S1, - (An extra technicality is that the columns of

(VkT)S1 must be rescaled ) - Run a deterministic NLA algorithm on (VkT)S1 to

select exactly k columns. - (Any algorithm with p(k,n) k1/2(n-k)1/2 will

do.) - Let S2 be the set of indices of the selected

columns (the cardinality of S2 is exactly k). - Return AS2 (the columns of A corresponding to

indices in S2) as the final output.

Back to SNPs CHB and JPT

Let A be the 902.7 million matrix of the CHB and

JPT population in HapMap.

Can we find actual SNPs that capture the

information in the top two left singular vectors?

Results

Number of SNPs Misclassifications

40 (c 400) 6

50 (c 500) 5

60 (c 600) 3

70 (c 700) 1

- Essentially as good as the best existing metric

(informativeness). - However, our metric is unsupervised!
- (Informativeness is supervised it essentially

identifies SNPs that are correlated with

population membership, given such membership

information). - The fact that we can select ancestry informative

SNPs in an unsupervised manner based on PCA is

novel, and seems interesting.

Overview

x

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

x

x

CUR-type decompositions

For any matrix A, we can find C, U and R such

that the norm of A CUR is almost equal to the

norm of A-Ak. This might lead to a better

understanding of the data.

Theorem relative error CUR(Drineas, Mahoney,

Muthukrishnan 06, 07)

For any k, O(mn2) time suffices to construct C,

U, and R s.t. holds with probability at least

1-?, by picking O( k log k log(1/?) / ?2 )

columns, and O( k log2k log(1/?) / ?6 ) rows.

From SVD to CUR

Exploit structural properties of CUR to analyze

data

n features

A CUR-type decomposition needs O(minmn2, m2n)

time.

m objects

- Instead of reifying the Principal Components
- Use PCA (a.k.a. SVD) to find how many Principal

Components are needed to explain the data. - Run CUR and pick columns/rows instead of

eigen-columns and eigen-rows! - Assign meaning to actual columns/rows of the

matrix! Much more intuitive! Sparse!

CUR decompositions a summary

G.W. Stewart (Num. Math. 99, TR 04 ) C variant of the QR algorithm R variant of the QR algorithm U minimizes A-CURF No a priori bounds Solid experimental performance

Goreinov, Tyrtyshnikov, and Zamarashkin (LAA 97, Cont. Math. 01) C columns that span max volume U W R rows that span max volume Existential result Error bounds depend on W2 Spectral norm bounds!

Williams and Seeger (NIPS 01) C uniformly at random U W R uniformly at random Experimental evaluation A is assumed PSD Connections to Nystrom method

Drineas, Kannan, and Mahoney (SODA 03, 04) C w.r.t. column lengths U in linear/constant time R w.r.t. row lengths Randomized algorithm Provable, a priori, bounds Explicit dependency on A Ak

Drineas, Mahoney, and Muthu (05, 06) C depends on singular vectors of A. U (almost) W R depends on singular vectors of C (1?) approximation to A Ak Computable in SVDk(A) time.

Data applications of CUR

CMD factorization (Sun, Xie, Zhang, and Faloutsos

07, best paper award in SIAM Conference on Data

Mining 07) A CUR-type decomposition that avoids

duplicate rows/columns that might appear in some

earlier versions of CUR-type decomposition. Many

interesting applications to large network

datasets, DBLP, etc. extensions to

tensors. Fast computation of Fourier Integral

Operators (Demanet, Candes, and Ying

06) Application in seismology imaging data

(PBytes of data can be generated) The problem

boils down to solving integral equations, i.e.,

matrix equations after discretization. CUR-type

structures appear uniform sampling seems to work

well in practice.

Overview

x

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

x

x

x

Decompositions that respect the data

Non-negative matrix factorization (Lee and Seung

00) Assume that the Aij are non-negative for all

i,j.

The Non-negative Matrix Factorization

Non-negative matrix factorization (Lee and Seung

00) Assume that the Aij are non-negative for all

i,j. Constrain ? and X to have only non-negative

entries as well. This should respect the

structure of the data better than Ak Uk?kVkT

which introduces a lot of (difficult to

interpret) negative entries.

The Non-negative Matrix Factorization

- It has been extensively applied to
- Image mining (Lee and Seung 00)
- Enron email collection (Berry and Brown 05)
- Other text mining tasks (Berry and Plemmons 04)

- Algorithms for NMF
- Multiplicative updage rules (Lee and Seung 00,

Hoyer 02) - Gradient descent (Hoyer 04, Berry and Plemmons

04) - Alternating least squares (dating back to Paatero

94)

Algorithmic challenges for NMF

- Algorithmic challenges for the NMF
- NMF (as stated above) is convex given ? or X, but

not if both are unknown. - No unique solution many matrices ? and X that

minimize the error. - Other optimization objectives could be chosen

(e.g., spectral norm, etc.) - NMF becomes harder if sparsity constraints are

included (e.g., X has a small number of

non-zeros). - For the multiplicative update rules there exists

some theory proving that they converge to a fixed

point this might be a local optimum or a saddle

point. - Little theory is known for the other algorithms.

Overview

x

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

x

x

x

x

SemiDiscrete Decomposition (SDD)

Dk diagonal matrix Xk, Yk all entries are in

-1,0,1 SDD identifies regions of the matrix

that have homogeneous density.

Xk

YkT

ASDD

Dk

SemiDiscrete Decomposition (SDD)

SDD looks for blocks of similar height towers and

similar depth holes bump hunting. Applications

include image compression and text mining.

OLeary and Peleg 83, Kolda and OLeary 98,

00, OLeary and Roth 06 The figures are from D.

Skillkorns book on Data Mining with Matrix

Decompositions.

Overview

x

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

x

x

x

x

x

Collaborative Filtering and MMMF

User ratings for movies Goal predict unrated

movies (?)

Collaborative Filtering and MMMF

User ratings for movies Goal predict unrated

movies (?) Maximum Margin Matrix Factorization

(MMMF) A novel, semi-definite programming based

matrix decomposition that seems to perform very

well in real data, including the Netflix

challenge. Srebro, Rennie, and Jaakkola 04,

Rennie and Srebro 05 Some pictures are from

Srebros presentation in NIPS 04.

A linear factor model

A linear factor model

T

User biases for different movie attributes

All users

T

(Possible) solution to collaborative filtering

fit a rank (exactly) k matrix X to Y. Fully

observed Y ? X is the best rank k approximation

to Y. Azar, Fiat, Karlin, McSherry, and Saia

01, Drineas, Kerenidis, and Raghavan 02

Imputing the missing entries via SVD(Achlioptas

and McSherry 01, 06)

- Reconstruction Algorithm
- Compute the SVD of the matrix filling in the

missing entries with zeros. - Some rescaling prior to computing the SVD is

necessary, e.g., multiply by 1/(fraction of

observed entries). - Keep the resulting top k principal components.

Imputing the missing entries via SVD(Achlioptas

and McSherry 01, 06)

- Reconstruction Algorithm
- Compute the SVD of the matrix filling in the

missing entries with zeros. - Some rescaling prior to computing the SVD is

necessary, e.g., multiply by 1/(fraction of

observed entries). - Keep the resulting top k principal components.

Under assumptions on the quality of the

observed entries, reconstruction accuracy bounds

may be proven. The error bounds scale with the

Frobenius norm of the matrix.

A convex formulation

T

- MMMF
- Focus on 1 rankings (for simplicity).
- Fit a prediction matrix X UVT to the

observations.

A convex formulation

T

- MMMF
- Focus on 1 rankings (for simplicity).
- Fit a prediction matrix X UVT to the

observations. - Objectives (CONVEX!)
- Minimize the total number of mismatches between

the observed data and the predicted data. - Keep the trace norm of X small.

A convex formulation

T

- MMMF
- Focus on 1 rankings (for simplicity).
- Fit a prediction matrix X UVT to the

observations. - Objectives (CONVEX!)
- Minimize the total number of mismatches between

the observed data and the predicted data. - Keep the trace norm of X small.

MMMF and SDP

T

MMMF This may be formulated as a semi-definite

program, and thus may be solved efficiently.

Bounding the factor contribution

T

MMMF Instead of a hard rank constraint

(non-convex), a softer constraint is introduced.

The total number of contributing factors (number

of columns/rows in U/VT) is unbounded, but their

total contribution is bounded.

Overview

x

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

x

x

x

x

x

x

Tensors

- Tensors appear both in Math and CS.
- Connections to complexity theory (i.e., matrix

multiplication complexity) - Data Set applications (i.e., Independent

Component Analysis, higher order statistics,

etc.) - Also, many practical applications, e.g., Medical

Imaging, Hyperspectral Imaging, video,

Psychology, Chemometrics, etc.

However, there does not exist a definition of

tensor rank (and associated tensor SVD) with the

nice properties found in the matrix case.

Tensor rank

A definition of tensor rank Given a tensor find

the minimum number of rank one tensors into it

can be decomposed.

- only weak bounds are known
- tensor rank depends on the underlying ring of

scalars - computing it is NP-hard
- successive rank one approxi-imations are no good

Tensors decompositions

- Many tensor decompositions matricize the tensor
- PARAFAC, Tucker, Higher-Order SVD, DEDICOM, etc.
- Most are computed via iterative algorithms (e.g.,

alternating least squares).

create the unfolded matrix

Given

unfold

Useful links on tensor decompositions

- Workshop on Algorithms for Modern Massive Data

Sets (MMDS) 06 - http//www.stanford.edu/group/mmds/
- Check the tutorial by Lek-Heng Lim on tensor

decompositions. - Tutorial by Faloutsos, Kolda, and Sun in SIAM

Data Mining Conference 07 - Tammy Koldas web page
- http//csmr.ca.sandia.gov/tgkolda/

Overview

x

- Datasets in the form of matrices (and tensors)
- Matrix Decompositions
- Singular Value Decomposition (SVD)
- Column-based Decompositions (CX, interpolative

decomposition) - CUR-type decompositions
- Non-negative matrix factorization
- Semi-Discrete Decomposition (SDD)
- Maximum-Margin Matrix Factorization (MMMF)
- Tensor decompositions
- Regression
- Coreset constructions
- Fast algorithms for least-squares regression

x

x

x

x

x

x

x

x

Problem definition and motivation

In many applications (e.g., statistical data

analysis and scientific computation), one has n

observations of the form

Least-norm approximation problems

Recall a linear measurement model

In order to estimate x, solve

Application data analysis in science

- First application Astronomy
- Predicting the orbit of the asteroid Ceres (in

1801!). - Gauss (1809) -- see also Legendre (1805) and

Adrain (1808). - First application of least squares

optimization and runs in O(nd2) time! - Data analysis Fit parameters of a biological,

chemical, economical, physical (astronomical),

social, internet, etc. model to experimental

data.

Norms of common interest

Let y b and define the residual

Least-squares approximation

Chebyshev or mini-max approximation

Sum of absolute residuals approximation

Lp norms and their unit balls

Recall the Lp norm for

Lp regression problems

We are interested in over-constrained Lp

regression problems, n gtgt d. Typically, there

is no x such that Ax b. Want to find the

best x such that Ax b. Lp regression problems

are convex programs (or better). There exist

poly-time algorithms. We want to solve them

faster.

Exact solution to L2 regression

Cholesky Decomposition If A is full rank and

well-conditioned, decompose ATA RTR, where R

is upper triangular, and solve the normal

equations RTRx ATb. QR Decomposition

Slower but numerically stable, esp. if A is

rank-deficient. Write A QR, and solve Rx

QTb. Singular Value Decomposition Most

expensive, but best if A is very

ill-conditioned. Write A U?VT, in which case

xOPT Ab V?-1UTb. Complexity is O(nd2) , but

constant factors differ.

Questions

Approximation algorithms Can we approximately

solve Lp regression faster than exact

methods? Core-sets (or induced

sub-problems) Can we find a small set of

constraints such that solving the Lp regression

on those constraints gives an approximation to

the original problem?

Randomized algorithms for Lp regression

Alg. 1 p2 Sampling (core-set) (1?)-approx O(nd2) Drineas, Mahoney, and Muthu 06, 07

Alg. 2 p2 Projection (no core-set) (1?)-approx O(nd logd) Sarlos 06 Drineas, Mahoney, Muthu, and Sarlos 07

Alg. 3 p ? 1,8) Sampling (core-set) (1?)-approx O(nd5) o(exact) DasGupta, Drineas, Harb, Kumar, Mahoney 07

Note Clarkson 05 gets a (1?)-approximation for

L1 regression in O(d3.5/?4) time. He

preprocessed A,b to make it well-rounded or

well-conditioned and then sampled.

Algorithm 1 Sampling for L2 regression

- Algorithm
- Fix a set of probabilities pi, i1n, summing up

to 1. - Pick the i-th row of A and the i-th element of b

with probability - min 1, rpi,
- and rescale both by (1/min1,rpi)1/2.
- Solve the induced problem.

Note in expectation, at most r rows of A and r

elements of b are kept.

Sampling algorithm for L2 regression

sampled rows of b

sampled rows of A

Our results for p2

If the pi satisfy a condition, then with

probability at least 1-?,

?(A) condition number of A

The sampling complexity is

Notation

U(i) i-th row of U

? rank of A U orthogonal matrix containing the

left singular vectors of A.

Condition on the probabilities

The condition that the pi must satisfy is, for

some ? ? (0,1

- Notes
- O(nd2) time suffices (to compute probabilities

and to construct a core-set). - Important question
- Is O(nd2) necessary? Can we compute the pis, or

construct a core-set, faster?

The Johnson-Lindenstrauss lemma

- Results for J-L
- Johnson Lindenstrauss 84 project to a random

subspace - Frankl Maehara 88 random orthogonal matrix
- DasGupta Gupta 99 matrix with entries from

N(0,1), normalized - Indyk Motwani 98 matrix with entries from

N(0,1) - Achlioptas 03 matrix with entries in -1,0,1
- Alon 03 optimal dependency on n, and almost

optimal dependency on ?

Fast J-L transform (1 of 2)(Ailon Chazelle 06)

Fast J-L transform (2 of 2)(Ailon Chazelle 06)

- Multiplication of the vectors by PHD is fast,

since - (Du) is O(d) - since D is diagonal
- (HDu) is O(d logd) use Fast Fourier Transform

algorithms - (PHDu) is O(poly (logn)) - P has on average

O(poly(logn)) non-zeros per row.

O(nd logd) L2 regression

Fact 1 since Hn (the n-by-n Hadamard matrix) and

Dn (an n-by-n diagonal with 1 in the diagonal,

chosen uniformly at random) are orthogonal

matrices,

Thus, we can work with HnDnAx HnDnb. Lets use

our sampling approach

O(nd logd) L2 regression

Fact 1 since Hn (the n-by-n Hadamard matrix) and

Dn (an n-by-n diagonal with 1 in the diagonal,

chosen uniformly at random) are orthogonal

matrices,

Thus, we can work with HnDnAx HnDnb. Lets use

our sampling approach

Fact 2 Using a Chernoff-type argument, we can

prove that the lengths of all the rows of the

left singular vectors of HnDnA are, with

probability at least .9,

O(nd logd) L2 regression

DONE! We can perform uniform sampling in order to

keep r O(d logd/?2) rows of HnDnA our L2

regression theorem guarantees the accuracy of the

approximation. Running time is O(nd logd),

since we can use the fast Hadamard-Walsh

transform to multiply Hn and DnA.

Open problem sparse approximations

Sparse approximations and l2 regression (Nataraja

n 95, Tropp 04, 06) In the sparse

approximation problem, we are given a d-by-n

matrix A forming a redundant dictionary for Rd

and a target vector b 2 Rd and we seek to solve

subject to

In words, we seek a sparse, bounded error

representation of b in terms of the vectors in

the dictionary.

Open problem sparse approximations

Sparse approximations and l2 regression (Nataraja

n 95, Tropp 04, 06) In the sparse

approximation problem, we are given a d-by-n

matrix A forming a redundant dictionary for Rd

and a target vector b 2 Rd and we seek to solve

subject to

In words, we seek a sparse, bounded error

representation of b in terms of the vectors in

the dictionary. This is (sort of)

under-constrained least squares regression. Can

we use the aforementioned ideas to get better

and/or faster approximation algorithms for the

sparse approximation problem?

Application feature selection for RLSC

Regularized Least Squares Regression (RLSC)

Given a term-document matrix A and a class

label for each document, find xopt to minimize

Here c is the vector of labels. For simplicity

assume two classes, thus ci 1.

Application feature selection for RLSC

Regularized Least Squares Regression (RLSC)

Given a term-document matrix A and a class

label for each document, find xopt to minimize

Here c is the vector of labels. For simplicity

assume two classes, thus ci 1. Given a new

document-vector q, its classification is

determined by the sign of

Feature selection for RLSC

Feature selection for RLSC Is it possible to

select a small number of actual features (terms)

and apply RLSC only on the selected terms without

a huge loss in accuracy?

Well studied problem supervised (they employ the

class label vector c) algorithms exist. We

applied our L2 regression sampling scheme to

select terms unsupervised!

A smaller RLSC problem

A smaller RLSC problem

TechTC data from ODP(Gabrilovich and Markovitch

04)

TechTC data 100 term-document matrices average

size ¼ 20,000 terms and ¼ 150 documents.

In prior work, feature selection was performed

using a supervised metric called information gain

(IG), an entropic measure of correlation with

class labels.

Conclusion of the experiments Our unsupervised

technique had (on average) comparable performance

to IG.

(No Transcript)

Conclusions

Linear Algebraic techniques (e.g., matrix

decompositions and regression) are fundamental in

data mining and information retrieval. Randomized

algorithms for linear algebra computations

contribute novel results and ideas, both from a

theoretical as well as an applied perspective.

Conclusions and future directions

- Linear Algebraic techniques (e.g., matrix

decompositions and regression) are fundamental in

data mining and information retrieval. - Randomized algorithms for linear algebra

computations contribute novel results and ideas,

both from a theoretical as well as an applied

perspective. - Important direction

About PowerShow.com

PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

Recommended

«

/ »

Page of

«

/ »

Promoted Presentations

Related Presentations

Page of

Home About Us Terms and Conditions Privacy Policy Contact Us Send Us Feedback

Copyright 2017 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

Copyright 2017 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

The PowerPoint PPT presentation: "Information Retrieval" is the property of its rightful owner.

Do you have PowerPoint slides to share? If so, share your PPT presentation slides online with PowerShow.com. It's FREE!

Committed to assisting Ucla University and other schools with their online training by sharing educational presentations for free