Title: Simultaneous Tensor Subspace Selection and Clustering: The Equivalence of High Order SVD and KMeans
1Simultaneous Tensor Subspace Selection and
ClusteringThe Equivalence of High Order SVD and
K-Means Clustering(KDD08)
2Introduction
- Data are everywhere.
- Some are 1-D vectors, eg. the daily temperature
in chapel hill - Some are 2-D matrices, eg. the daily temperature
in NC cities - The rest are high dimensional data, N-D data, eg.
the daily temperature, humidity, light in NC
cities - Data dimension reduction is an important topic in
data mining, machine learning, and pattern
recognition application. - At the very beginning, PCA and SVD were popular
tools for 2-D arrays of data - Recently, tensor-based methods have been
extensively studied, eg. HOSVD,2DSVD,GLRAM.
3Introduction
- The contributions of the paper which the authors
claim - Prove the equivalence of HOSVD and simultaneous
subspace selection(GLRAM/2DSVD) and K-means
clustering - Experiments demonstrate the equivalence theory
- Provide a dataset quality assessment method based
on HOSVD to help to select subdatasets with
expected noise level.
4Outline
- Review SVD and PCA
- Tensor
- High Order SVD, 2DSVD, Tensor clustering
- The equivalence theorem
- Experimental results on ATT Databases
- Dataset Quality Assessment and subdataset
selection.
5Data dimension reduction of matrices
- Singular Value Decomposition (SVD)
- XUSVT
X(xij)MxN
U(uij)MxK
S (sij)KxK
VT (vij)KxN
p
p
p
6Data dimension reduction of matrices
- Principle Component Analysis(PCA)
- An important application of SVD
- XUVT
X(xij)MxN
U(uij)MxK
VT (vij)KxN
PCs
Loadings
7Generalization to N Dimension
- The limit of PCA and SVD
- Methods for analyzing matrix(2-D arrays)
- Not natural to apply to higher dimensional data
- From 2-D to N-D (eg. 3-D)
- Matrix ? Tensor
- SVD/PCA ? HOSVD, 2DSVD/GLRAM
- ?
8What is tensor?
- Tensor
- A multidimensional array
9Tensor Mode-n Multiplication
10Frobenius norm
- Assuming A is a MxN matrix
- The Frobenius norm of A is
- Assuming B is a PxQxR tensor
- The norm of B is
11High Order SVD
- Assuming that the data is a 3D tensor. For
example a set of images, which are of the same
size. - HOSVD factorization treats every index uniformly.
- The goal of HOSVD is
- U,V,W are 2D matrices and orthogonal. S is a 3D
tensor - Using explicit index, we can rewrite the formula
as
12An illustration of HOSVD
W
V
S
U
X
13GLRAM/2DSVD
- Instead of treating every index equally, 2DSVD
views the 3D tensor
- as
- Each Xi is a 2D matrix, eg. an image.
- The goal of GLRAM/2DSVD is
- Or
14An illustration of 2DSVD
U
VT
X1
M1
M2
X2
Mn3
Xn3
15Tensor Clustering
- For vectors x1,,xn, we can do k-means clustering
- ck is the centroid vector of cluster Ck
- For tensors M1,,Mn, we can generalize it
- ck is the centroid tensor of cluster Ck
16The equivalence Theorem
- The HOSVD does simultaneous 2DSVD and K-means
clustering. - (1) Solution of W in HOSVD is cluster indicator
to k-means - (2) (U,V) in HOSVD is the same (U,V) in 2DSVD
(Global Consistence Lemma)
17Experiment on ATT Face Image Databases
- 10 different images of each of 40 distinct
subjects. 400 images in total. - Images were taken at different conditions
- Different times
- Varying lighting
- Facial expression(open/close eyes,
smiling/no-smiling) - Facial details(glasses/no glasses)
- Image size is 102 x 92.
18Three methods are explored
- PCAK-means clustering
- Reshape each image into one vector
- Images consist of a 9384x400 matrix
- K-means is employed on the matrix. (K40)
- 2D-SVDK-means clustering
- 2DSVD is applied first, and 102x92 images (Xl,
l1,,400) are reduced to 30x30 dimensions(Ml ,
l1,,400). - Cluster Ml with K-means (K40)
19Three methods are explored
- HOSVD
- Perform HOSVD on the 102x92x400 tensor to
simultaneous compression and clustering with
reduced dimensions 30x30x40.
20- After using three methods, the results are
represented as 400x40 matrices Q - Qij means image i is clustered into cluster j.
- Create a new 40x40 matrix I. Each row in I
represents one subject and each column represents
one cluster -
Images from one subject
subject
image
cluster
cluster
21Visualization
- PCAK-means 2DSVDK-means
HOSVD - Rows subjects
- Columns clusters
- Green squares show the number of images clustered
in the same cluster.
22Data inconsistent
- Ideally, the 10 images of a subject should have
the same cluster label. - But actually this is not the case.
23Data inconsistent
24Clustering accuracy comparison
- The clustering accuracy
- (The number of images that clustered into default
subject clusters) / (The total number of images) - Comparing in three image sets
- (1) All 400 images
- (2) 300 subset images
- (3) 220 subset images
- How are the subsets selected?
- Running three methods on 400 images, select the
groups in which at least n (n8 and n10 in the
cases above) images are clustered into default
cluster by at least one method. - Merge them together.
25Clustering accuracy comparison
26Dataset quality assessment and subset selection
- From the experiments we know
- The figures of one person can be more similar to
others than his own images - These outliers lead to data inconsistent problem
and confuse data mining and pattern analysis
algorithms - How can we select high dimensional datasets(or
subdatasets) with fewer outliers? - HOSVD
27Dataset quality assessment and subset selection
- Subset Selection Method (on ATT dataset)
- Apply HOSVD on all images
- Select the subjects which at least have n images
been clustered into default subject cluster.
28Conclusion of the paper
- Firstly, the authors propose that HOSVD does
simultaneous 2DSVD and K-means clustering - Secondly, the authors provide experiments to
demonstrate the theoretical results - Finally, a HOSVD based dataset quality assessment
method is provided, to select clean datasets with
expected noise level.
29Thanks!