Simultaneous Tensor Subspace Selection and Clustering: The Equivalence of High Order SVD and KMeans - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Simultaneous Tensor Subspace Selection and Clustering: The Equivalence of High Order SVD and KMeans

Description:

Simultaneous Tensor Subspace Selection and Clustering: The Equivalence of High ... (2) (U,V) in HOSVD is the same (U,V) in 2DSVD (Global Consistence Lemma) 7/29/09 ... – PowerPoint PPT presentation

Number of Views:228
Avg rating:3.0/5.0
Slides: 30
Provided by: Office20041681
Category:

less

Transcript and Presenter's Notes

Title: Simultaneous Tensor Subspace Selection and Clustering: The Equivalence of High Order SVD and KMeans


1
Simultaneous Tensor Subspace Selection and
ClusteringThe Equivalence of High Order SVD and
K-Means Clustering(KDD08)
  • Shunping Huang

2
Introduction
  • Data are everywhere.
  • Some are 1-D vectors, eg. the daily temperature
    in chapel hill
  • Some are 2-D matrices, eg. the daily temperature
    in NC cities
  • The rest are high dimensional data, N-D data, eg.
    the daily temperature, humidity, light in NC
    cities
  • Data dimension reduction is an important topic in
    data mining, machine learning, and pattern
    recognition application.
  • At the very beginning, PCA and SVD were popular
    tools for 2-D arrays of data
  • Recently, tensor-based methods have been
    extensively studied, eg. HOSVD,2DSVD,GLRAM.

3
Introduction
  • The contributions of the paper which the authors
    claim
  • Prove the equivalence of HOSVD and simultaneous
    subspace selection(GLRAM/2DSVD) and K-means
    clustering
  • Experiments demonstrate the equivalence theory
  • Provide a dataset quality assessment method based
    on HOSVD to help to select subdatasets with
    expected noise level.

4
Outline
  • Review SVD and PCA
  • Tensor
  • High Order SVD, 2DSVD, Tensor clustering
  • The equivalence theorem
  • Experimental results on ATT Databases
  • Dataset Quality Assessment and subdataset
    selection.

5
Data dimension reduction of matrices
  • Singular Value Decomposition (SVD)
  • XUSVT

X(xij)MxN
U(uij)MxK
S (sij)KxK
VT (vij)KxN



p
p
p



6
Data dimension reduction of matrices
  • Principle Component Analysis(PCA)
  • An important application of SVD
  • XUVT

X(xij)MxN
U(uij)MxK
VT (vij)KxN



PCs
Loadings



7
Generalization to N Dimension
  • The limit of PCA and SVD
  • Methods for analyzing matrix(2-D arrays)
  • Not natural to apply to higher dimensional data
  • From 2-D to N-D (eg. 3-D)
  • Matrix ? Tensor
  • SVD/PCA ? HOSVD, 2DSVD/GLRAM
  • ?

8
What is tensor?
  • Tensor
  • A multidimensional array

9
Tensor Mode-n Multiplication
10
Frobenius norm
  • Assuming A is a MxN matrix
  • The Frobenius norm of A is
  • Assuming B is a PxQxR tensor
  • The norm of B is

11
High Order SVD
  • Assuming that the data is a 3D tensor. For
    example a set of images, which are of the same
    size.
  • HOSVD factorization treats every index uniformly.
  • The goal of HOSVD is
  • U,V,W are 2D matrices and orthogonal. S is a 3D
    tensor
  • Using explicit index, we can rewrite the formula
    as

12
An illustration of HOSVD
  • The goal of HOSVD

W

V
S
U
X
13
GLRAM/2DSVD
  • Instead of treating every index equally, 2DSVD
    views the 3D tensor
  • as
  • Each Xi is a 2D matrix, eg. an image.
  • The goal of GLRAM/2DSVD is
  • Or

14
An illustration of 2DSVD
  • The goal of 2DSVD

U
VT
X1
M1



M2
X2


Mn3
Xn3
15
Tensor Clustering
  • For vectors x1,,xn, we can do k-means clustering
  • ck is the centroid vector of cluster Ck
  • For tensors M1,,Mn, we can generalize it
  • ck is the centroid tensor of cluster Ck

16
The equivalence Theorem
  • The HOSVD does simultaneous 2DSVD and K-means
    clustering.
  • (1) Solution of W in HOSVD is cluster indicator
    to k-means
  • (2) (U,V) in HOSVD is the same (U,V) in 2DSVD
    (Global Consistence Lemma)

17
Experiment on ATT Face Image Databases
  • 10 different images of each of 40 distinct
    subjects. 400 images in total.
  • Images were taken at different conditions
  • Different times
  • Varying lighting
  • Facial expression(open/close eyes,
    smiling/no-smiling)
  • Facial details(glasses/no glasses)
  • Image size is 102 x 92.

18
Three methods are explored
  • PCAK-means clustering
  • Reshape each image into one vector
  • Images consist of a 9384x400 matrix
  • K-means is employed on the matrix. (K40)
  • 2D-SVDK-means clustering
  • 2DSVD is applied first, and 102x92 images (Xl,
    l1,,400) are reduced to 30x30 dimensions(Ml ,
    l1,,400).
  • Cluster Ml with K-means (K40)

19
Three methods are explored
  • HOSVD
  • Perform HOSVD on the 102x92x400 tensor to
    simultaneous compression and clustering with
    reduced dimensions 30x30x40.

20
  • After using three methods, the results are
    represented as 400x40 matrices Q
  • Qij means image i is clustered into cluster j.
  • Create a new 40x40 matrix I. Each row in I
    represents one subject and each column represents
    one cluster


Images from one subject

subject
image

cluster
cluster
21
Visualization
  • PCAK-means 2DSVDK-means
    HOSVD
  • Rows subjects
  • Columns clusters
  • Green squares show the number of images clustered
    in the same cluster.

22
Data inconsistent
  • Ideally, the 10 images of a subject should have
    the same cluster label.
  • But actually this is not the case.

23
Data inconsistent
24
Clustering accuracy comparison
  • The clustering accuracy
  • (The number of images that clustered into default
    subject clusters) / (The total number of images)
  • Comparing in three image sets
  • (1) All 400 images
  • (2) 300 subset images
  • (3) 220 subset images
  • How are the subsets selected?
  • Running three methods on 400 images, select the
    groups in which at least n (n8 and n10 in the
    cases above) images are clustered into default
    cluster by at least one method.
  • Merge them together.

25
Clustering accuracy comparison
26
Dataset quality assessment and subset selection
  • From the experiments we know
  • The figures of one person can be more similar to
    others than his own images
  • These outliers lead to data inconsistent problem
    and confuse data mining and pattern analysis
    algorithms
  • How can we select high dimensional datasets(or
    subdatasets) with fewer outliers?
  • HOSVD

27
Dataset quality assessment and subset selection
  • Subset Selection Method (on ATT dataset)
  • Apply HOSVD on all images
  • Select the subjects which at least have n images
    been clustered into default subject cluster.

28
Conclusion of the paper
  • Firstly, the authors propose that HOSVD does
    simultaneous 2DSVD and K-means clustering
  • Secondly, the authors provide experiments to
    demonstrate the theoretical results
  • Finally, a HOSVD based dataset quality assessment
    method is provided, to select clean datasets with
    expected noise level.

29
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com