(Semi-)Supervised Probabilistic Principal Component Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

(Semi-)Supervised Probabilistic Principal Component Analysis

Description:

(Semi-)Supervised Probabilistic. Principal Component Analysis. Shipeng Yu ... Do eigen-decomposition (sort eigenvalues decreasingly) ... – PowerPoint PPT presentation

Number of Views:251
Avg rating:3.0/5.0
Slides: 30
Provided by: dbsInform
Category:

less

Transcript and Presenter's Notes

Title: (Semi-)Supervised Probabilistic Principal Component Analysis


1
(Semi-)Supervised Probabilistic Principal
Component Analysis
  • Shipeng Yu
  • University of Munich, Germany
  • Siemens Corporate Technology
  • http//www.dbs.ifi.lmu.de/spyu
  • Joint work with Kai Yu, Volker Tresp,
  • Hans-Peter Kriegel, Mingrui Wu

2
Dimensionality Reduction
  • We are dealing with high-dimensional data
  • Texts e.g. bag-of-words features
  • Images color histogram, correlagram, etc.
  • Web pages texts, linkages, structures, etc.
  • Motivations
  • Noisy features how to remove or down-weight
    them?
  • Learnability curse of dimensionality
  • Inefficiency high computational cost
  • Visualization
  • A pre-processing for many data mining tasks

3
Unsupervised versus Supervised
  • Unsupervised Dimensionality Reduction
  • Only the input data are given
  • PCA (principal component analysis)
  • Supervised Dimensionality Reduction
  • Should be biased by the outputs
  • Classification FDA (Fisher discriminant
    analysis)
  • Regression PLS (partial least squares)
  • RVs CCA (canonical correlation analysis)
  • More general solutions?
  • Semi-Supervised?

4
Our Settings and Notations
  • data points, input features, output
    labels
  • We aim to derive a mapping such
    that

Unlabeled Data!
Unsupervised
Semi-supervised
Supervised
5
Outline
  • Principal Component Analysis
  • Probabilistic PCA
  • Supervised Probabilistic PCA
  • Related Work
  • Conclusion

6
PCA Motivation
  • Find the K orthogonal projection directions which
    have the most data variances
  • Applications
  • Visualization
  • De-noising
  • Latent semantic indexing
  • Eigenfaces

1st PC
2nd PC
7
PCA Algorithm
  • Basic Algorithm
  • Centralize data
  • Compute the sample covariance matrix
  • Do eigen-decomposition (sort eigenvalues
    decreasingly)
  • PCA directions are given in , the first K
    columns of
  • The PCA projection of a test data is

8
Supervised PCA?
  • PCA is unsupervised
  • When output information is available
  • Classification labels 0/1
  • Regression responses real values
  • Ranking orders rank labels / preferences
  • Multi-outputs output dimension gt 1
  • Structured outputs,
  • Can PCA be biased by outputs?
  • And how?

9
Outline
  • Principal Component Analysis
  • Probabilistic PCA
  • Supervised Probabilistic PCA
  • Related Work
  • Conclusion

10
Latent Variable Model for PCA
  • Another interpretation of PCA Pearson 1901
  • PCA is minimizing the reconstruction error of
  • are latent variables PCA
    projections of
  • are factor loadings PCA mappings
  • Equivalent to PCA up to a scaling factor
  • Lead to idea of PPCA

11
Probabilistic PCA TipBis99
  • Latent variable model
  • Conditional independence
  • If , PPCA leads to PCA solution (up to
    a rotation and scaling factor)
  • is Gaussian distributed

Mean vector
12
From Unsupervised to Supervised
  • Key insights of PPCA
  • All the M input dimensions are conditionally
    independent given the K latent variables
  • In PCA we are seeking the K latent variables that
    best explain the data covariance
  • When we have outputs , we believe
  • There are inter-covariances between and
  • There are intra-covariances within if
  • Idea Let the latent variables explain all of
    them!

13
Outline
  • Principal Component Analysis
  • Probabilistic PCA
  • Supervised Probabilistic PCA
  • Related Work
  • Conclusion

14
Supervised Probabilistic PCA
  • Supervised latent variable model
  • All input and output dimensions are conditionally
    independent
  • are jointly Gaussian distributed

15
Semi-Supervised Probabilistic PCA
  • Idea A SPPCA model with missing data!
  • Likelihood

16
S2PPCA EM Learning
  • Model parameters
  • EM Learning
  • E-step estimate for each data point (a
    projection problem)
  • M-step maximize data log-likelihood w.r.t.
    parameters
  • An extension of EM learning for PPCA model
  • Can be kernelized!
  • By product an EM learning algorithm for kernel
    PCA

Inference Problem
17
S2PPCA Toy Problem - Linear
18
S2PPCA Toy Problem - Nonlinear
19
S2PPCA Toy Problem - Nonlinear
20
S2PPCA Properties
  • Semi-supervised projection
  • Take PCA and kernel PCA as special cases
  • Applicable to large data sets
  • Primal O(t(ML)NK) time, O((ML)N) space
  • Dual O(tN2K) time, O(N2) space
  • A latent variable solution Yu et al, SIGIR05
  • Cannot deal with semi-supervised setting!
  • Closed form solutions for SPPCA
  • No closed form solutions for S2PPCA

cheaper than Primal if MgtN
21
SPPCA Primal Solution
(ML)(ML)
22
SPPCA Dual Solution
New kernel matrix!
23
Experiments
  • Train Nearest Neighbor classifier after
    projection
  • Evaluation metrics
  • Multi-class classification error rate
  • Multi-label classification F1-measure, AUC

24
Multi-class Classification
  • S2PPCA is almost always better than SPPCA
  • LDA is very good for FACE data
  • S2PPCA is very good on TEXT data
  • S2PPCA has good scalability

25
Multi-label Classification
  • S2PPCA is always better than SPPCA
  • S2PPCA is better in most cases
  • S2PPCA has good scalability

26
Extensions
  • Put priors on factor loading matrices
  • Learn MAP estimates for them
  • Relax Gaussian noise model for outputs
  • Better way to incorporate supervised information
  • We need to do more approximations (using, e.g.
    EP)
  • Directly predict missing outputs (i.e. single
    step)
  • Mixture modeling in latent variable space
  • Achieve local PCA instead of global PCA
  • Robust supervised PCA mapping
  • Replace Gaussian with Student-t
  • Outlier detection in PCA

27
Related Work
  • Fisher discriminant analysis (FDA)
  • Goal Find directions to maximize between-class
    distance while minimizing within-class distance
  • Only deal with outputs of multi-class
    classification
  • Limited number of projection dimensions
  • Canonical correlation analysis (CCA)
  • Goal Maximize the correlation between inputs and
    outputs
  • Ignore intra-covariance of both inputs and
    outputs
  • Partial least squares (PLS)
  • Goal Sequentially find orthogonal directions to
    maximize covariance with respect to outputs
  • A penalized CCA poor generalization on new
    output dimensions

28
Conclusion
  • A general framework for (semi-)supervised
    dimensionality reduction
  • We can solve the problem analytically (EIG) or
    via iterative algorithms (EM)
  • Trade-off
  • EIG optimization-based, easily extended to other
    loss
  • EM semi-supervised projection, good scalability
  • Both algorithms can be kernelized
  • PCA and kernel PCA are special cases
  • More applications expected

29
Thank you!
  • Questions?

http//www.dbs.ifi.lmu.de/spyu
Write a Comment
User Comments (0)
About PowerShow.com