Density Ratio Estimation: A New Versatile Tool for Machine Learning - PowerPoint PPT Presentation

About This Presentation

Title:

Density Ratio Estimation: A New Versatile Tool for Machine Learning

Description:

Title: PowerPoint Author: sugi Last modified by: Masashi Sugiyama Created Date: 5/12/2000 5:44:36 AM Document presentation format – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 51

Provided by: sugi2

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Density Ratio Estimation: A New Versatile Tool for Machine Learning

1
Density Ratio EstimationA New Versatile
Toolfor Machine Learning

Department of Computer Science
Tokyo Institute of Technology
Masashi Sugiyama
sugi_at_cs.titech.ac.jp
http//sugiyama-www.cs.titech.ac.jp/sugi

2
Overview of My Talk (1)

Consider the ratio of two probability densities.
If the ratio is estimated, various machine
learning problems can be solved!
Non-stationarity adaptation, domain adaptation,
multi-task learning, outlier detection, change
detection in time series, feature selection,
dimensionality reduction, independent component
analysis, conditional density estimation,
classification, two-sample test

3
Overview of My Talk (2)
Vapnik said When solving a problem of
interest, one should not solve a more general
problem as an intermediate step
Knowing densities
Knowing ratio

Estimating density ratio is substantially easier
than estimating densities!
Various direct density ratio estimation methods
have been developed recently.

4
Organization of My Talk

Usage of Density Ratios
Covariate shift adaptation
Outlier detection
Mutual information estimation
Conditional probability estimation
Methods of Density Ratio Estimation

5
Covariate Shift Adaptation
Shimodaira (JSPI2000)

Training/test input distributions are different,
but target function remains unchanged
extrapolation

Function
Input density
Training samples
Test samples
Target function
6
Adaptation Using Density Ratios

Ordinary least-squares is not consistent.

Density-ratio weighted least-squares is
consistent.

Applicable to any likelihood-based methods!
7
Model Selection

Controlling bias-variance trade-off is important
in covariate shift adaptation.
No weighting low-variance, high-bias
Density ratio weighting low-bias, high-variance
Density ratio weighting plays an important role
for unbiased model selection
Akaike information criterion (regular models)
Subspace information criterion (linear models)
Cross-validation (arbitrary models)

Shimodaira (JSPI2000)
Sugiyama Müller (StatDec.2005)
Sugiyama, Krauledat Müller (JMLR2007)
8
Active Learning

Covariate shift naturally occurs in active
learning scenarios
is designed by users.
is determined by environment.
Density ratio weighting plays an important role
for low-bias active learning.
Population-based methods
Pool-based methods

Wiens (JSPI2000) Kanamori Shimodaira
(JSPI2003) Sugiyama, (JMLR2006)
Kanamori, (NeuroComp2007), Sugiyama Nakajima
(MLJ2009)
9
Real-world Applications

Age prediction from faces (with NECsoft)
Illumination and angle change (KLSCV)
Speaker identification (with Yamaha)
Voice quality change (KLRCV)
Brain-computer interface
Mental condition change (LDACV)

Ueki, Sugiyama Ihara (ICPR2010)
Yamada, Sugiyama Matsui (SP2010)
Sugiyama, Krauledat Müller (JMLR2007) Li,
Kambara, Koike Sugiyama (IEEE-TBE2010)
10
Real-world Applications (cont.)

Japanese text segmentation (with IBM)
Domain adaptation from general conversation to
medicine (CRFCV)
Semiconductor wafer alignment (with Nikon)
Optical marker allocation (AL)
Robot control
Dynamic motion update (KLSCV)
Optimal exploration (AL)

???????????.
Tsuboi, Kashima, Hido, Bickel Sugiyama (JIP2008)
Sugiyama Nakajima (MLJ2009)
Hachiya, Akiyama, Sugiyama Peters (NN2009)
Hachiya, Peters Sugiyama (ECML2009)
Akiyama, Hachiya, Sugiyama (NN2010)
11
Organization of My Talk

Usage of Density Ratios
Covariate shift adaptation
Outlier detection
Mutual information estimation
Conditional probability estimation
Methods of Density Ratio Estimation

12
Inlier-based Outlier Detection
Hido, Tsuboi, Kashima, Sugiyama Kanamori
(ICDM2008) Smola, Song Teo (AISTATS2009)

Test samples having different tendency from
regular ones are regarded as outliers.

Outlier
13
Real-world Applications

Steel plant diagnosis (with JFE Steel)
Printer roller quality control (with Canon)
Loan customer inspection (with IBM)
Sleep therapy from biometric data

Takimoto, Matsugu Sugiyama (DMSS2009)
Hido, Tsuboi, Kashima, Sugiyama Kanamori
(KAIS2010)
Kawahara Sugiyama (SDM2009)
14
Organization of My Talk

Usage of Density Ratios
Covariate shift adaptation
Outlier detection
Mutual information estimation
Conditional probability estimation
Methods of Density Ratio Estimation

15
Mutual Information Estimation

Mutual information (MI)
MI as an independence measure
MI can be approximated using density ratio

and are statistically independent
Suzuki, Sugiyama Tanaka (ISIT2009)
16
Usage of MI Estimator

Independence between input and output
Feature selection
Sufficient dimension reduction
Independence between inputs
Independent component analysis
Independence between input and residual
Causal inference

Suzuki, Sugiyama, Sese Kanamori (BMC
Bioinformatics 2009)
Suzuki Sugiyama (AISTATS2010)
Suzuki Sugiyama (ICA2009)
Yamada Sugiyama (AAAI2010)
y
Causal inference
e
Feature selection, Sufficient dimension reduction
x
x
Independent component analysis
17
Organization of My Talk

Usage of Density Ratios
Covariate shift adaptation
Outlier detection
Mutual information estimation
Conditional probability estimation
Methods of Density Ratio Estimation

18
Conditional Probability Estimation

When is continuous
Conditional density estimation
Mobile robots transition estimation
When is categorical
Probabilistic classification

Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya
Okanohara (AISTATS2010)
80
Class 1
Pattern
Class 2
20
Sugiyama (Submitted)
19
Organization of My Talk

Usage of Density Ratios
Methods of Density Ratio Estimation
Probabilistic Classification Approach
Moment matching Approach
Ratio matching Approach
Comparison
Dimensionality Reduction

20
Density Ratio Estimation

Density ratios are shown to be versatile.
In practice, however, the density ratio should be
estimated from data.
Naïve approach Estimate two densities separately
and take their ratio

21
Vapniks Principle
When solving a problem, dont solve more
difficult problems as an intermediate step
Knowing densities
Knowing ratio

Estimating density ratio is substantially easier
than estimating densities!
We estimate density ratio without going through
density estimation.

22
Organization of My Talk

Usage of Density Ratios
Methods of Density Ratio Estimation
Probabilistic Classification Approach
Moment Matching Approach
Ratio Matching Approach
Comparison
Dimensionality Reduction

23
Probabilistic Classification

Separate numerator and denominator samples by a
probabilistic classifier
Logistic regression achieves the minimum
asymptotic variance when model is correct.

Qin (Biometrika1998)
24
Organization of My Talk

Usage of Density Ratios
Methods of Density Ratio Estimation
Probabilistic Classification Approach
Moment Matching Approach
Ratio Matching Approach
Comparison
Dimensionality Reduction

25
Moment Matching

Match moments of and
Ex) Matching the mean

26
Kernel Mean Matching
Huang, Smola, Gretton, Borgwardt Schölkopf
(NIPS2006)

Matching a finite number of moments does not
yield the true ratio even asymptotically.
Kernel mean matching All the moments are
efficiently matched using Gaussian RKHS
Ratio values at samples can
be estimated via quadratic program.
Variants for learning entire ratio function and
other loss functions are also available.

Kanamori, Suzuki Sugiyama (arXiv2009)
27
Organization of My Talk

Usage of Density Ratios
Methods of Density Ratio Estimation
Probabilistic Classification Approach
Moment Matching Approach
Ratio Matching Approach
Kullback-Leibler Divergence
Squared Distance
Comparison
Dimensionality Reduction

28
Ratio Matching
Sugiyama, Suzuki Kanamori (RIMS2010)

Match a ratio model with true ratio
under the Bregman divergence
Its empirical approximation is given by

(Constant is ignored)
29
Organization of My Talk

Usage of Density Ratios
Methods of Density Ratio Estimation
Probabilistic Classification Approach
Moment Matching Approach
Ratio Matching Approach
Kullback-Leibler Divergence
Squared Distance
Comparison
Dimensionality Reduction

30
Kullback-Leibler Divergence (KL)
Sugiyama, Nakajima, Kashima, von Bünau Kawanabe
(NIPS2007) Nguyen, Wainwright Jordan (NIPS2007)

Put . Then BR yields
For a linear model
,
minimization of KL is convex.

(ex. Gauss kernel)
31
Properties
Nguyen, Wainwright Jordan (NIPS2007) Sugiyama,
Suzuki, Nakajima, Kashima, von Bünau Kawanabe
(AISM2008)

Parametric case
Learned parameter converge to the optimal value
with order .
Non-parametric case
Learned function converges to the optimal
function with order slightly slower than
(depending on the covering number or the
bracketing entropy of the function class).

32
Experiments

Setup d-dim. Gaussian with covariance identity
and
Denominator mean (0,0,0,,0)
Numerator mean (1,0,0,,0)
Ratio of kernel density estimators (RKDE)
Estimate two densities separately and take ratio.
Gaussian width is chosen by CV.
KL method
Estimate density ratio directly.
Gaussian width is chosen by CV.

33
Accuracy as a Functionof Input Dimensionality
RKDE
Normalized MSE
KL
Dim.

RKDE curse of dimensionality
KL works well

34
Variations

EM algorithms for KL with
Log-linear models
Gaussian mixture models
Probabilistic PCA mixture models
are also available.

Tsuboi, Kashima, Hido, Bickel Sugiyama (JIP2009)
Yamada Sugiyama (IEICE2009)
Yamada, Sugiyama, Wichern Jaak (Submitted)
35
Organization of My Talk

Usage of Density Ratios
Methods of Density Ratio Estimation
Probabilistic Classification Approach
Moment Matching Approach
Ratio Matching Approach
Kullback-Leibler Divergence
Squared Distance
Comparison
Dimensionality Reduction

36
Squared Distance (SQ)
Kanamori, Hido Sugiyama (JMLR2009)

Put . Then BR yields
For a linear model
,
minimizer of SQ is analytic
Computationally efficient!

37
Properties
Kanamori, Hido Sugiyama (JMLR2009)
Kanamori, Suzuki Sugiyama (arXiv2009)

Optimal parametric/non-parametric rates of
convergence are achieved.
Leave-one-out cross-validation score can be
computed analytically.
SQ has the smallest condition number.
Analytic solution allows computation of the
derivative of mutual information estimator
Sufficient dimension reduction
Independent component analysis
Causal inference

38
Organization of My Talk

Usage of Density Ratios
Methods of Density Ratio Estimation
Probabilistic Classification Approach
Moment Matching Approach
Ratio Matching Approach
Comparison
Dimensionality Reduction

39
Density Ratio Estimation Methods
Method Density estimation Domains of de/nu Model selection Computation time
RKDE Involved Could differ Possible Very fast
LR Free Same? Possible Slow
KMM Free Same? ??? Slow
KL Free Could differ Possible Slow
SQ Free Could differ Possible Fast

SQ would be preferable in practice.

40
Organization of My Talk

Usage of Density Ratios
Methods of Density Ratio Estimation
Probabilistic Classification Approach
Moment Matching Approach
Ratio Matching Approach
Comparison
Dimensionality Reduction

41
Direct Density Ratio Estimationwith
Dimensionality Reduction (D3)

Directly density ratio estimation without going
through density estimation is promising.
However, for high-dimensional data, density
ratio estimation is still challenging.
We combine direct density ratio estimation with
dimensionality reduction!

42
Hetero-distributional Subspace (HS)

Key assumption and are
different only in a subspace (called HS).
This allows us to estimate the density ratio only
within the low-dimensional HS!

HS
Full-rank and orthogonal
43
HS Search Based onSupervised Dimension Reduction
Sugiyama, Kawanabe Chui (NN2010)

In HS, and are different,
so and would
be separable.
We may use any supervised dimension reduction
method for HS search.
Local Fisher discriminant analysis (LFDA)
Can handle within-class multimodality
No limitation on reduced dimensions
Analytic solution available

Sugiyama (JMLR2007)
44
Pros and Cons of SupervisedDimension Reduction
Approach

Pros
SQLFDA gives analytic solution for D3.
Cons
Rather restrictive implicit assumption that
and are independent.
Separability does not always lead to HS (e.g.,
two overlapped, but different densities)

Separable
HS
45
Characterization of HS
Sugiyama, Hara, von Bünau, Suzuki, Kanamori
Kawanabe (SDM2010)

HS is given as the maximizer of the Pearson
divergence
Thus, we can identify HS by finding maximizer of
PE with respect to .
PE can be analytically approximated by SQ!

46
HS Search

Gradient descent
Natural gradient descent
Utilize the Stiefel-manifold structure for
efficiency
Givens rotation
Choose a pair of axes and rotate in the subspace
Subspace rotation
Ignore rotation within subspaces for efficiency

Amari (NeuralComp1998)
Plumbley (NeuroComp2005)
Rotation across the subspace
Rotation within the subspace
Hetero-distributional subspace
47
Experiments
True ratio (2d)
Samples (2d)
Plain SQ (2d)
Increasing dimensionality (by adding noisy dims)
Plain SQ
D3-SQ (2d)
D3-SQ
48
Conclusions

Many ML tasks can be solved via density ratio
estimation
Non-stationarity adaptation, domain adaptation,
multi-task learning, outlier detection, change
detection in time series, feature selection,
dimensionality reduction, independent component
analysis, conditional density estimation,
classification, two-sample test
Directly estimating density ratios without going
through density estimation is the key.
LR, KMM, KL, SQ, and D3 .

49
Books

Quiñonero-Candela, Sugiyama, Schwaighofer
Lawrence (Eds.), Dataset Shift in Machine
Learning, MIT Press, 2009.
Sugiyama, von Bünau, Kawanabe Müller, Covariate
Shift Adaptation in Machine Learning, MIT Press
(coming soon!)
Sugiyama, Suzuki Kanamori,
Density Ratio Estimation in Machine Learning,
Cambridge University Press (coming soon!)

50
The World of Density Ratios
Real-world applications Brain-computer
interface, Robot control, Image
understanding, Speech recognition, Natural
language processing, Bioinformatics
Machine learning algorithms Covariate shift
adaptation, Outlier detection, Conditional
probability estimation, Mutual information
estimation
Density ratio estimation Fundamental algorithms
(LR, KMM, KL, SQ) Large-scale, High-dimensionality
, Stabilization, Robustification
Theoretical analysis Convergence, Information
criteria, Numerical stability

Write a Comment

User Comments (0)