Title: Density Ratio Estimation: A New Versatile Tool for Machine Learning
1Density Ratio EstimationA New Versatile
Toolfor Machine Learning
- Department of Computer Science
- Tokyo Institute of Technology
- Masashi Sugiyama
- sugi_at_cs.titech.ac.jp
- http//sugiyama-www.cs.titech.ac.jp/sugi
2Overview of My Talk (1)
- Consider the ratio of two probability densities.
- If the ratio is estimated, various machine
learning problems can be solved! - Non-stationarity adaptation, domain adaptation,
multi-task learning, outlier detection, change
detection in time series, feature selection,
dimensionality reduction, independent component
analysis, conditional density estimation,
classification, two-sample test
3Overview of My Talk (2)
Vapnik said When solving a problem of
interest, one should not solve a more general
problem as an intermediate step
Knowing densities
Knowing ratio
- Estimating density ratio is substantially easier
than estimating densities! - Various direct density ratio estimation methods
have been developed recently.
4Organization of My Talk
- Usage of Density Ratios
- Covariate shift adaptation
- Outlier detection
- Mutual information estimation
- Conditional probability estimation
- Methods of Density Ratio Estimation
5Covariate Shift Adaptation
Shimodaira (JSPI2000)
- Training/test input distributions are different,
but target function remains unchanged - extrapolation
Function
Input density
Training samples
Test samples
Target function
6 Adaptation Using Density Ratios
- Ordinary least-squares is not consistent.
- Density-ratio weighted least-squares is
consistent.
Applicable to any likelihood-based methods!
7Model Selection
- Controlling bias-variance trade-off is important
in covariate shift adaptation. - No weighting low-variance, high-bias
- Density ratio weighting low-bias, high-variance
- Density ratio weighting plays an important role
for unbiased model selection - Akaike information criterion (regular models)
- Subspace information criterion (linear models)
- Cross-validation (arbitrary models)
Shimodaira (JSPI2000)
Sugiyama Müller (StatDec.2005)
Sugiyama, Krauledat Müller (JMLR2007)
8Active Learning
- Covariate shift naturally occurs in active
learning scenarios - is designed by users.
- is determined by environment.
- Density ratio weighting plays an important role
for low-bias active learning. - Population-based methods
- Pool-based methods
Wiens (JSPI2000) Kanamori Shimodaira
(JSPI2003) Sugiyama, (JMLR2006)
Kanamori, (NeuroComp2007), Sugiyama Nakajima
(MLJ2009)
9Real-world Applications
- Age prediction from faces (with NECsoft)
- Illumination and angle change (KLSCV)
- Speaker identification (with Yamaha)
- Voice quality change (KLRCV)
- Brain-computer interface
- Mental condition change (LDACV)
Ueki, Sugiyama Ihara (ICPR2010)
Yamada, Sugiyama Matsui (SP2010)
Sugiyama, Krauledat Müller (JMLR2007) Li,
Kambara, Koike Sugiyama (IEEE-TBE2010)
10Real-world Applications (cont.)
- Japanese text segmentation (with IBM)
- Domain adaptation from general conversation to
medicine (CRFCV) - Semiconductor wafer alignment (with Nikon)
- Optical marker allocation (AL)
- Robot control
- Dynamic motion update (KLSCV)
- Optimal exploration (AL)
???????????.
Tsuboi, Kashima, Hido, Bickel Sugiyama (JIP2008)
Sugiyama Nakajima (MLJ2009)
Hachiya, Akiyama, Sugiyama Peters (NN2009)
Hachiya, Peters Sugiyama (ECML2009)
Akiyama, Hachiya, Sugiyama (NN2010)
11Organization of My Talk
- Usage of Density Ratios
- Covariate shift adaptation
- Outlier detection
- Mutual information estimation
- Conditional probability estimation
- Methods of Density Ratio Estimation
12Inlier-based Outlier Detection
Hido, Tsuboi, Kashima, Sugiyama Kanamori
(ICDM2008) Smola, Song Teo (AISTATS2009)
- Test samples having different tendency from
regular ones are regarded as outliers.
Outlier
13Real-world Applications
- Steel plant diagnosis (with JFE Steel)
- Printer roller quality control (with Canon)
- Loan customer inspection (with IBM)
- Sleep therapy from biometric data
Takimoto, Matsugu Sugiyama (DMSS2009)
Hido, Tsuboi, Kashima, Sugiyama Kanamori
(KAIS2010)
Kawahara Sugiyama (SDM2009)
14Organization of My Talk
- Usage of Density Ratios
- Covariate shift adaptation
- Outlier detection
- Mutual information estimation
- Conditional probability estimation
- Methods of Density Ratio Estimation
15Mutual Information Estimation
- Mutual information (MI)
- MI as an independence measure
- MI can be approximated using density ratio
and are statistically independent
Suzuki, Sugiyama Tanaka (ISIT2009)
16Usage of MI Estimator
- Independence between input and output
- Feature selection
- Sufficient dimension reduction
- Independence between inputs
- Independent component analysis
- Independence between input and residual
- Causal inference
Suzuki, Sugiyama, Sese Kanamori (BMC
Bioinformatics 2009)
Suzuki Sugiyama (AISTATS2010)
Suzuki Sugiyama (ICA2009)
Yamada Sugiyama (AAAI2010)
y
Causal inference
e
Feature selection, Sufficient dimension reduction
x
x
Independent component analysis
17Organization of My Talk
- Usage of Density Ratios
- Covariate shift adaptation
- Outlier detection
- Mutual information estimation
- Conditional probability estimation
- Methods of Density Ratio Estimation
18Conditional Probability Estimation
- When is continuous
- Conditional density estimation
- Mobile robots transition estimation
- When is categorical
- Probabilistic classification
Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya
Okanohara (AISTATS2010)
80
Class 1
Pattern
Class 2
20
Sugiyama (Submitted)
19Organization of My Talk
- Usage of Density Ratios
- Methods of Density Ratio Estimation
- Probabilistic Classification Approach
- Moment matching Approach
- Ratio matching Approach
- Comparison
- Dimensionality Reduction
20Density Ratio Estimation
- Density ratios are shown to be versatile.
- In practice, however, the density ratio should be
estimated from data. - Naïve approach Estimate two densities separately
and take their ratio
21Vapniks Principle
When solving a problem, dont solve more
difficult problems as an intermediate step
Knowing densities
Knowing ratio
- Estimating density ratio is substantially easier
than estimating densities! - We estimate density ratio without going through
density estimation.
22Organization of My Talk
- Usage of Density Ratios
- Methods of Density Ratio Estimation
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Comparison
- Dimensionality Reduction
23Probabilistic Classification
- Separate numerator and denominator samples by a
probabilistic classifier - Logistic regression achieves the minimum
asymptotic variance when model is correct.
Qin (Biometrika1998)
24Organization of My Talk
- Usage of Density Ratios
- Methods of Density Ratio Estimation
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Comparison
- Dimensionality Reduction
25Moment Matching
- Match moments of and
- Ex) Matching the mean
26Kernel Mean Matching
Huang, Smola, Gretton, Borgwardt Schölkopf
(NIPS2006)
- Matching a finite number of moments does not
yield the true ratio even asymptotically. - Kernel mean matching All the moments are
efficiently matched using Gaussian RKHS - Ratio values at samples can
be estimated via quadratic program. - Variants for learning entire ratio function and
other loss functions are also available.
Kanamori, Suzuki Sugiyama (arXiv2009)
27Organization of My Talk
- Usage of Density Ratios
- Methods of Density Ratio Estimation
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Kullback-Leibler Divergence
- Squared Distance
- Comparison
- Dimensionality Reduction
28Ratio Matching
Sugiyama, Suzuki Kanamori (RIMS2010)
- Match a ratio model with true ratio
under the Bregman divergence - Its empirical approximation is given by
(Constant is ignored)
29Organization of My Talk
- Usage of Density Ratios
- Methods of Density Ratio Estimation
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Kullback-Leibler Divergence
- Squared Distance
- Comparison
- Dimensionality Reduction
30Kullback-Leibler Divergence (KL)
Sugiyama, Nakajima, Kashima, von Bünau Kawanabe
(NIPS2007) Nguyen, Wainwright Jordan (NIPS2007)
- Put . Then BR yields
- For a linear model
, - minimization of KL is convex.
(ex. Gauss kernel)
31Properties
Nguyen, Wainwright Jordan (NIPS2007) Sugiyama,
Suzuki, Nakajima, Kashima, von Bünau Kawanabe
(AISM2008)
- Parametric case
- Learned parameter converge to the optimal value
with order . - Non-parametric case
- Learned function converges to the optimal
function with order slightly slower than
(depending on the covering number or the
bracketing entropy of the function class).
32Experiments
- Setup d-dim. Gaussian with covariance identity
and - Denominator mean (0,0,0,,0)
- Numerator mean (1,0,0,,0)
- Ratio of kernel density estimators (RKDE)
- Estimate two densities separately and take ratio.
- Gaussian width is chosen by CV.
- KL method
- Estimate density ratio directly.
- Gaussian width is chosen by CV.
33Accuracy as a Functionof Input Dimensionality
RKDE
Normalized MSE
KL
Dim.
- RKDE curse of dimensionality
- KL works well
34Variations
- EM algorithms for KL with
- Log-linear models
- Gaussian mixture models
- Probabilistic PCA mixture models
- are also available.
Tsuboi, Kashima, Hido, Bickel Sugiyama (JIP2009)
Yamada Sugiyama (IEICE2009)
Yamada, Sugiyama, Wichern Jaak (Submitted)
35Organization of My Talk
- Usage of Density Ratios
- Methods of Density Ratio Estimation
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Kullback-Leibler Divergence
- Squared Distance
- Comparison
- Dimensionality Reduction
36Squared Distance (SQ)
Kanamori, Hido Sugiyama (JMLR2009)
- Put . Then BR yields
- For a linear model
, - minimizer of SQ is analytic
- Computationally efficient!
37Properties
Kanamori, Hido Sugiyama (JMLR2009)
Kanamori, Suzuki Sugiyama (arXiv2009)
- Optimal parametric/non-parametric rates of
convergence are achieved. - Leave-one-out cross-validation score can be
computed analytically. - SQ has the smallest condition number.
- Analytic solution allows computation of the
derivative of mutual information estimator - Sufficient dimension reduction
- Independent component analysis
- Causal inference
38Organization of My Talk
- Usage of Density Ratios
- Methods of Density Ratio Estimation
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Comparison
- Dimensionality Reduction
39Density Ratio Estimation Methods
Method Density estimation Domains of de/nu Model selection Computation time
RKDE Involved Could differ Possible Very fast
LR Free Same? Possible Slow
KMM Free Same? ??? Slow
KL Free Could differ Possible Slow
SQ Free Could differ Possible Fast
- SQ would be preferable in practice.
40Organization of My Talk
- Usage of Density Ratios
- Methods of Density Ratio Estimation
- Probabilistic Classification Approach
- Moment Matching Approach
- Ratio Matching Approach
- Comparison
- Dimensionality Reduction
41Direct Density Ratio Estimationwith
Dimensionality Reduction (D3)
- Directly density ratio estimation without going
through density estimation is promising. - However, for high-dimensional data, density
ratio estimation is still challenging. - We combine direct density ratio estimation with
dimensionality reduction!
42Hetero-distributional Subspace (HS)
- Key assumption and are
different only in a subspace (called HS). - This allows us to estimate the density ratio only
within the low-dimensional HS!
HS
Full-rank and orthogonal
43HS Search Based onSupervised Dimension Reduction
Sugiyama, Kawanabe Chui (NN2010)
- In HS, and are different,
so and would
be separable. - We may use any supervised dimension reduction
method for HS search. - Local Fisher discriminant analysis (LFDA)
- Can handle within-class multimodality
- No limitation on reduced dimensions
- Analytic solution available
Sugiyama (JMLR2007)
44Pros and Cons of SupervisedDimension Reduction
Approach
- Pros
- SQLFDA gives analytic solution for D3.
- Cons
- Rather restrictive implicit assumption that
and are independent. - Separability does not always lead to HS (e.g.,
two overlapped, but different densities)
Separable
HS
45Characterization of HS
Sugiyama, Hara, von Bünau, Suzuki, Kanamori
Kawanabe (SDM2010)
- HS is given as the maximizer of the Pearson
divergence - Thus, we can identify HS by finding maximizer of
PE with respect to . - PE can be analytically approximated by SQ!
46HS Search
- Gradient descent
- Natural gradient descent
- Utilize the Stiefel-manifold structure for
efficiency - Givens rotation
- Choose a pair of axes and rotate in the subspace
- Subspace rotation
- Ignore rotation within subspaces for efficiency
Amari (NeuralComp1998)
Plumbley (NeuroComp2005)
Rotation across the subspace
Rotation within the subspace
Hetero-distributional subspace
47Experiments
True ratio (2d)
Samples (2d)
Plain SQ (2d)
Increasing dimensionality (by adding noisy dims)
Plain SQ
D3-SQ (2d)
D3-SQ
48Conclusions
- Many ML tasks can be solved via density ratio
estimation - Non-stationarity adaptation, domain adaptation,
multi-task learning, outlier detection, change
detection in time series, feature selection,
dimensionality reduction, independent component
analysis, conditional density estimation,
classification, two-sample test - Directly estimating density ratios without going
through density estimation is the key. - LR, KMM, KL, SQ, and D3 .
49Books
- Quiñonero-Candela, Sugiyama, Schwaighofer
Lawrence (Eds.), Dataset Shift in Machine
Learning, MIT Press, 2009. - Sugiyama, von Bünau, Kawanabe Müller, Covariate
Shift Adaptation in Machine Learning, MIT Press
(coming soon!) - Sugiyama, Suzuki Kanamori,
Density Ratio Estimation in Machine Learning,
Cambridge University Press (coming soon!)
50The World of Density Ratios
Real-world applications Brain-computer
interface, Robot control, Image
understanding, Speech recognition, Natural
language processing, Bioinformatics
Machine learning algorithms Covariate shift
adaptation, Outlier detection, Conditional
probability estimation, Mutual information
estimation
Density ratio estimation Fundamental algorithms
(LR, KMM, KL, SQ) Large-scale, High-dimensionality
, Stabilization, Robustification
Theoretical analysis Convergence, Information
criteria, Numerical stability