Learning with Limited Supervision by Input and Output Coding presentation

About This Presentation

Transcript and Presenter's Notes

Title: Learning with Limited Supervision by Input and Output Coding

1
Learning with Limited Supervision by Input and
Output Coding

Yi Zhang
Machine Learning Department
Carnegie Mellon University
April 30th, 2012

2
Thesis Committee

Jeff Schneider, Chair
Geoff Gordon
Tom Mitchell
Xiaojin (Jerry) Zhu, University of
Wisconsin-Madison

3
Introduction
(x1,y1) (xn,yn)

Learning a prediction system, usually based on
examples
Training examples are usually limited
Cost of obtaining high-quality examples
Complexity of the prediction problem

Learning
Y
X
4
Introduction
(x1,y1) (xn,yn)

Solution exploit extra information about the
input and output space
Improve the prediction performance
Reduce the cost for collecting training examples

Learning
Y
X
5
Introduction
(x1,y1) (xn,yn)
?
?

Solution exploit extra information about the
input and output space
Representation and discovery?
Incorporation?

Learning
Y
X
6
Outline
Part I Encoding Input Information by
Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
7
Regularization

The general formulation
Ridge regression
Lasso

8
Outline
Part I Encoding Input Information by
Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
9
Learning with unlabeled text

For a text classification task
? plenty of unlabeled text on the Web
? seemingly unrelated to the task
What can we gain from such unlabeled text?

Yi Zhang, Jeff Schneider and Artur Dubrawski.
Learning the Semantic Correlation An Alternative
Way to Gain from Unlabeled Text. NIPS 2008
10
A motivating example for text learning

Humans learn text classification effectively!
Two training examples
gasoline, truck
- vote, election
Query
gallon, vehicle
Seems very easy! But why?

11
A motivating example for text learning

Humans learn text classification effectively!
Two training examples
gasoline, truck
- vote, election
Query
gallon, vehicle
Seems very easy! But why?
Gasoline gallon, truck vehicle

12
A covariance operator for regularization

Covariance structure of model coefficients
Usually unknown -- learn from unlabeled text?

13
Learning with unlabeled text

Infer the covariance operator
Extract latent topics from unlabeled text (with
resampling)
Observe the contribution of words in each topic
gas 0.3, gallon 0.2, truck 0.2, safety
0.2,
Estimate the correlation (covariance) of
words

14
Learning with unlabeled text

Infer the covariance operator
Extract latent topics from unlabeled text (with
resampling)
Observe the contribution of words in each topic
gas 0.3, gallon 0.2, truck 0.2, safety
0.2,
Estimate the correlation (covariance) of
words
For a new task, we learn with regularization

15
Experiments

Empirical results on 20 newsgroups
190 1-vs-1 classification tasks, 2 labeled
examples
For any task, majority of unlabeled text (18/20)
is irrelevant
Similar results on logistic regression and least
squares

1 V. Sindhwani and S. Keerthi. Large scale
semi-supervised linear svms. In SIGIR, 2006
16
Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
17
Multi-task learning

Different but related prediction tasks
An example
Landmine detection using radar images
Multiple tasks different landmine fields
Geographic conditions
Landmine types
Goal information sharing among tasks

18
Regularization for multi-task learning

Our approach view MTL as estimating a parameter
matrix

W
19
Regularization for multi-task learning

Our approach view MTL as estimating a parameter
matrix
A covariance operator for regularizing a matrix?
Vector w
Matrix W

W
(Gaussian prior)
?
Yi Zhang and Jeff Schneider. Learning Multiple
Tasks with a Sparse Matrix-Normal Penalty. NIPS
2010
20
Matrix-normal distributions

Consider a 2 by 3 matrix W
The full covariance Kronecker product of
and

full covariance
row covariance
column covariance

21
Matrix-normal distributions

Consider a 2 by 3 matrix W
The full covariance Kronecker product of
and
The matrix-normal density offers a compact form
for

full covariance
row covariance
column covariance

22
Learning with a matrix-normal penalty

Joint learning of multiple tasks
Alternating optimization

Matrix-normal prior
23
Learning with a matrix-normal penalty

Joint learning of multiple tasks
Alternating optimization
Other recent work as variants of special cases
Multi-task feature learning Argyriou et al, NIPS
06 learning with the feature covariance
Clustered multi-task learning Jacob et al, NIPS
08 learning with the task covariance and
spectral constraints
Multi-task relationship learning Zhang et al,
UAI 10 learning with the task covariance

Matrix-normal prior
24
Sparse covariance selection

Sparse covariance selection in matrix-normal
penalties
Sparsity of
Conditional independence of rows (tasks) and
columns (feature dimensions) of W

25
Sparse covariance selection

Sparse covariance selection in matrix-normal
penalties
Sparsity of
Conditional independence of rows (tasks) and
columns (feature dimensions) of W
Alternating optimization
Estimating W same as before
Estimating and L-1 penalized
covariance estimation

26
Results on multi-task learning

Landmine detection multiple landmine
fields
Face recognition multiple 1-vs-1 tasks

1 Jacob, Bach, and Vert. Clustered multi-task
learning A convex formulation. NIPS, 2008
2 Argyriou, Evgeniou, and Pontil. Multi-task
feature learning, NIPS 2006
27
Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
28
Learning compressible models

Learning compressible models
A compression operator P instead of
Bias model compressibility

Yi Zhang, Jeff Schneider and Artur Dubrawski.
Learning Compressible Models. SDM 2010
29
Energy compaction

Image energy is concentrated at a few frequencies

JPEG (2D-DCT), 46 1 compression
30
Energy compaction

Image energy is concentrated at a few frequencies
Models need to operate at relevant frequencies

JPEG (2D-DCT), 46 1 compression
2D-DCT
31
Digit recognition

Sparse vs. compressible
Model coefficients w

sparse vs compressible
sparse vs compressible
sparse vs compressible
coefficients w
compressed coefficients Pw
coefficients w as an image
32
Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
33
Dimension reduction

Dimension reduction conveys information about the
input space
Feature selection ? importance
Feature clustering ? granularity
Feature extraction ? more general structures

34
How to use a dimension reduction?

However, any reduction loses certain information
May be relevant to a prediction task
Goal of projection penalties
Encode useful information from a dimension
reduction
Control the risk of potential information loss

Yi Zhang and Jeff Schneider. Projection Penalty
Dimension Reduction without Loss. ICML 2010
35
Projection penalties the basic idea

The basic idea
Observation reduce the feature space ? restrict
the model search to a model subspace MP
Solution still search in the full model space M,
and penalize the projection distance to the model
subspace MP

36
Projection penalties linear cases

Learn with a (linear) dimension reduction P

37
Projection penalties linear cases

Learn with projection penalties
Optimization

projection distance
38
Projection penalties nonlinear cases
w
MP
M
P
wP
Rd
Rp
P
?
F
F
X
P
?
F
F
Yi Zhang and Jeff Schneider. Projection Penalty
Dimension Reduction without Loss. ICML 2010
39
Projection penalties nonlinear cases
w
MP
M
P
wP
Rd
Rp
M
w
MP
P
wP
F
F
X
w
MP
M
P
wP
F
F
Yi Zhang and Jeff Schneider. Projection Penalty
Dimension Reduction without Loss. ICML 2010
40
Empirical results

Text classification (20 newsgroups), using
logistic regression
Dimension reduction latent Dirichlet allocation

Classification Errors
Projection Penalty
Projection Penalty
Original
Original
Reduction
Reduction
41
Empirical results

Text classification (20 newsgroups), using
logistic regression
Dimension reduction latent Dirichlet allocation

Classification Errors
Similar results on face recognition, using SVM
(poly-2) Dimension reduction KPCA, KDA,
OLaplacian Face Similar results on house price
prediction, using regression Dimension reduction
PCA and partial least squares
42
Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
43
Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
44
Multi-label classification

Multi-label classification
Existence of certain label dependency
Example classify an image into scenes (deserts,
river, forest, etc)
Multi-class problem is a special case only one
class is true

Label dependency
Learn to predict

x
yq
y2
y1
45
Output coding

d lt q compression, i.e., source coding
d gt q error-correcting codes, i.e., channel
coding
Use the redundancy to correct prediction
(transmission) errors

Learn to predict

x
z
z1
zd
z2
z3
encoding
decoding

yq
y2
y1
y
46
Error-correcting output codes (ECOCs)

Multi-class ECOCs Dietterich Bakiri, 1994
Allwein, Schapire Singer 2001
Encode into a (redundant) set of binary problems
Learn to predict the code
Decode the predictions
Our goal design ECOCs for multi-label
classification

y1
y2 vs. y3
y3,y4 vs. y7
Learn to predict

x
zt
z2
z1
encoding
decoding

yq
y2
y1
47
Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
48
Composite likelihood

The composite likelihood (CL) a partial
specification of the likelihood as the product of
simple component likelihoods
e.g., pairwise likelihood
e.g., full conditional likelihood
Estimation using composite likelihoods
Computational and statistical efficiency
Robustness under model misspecification

49
Multi-label problem decomposition

Problem decomposition methods
Decomposition into subproblems (encoding)
Decision making by combining subproblem
predictions (decoding)
Examples 1-vs-all, 1-vs-1, 1-vs-1 1-vs-all,
etc

Learn to predict
x

yq
y2
y1
50
1-vs-All (Binary Relevance)
Independently

Classify each label independently
The composite likelihood view

Learn to predict

x
yq
y2
y1
51
Pairwise label ranking 1

y1 vs. y2
yq-1 vs. yq
y1 vs. y3
Learn to predict
x

1-vs-1 method (a.k.a. pairwise label ranking)
Subproblems pairwise label comparisons
Decision making label ranking by counting the
number winning comparisons, and thresholding

yq
y2
y1
1 Hullermeier et. al. Artif. Intell., 2008
52
Pairwise label ranking 1

y1 vs. y2
yq-1 vs. yq
y1 vs. y3
Learn to predict
x

1-vs-1 method (a.k.a. pairwise label ranking)
Subproblems pairwise label comparisons
Decision making label ranking by counting the
number winning comparisons, and thresholding
The composite likelihood view

yq
y2
y1
1 Hullermeier et. al. Artif. Intell., 2008
53
Calibrated label ranking 2

y1 vs. y2
yq-1 vs. yq
y1 vs. y3
Learn to predict

1-vs-1 1-vs-all (a.k.a. calibrated label
ranking)
Subproblems 1-vs-1 1-vs-all
Decision making label ranking, and a smart
thresholding based on 1-vs-1 and 1-vs-all
predictions

x
Learn to predict

yq
y2
y1
2 Furnkranz et. al. MLJ, 2008
54
Calibrated label ranking 2

y1 vs. y2
yq-1 vs. yq
y1 vs. y3
Learn to predict

1-vs-1 1-vs-all (a.k.a. calibrated label
ranking)
Subproblems 1-vs-1 1-vs-all
Decision making label ranking, and a smart
thresholding based on 1-vs-1 and 1-vs-all
predictions
The composite likelihood view

x
Learn to predict

yq
y2
y1
2 Furnkranz et. al. MLJ, 2008
55
A composite likelihood view

A composite likelihood view for problem
decomposition
Choice of subproblems ? specification of a
composite likelihood?
Decision making ? inference on the composite
likelihood?

Learn to predict
x

yq
y2
y1
56
A composite pairwise coding

Subproblems individual and pairwise label
densities
conveys more information
than

yi0, yj0 yi0, yj1
yi1, yj0 yi1, yj1
yi0, yj0 yi0, yj1
yi1, yj0 yi1, yj1
Yi Zhang and Jeff Schneider. A Composite
Likelihood View for Multi-Label Classification.
AISTATS 2012
57
A composite pairwise coding

Decision making a robust mean-field
approximation
is not robust to
underestimation of label densities

Yi Zhang and Jeff Schneider. A Composite
Likelihood View for Multi-Label Classification.
AISTATS 2012
58
A composite pairwise coding

Decision making a robust mean-field
approximation
is not robust to
underestimation of label densities
A composite divergence, robust and efficient to
optimize

Yi Zhang and Jeff Schneider. A Composite
Likelihood View for Multi-Label Classification.
AISTATS 2012
59
Data sets

The Scene data
Image ? scenes (beach, sunset, fall foliage,
field, mountain and urban)
? beach, urban

Boutell et. al., Pattern Recognition 2004
60
Data sets

The Emotion data
Music ? emotions (amazed, happy, relaxed, sad,
etc)
The Medical data
Clinical text ? medical categories (ICD-9-CM
codes)
The Yeast data
Gene ? functional categories
The Enron data
Email ? tags on topics, attachment types, and
emotional tones

61
Empirical results

Similar results on other data sets (emotions,
medical, etc)

1 Hullermeier et. al. Label ranking by learning
pairwise preferences. Artif. Intell., 2008
2 Furnkranz et. al. Multi-label classification
via calibrated label ranking. MLJ, 2008
3 Read et. al. Classifier chains for
multi-label classification. ECML, 2009
4 Tsoumak et. al. Random k-labelsets an
ensemble method for multilabel classification.
ECML, 2007
5 Zhang et. al. Multi-label learning by
exploiting label dependency. KDD, 2010
62
Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
problem-dependent coding and code predictability
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
63
Multi-label output coding

Design output coding to multi-label problems
Problem-dependent encodings to exploit label
dependency
Code predictability
Propose multi-label ECOCs via CCA

?
?
?
Learn to predict

x
zt
z2
z1
encoding
decoding

yq
y2
y1
64
Canonical correlation analysis

Given , CCA finds
projection directions
with maximum correlation

65
Canonical correlation analysis

Given , CCA finds
projection directions
with maximum correlation
Also known as the most predictable criterion
CCA finds most predictable directions v in the
label space

66
Multi-label ECOCs using CCA

Encoding and learning
Perform CCA
Code includes both original labels and label
projections

Learn to predict

x
z
z1
zd
yq
y2
y1
encoding
decoding

yq
y2
y1
y
Yi Zhang and Jeff Schneider. Multi-label Output
Codes using Canonical Correlation Analysis.
AISTATS 2011
67
Multi-label ECOCs using CCA

Encoding and learning
Perform CCA
Code includes both original labels and label
projections
Learn classifiers for original labels
Learn regression for label projections

Learn to predict

x
z
z1
zd
yq
y2
y1
encoding
decoding

yq
y2
y1
y
Yi Zhang and Jeff Schneider. Multi-label Output
Codes using Canonical Correlation Analysis.
AISTATS 2011
68
Multi-label ECOCs using CCA

Decoding
Classifiers Bernoulli on q
original labels
Regression Gaussian on d
label projections

Learn to predict

x
z
z1
zd
yq
y2
y1
encoding
decoding

yq
y2
y1
y
Yi Zhang and Jeff Schneider. Multi-label Output
Codes using Canonical Correlation Analysis.
AISTATS 2011
69
Multi-label ECOCs using CCA

Decoding
Classifiers Bernoulli on q
original labels
Regression Gaussian on d
label projections
Mean-field approximation

Learn to predict

x
z
z1
zd
yq
y2
y1
encoding
decoding

yq
y2
y1
y
Yi Zhang and Jeff Schneider. Multi-label Output
Codes using Canonical Correlation Analysis.
AISTATS 2011
70
Empirical results

Similar results on other criteria (macro/micro
F-1 scores)
Similar results on other data (emotions)
Similar results on other base learners (decision
trees, SVMs)

1 Furnkranz et. al. Multi-label classification
via calibrated label ranking. MLJ, 2008
2 D. Hsu, et. al. Multi-label prediction via
compressed sensing. NIPS, 2009
3 Zhang and Schneider. A composite likelihood
view for multi-label classification. AISTATS 2012
71
Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
problem-dependent coding and code predictability
Composite likelihood for pairwise coding
Discriminative and predictable codes
Multi-label output codes with CCA
Maximum-margin output coding
72
Recall coding with CCA

CCA finds label projections z that are most
predictable
Low transmission errors in channel coding

Learn to predict

x
z
z1
zd
yq
y2
y1
encoding
decoding

yq
y2
y1
y
z
predict
x
73
A recent paper 1 coding with PCA

Label projections z obtained by PCA
z has maximum sample variance, i.e., far away
from each other.
Minimum code distance?

Learn to predict

x
z
z1
zd
yq
y2
y1
encoding
decoding

yq
y2
y1
y
z
1 Tai and Lin, 2010
74
Goal predictable and discriminative codes

Predictable prediction is closed to the correct
codeword
Discriminative prediction is far away from
incorrect codewords

Learn to predict

x
z
z1
zd
yq
y2
y1
encoding
decoding

yq
y2
y1
y
z
predict
x
75
Maximum margin output coding

A max-margin formulation

z
predict
x
76
Maximum margin output coding

A max-margin formulation
Assume M is best linear predictor (in closed form
of X, Y, V)
Reformulate using metric learning
Deal with the exponentially large number of
constraints
The cutting plane method
Overgenerating

77
Maximum margin output coding

A max-margin formulation
Assume M is best linear predictor, and define

78
Maximum margin output coding

A max-margin formulation
Metric learning formulation define the
Mahalanobis metric
and the notation

79
Maximum margin output coding

The metric learning problem
An exponentially large number of constraints
Cutting plane method? No polynomial-time
separation oracle!

80
Maximum margin output coding

The metric learning problem
An exponentially large number of constraints
Cutting plane method? No polynomial-time
separation oracle!
Cutting plane method with overgenerating
(relaxation)
Relax into
Linearize for the relaxed domain
New separation oracle box-constrained QP

81
Empirical results

Similar results on other data (emotions and
medical)

1 Furnkranz et. al. Multi-label classification
via calibrated label ranking. MLJ, 2008
2 Zhang et. al. Multi-label learning by
exploiting label dependency. KDD, 2010
3 D. Hsu, et. al. Multi-label prediction via
compressed sensing. NIPS, 2009
4 Tai and Lin. Multi-label Classification with
Principal Label Space Transformation. Neur. Comp.
5 Zhang and Schneider. Multi-label output codes
via canonical correlation analysis, AISTATS 2011
82
Conclusion

Regularization to exploit input information
Semi-supervised learning with word correlation
Multi-task learning with a matrix-normal penalty
Learning compressible models
Projection penalties for dimension reduction
Output coding to exploit output information
Composite pairwise coding
Coding via CCA
Coding via max-margin formulation
Future

83
Thank you! Questions?
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
problem-dependent coding and code predictability
Composite likelihood for pairwise coding
Discriminative and predictable codes
Multi-label output codes with CCA
Maximum-margin output coding
84
(No Transcript)
85
(No Transcript)
86
Local smoothness

Smoothness of model coefficients
Key property certain order of derivatives are
sparse

Differentiation operator
87
Brain computer interaction

Classify Electroencephalography (EEG) signals
Sparse models vs. piecewise
smooth models

88
Projection penalties linear cases

Learn a linear model with a given linear
reduction P

89
Projection penalties linear cases

Learn a linear model with a given linear
reduction P

90
Projection penalties linear cases

Learn a linear model with projection penalties

projection distance
91
Projection penalties RKHS cases
M
w

Learning in RKHS with projection penalties
Primal
Solve for in the dual (see the next page)
Solve for v and b in the primal

MP
P
wP
F
F
X
92
Projection penalties RKHS cases
M
w

Representer theorem for
Dual

MP
P
wP
F
F
X
93
Projection penalties nonlinear cases

Learning linear models
Learning RKHS models

P
P
P
F
Rd
F
F
Rp
F
X
?
P(xi)
94
Empirical results

Face recognition (Yale), using SVM (poly-2)
Dimension reduction KPCA, KDA, OLaplacian

Classification Errors
95
Empirical results

Face recognition (Yale), using SVM (poly-2)
Dimension reduction KPCA, KDA, OLaplacian

Classification Errors
96
Empirical results

Face recognition (Yale), SVM (poly-2)
Dimension reduction KPCA, KDA, OLaplacian

Classification Errors
97
Empirical results

Price forecasting (Boston house), ridge
regression
Dimension reduction partial least squares

1-R2
98
Binary relevance
Independently

Binary relevance (a.k.a. 1-vs-all)
Subproblems classify each label independently
Decision making same
Assume no label dependency

Learn to predict

x
yq
y2
y1
99
Binary relevance
Independently

Binary relevance (a.k.a. 1-vs-all)
Subproblems classify each label independently
Decision making same
Assume no label dependency
The composite likelihood view

Learn to predict

x
yq
y2
y1
100
Empirical results

Emotion data (classify music to different
emotions)
Evaluation measure subset accuracy

1 Furnkranz et. al. Multi-label classification
via calibrated label ranking. MLJ, 2008
2 D. Hsu, et. al. Multi-label prediction via
compressed sensing. NIPS, 2009

Write a Comment

User Comments (0)

About PowerShow.com

Learning with Limited Supervision by Input and Output Coding PowerPoint PPT Presentation