Convex Point Estimation using Undirected Bayesian Transfer Hierarchies - PowerPoint PPT Presentation

About This Presentation

Title:

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Description:

Llama-Bison. Llama-Elephant. Llama-Giraffe. Llama-Rhino ... Llama-Rhino. qroot. Degrees of Transfer. 1/l. Stronger transfer. Weaker transfer. qroot ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 26

Provided by: aiSta

Learn more at: http://ai.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

1
Convex Point Estimation using Undirected
Bayesian Transfer Hierarchies

Gal Elidan Ben Packer
Geremy Heitz Daphne Koller
Stanford AI Lab

2
Motivation
Problem With few instances, learned models
arent robust
Task Shape modeling
Principal Components
std 1
MEAN
MEAN
std -1
std 1
std -1
3
Transfer Learning
Shape is stabilized, but doesnt look like an
elephant
Can we use rhinos to help elephants?
Principal Components
std 1
MEAN
MEAN
std -1
std 1
std -1
4
Hierarchical Bayes
P(qroot)
qroot
P(qElephantqroot)
P(qRhinoqroot)
qRhino
qElephant
P(DataRhinoqRhino)
P(DataElephantqElephant)
5
Goals

Transfer between related classes
Range of settings, tasks
Probabilistic motivation
Multilevel, complex hierarchies
Simple, efficient computation
Automatically learn what to transfer

6
Hierarchical Bayes

Compute full posterior P(D)
P(croot) must be conjugate to P(Dc)

Problem Often cant perform full
Bayesian computations
7
Approx. Point estimation
Best parameters are good enough dont need full
distribution

Empirical Bayes
Point estimation
Other approximations
Posterior as normal, sampling, etc.

8
More Issues Multiple Levels

Conjugate priors usually cant be extended to
multiple levels (e.g., Dirichlet,
inverse-Wishart)
Exception Thibeaux and Jordan (05)

9
More Issues Restrictive Priors

Example inverse-Wishart
Pseudocount restriction
n gt d
If d is large, N is small, signal from prior
overwhelms data
We show experiments with N3, d20

Pseudocounts
N samples, d dimension
10
Alternative Shrinkage

McCallum et al. (98)
1. Compute maximum likelihood at each node
2. Shrink each node toward its parent
Linear combination of q and qparent
Uses cross-validation of a

Pros
Simple to compute
Handles multiple levels
Cons
Naive heuristic for transfer
Averaging not always appropriate

qparent
q aq (1-a) qparent
q
11
Undirected HB Reformulation
Probabilistic Abstraction Hierarchies (Segal et
al. 01)
Defines an undirected Markov random field model
over Q, D
12
Undirected Probabilistic Model
qroot
b high
b low
Divergence
Divergence
qRhino
qElephant
Fdata
Fdata
Divergence Encourage parameters to be similar
to parents
Fdata Encourage parameters to explain data
13
Purpose of Reformulation

Easy to specify
Fdata can be likelihood, classification, or other
objective
Divergence can be L1-distance, L2-distance,
e-insensitive loss, KL divergence, etc.
No conjugacy or proper prior restrictions
Easy to optimize
Convex over Q if Fdata is convex and Divergence
is concave

14
Application Text categorization

Task Categorize Documents
Bag-of-words model
Fdata Multinomial log likelihood (regularized)
µi represents frequency of word i
Divergence L2 norm

Newsgroup20 Dataset
15
Baselines

1. Maximum likelihood at each node (no hierarchy)
2. Cross-validate regularization (no hierarchy)
3. Shrinkage (McCallum et al. 98, with hierarchy)

qparent
qB
qA
q aq (1-a) qparent
q
16
Can It Handle Multiple Levels?
Newsgroup Topic Classification
0.7

0.65
0.6
0.55
Classification Rate
0.5
0.45
Max Likelihood (No regularization)
Shrinkage
Regularized Max Likelihood
0.4
Undirected HB
0.35
75
150
225
300
375

Total Number of Training Instances
17
Application Shape Modeling
Mammals Dataset (Fink, 05)

Task Learn shape
(Density estimation test likelihood)
Instances represented by 60 x-y
coordinates of landmarks on outline
Divergence
L2 norm over mean and variance

Mean landmark location
Covariance over landmarks
Regularization
MEAN
Principal Components
18
Does Hierarchy Help?
Mammal Pairs
50

Regularized Max Likelihood
0
Bison-Rhino
-50
-100
Delta log-loss / instance
-150
Bison-Rhino
Elephant-Bison
-200
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
-250
Giraffe-Rhino
Llama-Bison
Llama-Elephant
-300
Llama-Giraffe
Llama-Rhino
-350
6
10
20
30

Total Number of Training Instances
Unregularized max likelihood, shrinkage Much
worse, not shown
19
Transfer
Not all parameters deserve equal sharing
20
Degrees of Transfer
How do we estimate all these parameters?
Split q into subcomponents µi with weights l
Allows for different strengths for different
subcomponents, child-parent pairs
21
Learning Degrees of Transfer

Bootstrap approach
If and have a consistent
relationship, want to encourage them to be
similar
Define
Want to estimate variance of d across all
possible datasets
Select random subsets of data
Let be empirical variance of d
If and have a consistent
relationship (low variance), strongly
encourages similarity

Bootstrap approach
Hyper-prior approach

22
Degrees of Transfer

Bootstrap approach
If we use an L2 norm for our Div terms
Resembles a product of Gaussian priors
If we were to fix the value of
empirical Bayes
Undirected empirical Bayes estimation

L2 norm
Gaussian Prior
Degree of Transfer
23
Learning Degrees of Transfer

Bootstrap approach
If and have a consistent
relationship, want to encourage them to be
similar
Hyper-prior approach
Bayesian idea
Put prior on
Add as parameter to optimization along with
Concretely inverse-Gamma prior (forced to be
positive)

If likelihood is concave, entire objective is
convex!
Prior on Degree of Transfer
24
Do Degrees of Transfer Help?
Mammal Pairs
15

Hyperprior
10
Bison-Rhino
5
Delta log-loss / instance
0
Regularized Max Likelihood
Bison-Rhino
-5
Elephant-Bison
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
Giraffe-Rhino
-10
Llama-Bison
Llama-Elephant
Llama-Giraffe
Llama-Rhino
-15
6
10
20
30

Total Number of Training Instances
25
Do Degrees of Transfer Help?
Bison-Rhino

Const
No Degrees of
Transfer
Hyperprior, Bootstrap
Use Degrees of
Transfer

10
Hyperprior
Bootstrap
5
0
RegML
Delta log-loss / instance
-5
Const
qroot
-10
-15
6
10
20
30
Total Number of Training Instances
26
Do Degrees of Transfer Help?
Llama-Rhino
Const gt100 bits worse
8
7
Hyperprior
6
5
Bootstrap
4
Delta log-loss / instance
3
2
qroot
1
RegML
0
6
10
20
30
Total Number of Training Instances
27
Degrees of Transfer
Distribution of DOT coefficients using Hyperprior
20
18
qroot
16
14
12
10
8
6
4
2
0
0
5
10
15
20
25
30
35
40
45
50
1/l
Stronger transfer
Weaker transfer
28
Summary