Convex Point Estimation using Undirected Bayesian Transfer Hierarchies - PowerPoint PPT Presentation

About This Presentation
Title:

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Description:

Llama-Bison. Llama-Elephant. Llama-Giraffe. Llama-Rhino ... Llama-Rhino. qroot. Degrees of Transfer. 1/l. Stronger transfer. Weaker transfer. qroot ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 26
Provided by: aiSta
Learn more at: http://ai.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Convex Point Estimation using Undirected Bayesian Transfer Hierarchies


1
Convex Point Estimation using Undirected
Bayesian Transfer Hierarchies
  • Gal Elidan Ben Packer
  • Geremy Heitz Daphne Koller
  • Stanford AI Lab

2
Motivation
Problem With few instances, learned models
arent robust
Task Shape modeling
Principal Components
std 1
MEAN
MEAN
std -1
std 1
std -1
3
Transfer Learning
Shape is stabilized, but doesnt look like an
elephant
Can we use rhinos to help elephants?
Principal Components
std 1
MEAN
MEAN
std -1
std 1
std -1
4
Hierarchical Bayes
P(qroot)
qroot
P(qElephantqroot)
P(qRhinoqroot)
qRhino
qElephant
P(DataRhinoqRhino)
P(DataElephantqElephant)
5
Goals
  • Transfer between related classes
  • Range of settings, tasks
  • Probabilistic motivation
  • Multilevel, complex hierarchies
  • Simple, efficient computation
  • Automatically learn what to transfer

6
Hierarchical Bayes
  • Compute full posterior P(D)
  • P(croot) must be conjugate to P(Dc)

Problem Often cant perform full
Bayesian computations
7
Approx. Point estimation
Best parameters are good enough dont need full
distribution
  • Empirical Bayes
  • Point estimation
  • Other approximations
  • Posterior as normal, sampling, etc.

8
More Issues Multiple Levels
  • Conjugate priors usually cant be extended to
  • multiple levels (e.g., Dirichlet,
    inverse-Wishart)
  • Exception Thibeaux and Jordan (05)

9
More Issues Restrictive Priors
  • Example inverse-Wishart
  • Pseudocount restriction
  • n gt d
  • If d is large, N is small, signal from prior
    overwhelms data
  • We show experiments with N3, d20

Pseudocounts
N samples, d dimension
10
Alternative Shrinkage
  • McCallum et al. (98)
  • 1. Compute maximum likelihood at each node
  • 2. Shrink each node toward its parent
  • Linear combination of q and qparent
  • Uses cross-validation of a
  • Pros
  • Simple to compute
  • Handles multiple levels
  • Cons
  • Naive heuristic for transfer
  • Averaging not always appropriate

qparent
q aq (1-a) qparent
q
11
Undirected HB Reformulation
Probabilistic Abstraction Hierarchies (Segal et
al. 01)
Defines an undirected Markov random field model
over Q, D
12
Undirected Probabilistic Model
qroot
b high
b low
Divergence
Divergence
qRhino
qElephant
Fdata
Fdata
Divergence Encourage parameters to be similar
to parents
Fdata Encourage parameters to explain data
13
Purpose of Reformulation
  • Easy to specify
  • Fdata can be likelihood, classification, or other
    objective
  • Divergence can be L1-distance, L2-distance,
  • e-insensitive loss, KL divergence, etc.
  • No conjugacy or proper prior restrictions
  • Easy to optimize
  • Convex over Q if Fdata is convex and Divergence
    is concave

14
Application Text categorization
  • Task Categorize Documents
  • Bag-of-words model
  • Fdata Multinomial log likelihood (regularized)
  • µi represents frequency of word i
  • Divergence L2 norm

Newsgroup20 Dataset
15
Baselines
  • 1. Maximum likelihood at each node (no hierarchy)
  • 2. Cross-validate regularization (no hierarchy)
  • 3. Shrinkage (McCallum et al. 98, with hierarchy)

qparent
qB
qA
q aq (1-a) qparent
q
16
Can It Handle Multiple Levels?
Newsgroup Topic Classification
0.7

0.65
0.6
0.55
Classification Rate
0.5
0.45
Max Likelihood (No regularization)
Shrinkage
Regularized Max Likelihood
0.4
Undirected HB
0.35
75
150
225
300
375

Total Number of Training Instances
17
Application Shape Modeling
Mammals Dataset (Fink, 05)
  • Task Learn shape
  • (Density estimation test likelihood)
  • Instances represented by 60 x-y
  • coordinates of landmarks on outline
  • Divergence
  • L2 norm over mean and variance

Mean landmark location
Covariance over landmarks
Regularization
MEAN
Principal Components
18
Does Hierarchy Help?
Mammal Pairs
50

Regularized Max Likelihood
0
Bison-Rhino
-50
-100
Delta log-loss / instance
-150
Bison-Rhino
Elephant-Bison
-200
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
-250
Giraffe-Rhino
Llama-Bison
Llama-Elephant
-300
Llama-Giraffe
Llama-Rhino
-350
6
10
20
30

Total Number of Training Instances
Unregularized max likelihood, shrinkage Much
worse, not shown
19
Transfer
Not all parameters deserve equal sharing
20
Degrees of Transfer
How do we estimate all these parameters?
Split q into subcomponents µi with weights l
Allows for different strengths for different
subcomponents, child-parent pairs
21
Learning Degrees of Transfer
  • Bootstrap approach
  • If and have a consistent
    relationship, want to encourage them to be
    similar
  • Define
  • Want to estimate variance of d across all
    possible datasets
  • Select random subsets of data
  • Let be empirical variance of d
  • If and have a consistent
    relationship (low variance), strongly
    encourages similarity
  • Bootstrap approach
  • Hyper-prior approach

22
Degrees of Transfer
  • Bootstrap approach
  • If we use an L2 norm for our Div terms
  • Resembles a product of Gaussian priors
  • If we were to fix the value of
  • empirical Bayes
  • Undirected empirical Bayes estimation


L2 norm
Gaussian Prior
Degree of Transfer
23
Learning Degrees of Transfer
  • Bootstrap approach
  • If and have a consistent
    relationship, want to encourage them to be
    similar
  • Hyper-prior approach
  • Bayesian idea
  • Put prior on
  • Add as parameter to optimization along with
  • Concretely inverse-Gamma prior (forced to be
    positive)

If likelihood is concave, entire objective is
convex!
Prior on Degree of Transfer
24
Do Degrees of Transfer Help?
Mammal Pairs
15

Hyperprior
10
Bison-Rhino
5
Delta log-loss / instance
0
Regularized Max Likelihood
Bison-Rhino
-5
Elephant-Bison
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
Giraffe-Rhino
-10
Llama-Bison
Llama-Elephant
Llama-Giraffe
Llama-Rhino
-15
6
10
20
30

Total Number of Training Instances
25
Do Degrees of Transfer Help?
Bison-Rhino
  • Const
  • No Degrees of
  • Transfer
  • Hyperprior, Bootstrap
  • Use Degrees of
  • Transfer

10
Hyperprior
Bootstrap
5
0
RegML
Delta log-loss / instance
-5
Const
qroot
-10
-15
6
10
20
30
Total Number of Training Instances
26
Do Degrees of Transfer Help?
Llama-Rhino
Const gt100 bits worse
8
7
Hyperprior
6
5
Bootstrap
4
Delta log-loss / instance
3
2
qroot
1
RegML
0
6
10
20
30
Total Number of Training Instances
27
Degrees of Transfer
Distribution of DOT coefficients using Hyperprior
20
18
qroot
16
14
12
10
8
6
4
2
0
0
5
10
15
20
25
30
35
40
45
50
1/l
Stronger transfer
Weaker transfer
28
Summary
  • Transfer between related classes
  • Range of settings, tasks
  • Probabilistic motivation
  • Multilevel, complex hierarchies
  • Simple, efficient computation
  • Refined transfer of components

29
Future Work
  • Non-tree hierarchies
  • (multiple inheritance)
  • Block degrees of transfer
  • Structure learning

Gene Ontology (GO) network
WordNet Hierarchy
General undirected model doesnt require tree
structure
Part discovery
Write a Comment
User Comments (0)
About PowerShow.com