Learning Tree Conditional Random Fields presentation

About This Presentation

Transcript and Presenter's Notes

Title: Learning Tree Conditional Random Fields

1
Learning Tree Conditional Random Fields

Joseph K. Bradley
Carlos Guestrin

2
Reading peoples minds
(Application from Palatucci et al., 2009)
X fMRI voxels
Y semantic features

Metal?
Manmade?
Found in house?
...

We want to model conditional correlations
Predict independently? Yi X, for all i
Image from http//en.wikipedia.org/wiki/FileFMRI.
jpg
3
Conditional Random Fields (CRFs)

(Lafferty et al., 2001)

In fMRI, X 500 to 10,000 voxels
Pro Avoid modeling P(X)
4
Conditional Random Fields (CRFs)
Y4
Y3
Y2
Y1
Pro Avoid modeling P(X)
5
Conditional Random Fields (CRFs)
Y4
Y3
Y2
Y1
Pro Avoid modeling P(X)
6
Conditional Random Fields (CRFs)
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
7
Conditional Random Fields (CRFs)
Exact inference intractable in general. Approximat
e inference expensive.
Use tree CRFs!
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
8
Conditional Random Fields (CRFs)
Use tree CRFs!
Pro Fast, exact inference
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
9
CRF Structure Learning
Feature selection
Tree CRFs Fast, exact inference Avoid
modeling P(X)
10
CRF Structure Learning
(scalable)
Local inputs
Tree CRFs Fast, exact inference Avoid
modeling P(X)
11
This work
Goals

Structured conditional models P(YX)
Scalable methods
Tree structures
Local inputs Xij
Max spanning trees

Outline
Gold standard
Max spanning trees
Generalized edge weights
Heuristic weights
Experiments synthetic fMRI

12
Related work
Method Feature selection? Tractable models?
Torralba et al. (2004) Boosted Random Fields Yes No
Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No No
Shahaf et al. (2009) Edge weight low-treewidth model No Yes

Vs. our work
Choice of edge weights
Local inputs

13
Chow-Liu
For generative models
14
Chow-Liu for CRFs?
For CRFs with global inputs

Global CMI (Conditional Mutual Information)
Pro Gold standard
Con I(YiYj X) intractable for big X

15
Where now?

Global CMI (Conditional Mutual Information)
Pros Gold standard
Cons I(YiYj X) intractable for big X

Algorithmic framework
Given data (y(i),x(i)).
Given input mapping Yi ? Xi
Weight potential edge (Yi,Yj) with Score(i,j)
Choose max spanning tree

Local inputs!
16
Generalized edge scores

Key step Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores Score(i,j)
linear combination of
entropies over Yi,Yj,Xi,Xj
E.g., Local Conditional Mutual Information
17
Generalized edge scores

Key step Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores Score(i,j)
linear combination of
entropies over Yi,Yj,Xi,Xj

Theorem
Assume true P(YX) is tree CRF
(w/ non-trivial parameters).
?No Local Linear Entropy Score can recover all
such tree CRFs
(even with exact entropies).

18
Heuristics

Outline
Gold standard
Max spanning trees
Generalized edge weights
Heuristic weights
Experiments synthetic fMRI

? Piecewise likelihood ? Local CMI ? DCI
19
Piecewise likelihood (PWL)
Sutton and McCallum (2005,2007) PWL for
parameter learning Main idea Bound Z(X)
For tree CRFs, optimal parameters give

Edge score w/ local inputs Xij
Bounds log likelihood

Fails on simple counterexample
Does badly in practice

Helps explain other edge scores

20
Piecewise likelihood (PWL)
True P(Y,X)
21
Local Conditional Mutual Info

Decomposable score w/ local inputs Xij

Theorem Local CMI bounds log likelihood gain

Does pretty well in practice
Can fail with strong potentials

22
Local Conditional Mutual Info
True P(Y,X)
Strong potential ?
Y3
Y2
Y1
23
Decomposable Conditional Influence (DCI)

Exact measure of gain for some edges
Edge score w/ local inputs Xij
Succeeds on counterexample
Does best in practice

24
Experiments
Algorithmic details

Given Data (yi,xi) input mapping Yi ? Xi
Compute edge scores

Choose max spanning tree
Parameter learning
Conjugate gradient on L2-regularized log
likelihood
10-fold CV to choose regularization

25
Synthetic experiments
P(YX)
P(X)
...
X1
X2
X3
Xn

Experiments
Binary Y,X tabular edge factors
Use natural input mapping Yi ? Xi

26
Synthetic experiments
P(YX)
P(X)
Y4
Y3
Y2
Y1
Y5
intractable P(Y,X)
tractable P(Y,X)

P(YX), P(X) chains trees

P(Y,X) tractable intractable

F(Yij,Xij)
27
Synthetic experiments
P(YX)
Y1
Y2
Y3
Yn
...
cross factors
X1
X2
X3
Xn

P(YX) chains trees

P(Y,X) tractable intractable

F(Yij,Xij)

With without cross-factors

Associative (all positive alternating /-)
random factors

28
Synthetic vary train exs.
29
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
30
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
31
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
32
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
33
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
34
Synthetic vary train exs.
35
Synthetic vary model size
Fixed 50 train exs., 1000 test exs.
36
fMRI experiments
X (500 fMRI voxels)
Y (218 semantic features)
predict

Metal?
Manmade?
Found in house?
...

Data, setup from Palatucci et al. (2009)
Zero-shot learning Can predict objects not
in training data (given decoding).
Image from http//en.wikipedia.org/wiki/FileFMRI.
jpg
37
fMRI experiments
X (500 fMRI voxels)
Y (218 semantic features)
predict
Input mapping Regressed Yi Y-i,X
Chose top K inputs
38
fMRI experiments
Accuracy (for zero-shot learning) Hold out
objects i,j. Predict Y(i), Y(j) If
Y(i) - Y(i)2 lt Y(j) - Y(i)2 then we
got i right.
39
fMRI experiments
Accuracy CRFs a bit worse
better
40
fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better
better
41
fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better Squared error CRFs better
better
42
fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better Squared error CRFs better
better
43
Conclusion

Scalable learning of CRF structure
Analyzed edge scores for spanning tree methods
Local Linear Entropy Scores imperfect
Heuristics
Pleasing theoretical properties
Empirical successwe recommend DCI
Future work
Templated CRFs
Learning edge score
Assumptions on model/factors which give
learnability

Thank you!
44
Thank you!

References
M. Craven, D. DiPasquo, D. Freitag, A. McCallum,
T. Mitchell, K. Nigam, S. Slattery. Learning to
Extract Symbolic Knowledge from the World Wide
Web. AAAI 1998.
Lafferty, J.D., McCallum, A., Pereira, F.C.N.
Conditional Random Fields Probabilistic Models
for Segmenting and Labeling Sequence Data. ICML
2001.
M. Palatucci, D. Pomerleau, G. Hinton, T.
Mitchell. Zero-Shot Learning with Semantic Output
Codes. NIPS 2009.
M. Schmidt, K. Murphy, G. Fung, R. Rosales.
Structure learning in random fields for heart
motion abnormality detection. CVPR 2008.
D. Shahaf, A. Chechetka, C. Guestrin. Learning
Thin Junction Trees via Graph Cuts. AI-STATS
2009.
C. Sutton, A. McCallum. Piecewise training of
undirected models. UAI 2005.
C. Sutton, A. McCallum. Piecewise
pseudolikelihood for efficient training of
conditional random fields. ICML, 2007.
A. Torralba, K. Murphy, W. Freeman. Contextual
models for object detection using boosted random
fields. NIPS 2004.

45
(extra slides)
46
B Score Decay Assumption
47
B Example complexity
48
Future work Templated CRFs

Learn template, e.g.
Score(i,j) DCI(i,j)
Parametrization

WebKB (Craven et al., 1998)
Given webpages (Yipage type, Xicontent)
Use template to Choose tree over pages
Instantiate parameters
? P(YXx) P(pages types pages content)

Requires local inputs
Potentially very fast

49
Future work Learn score

Given training queries
Data
Ground-truth model (E.g., from expensive
structure learning method)
Learn function Score(Yi,Yj) for MST algorithm.

Write a Comment

User Comments (0)

About PowerShow.com

Learning Tree Conditional Random Fields PowerPoint PPT Presentation