Loading...

PPT – Learning Tree Conditional Random Fields PowerPoint presentation | free to view - id: 6a4a0e-ZTlkM

The Adobe Flash plugin is needed to view this content

Learning Tree Conditional Random Fields

- Joseph K. Bradley
- Carlos Guestrin

Reading peoples minds

(Application from Palatucci et al., 2009)

X fMRI voxels

Y semantic features

- Metal?
- Manmade?
- Found in house?
- ...

We want to model conditional correlations

Predict independently? Yi X, for all i

Image from http//en.wikipedia.org/wiki/FileFMRI.

jpg

Conditional Random Fields (CRFs)

- (Lafferty et al., 2001)

In fMRI, X 500 to 10,000 voxels

Pro Avoid modeling P(X)

Conditional Random Fields (CRFs)

Y4

Y3

Y2

Y1

Pro Avoid modeling P(X)

Conditional Random Fields (CRFs)

Y4

Y3

Y2

Y1

Pro Avoid modeling P(X)

Conditional Random Fields (CRFs)

Con Compute Z(x) for each inference

Pro Avoid modeling P(X)

Conditional Random Fields (CRFs)

Exact inference intractable in general. Approximat

e inference expensive.

Use tree CRFs!

Con Compute Z(x) for each inference

Pro Avoid modeling P(X)

Conditional Random Fields (CRFs)

Use tree CRFs!

Pro Fast, exact inference

Con Compute Z(x) for each inference

Pro Avoid modeling P(X)

CRF Structure Learning

Feature selection

Tree CRFs Fast, exact inference Avoid

modeling P(X)

CRF Structure Learning

(scalable)

Local inputs

Tree CRFs Fast, exact inference Avoid

modeling P(X)

This work

Goals

- Structured conditional models P(YX)
- Scalable methods
- Tree structures
- Local inputs Xij
- Max spanning trees

- Outline
- Gold standard
- Max spanning trees
- Generalized edge weights
- Heuristic weights
- Experiments synthetic fMRI

Related work

Method Feature selection? Tractable models?

Torralba et al. (2004) Boosted Random Fields Yes No

Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No No

Shahaf et al. (2009) Edge weight low-treewidth model No Yes

- Vs. our work
- Choice of edge weights
- Local inputs

Chow-Liu

For generative models

Chow-Liu for CRFs?

For CRFs with global inputs

- Global CMI (Conditional Mutual Information)
- Pro Gold standard
- Con I(YiYj X) intractable for big X

Where now?

- Global CMI (Conditional Mutual Information)
- Pros Gold standard
- Cons I(YiYj X) intractable for big X

- Algorithmic framework
- Given data (y(i),x(i)).
- Given input mapping Yi ? Xi
- Weight potential edge (Yi,Yj) with Score(i,j)
- Choose max spanning tree

Local inputs!

Generalized edge scores

- Key step Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores Score(i,j)

linear combination of

entropies over Yi,Yj,Xi,Xj

E.g., Local Conditional Mutual Information

Generalized edge scores

- Key step Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores Score(i,j)

linear combination of

entropies over Yi,Yj,Xi,Xj

- Theorem
- Assume true P(YX) is tree CRF
- (w/ non-trivial parameters).
- ?No Local Linear Entropy Score can recover all

such tree CRFs - (even with exact entropies).

Heuristics

- Outline
- Gold standard
- Max spanning trees
- Generalized edge weights
- Heuristic weights
- Experiments synthetic fMRI

? Piecewise likelihood ? Local CMI ? DCI

Piecewise likelihood (PWL)

Sutton and McCallum (2005,2007) PWL for

parameter learning Main idea Bound Z(X)

For tree CRFs, optimal parameters give

- Edge score w/ local inputs Xij
- Bounds log likelihood

- Fails on simple counterexample
- Does badly in practice

- Helps explain other edge scores

Piecewise likelihood (PWL)

True P(Y,X)

Local Conditional Mutual Info

- Decomposable score w/ local inputs Xij

- Theorem Local CMI bounds log likelihood gain

- Does pretty well in practice
- Can fail with strong potentials

Local Conditional Mutual Info

True P(Y,X)

Strong potential ?

Y3

Y2

Y1

Decomposable Conditional Influence (DCI)

- Exact measure of gain for some edges
- Edge score w/ local inputs Xij
- Succeeds on counterexample
- Does best in practice

Experiments

Algorithmic details

- Given Data (yi,xi) input mapping Yi ? Xi
- Compute edge scores

- Choose max spanning tree
- Parameter learning
- Conjugate gradient on L2-regularized log

likelihood - 10-fold CV to choose regularization

Synthetic experiments

P(YX)

P(X)

...

X1

X2

X3

Xn

- Experiments
- Binary Y,X tabular edge factors
- Use natural input mapping Yi ? Xi

Synthetic experiments

P(YX)

P(X)

Y4

Y3

Y2

Y1

Y5

intractable P(Y,X)

tractable P(Y,X)

- P(YX), P(X) chains trees

- P(Y,X) tractable intractable

F(Yij,Xij)

Synthetic experiments

P(YX)

Y1

Y2

Y3

Yn

...

cross factors

X1

X2

X3

Xn

- P(YX) chains trees

- P(Y,X) tractable intractable

F(Yij,Xij)

- With without cross-factors

- Associative (all positive alternating /-)

random factors

Synthetic vary train exs.

Synthetic vary train exs.

Tree Intractable P(Y,X) Associative F

(alternating /-) Y40 1000 test examples

Synthetic vary train exs.

Tree Intractable P(Y,X) Associative F

(alternating /-) Y40 1000 test examples

Synthetic vary train exs.

Tree Intractable P(Y,X) Associative F

(alternating /-) Y40 1000 test examples

Synthetic vary train exs.

Tree Intractable P(Y,X) Associative F

(alternating /-) Y40 1000 test examples

Synthetic vary train exs.

Tree Intractable P(Y,X) Associative F

(alternating /-) Y40 1000 test examples

Synthetic vary train exs.

Synthetic vary model size

Fixed 50 train exs., 1000 test exs.

fMRI experiments

X (500 fMRI voxels)

Y (218 semantic features)

predict

- Metal?
- Manmade?
- Found in house?
- ...

Data, setup from Palatucci et al. (2009)

Zero-shot learning Can predict objects not

in training data (given decoding).

Image from http//en.wikipedia.org/wiki/FileFMRI.

jpg

fMRI experiments

X (500 fMRI voxels)

Y (218 semantic features)

predict

Input mapping Regressed Yi Y-i,X

Chose top K inputs

fMRI experiments

Accuracy (for zero-shot learning) Hold out

objects i,j. Predict Y(i), Y(j) If

Y(i) - Y(i)2 lt Y(j) - Y(i)2 then we

got i right.

fMRI experiments

Accuracy CRFs a bit worse

better

fMRI experiments

Accuracy CRFs a bit worse Log likelihood

CRFs better

better

fMRI experiments

Accuracy CRFs a bit worse Log likelihood

CRFs better Squared error CRFs better

better

fMRI experiments

Accuracy CRFs a bit worse Log likelihood

CRFs better Squared error CRFs better

better

Conclusion

- Scalable learning of CRF structure
- Analyzed edge scores for spanning tree methods
- Local Linear Entropy Scores imperfect
- Heuristics
- Pleasing theoretical properties
- Empirical successwe recommend DCI
- Future work
- Templated CRFs
- Learning edge score
- Assumptions on model/factors which give

learnability

Thank you!

Thank you!

- References
- M. Craven, D. DiPasquo, D. Freitag, A. McCallum,

T. Mitchell, K. Nigam, S. Slattery. Learning to

Extract Symbolic Knowledge from the World Wide

Web. AAAI 1998. - Lafferty, J.D., McCallum, A., Pereira, F.C.N.

Conditional Random Fields Probabilistic Models

for Segmenting and Labeling Sequence Data. ICML

2001. - M. Palatucci, D. Pomerleau, G. Hinton, T.

Mitchell. Zero-Shot Learning with Semantic Output

Codes. NIPS 2009. - M. Schmidt, K. Murphy, G. Fung, R. Rosales.

Structure learning in random fields for heart

motion abnormality detection. CVPR 2008. - D. Shahaf, A. Chechetka, C. Guestrin. Learning

Thin Junction Trees via Graph Cuts. AI-STATS

2009. - C. Sutton, A. McCallum. Piecewise training of

undirected models. UAI 2005. - C. Sutton, A. McCallum. Piecewise

pseudolikelihood for efficient training of

conditional random fields. ICML, 2007. - A. Torralba, K. Murphy, W. Freeman. Contextual

models for object detection using boosted random

fields. NIPS 2004.

(extra slides)

B Score Decay Assumption

B Example complexity

Future work Templated CRFs

- Learn template, e.g.
- Score(i,j) DCI(i,j)
- Parametrization

- WebKB (Craven et al., 1998)
- Given webpages (Yipage type, Xicontent)
- Use template to Choose tree over pages
- Instantiate parameters
- ? P(YXx) P(pages types pages content)

- Requires local inputs
- Potentially very fast

Future work Learn score

- Given training queries
- Data
- Ground-truth model (E.g., from expensive

structure learning method) - Learn function Score(Yi,Yj) for MST algorithm.