Graphical model software for machine learning - PowerPoint PPT Presentation

About This Presentation
Title:

Graphical model software for machine learning

Description:

For the local evidence, we can use a discriminative classifier (trained iid) ... Uses inference as subroutine (can be slow no worse than discriminative learning) ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 43
Provided by: nirf2
Category:

less

Transcript and Presenter's Notes

Title: Graphical model software for machine learning


1
Graphical modelsoftware for machine learning
  • Kevin Murphy
  • University of British Columbia

December, 2005
2
Outline
  • Discriminative models for iid data
  • Beyond iid data conditional random fields
  • Beyond supervised learning generative models
  • Beyond optimization Bayesian models

3
Supervised learning as Bayesian inference
Training
Testing
?
?
Y1
Yn
YN
Y
Y
X1
Xn
XN
X
X
N
4
Supervised learning as optimization
Training
Testing
?
?
Y1
Yn
YN
Y
Y
X1
Xn
XN
X
X
N
5
Example logistic regression
  • Let yn 2 1,,C be given by a softmax
  • Maximize conditional log likelihood
  • Max margin solution

6
Outline
  • Discriminative models for iid data
  • Beyond iid data conditional random fields
  • Beyond supervised learning generative models
  • Beyond optimization Bayesian models

7
1D chain CRFs for sequence labeling
A 1D conditional random field (CRF) is an
extension of logistic regressionto the case
where the output labels are sequences, yn 2
1,,Cm
Edge potential
Local evidence
?
?ij
Yn1
Yn2
Ynm
?i
Xn
8
2D Lattice CRFs for pixel labeling
A conditional randomfield (CRF) is a
discriminative modelof P(yx). The edge
potentials?ij are image dependent.
9
2D Lattice MRFs for pixel labeling
Local evidence
Potential function
Partition function
A Markov Random Field (MRF) is an
undirectedgraphical model. Here we model
correlation between pixel labels using
?ij(yi,yj). We also have a per-pixelgenerative
model of observations P(xiyi)
10
Tree-structured CRFs
  • Used in parts-based object detection
  • Yi is location of part i in image

nose
eyeR
eyeL
mouth
Fischler Elschlager, "The representation and
matching of pictorial structures,
PAMI73 Felzenszwalb Huttenlocher, "Pictorial
Structures for Object Recognition," IJCV05
11
General CRFs
  • In general, the graph may have arbitrary
    structure
  • eg for collective web page classification,nodesu
    rls, edgeshyperlinks
  • The potentials are in general defined on cliques,
    not just edges

12
Factor graphs
Square nodes factors (potentials) Round nodes
random variables Graph structure bipartite
13
Potential functions
  • For the local evidence, we can use a
    discriminative classifier (trained iid)
  • For the edge compatibilities, we can use a
    maxent/ loglinear form, using pre-defined features

14
Restricted potential functions
  • For some applications (esp in vision), we often
    use a Potts model of the form
  • We can generalize this for ordered labels (eg
    discretization of continuous states)

15
(No Transcript)
16
Learning CRFs
  • If the log likelihood is
  • then the gradient is

Tied params
cliques
Gradient features expected features
17
Learning CRFs
  • Given the gradient rd, one can find the global
    optimum using first or second order optimization
    methods, such as
  • Conjugate gradient
  • Limited memory BFGS
  • Stochastic meta descent (SMD)?
  • The bottleneck is computing the expected features
    needed for the gradient

18
Exact inference
  • For 1D chains, one can compute P(yi,i1x)
    exactly in O(N K2) time using belief propagation
    (BP forwards backwards algorithm)
  • For restricted potentials (eg ?ij?(? l)), one
    can do this in O(NK) time using FFT-like tricks
  • This can be generalized to trees.

19
Sum-product vs max-product
  • We use sum-product to compute marginal
    probabilities needed for learning
  • We use max-product to find the most probable
    assignment (Viterbi decoding)
  • We can also compute max-marginals

20
Complexity of exact inference
In general, the running time is ?(N Kw), where w
is the treewidthof the graph this is the size
of the maximal clique of the triangulatedgraph
(assuming an optimal elimination ordering). For
chains and trees, w 2. For n n lattices, w
O(n).
21
Approximate sum-product
Algorithm Potential (pairwise) Time Nnum nodes,K num states,I num iterations
BP(exact iff tree) General O(N K2 I)
BPFFT(exact iff tree) Restricted O(N K I)
Generalized BP General O(N K2c I)c cluster size
Gibbs General O(N K I)
Swendsen-Wang General O(N K I)
Mean field General O(N K I)
22
Approximate max-product
Algorithm Potential (pairwise) Time Nnum nodes,K num states,I num iterations
BP (exact iff tree) General O(N K2 I)
BPDT (exact iff tree) Restricted O(N K I)
Generalized BP General O(N K2c I)c cluster size
Graph-cuts(exact iff K2) Restricted O(N2 K I) ?
ICM (iterated conditional modes) General O(N K I)
SLS (stochastic local search) General O(N K I)
23
Learning intractable CRFs
  • We can use approximate inference and hope the
    gradient is good enough.
  • If we use max-product, we are doing Viterbi
    training (cf perceptron rule)
  • Or we can use other techniques, such as pseudo
    likelihood, which does not need inference.

24
Pseudo-likelihood
25
Software for inference and learning in 1D CRFs
  • Various packages
  • Mallet (McCallum et al) Java
  • Crf.sourceforge.net (Sarawagi, Cohen) Java
  • My code matlab (just a toy, not integrated with
    BNT)
  • Ben Taskar says he will soon release his Max
    Margin Markov net code (which uses LP for
    inference and QP for learning).
  • Nothing standard, emphasis on NLP apps

26
Software for inference in general CRFs/ MRFs
  • Max-product C code for GC, BP, TRP and ICM
    (for Lattice2) by Rick Szeliski et al
  • A comparative study of energy minimization
    methods for MRFs, Rick Szeliksi, Ramin Zabih,
    Daniel Scharstein, Olga Veksler, Vladimir
    Kolmogorov, Aseem Agarwala, Marsall Tappen,
    Carsten Rother
  • Sum-product for Gaussian MRFs GMRFlib, C code by
    Havard Rue (exact inference)
  • Sum-product various other ad hoc pieces
  • My matlab BP code (MRF2)
  • Rivasseaus C code for BP, Gibbs, tree-sampling
    (factor graphs)
  • Metlzers C code for BP, GBP, Gibbs, MF
    (Lattice2)

27
Software for learning general MRFs/CRFs
  • Hardly any!
  • Parises matlab code (approx gradient, pseudo
    likelihood, CD, etc)
  • My matlab code (IPF, approx gradient just a toy
    not integrated with BNT)

28
Structure of ideal toolbox
Generator/GUI/file
train
testData
infer
decisionEngine
performance
decision
visualize
summarize
utilities
29
Structure of BNT
LeRay
Shan
Generator/GUI/file
GraphsCPDs
Cell array
BPJtree MCMC
EM StructuralEM
train
testData
NodeIds
VarElim
GraphsCPDs
Cell array
infer
JtreeVarElim
decisionEngine
policy
Array, Gaussian, samples
N1 (MAP)
visualize
summarize
LIMID
30
Outline
  • Discriminative models for iid data
  • Beyond iid data conditional random fields
  • Beyond supervised learning generative models
  • Beyond optimization Bayesian models

31
Unsupervised learning why?
  • Labeling data is time-consuming.
  • Often not clear what label to use.
  • Complex objects often not describable with a
    single discrete label.
  • Humans learn without labels.
  • Want to discover novel patterns/ structure.

32
Unsupervised learning what?
  • Clusters (eg GMM)
  • Low dim manifolds (eg PCA)
  • Graph structure (eg biology, social networks)
  • Features (eg maxent models of language and
    texture)
  • Objects (eg sprite models in vision)

33
Unsupervised learning of objects from video
Frey and Jojic Williams and Titsias et al
34
Unsupervised learning issues
  • Objective function not as obvious as in
    supervised learning. Usually try to maximize
    likelihood (measure of data compression).
  • Local minima (non convex objective).
  • Uses inference as subroutine (can be slow no
    worse than discriminative learning)

35
Unsupervised learning how?
  • Construct a generative model (eg a Bayes net).
  • Perform inference.
  • May have to use approximations such as maximum
    likelihood and BP.
  • Cannot use max likelihood for model selection

36
A comparison of BN software
www.ai.mit.edu/murphyk/Software/Bayes/bnsoft.html
37
Popular BN software
  • BNT (matlab)
  • Intels PNL (C)
  • Hugin (commercial)
  • Netica (commercial)
  • GMTk (free .exe from Jeff Bilmes)

38
Outline
  • Discriminative models for iid data
  • Beyond iid data conditional random fields
  • Beyond supervised learning generative models
  • Beyond optimization Bayesian models

39
Bayesian inference why?
  • It is optimal.
  • It can easily incorporate prior knowledge (esp.
    useful for small n, large p problems).
  • It properly reports confidence in output (useful
    for combining estimates, and for risk-averse
    applications).
  • It separates models from algorithms.

40
Bayesian inference how?
  • Since we want to integrate, we cannot use
    max-product.
  • Since the unknown parameters are continuous, we
    cannot use sum-product.
  • But we can use EP (expectation propagation),
    which is similar to BP.
  • We can also use variational inference.
  • Or MCMC (eg Gibbs sampling).

41
General purposeBayesian software
  • BUGS (Gibbs sampling)
  • VIBES (variational message passing)
  • Minka and Winns toolbox (infer.net)

42
Structure of ideal Bayesian toolbox
Generator/ GUI/ file
train
testData
infer
decisionEngine
performance
decision
visualize
summarize
utilities
Write a Comment
User Comments (0)
About PowerShow.com