Sequential Learning with Dependency Nets - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Sequential Learning with Dependency Nets

Description:

(Critiques showing evidence of classroom/classwork multiplexing are considered ... Inference can be speeded up substantially over na ve Gibbs sampling. Dependency nets ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 30
Provided by: www2C
Category:

less

Transcript and Presenter's Notes

Title: Sequential Learning with Dependency Nets


1
Sequential Learning with Dependency Nets
  • William W. Cohen
  • 2/22

2
Announcements
  • Critiques for the week due Tuesday
  • Email to Vitor and me
  • (Critiques showing evidence of classroom/classwork
    multiplexing are considered really really late)
  • Confusion about number of student presentations
    for today (1,2?)
  • Please dont make changes after I copy over the
    presentations to the web page
  • Do we need a system?
  • Office hours are no-appointment-necessary
  • Appointments are via sharonw_at_cs
  • Preferences on later topics?
  • Relation extraction
  • Semantic role labeling
  • Semi-supervised IE IE on the web and large
    corpora bootstrapping

3
CRFs the good, the bad, and the cumbersome
  • Good points
  • Global optimization of weight vector that guides
    decision making
  • Trade off decisions made at different points in
    sequence
  • Worries
  • Cost (of training)
  • Complexity (do we need all this math?)
  • Amount of context
  • Matrix for normalizer is Y Y, so high-order
    models for many classes get expensive fast.
  • Strong commitment to maxent-style learning
  • Loglinear models are nice, but nothing is always
    best.

4
Dependency Nets
5
(No Transcript)
6
  • Proposed solution
  • parents of node are the Markov blanket
  • like undirected Markov net
  • capture all correlational associations
  • one conditional probability for each node X,
    namely P(Xparents of X)
  • like directed Bayes netno messy clique
    potentials

7
Example bidirectional chains


Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
8
DN chains


Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
  • How do we do inference? Iteratively
  • Pick values for Y1, Y2, at random
  • Pick some j, and compute
  • Set new value of Yj according to this
  • Go back to (2)

Current values
9
This an MCMC process
Transition probability
General case


Markov Chain Monte Carlo a randomized process
that doesnt depend on previous ys changes y(t)
to y(t1)
One particular run

  • How do we do inference? Iteratively
  • Pick values for Y1, Y2, at random y(0)
  • Pick some j, and compute
  • Set new value of Yj according to this y(1)
  • Go back to (2) and repeat to get y(1) , y(2) , ,
    y(t) ,

Current values (t)
10
(No Transcript)
11
This an MCMC process


Claim suppose Y(t) is drawn from some
distribution D such that
Then Y(t1) is also drawn from D (i.e., the
random flip doesnt move us away from D
12
This an MCMC process


Burn-in
Claim if you wait long enough then for some t,
Y(t) will be drawn from some distribution D such
that
under certain reasonable conditions (e.g., graph
of potential edges is connected, ). So D is a
sink.
13
(No Transcript)
14
This an MCMC process


averaged for prediction
burn-in - discarded
  • An algorithm
  • Run the MCMC chain for a long time t, and hope
    that Y(t) will be drawn from the target
    distribution D.
  • Run the MCMC chain for a while longer and save
    sample S Y(t) , Y(t1) , , Y(tm)
  • Use S to answer any probabilistic queries like
    Pr(YjX)

15
More on MCMC
  • This particular process is Gibbs sampling
  • Transition probabilities are defined by sampling
    from the posterior of one variable Yj given the
    others.
  • MCMC is very general-purpose inference scheme
    (and sometimes very slow)
  • On the plus side, learning is relatively cheap,
    since theres no inference involved (!)
  • A dependency net is closely related to a Markov
    random field learned by maximizing
    pseudo-likelihood
  • Identical?
  • Statistical relation learning community has some
    proponents of this approach
  • Pedro Domingos, David Jensen,
  • A big advantage is the generality of the approach
  • Sparse learners (eg L1 regularized maxent,
    decision trees, ) can be used to infer Markov
    blanket (NIPS 2006)

16
Examples


Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
17
Examples
POS?


Z1
Z2
Zi
Mahesh


Y1
Y2
Yi
BIO
Cohen
post
the
When
will
dr
notes
18
Examples


Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
19
Dependency nets
  • The bad and the ugly
  • Inference is less efficient MCMC sampling
  • Cant reconstruct probability via chain rule
  • Networks might be inconsistent
  • ie local P(xpa(x))s dont define a pdf
  • Exactly equal, representationally, to normal
    undirected Markov nets

20
(No Transcript)
21
Dependency nets
  • The good
  • Learning is simple and elegant (if you know each
    nodes Markov blanket) just learn a
    probabilistic classifier for P(Xpa(X)) for each
    node X.
  • (You might not learn a consistent model, but
    youll probably learn a reasonably good one.)
  • Inference can be speeded up substantially over
    naïve Gibbs sampling.

22
Dependency nets
  • Learning is simple and elegant (if you know each
    nodes Markov blanket) just learn a
    probabilistic classifier for P(Xpa(X)) for each
    node X.

Pr(y1x,y2)
Pr(y2x,y1,y2)
Pr(y3x,y2,y4)
Pr(y4x,y3)
y1
y2
y3
y4
Learning is local, but inference is not, and need
not be unidirectional
x
23
Toutanova, Klein, Manning, Singer
  • Dependency nets for POS tagging vs CMMs.
  • Maxent is used for local conditional model.
  • Goals
  • An easy-to-train bidirectional model
  • A really good POS tagger

24
Toutanova et al
  • Dont use Gibbs sampling for inference instead
    use a Viterbi variant (which is not guaranteed to
    produce the ML sequence)

D 11, 11, 11, 12, 21, 33 ML state
11 P(a1b1)P(b1a1) lt 1 P(a3b3)P(b3a3)
1
25
Results with model
26
Results with model
27
Results with model
Best model includes some special unknown-word
features, including a crude company-name
detector
28
Results with model
Final test-set results
MXPost 47.6, 96.4, 86.2 CRF 95.7, 76.4
(Ratnaparki)
(Lafferty et al ICML2001)
29
Other comments
  • Smoothing (quadratic regularization, aka Gaussian
    prior) is importantit avoids overfitting effects
    reported elsewhere
Write a Comment
User Comments (0)
About PowerShow.com