CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006

Description:

Model the problem of text correction as that of generating correct sentences. ... Pr(I saw the girl it the park) Pr(I saw the girl in the park) ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 59

Provided by: DanMol1

Category:

more less

Transcript and Presenter's Notes

Title: CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006

1
CS598 Machine Learning and Natural
LanguageLecture 7 Probabilistic
ClassificationOct. 19,21 2006

Dan Roth
University of Illinois, Urbana-Champaign
danr_at_cs.uiuc.edu
http//L2R.cs.uiuc.edu/danr

2
2 Generative Model

Model the problem of text correction as that of
generating correct sentences.
Goal learn a model of the language use it to
predict.
PARADIGM
Learn a probability distribution over all
sentences
Use it to estimate which sentence is more likely.
Pr(I saw the girl it the park) ltgt Pr(I saw
the girl in the park)
In the same paradigm we sometimes learn a
conditional probability distribution

3
Before Error Driven Learning

Consider a distribution D over space X?Y
X - the instance space Y - set of labels.
(e.g. /-1)
Given a sample (x,y)1m,, and a loss function
L(x,y) Find h?H that minimizes
?i1,mL(h(xi),yi)
L can be L(h(x),y)1, h(x)?y, o/w L(h(x),y)
0 (0-1 loss)
L(h(x),y) (h(x)-y)2 ,
(L2 )
L(h(x),y)exp- y h(x)
Find an algorithm that minimizes average loss
then, we know that things will be okay (as a
function of H).

4
Basics of Bayesian Learning

Goal find the best hypothesis from some space H
of hypotheses, given the observed data D.
Define best to be most probable hypothesis in H
In order to do that, we need to assume a
probability distribution over the class H.
In addition, we need to know something about the
relation between the data observed and the
hypotheses (E.g., a coin problem.)
As we will see, we will be Bayesian about other
things, e.g., the parameters of the model

5
Basics of Bayesian Learning

P(h) - the prior probability of a hypothesis h
Reflects background knowledge before data
is observed. If no information - uniform
distribution.
P(D) - The probability that this sample of the
Data is observed. (No knowledge of the
hypothesis)
P(Dh) The probability of observing the sample
D, given that the hypothesis h holds
P(hD) The posterior probability of h. The
probability h holds, given that D has been
observed.

6
Bayes Theorem

P(hD) increases with P(h) and with P(Dh)
P(hD) decreases with P(D)

7
Learning Scenario

The learner considers a set of candidate
hypotheses H (models), and attempts to find the
most probable one h ?H, given the observed data.
Such maximally probable hypothesis is called
maximum a posteriori hypothesis (MAP) Bayes
theorem is used to compute it

8
Learning Scenario (2)

We may assume that a priori, hypotheses are
equally probable
We get the Maximum Likelihood hypothesis
Here we just look for the hypothesis that best
explains the data

9
Bayes Optimal Classifier

How should we use the general formalism?
What should H be?
H can be a collection of functions. Given the
training data, choose an optimal function. Then,
given new data, evaluate the selected function on
it.
H can be a collection of possible predictions.
Given the data, try to directly choose the
optimal prediction.
H can be a collection of (conditional)
probability distributions.
Could be different!
Specific examples we will discuss
Naive Bayes a maximum likelihood based
algorithm
Max Entropy seemingly, a different selection
criteria
Hidden Markov Models

10
Bayesian Classifier

fX?V, finite set of values
Instances x? X can be described as a collection
of features
Given an example, assign it the most probable
value in V

11
Bayesian Classifier

fX?V, finite set of values
Instances x? X can be described as a collection
of features
Given an example, assign it the most probable
value in V
Bayes Rule
Notational convention P(y) means P(Yy)

12
Bayesian Classifier

Given training data we can estimate the two
terms.
Estimating P(vj) is easy. For each value vj
count how many times
it appears in the training data.
However, it is not feasible to estimate

13
Bayesian Classifier

Given training data we can estimate the two
terms.
Estimating P(vj) is easy. For each value vj
count how many times
it appears in the training data.
However, it is not feasible to estimate
In this case we have to estimate, for each
target value,
the probability of each instance (most of
which will not occur)

14
Bayesian Classifier

Given training data we can estimate the two
terms.
Estimating P(vj) is easy. For each value vj
count how many times
it appears in the training data.
However, it is not feasible to estimate
In this case we have to estimate, for each
target value,
the probability of each instance (most of
which will not occur)
In order to use a Bayesian classifiers in
practice, we need to make
assumptions that will allow us to estimate
these quantities.

15
Naive Bayes

Assumption feature values are independent given
the target value

16
Naive Bayes

Assumption feature values are independent given
the target value
Generative model
First choose a value vj ?V
according to P(vj )
For each vj choose x1 x2 , xn
according to P(xk vj )

17
Naive Bayes

Assumption feature values are independent given
the target value
Learning method Estimate nV parameters and
use them to
compute the new value. (how to estimate?)

18
Naive Bayes

Assumption feature values are independent given
the target value
Learning method Estimate nV parameters and
use them to
compute the new value.
This is learning without search. Given a
collection of training examples,
you just compute the best hypothesis (given
the assumptions)
This is learning without trying to achieve
consistency or even
approximate consistency.

19
Naive Bayes

Assumption feature values are independent given
the target value
Learning method Estimate nV parameters and
use them to
compute the new value.
This is learning without search. Given a
collection of training examples,
you just compute the best hypothesis (given
the assumptions)
This is learning without trying to achieve
consistency or even
approximate consistency. Why
does it work?

20
Conditional Independence

Notice that the features values are
conditionally independent,
given the target value, and are not required
to be independent.
Example
f(x,y)x?y over the product
distribution defined by
p(x0)p(x1)1/2 and
p(y0)p(y1)1/2
The distribution is defined so that x and y
are independent
p(x,y) p(x)p(y) (Interpretation - for
every value of x and y)
But, given that f(x,y)0
p(x1f0) p(y1f0) 1/3
p(x1,y1 f0) 0
so x and y are not conditionally independent.

21
Conditional Independence

The other direction also does not hold.
x and y can be conditionally independent but
not independent.
f0 p(x1f0) 1, p(y1f0) 0
f1 p(x1f1) 0, p(y1f1) 1
and assume, say, that p(f0)
p(f1)1/2
Given the value of f, x and y are
independent.
What about unconditional independence ?

22
Conditional Independence

The other direction also does not hold.
x and y can be conditionally independent but
not independent.
f0 p(x1f0) 1, p(y1f0) 0
f1 p(x1f0) 0, p(y1f1) 1
and assume, say, that p(f0)
p(f1)1/2
Given the value of f, x and y are
independent.
What about unconditional independence ?
p(x1) p(x1f0)p(f0)p(x1f1)p(f1)
0.500.5
p(y1) p(y1f0)p(f0)p(y1f1)p(f1)
0.500.5
But,
p(x1, y1)p(x1,y1f0)p(f0)p(x1,y1f1)p(
f1) 0
so x and y are not independent.

23
Example
Day Outlook Temperature Humidity
Wind PlayTennis
1 Sunny Hot High
Weak No
2 Sunny Hot High
Strong No
3 Overcast Hot High
Weak Yes
4 Rain Mild High
Weak Yes
5 Rain Cool
Normal Weak Yes
6 Rain Cool
Normal Strong No
7 Overcast Cool Normal
Strong Yes
8 Sunny Mild High
Weak No
9 Sunny Cool Normal
Weak Yes
10 Rain Mild
Normal Weak Yes
11 Sunny Mild Normal
Strong Yes
12 Overcast Mild High
Strong Yes
13 Overcast Hot Normal
Weak Yes
14 Rain Mild High
Strong No
24
Estimating Probabilities

How do we estimate P(observation v) ?

25
Example

Compute P(PlayTennis yes) P(PlayTennis no)
Compute P(outlook s/oc/r PlayTennis
yes/no) (6 numbers)
Compute P(Temp h/mild/cool PlayTennis
yes/no) (6 numbers)
Compute P(humidity hi/nor PlayTennis
yes/no) (4 numbers)
Compute P(wind w/st PlayTennis
yes/no) (4 numbers)

26
Example

Compute P(PlayTennis yes) P(PlayTennis no)
Compute P(outlook s/oc/r PlayTennis
yes/no) (6 numbers)
Compute P(Temp h/mild/cool PlayTennis
yes/no) (6 numbers)
Compute P(humidity hi/nor PlayTennis
yes/no) (4 numbers)
Compute P(wind w/st PlayTennis
yes/no) (4 numbers)
Given a new instance
(Outlooksunny Temperaturecool
Humidityhigh Windstrong)
Predict PlayTennis ?

27
Example

Given (Outlooksunny Temperaturecool
Humidityhigh Windstrong)
P(PlayTennis yes)9/140.64
P(PlayTennis no)5/140.36
P(outlook sunnyyes) 2/9 P(outlook
sunnyno) 3/5
P(temp cool yes) 3/9 P(temp
cool no) 1/5
P(humidity hi yes) 3/9
P(humidity hi no) 4/5
P(wind strong yes) 3/9 P(wind
strong no) 3/5
P(yes..) 0.0053
P(no..) 0.0206

28
Example

Given (Outlooksunny Temperaturecool
Humidityhigh Windstrong)
P(PlayTennis yes)9/140.64
P(PlayTennis no)5/140.36
P(outlook sunnyyes) 2/9 P(outlook
sunnyno) 3/5
P(temp cool yes) 3/9 P(temp
cool yes) 1/5
P(humidity hi yes) 3/9
P(humidity hi yes) 4/5
P(wind strong yes) 3/9 P(wind
strong no) 3/5
P(yes..) 0.0053
P(no..) 0.0206
What is we were asked about
OutlookOC ?

29
Example

Given (Outlooksunny Temperaturecool
Humidityhigh Windstrong)
P(PlayTennis yes)9/140.64
P(PlayTennis no)5/140.36
P(outlook sunnyyes) 2/9 P(outlook
sunnyno) 3/5
P(temp cool yes) 3/9 P(temp
cool no) 1/5
P(humidity hi yes) 3/9
P(humidity hi no) 4/5
P(wind strong yes) 3/9 P(wind
strong no) 3/5
P(yes..) 0.0053
P(no..) 0.0206
P(noinstance) 0/.0206/(0.00530.0206)0.795

30
Naive Bayes Two Classes

Notice that the naïve Bayes method gives a
method for predicting
rather than an explicit classifier.
In the case of two classes, v?0,1 we predict
that v1 iff

31
Naive Bayes Two Classes

Notice that the naïve Bayes method gives a
method for predicting
rather than an explicit classifier.
In the case of two classes, v?0,1 we predict
that v1 iff

32
Naive Bayes Two Classes

In the case of two classes, v?0,1 we predict
that v1 iff

33
Naïve Bayes Two Classes

In the case of two classes, v?0,1 we predict
that v1 iff

34
Naïve Bayes Two Classes

In the case of two classes, v?0,1 we predict
that v1 iff
We get that the optimal Bayes behavior is given
by a linear separator
with

35
Why does it work?

We have not addressed the question of why does
this Classifier
perform well, given that the assumptions are
unlikely to be
satisfied.
The linear form of the classifiers provides some
hints.
(More on that later also, Roth99 GargRoth
ECML02) one of
the presented papers will also address this
partly.

36
Naïve Bayes Two Classes

In the case of two classes we have that

37
Naïve Bayes Two Classes

In the case of two classes we have that
but since
We get (plug in (2) in (1) some algebra)
Which is simply the logistic (sigmoid) function
used in the
neural network representation.

38
Another look at Naive Bayes
Note this is a bit different than the previous
linearization. Rather than a single function,
here we have argmax over several different
functions.
Graphical model. It encodes the NB independence
assumption in the edge structure (siblings are
independent given parents)
39
Hidden Markov Model (HMM)

HMM is a probabilistic generative model
It models how an observed sequence is generated
Lets call each position in a sequence a time
step
At each time step, there are two variables
Current state (hidden)
Observation

40
HMM

Elements
Initial state probability P(s1)
Transition probability P(stst-1)
Observation probability P(otst)
As before, the graphical model is an encoding of
the independence assumptions
Note that we have seen this in the context of POS
tagging.

41
HMM for Shallow Parsing

States
B, I, O
Observations
Actual words and/or part-of-speech tags

42
HMM for Shallow Parsing

Given a sentences, we can ask what the most
likely state sequence is

Transition probabilty P(stBst-1B),P(stIst-1
B),P(stOst-1B), P(stBst-1I),P(stIst-1I),P
(stOst-1I),
Initial state probability P(s1B),P(s1I),P(s1O)
Observation Probability P(otMr.stB),P(otBr
ownstB),, P(otMr.stI),P(otBrownstI),
,
43
Finding most likely state sequence in HMM (1)
44
Finding most likely state sequence in HMM (2)
45
Finding most likely state sequence in HMM (3)
A function of sk
46
Finding most likely state sequence in HMM (4)

Viterbis Algorithm
Dynamic Programming

47
Learning the Model

Estimate
Initial state probability P (s1)
Transition probability P(stst-1)
Observation probability P(otst)
Unsupervised Learning (states are not observed)
EM Algorithm
Supervised Learning (states are observed more
common)
ML Estimate of above terms directly from data
Notice that this is completely analogues to the
case of naive Bayes, and essentially all other
models.

48
Another view of Markov Models
Input
T
States
Observations
W
Assumptions
49
Another View of Markov Models
Input
As for NB features are pairs and singletons of
ts, ws
Only 3 active features
This can be extended to an argmax that maximizes
the prediction of the whole state sequence and
computed, as before, via Viterbi.
50
Learning with Probabilistic Classifiers

Learning Theory
We showed that probabilistic predictions can be
viewed as predictions via Linear Statistical
Queries Models (Roth99).
The low expressivity explains GeneralizationRobus
tness
Is that all?
It does not explain why is it possible to
(approximately) fit the data with these models.
Namely, is there a reason to believe that these
hypotheses minimize the empirical error on the
sample?
In General, No. (Unless it corresponds to some
probabilistic assumptions that hold).

51
Learning Protocol

LSQ hypotheses are computed directly, w/o
assumptions on the underlying distribution
- Choose features
- Compute coefficients
Is there a reason to believe that an LSQ
hypothesis
minimizes the empirical error on the sample?

In general, no.
(Unless it corresponds to some probabilistic
assumptions that hold).

52
Learning Protocol Practice

LSQ hypotheses are computed directly
- Choose features
- Compute coefficients
If hypothesis does not fit the training data -
- Augment set of features

(Forget your original assumption)
53
Example probabilistic classifiers
If hypothesis does not fit the training data -
augment set of features (forget assumptions)
Features are pairs and singletons
of ts, ws
Additional features are included
54
Robustness of Probabilistic Predictors

Why is it relatively easy to fit the data?
Consider all distributions with the same
marginals
(E.g, a naïve Bayes classifier will predict the
same regardless of which distribution generated
the data.)
(GargRoth ECML01)
In most cases (i.e., for most such
distributions), the resulting predictors error
is close to optimal classifier (that if given the
correct distribution)

55
Summary Probabilistic Modeling