Title: CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006
1CS598 Machine Learning and Natural
LanguageLecture 7 Probabilistic
ClassificationOct. 19,21 2006
- Dan Roth
- University of Illinois, Urbana-Champaign
- danr_at_cs.uiuc.edu
- http//L2R.cs.uiuc.edu/danr
22 Generative Model
- Model the problem of text correction as that of
generating correct sentences. - Goal learn a model of the language use it to
predict. - PARADIGM
- Learn a probability distribution over all
sentences - Use it to estimate which sentence is more likely.
- Pr(I saw the girl it the park) ltgt Pr(I saw
the girl in the park) - In the same paradigm we sometimes learn a
conditional probability distribution
3Before Error Driven Learning
- Consider a distribution D over space X?Y
- X - the instance space Y - set of labels.
(e.g. /-1) - Given a sample (x,y)1m,, and a loss function
L(x,y) Find h?H that minimizes - ?i1,mL(h(xi),yi)
-
- L can be L(h(x),y)1, h(x)?y, o/w L(h(x),y)
0 (0-1 loss) - L(h(x),y) (h(x)-y)2 ,
(L2 ) - L(h(x),y)exp- y h(x)
- Find an algorithm that minimizes average loss
then, we know that things will be okay (as a
function of H).
4Basics of Bayesian Learning
- Goal find the best hypothesis from some space H
of hypotheses, given the observed data D. - Define best to be most probable hypothesis in H
- In order to do that, we need to assume a
probability distribution over the class H. - In addition, we need to know something about the
relation between the data observed and the
hypotheses (E.g., a coin problem.) - As we will see, we will be Bayesian about other
things, e.g., the parameters of the model
5Basics of Bayesian Learning
- P(h) - the prior probability of a hypothesis h
- Reflects background knowledge before data
is observed. If no information - uniform
distribution. - P(D) - The probability that this sample of the
Data is observed. (No knowledge of the
hypothesis) - P(Dh) The probability of observing the sample
D, given that the hypothesis h holds - P(hD) The posterior probability of h. The
probability h holds, given that D has been
observed.
6Bayes Theorem
- P(hD) increases with P(h) and with P(Dh)
- P(hD) decreases with P(D)
7Learning Scenario
- The learner considers a set of candidate
hypotheses H (models), and attempts to find the
most probable one h ?H, given the observed data. - Such maximally probable hypothesis is called
maximum a posteriori hypothesis (MAP) Bayes
theorem is used to compute it
8Learning Scenario (2)
- We may assume that a priori, hypotheses are
equally probable - We get the Maximum Likelihood hypothesis
- Here we just look for the hypothesis that best
explains the data
9Bayes Optimal Classifier
- How should we use the general formalism?
- What should H be?
- H can be a collection of functions. Given the
training data, choose an optimal function. Then,
given new data, evaluate the selected function on
it. - H can be a collection of possible predictions.
Given the data, try to directly choose the
optimal prediction. - H can be a collection of (conditional)
probability distributions. - Could be different!
- Specific examples we will discuss
- Naive Bayes a maximum likelihood based
algorithm - Max Entropy seemingly, a different selection
criteria - Hidden Markov Models
10Bayesian Classifier
- fX?V, finite set of values
- Instances x? X can be described as a collection
of features - Given an example, assign it the most probable
value in V
11Bayesian Classifier
- fX?V, finite set of values
- Instances x? X can be described as a collection
of features - Given an example, assign it the most probable
value in V - Bayes Rule
- Notational convention P(y) means P(Yy)
12Bayesian Classifier
- Given training data we can estimate the two
terms. - Estimating P(vj) is easy. For each value vj
count how many times - it appears in the training data.
- However, it is not feasible to estimate
13Bayesian Classifier
- Given training data we can estimate the two
terms. - Estimating P(vj) is easy. For each value vj
count how many times - it appears in the training data.
- However, it is not feasible to estimate
- In this case we have to estimate, for each
target value, - the probability of each instance (most of
which will not occur)
14Bayesian Classifier
- Given training data we can estimate the two
terms. - Estimating P(vj) is easy. For each value vj
count how many times - it appears in the training data.
- However, it is not feasible to estimate
- In this case we have to estimate, for each
target value, - the probability of each instance (most of
which will not occur) - In order to use a Bayesian classifiers in
practice, we need to make - assumptions that will allow us to estimate
these quantities.
15Naive Bayes
-
- Assumption feature values are independent given
the target value
16Naive Bayes
-
- Assumption feature values are independent given
the target value - Generative model
- First choose a value vj ?V
according to P(vj ) - For each vj choose x1 x2 , xn
according to P(xk vj )
17Naive Bayes
-
- Assumption feature values are independent given
the target value - Learning method Estimate nV parameters and
use them to - compute the new value. (how to estimate?)
18Naive Bayes
- Assumption feature values are independent given
the target value - Learning method Estimate nV parameters and
use them to - compute the new value.
- This is learning without search. Given a
collection of training examples, - you just compute the best hypothesis (given
the assumptions) -
- This is learning without trying to achieve
consistency or even - approximate consistency.
19Naive Bayes
- Assumption feature values are independent given
the target value - Learning method Estimate nV parameters and
use them to - compute the new value.
- This is learning without search. Given a
collection of training examples, - you just compute the best hypothesis (given
the assumptions) -
- This is learning without trying to achieve
consistency or even - approximate consistency. Why
does it work?
20Conditional Independence
- Notice that the features values are
conditionally independent, - given the target value, and are not required
to be independent. - Example
- f(x,y)x?y over the product
distribution defined by - p(x0)p(x1)1/2 and
p(y0)p(y1)1/2 - The distribution is defined so that x and y
are independent - p(x,y) p(x)p(y) (Interpretation - for
every value of x and y) - But, given that f(x,y)0
- p(x1f0) p(y1f0) 1/3
- p(x1,y1 f0) 0
- so x and y are not conditionally independent.
21Conditional Independence
- The other direction also does not hold.
- x and y can be conditionally independent but
not independent. - f0 p(x1f0) 1, p(y1f0) 0
- f1 p(x1f1) 0, p(y1f1) 1
- and assume, say, that p(f0)
p(f1)1/2 - Given the value of f, x and y are
independent. - What about unconditional independence ?
-
22Conditional Independence
- The other direction also does not hold.
- x and y can be conditionally independent but
not independent. - f0 p(x1f0) 1, p(y1f0) 0
- f1 p(x1f0) 0, p(y1f1) 1
- and assume, say, that p(f0)
p(f1)1/2 - Given the value of f, x and y are
independent. - What about unconditional independence ?
- p(x1) p(x1f0)p(f0)p(x1f1)p(f1)
0.500.5 - p(y1) p(y1f0)p(f0)p(y1f1)p(f1)
0.500.5 - But,
- p(x1, y1)p(x1,y1f0)p(f0)p(x1,y1f1)p(
f1) 0 - so x and y are not independent.
23Example
Day Outlook Temperature Humidity
Wind PlayTennis
1 Sunny Hot High
Weak No
2 Sunny Hot High
Strong No
3 Overcast Hot High
Weak Yes
4 Rain Mild High
Weak Yes
5 Rain Cool
Normal Weak Yes
6 Rain Cool
Normal Strong No
7 Overcast Cool Normal
Strong Yes
8 Sunny Mild High
Weak No
9 Sunny Cool Normal
Weak Yes
10 Rain Mild
Normal Weak Yes
11 Sunny Mild Normal
Strong Yes
12 Overcast Mild High
Strong Yes
13 Overcast Hot Normal
Weak Yes
14 Rain Mild High
Strong No
24Estimating Probabilities
-
- How do we estimate P(observation v) ?
25Example
- Compute P(PlayTennis yes) P(PlayTennis no)
- Compute P(outlook s/oc/r PlayTennis
yes/no) (6 numbers) - Compute P(Temp h/mild/cool PlayTennis
yes/no) (6 numbers) - Compute P(humidity hi/nor PlayTennis
yes/no) (4 numbers) - Compute P(wind w/st PlayTennis
yes/no) (4 numbers)
26Example
- Compute P(PlayTennis yes) P(PlayTennis no)
- Compute P(outlook s/oc/r PlayTennis
yes/no) (6 numbers) - Compute P(Temp h/mild/cool PlayTennis
yes/no) (6 numbers) - Compute P(humidity hi/nor PlayTennis
yes/no) (4 numbers) - Compute P(wind w/st PlayTennis
yes/no) (4 numbers) - Given a new instance
- (Outlooksunny Temperaturecool
Humidityhigh Windstrong) - Predict PlayTennis ?
27Example
- Given (Outlooksunny Temperaturecool
Humidityhigh Windstrong) -
- P(PlayTennis yes)9/140.64
P(PlayTennis no)5/140.36 - P(outlook sunnyyes) 2/9 P(outlook
sunnyno) 3/5 - P(temp cool yes) 3/9 P(temp
cool no) 1/5 - P(humidity hi yes) 3/9
P(humidity hi no) 4/5 - P(wind strong yes) 3/9 P(wind
strong no) 3/5 - P(yes..) 0.0053
P(no..) 0.0206
28Example
- Given (Outlooksunny Temperaturecool
Humidityhigh Windstrong) -
- P(PlayTennis yes)9/140.64
P(PlayTennis no)5/140.36 - P(outlook sunnyyes) 2/9 P(outlook
sunnyno) 3/5 - P(temp cool yes) 3/9 P(temp
cool yes) 1/5 - P(humidity hi yes) 3/9
P(humidity hi yes) 4/5 - P(wind strong yes) 3/9 P(wind
strong no) 3/5 - P(yes..) 0.0053
P(no..) 0.0206 - What is we were asked about
OutlookOC ?
29Example
- Given (Outlooksunny Temperaturecool
Humidityhigh Windstrong) -
- P(PlayTennis yes)9/140.64
P(PlayTennis no)5/140.36 - P(outlook sunnyyes) 2/9 P(outlook
sunnyno) 3/5 - P(temp cool yes) 3/9 P(temp
cool no) 1/5 - P(humidity hi yes) 3/9
P(humidity hi no) 4/5 - P(wind strong yes) 3/9 P(wind
strong no) 3/5 - P(yes..) 0.0053
P(no..) 0.0206 - P(noinstance) 0/.0206/(0.00530.0206)0.795
30Naive Bayes Two Classes
- Notice that the naïve Bayes method gives a
method for predicting - rather than an explicit classifier.
- In the case of two classes, v?0,1 we predict
that v1 iff
31Naive Bayes Two Classes
- Notice that the naïve Bayes method gives a
method for predicting - rather than an explicit classifier.
- In the case of two classes, v?0,1 we predict
that v1 iff
32Naive Bayes Two Classes
- In the case of two classes, v?0,1 we predict
that v1 iff
33Naïve Bayes Two Classes
- In the case of two classes, v?0,1 we predict
that v1 iff
34Naïve Bayes Two Classes
- In the case of two classes, v?0,1 we predict
that v1 iff -
- We get that the optimal Bayes behavior is given
by a linear separator - with
35Why does it work?
- We have not addressed the question of why does
this Classifier - perform well, given that the assumptions are
unlikely to be - satisfied.
- The linear form of the classifiers provides some
hints. - (More on that later also, Roth99 GargRoth
ECML02) one of - the presented papers will also address this
partly.
36Naïve Bayes Two Classes
- In the case of two classes we have that
37Naïve Bayes Two Classes
- In the case of two classes we have that
- but since
- We get (plug in (2) in (1) some algebra)
- Which is simply the logistic (sigmoid) function
used in the - neural network representation.
38Another look at Naive Bayes
Note this is a bit different than the previous
linearization. Rather than a single function,
here we have argmax over several different
functions.
Graphical model. It encodes the NB independence
assumption in the edge structure (siblings are
independent given parents)
39Hidden Markov Model (HMM)
- HMM is a probabilistic generative model
- It models how an observed sequence is generated
- Lets call each position in a sequence a time
step - At each time step, there are two variables
- Current state (hidden)
- Observation
40HMM
- Elements
- Initial state probability P(s1)
- Transition probability P(stst-1)
- Observation probability P(otst)
- As before, the graphical model is an encoding of
the independence assumptions - Note that we have seen this in the context of POS
tagging.
41HMM for Shallow Parsing
- States
- B, I, O
- Observations
- Actual words and/or part-of-speech tags
42HMM for Shallow Parsing
- Given a sentences, we can ask what the most
likely state sequence is
Transition probabilty P(stBst-1B),P(stIst-1
B),P(stOst-1B), P(stBst-1I),P(stIst-1I),P
(stOst-1I),
Initial state probability P(s1B),P(s1I),P(s1O)
Observation Probability P(otMr.stB),P(otBr
ownstB),, P(otMr.stI),P(otBrownstI),
,
43Finding most likely state sequence in HMM (1)
44Finding most likely state sequence in HMM (2)
45Finding most likely state sequence in HMM (3)
A function of sk
46Finding most likely state sequence in HMM (4)
- Viterbis Algorithm
- Dynamic Programming
47Learning the Model
- Estimate
- Initial state probability P (s1)
- Transition probability P(stst-1)
- Observation probability P(otst)
- Unsupervised Learning (states are not observed)
- EM Algorithm
- Supervised Learning (states are observed more
common) - ML Estimate of above terms directly from data
- Notice that this is completely analogues to the
case of naive Bayes, and essentially all other
models.
48Another view of Markov Models
Input
T
States
Observations
W
Assumptions
49Another View of Markov Models
Input
As for NB features are pairs and singletons of
ts, ws
Only 3 active features
This can be extended to an argmax that maximizes
the prediction of the whole state sequence and
computed, as before, via Viterbi.
50Learning with Probabilistic Classifiers
- Learning Theory
- We showed that probabilistic predictions can be
viewed as predictions via Linear Statistical
Queries Models (Roth99). - The low expressivity explains GeneralizationRobus
tness - Is that all?
- It does not explain why is it possible to
(approximately) fit the data with these models.
Namely, is there a reason to believe that these
hypotheses minimize the empirical error on the
sample? - In General, No. (Unless it corresponds to some
probabilistic assumptions that hold).
51Learning Protocol
- LSQ hypotheses are computed directly, w/o
assumptions on the underlying distribution - - Choose features
- - Compute coefficients
- Is there a reason to believe that an LSQ
hypothesis - minimizes the empirical error on the sample?
-
- In general, no.
- (Unless it corresponds to some probabilistic
assumptions that hold).
52Learning Protocol Practice
- LSQ hypotheses are computed directly
- - Choose features
- - Compute coefficients
- If hypothesis does not fit the training data -
- - Augment set of features
(Forget your original assumption)
53Example probabilistic classifiers
If hypothesis does not fit the training data -
augment set of features (forget assumptions)
Features are pairs and singletons
of ts, ws
Additional features are included
54Robustness of Probabilistic Predictors
- Why is it relatively easy to fit the data?
- Consider all distributions with the same
marginals - (E.g, a naïve Bayes classifier will predict the
same regardless of which distribution generated
the data.) -
- (GargRoth ECML01)
- In most cases (i.e., for most such
distributions), the resulting predictors error
is close to optimal classifier (that if given the
correct distribution)
55Summary Probabilistic Modeling
- Classifiers derived from probability density
- estimation models were viewed as LSQ
hypotheses. - Probabilistic assumptions
- Guiding feature selection but also -
- - Not allowing the use of more general
features.
56A Unified Approach
- Most methods blow up original feature space.
- And make predictions using a linear
representation over the new feature space
Note Methods do not have to actually do that
But they produce same decision as a hypothesis
that does that. (Roth 98 99,00)
57A Unified Approach
- Most methods blow up original feature space.
- And make predictions using a linear
representation over the new feature space
58A Unified Approach
- Most methods blow up original feature space.
- And make predictions using a linear
representation over the new feature space
Q 1 How are weights determined? Q 2 How is the
new feature-space determined?
Implications? Restrictions?