CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006

Description:

Model the problem of text correction as that of generating correct sentences. ... Pr(I saw the girl it the park) Pr(I saw the girl in the park) ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 59
Provided by: DanMol1
Category:

less

Transcript and Presenter's Notes

Title: CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006


1
CS598 Machine Learning and Natural
LanguageLecture 7 Probabilistic
ClassificationOct. 19,21 2006
  • Dan Roth
  • University of Illinois, Urbana-Champaign
  • danr_at_cs.uiuc.edu
  • http//L2R.cs.uiuc.edu/danr

2
2 Generative Model
  • Model the problem of text correction as that of
    generating correct sentences.
  • Goal learn a model of the language use it to
    predict.
  • PARADIGM
  • Learn a probability distribution over all
    sentences
  • Use it to estimate which sentence is more likely.
  • Pr(I saw the girl it the park) ltgt Pr(I saw
    the girl in the park)
  • In the same paradigm we sometimes learn a
    conditional probability distribution

3
Before Error Driven Learning
  • Consider a distribution D over space X?Y
  • X - the instance space Y - set of labels.
    (e.g. /-1)
  • Given a sample (x,y)1m,, and a loss function
    L(x,y) Find h?H that minimizes
  • ?i1,mL(h(xi),yi)
  • L can be L(h(x),y)1, h(x)?y, o/w L(h(x),y)
    0 (0-1 loss)
  • L(h(x),y) (h(x)-y)2 ,
    (L2 )
  • L(h(x),y)exp- y h(x)
  • Find an algorithm that minimizes average loss
    then, we know that things will be okay (as a
    function of H).

4
Basics of Bayesian Learning
  • Goal find the best hypothesis from some space H
    of hypotheses, given the observed data D.
  • Define best to be most probable hypothesis in H
  • In order to do that, we need to assume a
    probability distribution over the class H.
  • In addition, we need to know something about the
    relation between the data observed and the
    hypotheses (E.g., a coin problem.)
  • As we will see, we will be Bayesian about other
    things, e.g., the parameters of the model

5
Basics of Bayesian Learning
  • P(h) - the prior probability of a hypothesis h
  • Reflects background knowledge before data
    is observed. If no information - uniform
    distribution.
  • P(D) - The probability that this sample of the
    Data is observed. (No knowledge of the
    hypothesis)
  • P(Dh) The probability of observing the sample
    D, given that the hypothesis h holds
  • P(hD) The posterior probability of h. The
    probability h holds, given that D has been
    observed.

6
Bayes Theorem
  • P(hD) increases with P(h) and with P(Dh)
  • P(hD) decreases with P(D)

7
Learning Scenario
  • The learner considers a set of candidate
    hypotheses H (models), and attempts to find the
    most probable one h ?H, given the observed data.
  • Such maximally probable hypothesis is called
    maximum a posteriori hypothesis (MAP) Bayes
    theorem is used to compute it

8
Learning Scenario (2)
  • We may assume that a priori, hypotheses are
    equally probable
  • We get the Maximum Likelihood hypothesis
  • Here we just look for the hypothesis that best
    explains the data

9
Bayes Optimal Classifier
  • How should we use the general formalism?
  • What should H be?
  • H can be a collection of functions. Given the
    training data, choose an optimal function. Then,
    given new data, evaluate the selected function on
    it.
  • H can be a collection of possible predictions.
    Given the data, try to directly choose the
    optimal prediction.
  • H can be a collection of (conditional)
    probability distributions.
  • Could be different!
  • Specific examples we will discuss
  • Naive Bayes a maximum likelihood based
    algorithm
  • Max Entropy seemingly, a different selection
    criteria
  • Hidden Markov Models

10
Bayesian Classifier
  • fX?V, finite set of values
  • Instances x? X can be described as a collection
    of features
  • Given an example, assign it the most probable
    value in V

11
Bayesian Classifier
  • fX?V, finite set of values
  • Instances x? X can be described as a collection
    of features
  • Given an example, assign it the most probable
    value in V
  • Bayes Rule
  • Notational convention P(y) means P(Yy)

12
Bayesian Classifier
  • Given training data we can estimate the two
    terms.
  • Estimating P(vj) is easy. For each value vj
    count how many times
  • it appears in the training data.
  • However, it is not feasible to estimate

13
Bayesian Classifier
  • Given training data we can estimate the two
    terms.
  • Estimating P(vj) is easy. For each value vj
    count how many times
  • it appears in the training data.
  • However, it is not feasible to estimate
  • In this case we have to estimate, for each
    target value,
  • the probability of each instance (most of
    which will not occur)

14
Bayesian Classifier
  • Given training data we can estimate the two
    terms.
  • Estimating P(vj) is easy. For each value vj
    count how many times
  • it appears in the training data.
  • However, it is not feasible to estimate
  • In this case we have to estimate, for each
    target value,
  • the probability of each instance (most of
    which will not occur)
  • In order to use a Bayesian classifiers in
    practice, we need to make
  • assumptions that will allow us to estimate
    these quantities.

15
Naive Bayes
  • Assumption feature values are independent given
    the target value

16
Naive Bayes
  • Assumption feature values are independent given
    the target value
  • Generative model
  • First choose a value vj ?V
    according to P(vj )
  • For each vj choose x1 x2 , xn
    according to P(xk vj )

17
Naive Bayes
  • Assumption feature values are independent given
    the target value
  • Learning method Estimate nV parameters and
    use them to
  • compute the new value. (how to estimate?)

18
Naive Bayes
  • Assumption feature values are independent given
    the target value
  • Learning method Estimate nV parameters and
    use them to
  • compute the new value.
  • This is learning without search. Given a
    collection of training examples,
  • you just compute the best hypothesis (given
    the assumptions)
  • This is learning without trying to achieve
    consistency or even
  • approximate consistency.

19
Naive Bayes
  • Assumption feature values are independent given
    the target value
  • Learning method Estimate nV parameters and
    use them to
  • compute the new value.
  • This is learning without search. Given a
    collection of training examples,
  • you just compute the best hypothesis (given
    the assumptions)
  • This is learning without trying to achieve
    consistency or even
  • approximate consistency. Why
    does it work?

20
Conditional Independence
  • Notice that the features values are
    conditionally independent,
  • given the target value, and are not required
    to be independent.
  • Example
  • f(x,y)x?y over the product
    distribution defined by
  • p(x0)p(x1)1/2 and
    p(y0)p(y1)1/2
  • The distribution is defined so that x and y
    are independent
  • p(x,y) p(x)p(y) (Interpretation - for
    every value of x and y)
  • But, given that f(x,y)0
  • p(x1f0) p(y1f0) 1/3
  • p(x1,y1 f0) 0
  • so x and y are not conditionally independent.

21
Conditional Independence
  • The other direction also does not hold.
  • x and y can be conditionally independent but
    not independent.
  • f0 p(x1f0) 1, p(y1f0) 0
  • f1 p(x1f1) 0, p(y1f1) 1
  • and assume, say, that p(f0)
    p(f1)1/2
  • Given the value of f, x and y are
    independent.
  • What about unconditional independence ?

22
Conditional Independence
  • The other direction also does not hold.
  • x and y can be conditionally independent but
    not independent.
  • f0 p(x1f0) 1, p(y1f0) 0
  • f1 p(x1f0) 0, p(y1f1) 1
  • and assume, say, that p(f0)
    p(f1)1/2
  • Given the value of f, x and y are
    independent.
  • What about unconditional independence ?
  • p(x1) p(x1f0)p(f0)p(x1f1)p(f1)
    0.500.5
  • p(y1) p(y1f0)p(f0)p(y1f1)p(f1)
    0.500.5
  • But,
  • p(x1, y1)p(x1,y1f0)p(f0)p(x1,y1f1)p(
    f1) 0
  • so x and y are not independent.

23
Example
Day Outlook Temperature Humidity
Wind PlayTennis
1 Sunny Hot High
Weak No
2 Sunny Hot High
Strong No
3 Overcast Hot High
Weak Yes
4 Rain Mild High
Weak Yes
5 Rain Cool
Normal Weak Yes
6 Rain Cool
Normal Strong No
7 Overcast Cool Normal
Strong Yes
8 Sunny Mild High
Weak No
9 Sunny Cool Normal
Weak Yes
10 Rain Mild
Normal Weak Yes
11 Sunny Mild Normal
Strong Yes
12 Overcast Mild High
Strong Yes
13 Overcast Hot Normal
Weak Yes
14 Rain Mild High
Strong No
24
Estimating Probabilities
  • How do we estimate P(observation v) ?

25
Example
  • Compute P(PlayTennis yes) P(PlayTennis no)
  • Compute P(outlook s/oc/r PlayTennis
    yes/no) (6 numbers)
  • Compute P(Temp h/mild/cool PlayTennis
    yes/no) (6 numbers)
  • Compute P(humidity hi/nor PlayTennis
    yes/no) (4 numbers)
  • Compute P(wind w/st PlayTennis
    yes/no) (4 numbers)

26
Example
  • Compute P(PlayTennis yes) P(PlayTennis no)
  • Compute P(outlook s/oc/r PlayTennis
    yes/no) (6 numbers)
  • Compute P(Temp h/mild/cool PlayTennis
    yes/no) (6 numbers)
  • Compute P(humidity hi/nor PlayTennis
    yes/no) (4 numbers)
  • Compute P(wind w/st PlayTennis
    yes/no) (4 numbers)
  • Given a new instance
  • (Outlooksunny Temperaturecool
    Humidityhigh Windstrong)
  • Predict PlayTennis ?

27
Example
  • Given (Outlooksunny Temperaturecool
    Humidityhigh Windstrong)
  • P(PlayTennis yes)9/140.64
    P(PlayTennis no)5/140.36
  • P(outlook sunnyyes) 2/9 P(outlook
    sunnyno) 3/5
  • P(temp cool yes) 3/9 P(temp
    cool no) 1/5
  • P(humidity hi yes) 3/9
    P(humidity hi no) 4/5
  • P(wind strong yes) 3/9 P(wind
    strong no) 3/5
  • P(yes..) 0.0053
    P(no..) 0.0206

28
Example
  • Given (Outlooksunny Temperaturecool
    Humidityhigh Windstrong)
  • P(PlayTennis yes)9/140.64
    P(PlayTennis no)5/140.36
  • P(outlook sunnyyes) 2/9 P(outlook
    sunnyno) 3/5
  • P(temp cool yes) 3/9 P(temp
    cool yes) 1/5
  • P(humidity hi yes) 3/9
    P(humidity hi yes) 4/5
  • P(wind strong yes) 3/9 P(wind
    strong no) 3/5
  • P(yes..) 0.0053
    P(no..) 0.0206
  • What is we were asked about
    OutlookOC ?

29
Example
  • Given (Outlooksunny Temperaturecool
    Humidityhigh Windstrong)
  • P(PlayTennis yes)9/140.64
    P(PlayTennis no)5/140.36
  • P(outlook sunnyyes) 2/9 P(outlook
    sunnyno) 3/5
  • P(temp cool yes) 3/9 P(temp
    cool no) 1/5
  • P(humidity hi yes) 3/9
    P(humidity hi no) 4/5
  • P(wind strong yes) 3/9 P(wind
    strong no) 3/5
  • P(yes..) 0.0053
    P(no..) 0.0206
  • P(noinstance) 0/.0206/(0.00530.0206)0.795

30
Naive Bayes Two Classes
  • Notice that the naïve Bayes method gives a
    method for predicting
  • rather than an explicit classifier.
  • In the case of two classes, v?0,1 we predict
    that v1 iff

31
Naive Bayes Two Classes
  • Notice that the naïve Bayes method gives a
    method for predicting
  • rather than an explicit classifier.
  • In the case of two classes, v?0,1 we predict
    that v1 iff

32
Naive Bayes Two Classes
  • In the case of two classes, v?0,1 we predict
    that v1 iff

33
Naïve Bayes Two Classes
  • In the case of two classes, v?0,1 we predict
    that v1 iff

34
Naïve Bayes Two Classes
  • In the case of two classes, v?0,1 we predict
    that v1 iff
  • We get that the optimal Bayes behavior is given
    by a linear separator
  • with

35
Why does it work?
  • We have not addressed the question of why does
    this Classifier
  • perform well, given that the assumptions are
    unlikely to be
  • satisfied.
  • The linear form of the classifiers provides some
    hints.
  • (More on that later also, Roth99 GargRoth
    ECML02) one of
  • the presented papers will also address this
    partly.

36
Naïve Bayes Two Classes
  • In the case of two classes we have that

37
Naïve Bayes Two Classes
  • In the case of two classes we have that
  • but since
  • We get (plug in (2) in (1) some algebra)
  • Which is simply the logistic (sigmoid) function
    used in the
  • neural network representation.

38
Another look at Naive Bayes
Note this is a bit different than the previous
linearization. Rather than a single function,
here we have argmax over several different
functions.
Graphical model. It encodes the NB independence
assumption in the edge structure (siblings are
independent given parents)
39
Hidden Markov Model (HMM)
  • HMM is a probabilistic generative model
  • It models how an observed sequence is generated
  • Lets call each position in a sequence a time
    step
  • At each time step, there are two variables
  • Current state (hidden)
  • Observation

40
HMM
  • Elements
  • Initial state probability P(s1)
  • Transition probability P(stst-1)
  • Observation probability P(otst)
  • As before, the graphical model is an encoding of
    the independence assumptions
  • Note that we have seen this in the context of POS
    tagging.

41
HMM for Shallow Parsing
  • States
  • B, I, O
  • Observations
  • Actual words and/or part-of-speech tags

42
HMM for Shallow Parsing
  • Given a sentences, we can ask what the most
    likely state sequence is

Transition probabilty P(stBst-1B),P(stIst-1
B),P(stOst-1B), P(stBst-1I),P(stIst-1I),P
(stOst-1I),
Initial state probability P(s1B),P(s1I),P(s1O)
Observation Probability P(otMr.stB),P(otBr
ownstB),, P(otMr.stI),P(otBrownstI),
,
43
Finding most likely state sequence in HMM (1)
44
Finding most likely state sequence in HMM (2)
45
Finding most likely state sequence in HMM (3)
A function of sk
46
Finding most likely state sequence in HMM (4)
  • Viterbis Algorithm
  • Dynamic Programming

47
Learning the Model
  • Estimate
  • Initial state probability P (s1)
  • Transition probability P(stst-1)
  • Observation probability P(otst)
  • Unsupervised Learning (states are not observed)
  • EM Algorithm
  • Supervised Learning (states are observed more
    common)
  • ML Estimate of above terms directly from data
  • Notice that this is completely analogues to the
    case of naive Bayes, and essentially all other
    models.

48
Another view of Markov Models
Input
T
States
Observations
W
Assumptions
49
Another View of Markov Models
Input
As for NB features are pairs and singletons of
ts, ws
Only 3 active features
This can be extended to an argmax that maximizes
the prediction of the whole state sequence and
computed, as before, via Viterbi.
50
Learning with Probabilistic Classifiers
  • Learning Theory
  • We showed that probabilistic predictions can be
    viewed as predictions via Linear Statistical
    Queries Models (Roth99).
  • The low expressivity explains GeneralizationRobus
    tness
  • Is that all?
  • It does not explain why is it possible to
    (approximately) fit the data with these models.
    Namely, is there a reason to believe that these
    hypotheses minimize the empirical error on the
    sample?
  • In General, No. (Unless it corresponds to some
    probabilistic assumptions that hold).

51
Learning Protocol
  • LSQ hypotheses are computed directly, w/o
    assumptions on the underlying distribution
  • - Choose features
  • - Compute coefficients
  • Is there a reason to believe that an LSQ
    hypothesis
  • minimizes the empirical error on the sample?
  • In general, no.
  • (Unless it corresponds to some probabilistic
    assumptions that hold).

52
Learning Protocol Practice
  • LSQ hypotheses are computed directly
  • - Choose features
  • - Compute coefficients
  • If hypothesis does not fit the training data -
  • - Augment set of features

(Forget your original assumption)
53
Example probabilistic classifiers
If hypothesis does not fit the training data -
augment set of features (forget assumptions)
Features are pairs and singletons
of ts, ws
Additional features are included
54
Robustness of Probabilistic Predictors
  • Why is it relatively easy to fit the data?
  • Consider all distributions with the same
    marginals
  • (E.g, a naïve Bayes classifier will predict the
    same regardless of which distribution generated
    the data.)
  • (GargRoth ECML01)
  • In most cases (i.e., for most such
    distributions), the resulting predictors error
    is close to optimal classifier (that if given the
    correct distribution)

55
Summary Probabilistic Modeling
  • Classifiers derived from probability density
  • estimation models were viewed as LSQ
    hypotheses.
  • Probabilistic assumptions
  • Guiding feature selection but also -
  • - Not allowing the use of more general
    features.

56
A Unified Approach
  • Most methods blow up original feature space.
  • And make predictions using a linear
    representation over the new feature space

Note Methods do not have to actually do that
But they produce same decision as a hypothesis
that does that. (Roth 98 99,00)
57
A Unified Approach
  • Most methods blow up original feature space.
  • And make predictions using a linear
    representation over the new feature space

58
A Unified Approach
  • Most methods blow up original feature space.
  • And make predictions using a linear
    representation over the new feature space

Q 1 How are weights determined? Q 2 How is the
new feature-space determined?
Implications? Restrictions?
Write a Comment
User Comments (0)
About PowerShow.com