Bayesian Learning PowerPoint PPT Presentation

presentation player overlay
1 / 45
About This Presentation
Transcript and Presenter's Notes

Title: Bayesian Learning


1
  • Bayesian Learning
  • Machine Learning by Mitchell-Chp. 6
  • Neural Networks for Pattern Recognition by Bishop
    Chp. 1
  • Berrin Yanikoglu
  • Nov 2007

2
  • Imagine that your task is to classify as (C1)
    from bs (C2)
  • How would you decide if you had to decide without
    seeing a new instance?
  • Choose C1 if P(C1) gt P(C2) prior probabilities
  • Choose C2 otherwise
  • 2) How about if you have one measured feature X?
  • First lets define class-conditional and
    posterior probabilities

3
Definition of probabilities based on frequences
P(C1,Xx) num. samples in corresponding box
num. all samples //joint
probability of C1 and X P(XxC1) num. samples
in corresponding box num.
of samples in C1-row //class-conditional
probability of X P(C1) num. of of
samples in C1-row num.
all samples //prior probability of C1
P(C1,Xx) P(XxC1) P(C1) Bayes Thm.
4
2) How about if you have one measured feature
Xabout your instances (and in particular, for a
new feature, you have mesured Xx) ?
5
  • Imagine that your task is to classify as (C1)
    from bs (C2)
  • 2) How about if you have one measured feature X
    about your instances (and in particular, for a
    new feature, you have mesured Xx) ?
  • You would minimize misclassification errors if
    you choose the class that has the maximum
    posterior probability (why?)
  • Choose C1 if p(C1Xx) gt p(C2Xx)
  • Choose C2 otherwise
  • Since p(C1Xx) p(XxC1)P(C1)/P(Xx),
    equivalently
  • Choose C1 if p(XxC1)P(C1) gt p(XxC2)P(C2)
    ignoring P(Xx)
  • Choose C2 otherwise
  • Notice that both p(XxC1) and P(C1) are easy to
    compute.

6
Posterior Probability Distribution
7
Continuous valued attributes
P(x ? a, b) 1 if the interval a, b
corresponds to the whole of X-space. Note that
to be proper, we use upper-case letters for
probabilities and lower-case letters for
probability densities but often this is not
strictly followed. For continuous variables, the
class-conditional probabilities introduced above
become class-conditional probability density
functions, which we write in the form p(xCk).
8
Multible attributes
  • If there are d variables/attributes x1,...,xd, we
    may group them into a vector x x1,... ,xdT
    corresponding to a point in a
  • d-dimensional space.
  • The distribution of values of x can be described
    by probability density function p(x), such that
    the probability of x lying in a region R of the
    x-space is given by

9
Expected Value
  • The expected value of a function Q(x), where x
    has the probability density p(x) is
  • For a finite set of data points x1 , . . . ,xN,
    drawn from the distribution p(x), the expectation
    can be approximated
  • by the average over the data points

10
Bayes Thm. in General
  • For continuous variables, the prior probabilities
    can be combined with the class conditional
    densities to give the posterior probabilities
    P(Ckx) using Bayes theorem
  • Note that you can show (and generalize to k
    classes)

11
Decision Regions
  • In general, assign a feature x to Ck if Ckargmax
    (P(Cjx))

  • j
  • Equivalently, assign a feature x to Ck if
  • This generates c decision regions R1Rc such that
    a point falling in region Rk is assigned to class
    Ck.
  • Note that each of these regions need not be
    contiguous, but may itself be divided into
    several disjoint regions all of which are
    associated with the same class. The boundaries
    between these regions are known as decision
    surfaces or decision boundaries.


12
Probability of Error
  • For two regions R1 R2 (you can generalize)

Not ideal decision boundary!
13
Justification for the Decision Criteriabased on
max. Posterior probability
14
Justification for the Decision Criteriabased on
max. Posterior probability
  • Another approach to justify

15
Discriminant Functions
  • Although we have focused on probability
    distribution functions, the decision on class
    membership in our classifiers has been based
    solely on the relative sizes of the
    probabilities. This observation allows us to
    reformulate the classification process in terms
    of a set of
  • discriminant functions y1(x),...., yc(x) such
    that an input vector x is assigned to class Ck
    if
  • Recast he decision rule for minimizing the
    probability of misclassification in terms of
    discriminant functions, by choosing

16
Discriminant Functions
We can use any monotonic function of yk(x) that
would simplify calculations. Since a monotonic
transformation does not change the order of yks,
it would not change the decision.
17
Minimizing Risks
  • We define a loss matrix with elements Lkj
    specifying the penalty associated with assigning
    a pattern to class Cj when in fact it belongs to
    class Ck.
  • Consider all the patterns x which belong to class
    Ck. Then the expected loss for those patterns is
    given by
  • Overall expected loss/risk

18
Minimizing Expected Risk
  • This risk is minimized if the integrand is
    minimized at each point x, that is if the regions
    Rj are chosen such that x ? Rj when
  • a generalization of the simple function
    minimizing the number of misclassifications

19
Mitchell Chp.6
  • Use of Bayes thm. to justify selecting a
    particular hypothesis

20
  • Bayes thm

21
Choosing Hypotheses
22
Example to Work on
23
(No Transcript)
24
Probability - Basics
25
  • What is the relation between Bayes thm. and
    concept learning?
  • Calculate the posterior probability of each
    hypothesis and output the one which is most
    likely.
  • Brute Force MAP algorithm
  • Computationally complex, but interesting
    theoretically

26
  • For this, we must specify P(h) and P(Dh)
  • P(D) will be found from these two
  • Lets choose them to be consistent with our
    assumptions
  • (in Find-S and Candidate Elimination)
  • Training data D is noise free
  • The target concept c is in the hypothesis space H
  • Each hypothesis is equally probable (a priori)

27
  • Lets choose them to be consistent with our
    assumptions
  • Each hypothesis is equally probable (a priori)
  • P(h) 1/H for all h in H
  • Training data D is noise free
  • P(Dh) 1 if h is consistent with D
  • 0 otherwise

28
  • P(hD) P(Dh)P(h)
  • P(D)
  • For inconsistent hypotheses P(hD) 0. 1/H
    0

  • P(D)
  • For consistent hypotheses P(hD) 1. 1/H
    1___

  • P(D) VSH,D
  • with P(D)VSH,D/H as shown in the following
    slide.
  • Under our choice for P(h) and P(Dh), every
    consistent hypothesis has equal
  • posterior probability (1/ VSH,D) and every
    inconsistent hypothesis has probability 0.

29
(No Transcript)
30
Evolution of Posterior Probabilities
As we gather more data (nothing, then D1, then
D2), inconsistent hypotheses gets 0 posterior
probability and consistent ones share the
remaining probabilities (summing up to 1).
31
Characterizing Learning Algorithms by Equivalent
MAP Learners
Every consistent learner outputs a MAP
hypothesis, if we assume equal priors noise
free data
32
Normal Distribution Multivariate Normal
Distribution
  • For a single variable, the normal density
    function is
  • For variables in higher dimensions, this
    generalizes to
  • where the mean m is now a d-dimensional vector,
  • is a d x d covariance matrix and S is the
    determinant of S

33
Learning a Real Valued Function
34
(No Transcript)
35
  • Skip 6.5 6.6 for now.
  • So far we have considered the question
  • "what is the most probable hypothesis given the
    training data?
  • In fact, the question that is often of most
    significance is
  • "what is the most probable classiffication of the
    new
  • instance given the training data?
  • Although it may seem that this second question
    can be answered by simply applying the MAP
    hypothesis to the new instance, in fact it is
    possible to do better.

36
(No Transcript)
37
Bayes Optimal Classifier
No other classifier using the same hypothesis
space and same prior knowledge can outperform
this method on average
38
Gibbs Classifier (Opper and Haussler, 1991, 1994)
39
Naive Bayes Classifier
40
Naive Bayes Classifier
  • But it is difficult (requires a lot of data) to
    estimate
  • P(a1,a2,an vj)
  • Naive Bayes assumption

41
(No Transcript)
42
Illustrative Example
43
Illustrative Example
44
Naive Bayes Subtleties
45
Naive Bayes Subtleties
Write a Comment
User Comments (0)
About PowerShow.com