Title: Bayesian Learning
1- Bayesian Learning
- Machine Learning by Mitchell-Chp. 6
- Neural Networks for Pattern Recognition by Bishop
Chp. 1 - Berrin Yanikoglu
- Nov 2007
2- Imagine that your task is to classify as (C1)
from bs (C2) - How would you decide if you had to decide without
seeing a new instance? - Choose C1 if P(C1) gt P(C2) prior probabilities
- Choose C2 otherwise
- 2) How about if you have one measured feature X?
- First lets define class-conditional and
posterior probabilities
3Definition of probabilities based on frequences
P(C1,Xx) num. samples in corresponding box
num. all samples //joint
probability of C1 and X P(XxC1) num. samples
in corresponding box num.
of samples in C1-row //class-conditional
probability of X P(C1) num. of of
samples in C1-row num.
all samples //prior probability of C1
P(C1,Xx) P(XxC1) P(C1) Bayes Thm.
42) How about if you have one measured feature
Xabout your instances (and in particular, for a
new feature, you have mesured Xx) ?
5- Imagine that your task is to classify as (C1)
from bs (C2) - 2) How about if you have one measured feature X
about your instances (and in particular, for a
new feature, you have mesured Xx) ? - You would minimize misclassification errors if
you choose the class that has the maximum
posterior probability (why?) - Choose C1 if p(C1Xx) gt p(C2Xx)
- Choose C2 otherwise
- Since p(C1Xx) p(XxC1)P(C1)/P(Xx),
equivalently - Choose C1 if p(XxC1)P(C1) gt p(XxC2)P(C2)
ignoring P(Xx) - Choose C2 otherwise
- Notice that both p(XxC1) and P(C1) are easy to
compute.
6Posterior Probability Distribution
7Continuous valued attributes
P(x ? a, b) 1 if the interval a, b
corresponds to the whole of X-space. Note that
to be proper, we use upper-case letters for
probabilities and lower-case letters for
probability densities but often this is not
strictly followed. For continuous variables, the
class-conditional probabilities introduced above
become class-conditional probability density
functions, which we write in the form p(xCk).
8Multible attributes
- If there are d variables/attributes x1,...,xd, we
may group them into a vector x x1,... ,xdT
corresponding to a point in a - d-dimensional space.
- The distribution of values of x can be described
by probability density function p(x), such that
the probability of x lying in a region R of the
x-space is given by
9Expected Value
- The expected value of a function Q(x), where x
has the probability density p(x) is - For a finite set of data points x1 , . . . ,xN,
drawn from the distribution p(x), the expectation
can be approximated - by the average over the data points
10Bayes Thm. in General
- For continuous variables, the prior probabilities
can be combined with the class conditional
densities to give the posterior probabilities
P(Ckx) using Bayes theorem - Note that you can show (and generalize to k
classes)
11Decision Regions
- In general, assign a feature x to Ck if Ckargmax
(P(Cjx)) -
j - Equivalently, assign a feature x to Ck if
- This generates c decision regions R1Rc such that
a point falling in region Rk is assigned to class
Ck. - Note that each of these regions need not be
contiguous, but may itself be divided into
several disjoint regions all of which are
associated with the same class. The boundaries
between these regions are known as decision
surfaces or decision boundaries. -
12Probability of Error
- For two regions R1 R2 (you can generalize)
Not ideal decision boundary!
13Justification for the Decision Criteriabased on
max. Posterior probability
14Justification for the Decision Criteriabased on
max. Posterior probability
- Another approach to justify
15Discriminant Functions
- Although we have focused on probability
distribution functions, the decision on class
membership in our classifiers has been based
solely on the relative sizes of the
probabilities. This observation allows us to
reformulate the classification process in terms
of a set of - discriminant functions y1(x),...., yc(x) such
that an input vector x is assigned to class Ck
if - Recast he decision rule for minimizing the
probability of misclassification in terms of
discriminant functions, by choosing
16Discriminant Functions
We can use any monotonic function of yk(x) that
would simplify calculations. Since a monotonic
transformation does not change the order of yks,
it would not change the decision.
17Minimizing Risks
- We define a loss matrix with elements Lkj
specifying the penalty associated with assigning
a pattern to class Cj when in fact it belongs to
class Ck. - Consider all the patterns x which belong to class
Ck. Then the expected loss for those patterns is
given by - Overall expected loss/risk
18Minimizing Expected Risk
- This risk is minimized if the integrand is
minimized at each point x, that is if the regions
Rj are chosen such that x ? Rj when - a generalization of the simple function
minimizing the number of misclassifications
19Mitchell Chp.6
- Use of Bayes thm. to justify selecting a
particular hypothesis
20 21Choosing Hypotheses
22Example to Work on
23(No Transcript)
24Probability - Basics
25- What is the relation between Bayes thm. and
concept learning? - Calculate the posterior probability of each
hypothesis and output the one which is most
likely. - Brute Force MAP algorithm
- Computationally complex, but interesting
theoretically
26- For this, we must specify P(h) and P(Dh)
- P(D) will be found from these two
- Lets choose them to be consistent with our
assumptions - (in Find-S and Candidate Elimination)
- Training data D is noise free
- The target concept c is in the hypothesis space H
- Each hypothesis is equally probable (a priori)
27- Lets choose them to be consistent with our
assumptions - Each hypothesis is equally probable (a priori)
- P(h) 1/H for all h in H
- Training data D is noise free
- P(Dh) 1 if h is consistent with D
- 0 otherwise
28- P(hD) P(Dh)P(h)
- P(D)
- For inconsistent hypotheses P(hD) 0. 1/H
0 -
P(D) - For consistent hypotheses P(hD) 1. 1/H
1___ -
P(D) VSH,D - with P(D)VSH,D/H as shown in the following
slide. - Under our choice for P(h) and P(Dh), every
consistent hypothesis has equal - posterior probability (1/ VSH,D) and every
inconsistent hypothesis has probability 0.
29(No Transcript)
30Evolution of Posterior Probabilities
As we gather more data (nothing, then D1, then
D2), inconsistent hypotheses gets 0 posterior
probability and consistent ones share the
remaining probabilities (summing up to 1).
31Characterizing Learning Algorithms by Equivalent
MAP Learners
Every consistent learner outputs a MAP
hypothesis, if we assume equal priors noise
free data
32Normal Distribution Multivariate Normal
Distribution
- For a single variable, the normal density
function is - For variables in higher dimensions, this
generalizes to - where the mean m is now a d-dimensional vector,
- is a d x d covariance matrix and S is the
determinant of S
33Learning a Real Valued Function
34(No Transcript)
35- Skip 6.5 6.6 for now.
- So far we have considered the question
- "what is the most probable hypothesis given the
training data? - In fact, the question that is often of most
significance is - "what is the most probable classiffication of the
new - instance given the training data?
- Although it may seem that this second question
can be answered by simply applying the MAP
hypothesis to the new instance, in fact it is
possible to do better.
36(No Transcript)
37Bayes Optimal Classifier
No other classifier using the same hypothesis
space and same prior knowledge can outperform
this method on average
38Gibbs Classifier (Opper and Haussler, 1991, 1994)
39Naive Bayes Classifier
40Naive Bayes Classifier
- But it is difficult (requires a lot of data) to
estimate - P(a1,a2,an vj)
- Naive Bayes assumption
41(No Transcript)
42Illustrative Example
43Illustrative Example
44Naive Bayes Subtleties
45Naive Bayes Subtleties