Bayesian Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian Learning

Description:

Note: randomized decisions do not help. 0-1 Loss ... Formally, s(D) is a sufficient statistics if for any two datasets D and D' s(D) = s(D' ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 42
Provided by: yishaym4
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Learning


1
Bayesian Learning
  • Thanks to Nir Friedman, HU

2
Example
  • Suppose we are required to build a controller
    that removes bad oranges from a packaging line
  • Decision are made based on a sensor that reports
    the overall color of the orange

Bad oranges
3
Classifying oranges
  • Suppose we know all the aspects of the problem
  • Prior Probabilities
  • Probability of good (1) and bad (-1) oranges
  • P(C 1) probability of a good orange
  • P(C -1) probability of a bad orange
  • Note P(C 1) P(C -1) 1
  • Assumption oranges are independent ?The
    occurrence of a bad orange does not depend on
    previous

4
Classifying oranges (cont)
  • Sensor performance
  • Let X denote sensor measurement from each type of
    oranges

5
Bayes Rule
  • Given this knowledge, we can compute the
    posterior probabilities
  • Bayes Rule

6
Posterior of Oranges
7
Decision making
  • Intuition
  • Predict Good if P(C1 X) gt P(C-1 X)
  • Predict Bad, otherwise

8
Loss function
  • Assume we have classes 1, -1
  • Suppose we can make predictions a1,,ak
  • A loss function L(ai, cj) describes the loss
    associated with making prediction ai when the
    class is cj

Real Label
-1 1
Bad 1 5
Good 10 0
Prediction
9
Expected Risk
  • Given the estimates of P(C X) we can compute
    the expected conditional risk of each decision

10
The Risk in Oranges
-1 1
Bad 1 5
Good 10 0
10
R(GoodX)
5
R(BadX)
0
11
Optimal Decisions
  • Goal
  • Minimize risk
  • Optimal decision rule
  • Given X x, predict ai if R(aiXx) mina
    R(aXx)
  • (break ties arbitrarily)
  • Note randomized decisions do not help

12
0-1 Loss
  • If we dont have prior knowledge, it is common to
    use the 0-1 loss
  • L(a,c) 0 if a c
  • L(a,c) 1 otherwise
  • Consequence
  • R(aX) P(a ?cX)
  • Decision rulechoose ai if P(C ai X) maxa
    P(C aX)

13
Bayesian Decisions Summery
  • Decisions based on two components
  • Conditional distribution P(CX)
  • Loss function L(A,C)
  • Pros
  • Specifies optimal actions in presence of noisy
    signals
  • Can deal with skewed loss functions
  • Cons
  • Requires P(CX)

14
Simple Statistics Binomial Experiment
Head
Tail
  • When tossed, it can land in one of two positions
    Head or Tail
  • We denote by ? the (unknown) probability P(H).
  • Estimation task
  • Given a sequence of toss samples x1, x2, ,
    xM we want to estimate the probabilities P(H)
    ? and P(T) 1 - ?

15
Why Learning is Possible?
  • Suppose we perform M independent flips of the
    thumbtack
  • The number of heads we see has a binomial
    distribution
  • and thus
  • This suggests, that we can estimate ? by

16
Maximum Likelihood Estimation
  • MLE Principle
  • Learn parameters that maximize the likelihood
    function
  • This is one of the most commonly used estimators
    in statistics
  • Intuitively appealing
  • Well studied properties

17
Computing the Likelihood Functions
  • To compute the likelihood in the coin tossing
    example we only require NH and NT (the number of
    heads and the number of tails)
  • Applying the MLE principle we get
  • NH and NT are sufficient statistics for the
    binomial distribution

18
Sufficient Statistics
  • A sufficient statistic is a function of the data
    that summarizes the relevant information for the
    likelihood
  • Formally, s(D) is a sufficient statistics if for
    any two datasets D and D
  • s(D) s(D ) ? L(? D) L(? D)

19
Maximum A Posterior (MAP)
  • Suppose we observe the sequence
  • H, H
  • MLE estimate is P(H) 1, P(T) 0
  • Should we really believe that tails are
    impossible at this stage?
  • Such an estimate can have disastrous effect
  • If we assume that P(T) 0, then we are willing
    to act as though this outcome is impossible

20
Laplace Correction
  • Suppose we observe n coin flips with k heads
  • MLE
  • Laplace correction
  • As though we observed one additional H and one
    additional T
  • Can we justify this estimate? Uniform prior!

21
Bayesian Reasoning
  • In Bayesian reasoning we represent our
    uncertainty about the unknown parameter ? by a
    probability distribution
  • This probability distribution can be viewed as
    subjective probability
  • This is a personal judgment of uncertainty

22
Bayesian Inference
  • We start with
  • P(?) - prior distribution about the values of ?
  • P(x1, , xn?) - likelihood of examples given a
    known value ?
  • Given examples x1, , xn, we can compute
    posterior distribution on ?
  • Where the marginal likelihood is

23
Binomial Distribution Laplace Est.
  • In this case the unknown parameter is ? P(H)
  • Simplest prior P(?) 1 for 0lt? lt1
  • Likelihood
  • where k is number of heads in the sequence
  • Marginal Likelihood

24
Marginal Likelihood
  • Using integration by parts we have
  • Multiply both side by n choose k, we have

25
Marginal Likelihood - Cont
  • The recursion terminates when k n
  • Thus
  • We conclude that the posterior is

26
Bayesian Prediction
  • How do we predict using the posterior?
  • We can think of this as computing the probability
    of the next element in the sequence
  • Assumption if we know ?, the probability of Xn1
    is independent of X1, , Xn

27
Bayesian Prediction
  • Thus, we conclude that

28
Naïve Bayes
29
Bayesian Classification Binary Domain
  • Consider the following situation
  • Two classes -1, 1
  • Each example is described by by N attributes
  • Xn is a binary variable with value 0,1
  • Example dataset

X1 X2 XN C
0 1 1 1
1 0 1 -1
1 1 0 1

0 0 0 1
30
Binary Domain - Priors
  • How do we estimate P(C) ?
  • Simple Binomial estimation
  • Count of instances with C -1, and with C 1

X1 X2 XN C
0 1 1 1
1 0 1 -1
1 1 0 1

0 0 0 1
31
Binary Domain - Attribute Probability
  • How do we estimate P(X1,,XNC) ?
  • Two sub-problems

X1 X2 XN C
0 1 1 1
1 0 1 -1
1 1 0 1

0 0 0 1
32
Naïve Bayes
  • Naïve Bayes
  • Assume
  • This is an independence assumption
  • Each attribute Xi is independent of the other
    attributes once we know the value of C

33
Naïve BayesBoolean Domain
  • Parameters
  • for each i
  • How do we estimate ?11?
  • Simple binomial estimation
  • Count 1 and 0 values of X1in instances where
    C1

X1 X2 XN C
0 1 1 1
1 0 1 -1
1 1 0 1

0 0 0 1
34
Interpretation of Naïve Bayes
35
Interpretation of Naïve Bayes
  • Each Xi votes about the prediction
  • If P(XiC-1) P(XiC1) then Xi has no say in
    classification
  • If P(XiC-1) 0 then Xi overrides all other
    votes (veto)

36
Interpretation of Naïve Bayes
  • Set
  • Classification rule

37
Normal Distribution
  • The Gaussian distribution

0.4
0.3
0.2
0.1
0
-4
-2
0
2
4
6
8
10
38
Maximum Likelihood Estimate
  • Suppose we observe x1, , xm
  • Simple calculations show that the MLE is
  • Sufficient statistics are

39
Naïve Bayes with Gaussian Distributions
  • Recall,
  • Assume
  • Mean of Xi depends on class
  • Variance of Xi does not

40
Naïve Bayes with Gaussian Distributions
  • Recall

Distance between means
Distance of Xi to midway point
41
Different Variances?
  • If we allow different variances, the
    classification rule is more complex
  • The term is quadratic in Xi
Write a Comment
User Comments (0)
About PowerShow.com