Statistical Learning - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Statistical Learning

Description:

A bag of candy whose lime-cherry proportions are completely unknown. In this case we have hypotheses parameterized by the probability q of cherry. ... – PowerPoint PPT presentation

Number of Views:419
Avg rating:3.0/5.0
Slides: 38
Provided by: jkim1
Category:

less

Transcript and Presenter's Notes

Title: Statistical Learning


1
Statistical Learning
  • Chapter 20 of AIMA
  • KAIST CS570
  • Lecture note

Based on AIMA slides, Jahwan Kims slides and
Duda, Hart Storks slides

2
Statistical Learning
  • We view LEARNING as a form of uncertain reasoning
    from observation

3
Outline
  • Bayesian Learning
  • Bayesian inference
  • MAP and ML
  • Naïve Bayesian method
  • Parameter Learning
  • Examples
  • Regression and LMS
  • Learning Probability Distribution
  • Parametric method
  • Non-parametric method

4
Bayesian Learning 1
  • View learning as Bayesian updating of a
    probability distribution over the hypothesis
    space
  • H is the hypothesis variable, values h1,,hn be
    possible hypotheses.
  • Let d(d1,dn) be the observed data vectors.
  • Often (always) iid assumption is made.
  • Let X denote the prediction.
  • In Bayesian Learning,
  • Compute the probability of each hypothesis given
    the data. Predict based on that basis.
  • Predictions are made by using all hypotheses.
  • Learning in Bayesian setting is reduced to
    probabilistic inference.

5
Bayesian Learning 2
  • The probability that the prediction is X, when
    the data d is observed is
  • P(Xd) åi P(Xd, hi) P(hid)
  • åi P(Xhi) P(hid)
  • Prediction is weighted average over the
    predictions of individual hypothesis.
  • Hypotheses are intermediaries between the data
    and the predictions.
  • Requires computing P(hid) for all i. This is
    usually intractable.

6
Bayesian Learning Basics Terms
  • P(hi) is called the (hypothesis) prior.
  • We can embed knowledge by means of prior.
  • It also controls the complexity of the model.
  • P(hid) is called posterior (or a posteriori)
    probability.
  • Using Bayes rule,
  • P(hid)/ P(dhi)P(hi)
  • P(dhi) is called the likelihood of the data.
  • Under iid assumption,
  • P(dhi)Õj P(djhi).
  • Let hMAP be the hypothesis for which the
    posterior probability P(hid) is maximal. It is
    called the maximum a posteriori (or MAP)
    hypothesis.

7
Candy Example
  • Two flavors of candy, cherry and lime, wrapped in
    the same opaque wrapper. (cannot see inside)
  • Sold in very large bags, of which there are known
    to be five kinds
  • h1 100 cherry
  • h2 75 cherry 25 lime
  • h3 50 -50
  • h4 25 cherry -75 lime
  • h5 100 lime
  • Priors known P(h1),,P(h5) are 0.1, 0.2, 0.4,
    0.2, 0.1
  • Suppose from a bag of candy, we took N pieces of
    candy and all of them were lime (data dN).
  • What kind of bag is it ?
  • What flavor will the next candy be ?

8
Candy ExamplePosterior probability of hypotheses
  • P(h1dN) / P(dNh1)P(h1)0,P(h2dN) /
    P(dNh2)P(h2) 0.2(.25)N,P(h3dN) /
    P(dNh3)P(h3)0.4(.5)N,P(h4dN) /
    P(dNh4)P(h4)0.2(.75)N,P(h5dN) /
    P(dNh5)P(h5)P(h5)0.1.
  • Normalize them by requiring them to sum up to 1.

9
Candy ExamplePrediction Probability
10
Maximum a posteriori (MAP) Learning
  • Since calculating the exact probability is often
    impractical, we use approximation by MAP
    hypothesis. That is,
  • P(Xd)¼P(XhMAP).
  • Make prediction with most probable hypothesis
  • Summing over that hypotheses space is often
    intractable
  • instead of large summation (integration), an
    optimization problem can be solved.
  • For deterministic hypothesis, P(dhi) is 1 if
    consistent, 0 otherwise ? MAP simplest
    consistent hypothesis (cf. science)
  • The true hypothesis eventually dominates the
    Bayesian prediction

11
MAP approximation MDL Principle
  • Since P(hid)/ P(dhi)P(hi), instead of
    maximizing P(hid), we may maximize P(dhi)P(hi).
  • Equivalently, we may minimize
  • log P(dhi)P(hi)-log P(dhi)-log P(hi).
  • We can interpret this as choosing hi to minimize
    the number of bits that is required to encode the
    hypothesis hi and the data d under that
    hypothesis.
  • The principle of minimizing code length (under
    some pre-determined coding scheme) is called the
    minimum description length (or MDL) principle.
  • MDL is used in wide range of practical machine
    learning applications.

12
Maximum Likelihood Approximation
  • Assume furthermore that P(hi)s are all equal,
    i.e., assume the uniform prior.
  • reasonable when there is no reason to prefer one
    hypothesis over another a priori.
  • For Large data set, prior becomes irrelevant
  • to obtain MAP hypothesis, it suffices to maximize
    P(dhi), the likelihood.
  • the maximum likelihood hypothesis hML.
  • MAP and uniform prior , ML
  • ML is the standard statistical learning method
  • Simply get the best fit to the data

13
Naïve Bayes Method
  • Attributes (components of observed data) are
    assumed to be independent in Naïve Bayes Method.
  • Works well for about 2/3 of real-world problems,
    despite naivety of such assumption.
  • Goal Predict the class C, given the observed
    data Xixi.
  • By the independent assumption,
  • P(Cx1,xn) / P(C) Õi P(xiC)
  • We choose the most likely class.
  • Merits of NB
  • Scales well No search is required.
  • Robust against noisy data.
  • Gives probabilistic predictions.

14
Learning Curve on the Restaurant Problem
15
Learning with Data Parameter Learning
  • Introduce parametric probability model with
    parameter q.
  • Then the hypotheses are hq, i.e., hypotheses are
    parameterized.
  • In the simplest case, q is a single scalar. In
    more complex cases, q consists of many
    components.
  • Using the data d, predict the parameter q.

16
ML Parameter Learning Examples discrete case
  • A bag of candy whose lime-cherry proportions are
    completely unknown.
  • In this case we have hypotheses parameterized by
    the probability q of cherry.
  • P(dhq)Õj P(djhq)qcherry(1-q)lime
  • Find hq Maximize P(dhq)
  • Two wrappers, green and red, are selected
    according to some unknown conditional
    distribution, depending on the flavor.
  • It has three parameters qP(Fcherry),
    q1P(WredFcherry), q2P(WredFlime).
  • P(dhQ) qcherry(1-q)lime q1red,cherry(1-q1)green
    ,cherry q2red,lime(1-q2)green,lime
  • Find hQ Maximize P(dhQ)

17
ML Parameter Learning Example continuous
caseSingle Variable Gaussian
  • Gaussian pdf on a single variable
  • Suppose x1,,xN are observed. Then the log
    likelihood is
  • We want to find m and s that will maximize this.
    Find where gradient is zero.

18
ML Parameter Learning Example continuous case
Single Variable Gaussian
  • Solving this, we find
  • This verifies ML agrees with our common sense.

19
ML Parameter Learning Example continuous case
Linear Regression
  • Y has a Gaussian distribution whose mean is
    depend on X and standard deviation is fixed
  • Maximize
  • minimizing
  • This quantity is sum of squared errors. Thus in
    this case,
  • ML , Least Mean-Square (LMS)

20
Bayesian Parameter Learning
  • ML approximations deficiency with small data
  • e.g. ML of one cherry observation 100 cherry
  • Bayesian parameter learning
  • Place a hypothesis prior over the possible values
    of parameters
  • Update this distribution as data arrive

21
Bayesian Learning of Parameter q
  • The density becomes more peaked as the number of
    samples increase
  • Despite different prior dsitribution, posterior
    density is virtually identical with large set of
    data

22
Bayesian Parameter Learning ExampleBeta
Distribution Candy example revisited.
  • q is the value of a random variable Q in Bayesian
    view.
  • P(Q) is a continuous distribution.
  • Uniform density is one candidate.
  • Another possibility is to use beta distributions.
  • Beta distribution has two hyperparameters a and
    b, and is given by (a normalizing constant)
  • ba,b(q)aqa-1(1-q)b-1.
  • mean a/(ab).
  • Larger a suggest q is closer to 1 than to 0
  • More peaked when ab is large, suggesting greater
    certainty about the value of Q.

23
Beta Distribution
ba,b(q)aqa-1(1-q)b-1
24
Baysian Parameter Learning ExampleProperty of
Beta Distribution
  • if Q has a prior ba,b, then the posterior
    distribution for Q is also a beta distribution.
  • P(qdcherry) a P(dcherryq)P(q)
  • a q ba,b(q)
  • a qqa-1(1-q)b-1
  • a qa(1-q)b-1
  • ba1,b
  • Beta distribution is called the conjugate prior
    for the family of distributions for a Boolean
    variable.
  • a and b as virtual count
  • Uniform prior ba,b ? seen a-1 cherry and b-1 lime

25
Density Estimation
  • All Parametric densities are unimodal (have a
    single local maximum), whereas many practical
    problems involve multi-modal densities
  • Nonparametric procedures can be used with
    arbitrary distributions and without the
    assumption that the forms of the underlying
    densities are known
  • There are two types of nonparametric methods
  • Estimating P(x ?j )
  • Bypass probability and go directly to
    a-posteriori probability estimation

26
Density Estimation Basic idea
  • Probability that a vector x will fall in region R
    is
  • P is a smoothed (or averaged) version of the
    density function p(x) if we have a sample of size
    n therefore, the probability that k points fall
    in R is then
  • and the expected value for k is
  • E(k) nP
    (3)

27
  • ML estimation of P ?
  • is reached for
  • Therefore, the ratio k/n is a good estimate for
    the probability P and hence for the density
    function p.
  • p(x) is continuous and that the region R is so
    small that p does not vary significantly within
    it, we can write
  • where x is a point within R and V the volume
    enclosed by R.
  • Combining equation (1) , (3) and (4) yields

28
Parzen Windows
  • Parzen-window approach to estimate densities
    assume that the region Rn is a d-dimensional
    hypercube
  • ?((x-xi)/hn) is equal to unity if xi falls within
    the hypercube of volume Vn centered at x and
    equal to zero otherwise.

29
  • The number of samples in this hypercube is
  • By substituting kn in equation (7), we obtain the
    following estimate
  • Pn(x) estimates p(x) as an average of functions
    of x and
  • the samples (xi) (i 1, ,n). These functions ?
    can be general!

30
Illustration of Parzen Window
  • The behavior of the Parzen-window method
  • Case where p(x) ?N(0,1)
  • Let ?(u) (1/?(2?) exp(-u2/2) and hn h1/?n
    (ngt1) (h1 known parameter)
  • Thus
  • is an average of normal densities centered at the
    samples xi.

31
Numerical results
  • For n 1 and h11For n 10 and h 0.1,
    the contributions of the individual samples are
    clearly observable !

32
(No Transcript)
33
(No Transcript)
34
Analogous results are also obtained in two
dimensions as illustrated
35
Case where p(x) ?1.U(a,b) ?2.T(c,d) (unknown
density) (mixture of a uniform and a triangle
density)
36
(No Transcript)
37
Summary
  • Full Bayesian learning gives best possible
    predictions but is intractable
  • MAP Learning balances complexity with accuracy on
    training data
  • ML approximation assumes uniform prior, OK for
    large data sets
  • Parameter estimation is often used
Write a Comment
User Comments (0)
About PowerShow.com