Maximum Likelihood - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Maximum Likelihood

Description:

Bayesian Learning Univariate Normal Distribution. 11/7/09. 236875 Visual Recognition ... Prior probability: normal distribution over , ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 64
Provided by: rud52
Category:

less

Transcript and Presenter's Notes

Title: Maximum Likelihood


1
Outline
  • Maximum Likelihood
  • Maximum A-Posteriori (MAP) Estimation
  • Bayesian Parameter Estimation
  • ExampleThe Gaussian Case
  • Recursive Bayesian Incremental Learning
  • Problems of Dimensionality
  • Nonparametric Techniques
  • Density Estimation
  • Histogram Approach
  • Parzen-window method

2
Bayes' Decision Rule (Minimizes the probability
of error) 
  • choose w1 if P(w1x) gt P(w2x)
  • choose w2 otherwise
  • or
  • w1 if P ( x w1) P(w1) gt P(xw2) P(w2)
  • w2 otherwise
  • and
  • P(Errorx) min P(w1x) , P(w2x)

3
Normal Density - Multivariate Case
  • The general multivariate normal density (MND) in
    a d dimensions is written as
  • It can be shown that
  • which means for components

4
Maximum Likelihood and Bayesian Parameter
Estimation
  • To design an optimal classifier we need P(wi)
    and p(x wi), but usually we do not know them.
  • Solution to use training data to estimate the
    unknown probabilities. Estimation of
    class-conditional densities is a difficult task.

5
Maximum Likelihood and Bayesian Parameter
Estimation
  • Supervised learning we get to see samples from
    each of the classes separately (called tagged
    or labeled samples).
  • Tagged samples are expensive. We need to learn
    the distributions as efficiently as possible.
  • Two methods parametric (easier) and
    non-parametric (harder)

6
Learning From Observed Data
Hidden
Observed
Unsupervised
Supervised
7
Maximum Likelihood and Bayesian Parameter
Estimation
  • Program for parametric methods
  • Assume specific parametric distributions with
    parameters
  • Estimate parameters from training
    data
  • Replace true value of class-conditional density
    with approximation and apply the Bayesian
    framework for decision making. 

8
Maximum Likelihood and Bayesian Parameter
Estimation
  • Suppose we can assume that the relevant
    (class-conditional) densities are of some
    parametric form. That is,
  • p(xw)p(xq), where
  • Examples of parameterized densities
  • Binomial x(n) has m 1s and n-m 0s
  • Exponential Each data point x is distributed
    according to

9
Maximum Likelihood and Bayesian Parameter
Estimation cont.
  • Two procedures for parameter estimation will be
    considered
  • Maximum likelihood estimation choose parameter
    value that makes the data most probable
    (i.e., maximizes the probability of obtaining the
    sample that has actually been observed),
  • Bayesian learning define a prior probability on
    the model space and compute the
    posterior Additional
    samples sharp the posterior density which peaks
    near the true values of the parameters .
  •  
  •  

10
Sampling Model
  • It is assumed that a sample set
    with
  • independently generated samples is available.
  • The sample set is partitioned into separate
    sample sets for each class,
  • A generic sample set will simply be denoted by
    .
  • Each class-conditional is
    assumed to have a known parametric form and is
    uniquely specified by a parameter
    (vector) .
  • Samples in each set are assumed to be
    independent and identically distributed (i.i.d.)
    according to some true probability law
    .

11
Log-Likelihood function and Score Function
  • The sample sets are assumed to be functionally
    independent, i.e., the training set
    contains no information about for
    .
  • The i.i.d. assumption implies that
  • Let be a generic sample of size
    .
  • Log-likelihood function
  • The log-likelihood function is identical to the
    logarithm of the probability density function,
    but is interpreted as a function over the sample
    space for given parameter

12
Log-Likelihood Illustration
  • Assume that all the points in are drawn
    from some (one-dimensional) normal distribution
    with some (known) variance and unknown mean.

13
Log-Likelihood function and Score Function cont.
  • Maximum likelihood estimator (MLE)
  • (tacitly assuming that such a maximum
    exists!)
  • Score function
  • and hence
  • Necessary condition for MLE (if not on border of
    domain
  • )

14
Maximum A Posteriory
  • Maximum a posteriory (MAP)
  • Find the value of q that maximizes
    l(q)ln(p(q)), where p(q),is a prior probability
    of different parameter values.A MAP estimator
    finds the peak or mode of a posterior.
  • Drawback of MAP after arbitrary nonlinear
    transformation of the parameter space, the
    density will change, and the MAP solution will no
    longer be correct.

15
Maximum A-Posteriori (MAP) Estimation
  • The most likely value is given by q

16
Maximum A-Posteriori (MAP) Estimation

  • since the data is i.i.d.
  • We can disregard the normalizing factor
    when looking for the maximum

17
MAP - continued
  • So, the we are looking for is

18
The Gaussian Case Unknown Mean
  • Suppose that the samples are drawn from a
    multivariate normal population with mean ,
    and covariance matrix
  • .
  •  Consider fist the case where only the mean is
    unknown
  • .
  •  For a sample point xk , we have
  • and
  • The maximum likelihood estimate for must
    satisfy

19
The Gaussian Case Unknown Mean
  • Multiplying by , and rearranging, we obtain
  • The MLE estimate for the unknown population mean
    is just the arithmetic average of the training
    samples (sample mean).
  • Geometrically, if we think of the n samples as a
    cloud of points, the sample mean is the centroid
    of the cloud 

20
The Gaussian Case Unknown Mean and Covariance
  • In the general multivariate normal case, neither
    the mean nor the covariance matrix is known
    .
  • Consider fist the univariate case with
    and
  • .  The log-likelihood of a
    single point is
  • and its derivative is

21
The Gaussian Case Unknown Mean and Covariance
  • Setting the gradient to zero, and using all the
    sample points, we get the following necessary
    conditions
  • where are the
    MLE estimates for , and
    respectively.
  • Solving for , we obtain

22
The Gaussian multivariate case
  • For the multivariate case, it is easy to show
    that the MLE estimates for are given by
  • The MLE for the mean vector is the sample mean,
    and the MLE estimate for the covariance matrix is
    the arithmetic average of the n matrices
  • The MLE for is biased (i.e., the expected
    value over all data sets of size n of the sample
    variance is not equal to the true variance

23
The Gaussian multivariate case
  • Unbiased estimator for and are given by
  • and
  • C is called the sample covariance matrix . C is
    absolutely unbiased. is asymptotically
    unbiased.

24
Bayesian Estimation Class-Conditional Densities
  • The aim is to find posteriors P(wix) knowing
    p(xwi) and P(wi), but they are unknown. How to
    find them?
  • Given the sample D, we say that the aim is to
    find P(wix, D)
  • Bayes formula gives
  • We use the information provided by training
    samples to determine the class conditional
    densities and the prior probabilities.
  • Generally used assumptions
  • Priors generally are known or obtainable from a
    trivial calculations. Thus P(wi) P(wiD).
  • The training set can be separated into c subsets
    D1,,Dc

25
Bayesian Estimation Class-Conditional Densities
  • The samples Dj have no influence on p(xwi,Di )
    if
  • Thus we can write
  • We have c separate problems of the form
  • Use a set D of samples drawn independently
    according to a fixed but unknown probability
    distribution p(x) to determine p(xD).

26
Bayesian Estimation General Theory
  • Bayesian leaning considers (the parameter
    vector to be
  • estimated) to be a random variable.
  • Before we observe the data, the parameters
    are described by a prior p(q ) which is
    typically very broad. Once we observed the data,
    we can make use of Bayes formula to find
    posterior p(q D ). Since some values of the
    parameters are more consistent with the data than
    others, the posterior is narrower than prior.
    This is Bayesian learning (see fig.)

27
General Theory cont.
  • Density function for x, given the training data
    set ,
  • From the definition of conditional probability
    densities
  • The first factor is independent of since
    it just our assumed form
    for parameterized density.
  • Therefore
  • Instead of choosing a specific value for ,
    the Bayesian approach performs a weighted average
    over all values of
  • The weighting factor , which
    is a posterior of is determined by starting
    from some assumed prior

28
General Theory cont.
  • Then update it using Bayes formula to take
    account of
  • data set . Since
    are drawn independently
  • which is likelihood function.
  • Posterior for is
  • where normalization factor

29
Bayesian Learning Univariate Normal Distribution
  • Let us use the Bayesian estimation technique to
    calculate a posteriori density
    and the desired probability density
    for the case
  • Univariate Case
  • Let m be the only unknown parameter

30
Bayesian Learning Univariate Normal Distribution
  • Prior probability normal distribution over ,
  • encodes some prior knowledge about the
    true mean , while measures our prior
    uncertainty.
  • If m is drawn from p(m) then density for x is
    completely determined. Letting
    we use

31
Bayesian Learning Univariate Normal Distribution
  • Computing the posterior distribution

32
Bayesian Learning Univariate Normal Distribution
  • Where factors that do not depend on have
    been absorbed into the constants and
  • is an exponential
    function of a quadratic function of i.e.
    it is a normal density.
  • remains normal for
    any number of training samples.
  • If we write
  • then identifying the coefficients, we get

33
Bayesian Learning Univariate Normal Distribution
  • where is the sample
    mean.
  • Solving explicitly for and
    we obtain


  • and
  • represents our best guess for after
    observing
  • n samples.
  • measures our uncertainty about this guess.
  • decreases monotonically with n (approaching
  • as n approaches infinity)
  •  

34
Bayesian Learning Univariate Normal Distribution
  • Each additional observation decreases our
    uncertainty about the true value of .
  • As n increases, becomes more
    and more sharply peaked, approaching a Dirac
    delta function as n approaches infinity. This
    behavior is known as Bayesian Learning.

35
Bayesian Learning Univariate Normal Distribution
  • In general, is a linear combination of
    and , with coefficients that are
    non-negative and sum to 1.
  • Thus lies somewhere between and
    .
  • If , as
  • If , our a priori certainty that
    is so
  • strong that no number of observations can
    change our
  • opinion.
  • If , a priori guess is very
    uncertain, and we
  • take
  • The ratio is called dogmatism.

36
Bayesian Learning Univariate Normal Distribution
  • The Univariate Case
  • where

37
Bayesian Learning Univariate Normal Distribution
  • Since
    we can write
  • To obtain the class conditional probability
    , whose parametric form is known to be
    we
    replace by and by
  • The conditional mean is treated as if
    it were the true mean, and the known variance is
    increased to account for the additional
    uncertainty in x resulting from our lack of exact
    knowledge of the mean .

38
Example (demo-MAP)
  • We have N points which are generated by one
    dimensional Gaussian,
  • Since
    we think that the mean should not be very big we
    use as a prior
    where is a hyperparameter. The total
    objective function is
  • which is maximized to give,
  • For influence of prior
    is negligible and result is ML estimate. But for
    very strong belief in the prior
    the estimate tends to zero. Thus,
  • if few data are available, the prior will
    bias the estimate towards the prior expected value

39
Recursive Bayesian Incremental Learning
  • We have seen that Let
    us define Then
  • Substituting into and using
    Bayes we have
  • Finally

40
Recursive Bayesian Incremental Learning
  • While repeated use of
    this eq. produces a sequence

  • This is called the recursive Bayes approach to
    the parameter estimation. (Also incremental or
    on-line learning).
  • When this sequence of densities converges to a
    Dirac delta function centered about the true
    parameter value, we have Bayesian learning.

41
Maximal Likelihood vs. Bayesian
  • ML and Bayesian estimations are asymptotically
    equivalent and consistent. They yield the same
    class-conditional densities when the size of the
    training data grows to infinity.
  • ML is typically computationally easier in ML we
    need to do (multidimensional) differentiation and
    in Bayesian (multidimensional) integration.
  • ML is often easier to interpret it returns the
    single best model (parameter) whereas Bayesian
    gives a weighted average of models.
  • But for a finite training data (and given a
    reliable prior) Bayesian is more accurate (uses
    more of the information).
  • Bayesian with flat prior is essentially ML
    with asymmetric and broad priors the methods lead
    to different solutions.

42
Problems of DimensionalityAccuracy, Dimension,
and Training Sample Size
  • Consider two-class multivariate normal
    distributions
  • with the same covariance. If priors are
    equal then Bayesian error rate is given by
  • where is the squared Mahalanobis
    distance
  • Thus the probability of error decreases as r
    increases. In the conditionally independent case
    and

43
Problems of Dimensionality
  • While classification accuracy can become better
    with growing of dimensionality (and an amount of
    training data),
  • beyond a certain point, the inclusion of
    additional features leads to worse rather then
    better performance
  • computational complexity grows
  • the problem of overfitting arises

44
Occam's Razor
  • "Pluralitas non est ponenda sine neccesitate" or
    "plurality should not be posited without
    necessity." The words are those of the medieval
    English philosopher and Franciscan monk William
    of Occam (ca. 1285-1349).
  • Decisions based on overly complex models often
    lead to lower accuracy of the classifier.

45
Outline
  • Nonparametric Techniques
  • Density Estimation
  • Histogram Approach
  • Parzen-window method
  • Kn-Nearest-Neighbor Estimation
  • Component Analysis and Discriminants
  • Principal Components Analysis
  • Fisher Linear Discriminant
  • MDA

46
NONPARAMETRIC TECHNIQUES
  • So far, we treated supervised learning under the
    assumption that the forms of the underlying
    density functions were known.
  • The common parametric forms rarely fit the
    densities actually encountered in practice.
  • Classical parametric densities are unimodal,
    whereas many practical problems involve
    multimodal densities.
  • We examine nonparametric procedures that can be
    used with arbitrary distribution and without the
    assumption that the forms of the underlying
    densities are known.

47
NONPARAMETRIC TECHNIQUES
  • There are several types of nonparametric methods
  • Procedures for estimating the density functions
    from sample patterns. If these
    estimates are satisfactory, they can be
    substituted for the true densities when designing
    the classifier.
  • Procedures for directly estimating the a
    posteriori probabilities
  • Nearest neighbor rule which bypass probability
    estimation, and go directly to decision
    functions.

48
Histogram Approach
  • The conceptually simplest method of estimating a
    p.d.f. is histogram. The range of each component
    xs of vector x is divided into a fixed number m
    of equal intervals. The resulting boxes (bins) of
    identical volume V are then expected and the
    number of points falling into each bin is
    counted.
  • Suppose that we have ni samples xj , j1,, ni
    from class wi
  • Suppose that the number of vector points
    in the j-th bin, bj , be kj . The histogram
    estimate , of density function

49
Histogram Approach
  • is defined as
  • is constant over every bin bj
    .
  • Let us verify that is a density
    function
  • We can choose a number m of bins and their
    starting points. Fixation of starting points is
    not critical, but m is important. It place a
    role of smoothing parameter. Too big m makes
    histogram spiky, for too little m we loose a
    true form of the density function

50
The Histogram MethodExample
  • Assume (one dimensional) data
  • Some points were sampled from a combination of
    two Gaussians
  • 3 bins

51
The Histogram MethodExample
  • 7 bins
  • 11 bins

52
Histogram Approach
  • The histogram p.d.f. estimator is very
    effective. We can do it online all we should do
    is to update the counters kj during the run time,
    so we do not need to keep all the data which
    could be huge.
  • But its usefulness is limited only to low
    dimensional vectors x, because the number of
    bins, Nb , grows exponentially with
    dimensionality d
  • This is the so called curse of dimensionality

53
DENSITY ESTIMATION
  • To estimate the density at x, we form a sequence
    of regions R1, R2, ..
  • The probability for x to fall into R is
  • Suppose we have n i. i.d. samples x1 , , xn
    drawn according to p(x) . The probability that k
    of them fall in R is
  • and the expected value for k is
    and variance
  • . The
    relative part of samples which fall into R (k/n)
    is also a random variable for which
  • When n is growing up the variance is making
    smaller and is becoming to be better
    estimator for P.

54
DENSITY ESTIMATION
  • Pk sharply peaks about the mean, so the
  • k/n is a good estimate of P.
  • For small enough R
  • where x is within R and V is a volume enclosed by
    R.
  • Thus

55
Three Conditions for DENSITY ESTIMATION
  • Let us take a growing sequence of samples
    n1,2,3...
  • We take regions Rn with reduced volumes V1 gt V2 gt
    V3 gt...
  • Let kn be the number of samples falling in Rn
  • Let pn(x) be the nth estimate for p(x)
  • If pn(x) is to converge to p(x) , 3 conditions
    must be required
  • resolution as big
    as possible (to reduce smoothing)
  • otherwise in the
    range Rn there will not be infinite
  • number of points and k/n will not converge
    to P and well get p(x)0
  • to guarantee
    convergence of ().

56
Parzen Window and KNN
  •  How to obtain the sequence R1 , R2 , ..?
  • There are 2 common approaches of obtaining
    sequences of regions that satisfy the above
    conditions
  • Shrink an initial region by specifying the volume
    Vn as some function of n , such as
    and show that kn and kn/n behave
    properly i.e. pn(x) converges to p(x).
  • This is Parzen-window (or kernel ) method .
  • Specify kn as some function of n, such as
    Here
  • the volume Vn is grown until it encloses kn
    neighbors of x .  
  • This is kn-nearest-neighbor method .
  • Both of these methods do converge, although it is
    difficult to make meaningful statements about
    their finite-sample behavior.

57
PARZEN WINDOWS
  •  Assume that the region Rn is a d-dimensional
    hypercube.
  • If hn is the length of an edge of that
    hypercube, then its volume is given by
  • Define the following window function
  • defines a unit hypercube centered at
    the origin.
  •   , if xi falls
    within the hypercube of volume Vn centered at x,
    and is zero otherwise.
  • The number of samples in this hypercube is given
    by

58
PARZEN WINDOWS cont.
  •  Since
  • Rather than limiting ourselves to the hypercube
    window, we can use a more general class of window
    functions.Thus pn(x) is an average of functions
    of x and the samples xi.
  • The window function is being used for
    interpolation. Each sample contributing to the
    estimate in accordance with its distance from x.
  • pn(x) must
  • be nonnegative
  • integrate to 1.

59
PARZEN WINDOWS cont.
  •  This can be assured by requiring the window
    function itself be a density function, i.e.,
  • Effect of the window size hn on p(x)
  • Define the function
  • then, we write pn(x) as the average
  • Since , hn affects both the
    amplitude and the width of

60
PARZEN WINDOWS cont.
  • Examples of two-dimensional circularly symmetric
    normal Parzen windows
  • for 3
    different values of h. 
  • If hn is very large, the amplitude of
    is small, and x must be far from xi before
    changes much from

61
PARZEN WINDOWS cont.
  • In this case pn(x) is the superposition of n
    broad, slowly varying functions, and is very
    smooth "out-of-focus" estimate for p(x).
  • If hn is very small, the peak value of
    is large, and occurs near x xi .
  • In this case, pn(x) is the superposition of n
    sharp pulses centered at the samples an erratic,
    "noisy" estimate.
  • As hn approaches zero,
    approaches a Dirac delta function centered at xi
    , and pn(x) approaches a superposition of delta
    functions centered at the samples.

62
PARZEN WINDOWS cont.
  • 3 Parzen-window density estimates based on the
    same set of 5 samples,
  • using windows from
    previous figure
  •  The choice of hn (or Vn) has an important effect
    on pn(x)
  •  If Vn is too large the estimate will suffer
    from too little resolution
  •  If Vn is too small the estimate will suffer
    from too much
  • statistical variability.
  •  If there is limited number of samples, then
    seek some acceptable
  • compromise.

63
PARZEN WINDOWS cont.
  • If we have unlimited number of samples, then let
    Vn slowly approach zero as n increases, and have
    pn(x) converge to the unknown density p(x).
  • Examples
  • Example 1 p(x) is a zero-mean, unit variance,
    univariate normal density. Let the widow function
    be of the same form
  •  Let where is
    a parameter
  • pn(x) is an average of normal densities centered
    at the samples
Write a Comment
User Comments (0)
About PowerShow.com