Parameter Estimation - PowerPoint PPT Presentation

1 / 104
About This Presentation
Title:

Parameter Estimation

Description:

Have a number of design samples or training data as representatives of patterns ... In some occasions, more than one q values may yield the same p(x|q) ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 105
Provided by: shyhka
Category:

less

Transcript and Presenter's Notes

Title: Parameter Estimation


1
Parameter Estimation
  • Shyh-Kang Jeng
  • Department of Electrical Engineering/
  • Graduate Institute of Communication/
  • Graduate Institute of Networking and Multimedia,
    National Taiwan University

2
Typical Classification Problem
  • Rarely know the complete probabilistic structure
    of the problem
  • Have vague, general knowledge
  • Have a number of design samples or training data
    as representatives of patterns for classification
  • Find some way to use this information to design
    or train the classifier

3
Estimating Probabilities
  • Not difficulty to Estimate prior probabilities
  • Hard to estimate class-conditional densities
  • Number of available samples always seems too
    small
  • Serious when dimensionality is large

4
Estimating Parameters
  • Problems permit to parameterize the conditional
    densities
  • Simplifies the problem from one of estimating an
    unknown function to one of estimating the
    parameters
  • e.g., mean vector and covariance matrix for
    multi-variate normal distribution

5
Maximum-Likelihood Estimation
  • View the parameters as quantities whose values
    are fixed but unknown
  • Best estimate is the one that maximize the
    probability of obtaining the samples actually
    observed
  • Nearly always have good convergence properties as
    the number of samples increases
  • Often simpler than alternative methods

6
I. I. D. Random Variables
  • Separate data into D1, . . ., Dc
  • Samples in Dj are drawn independently according
    to p(xwj)
  • Such samples are independent and identically
    distributed (i. i. d.) random variables
  • Let p(xwj) has a known parametric form and is
    determined uniquely by a parameter vector qj,,
    i.e., p(xwj)p(xwj,qj)

7
Simplification Assumptions
  • Samples in Di give no information about qj, if i
    is not equal to j
  • Can work with each class separately
  • Have c separate problems of the same form
  • Use set D of i. i. d. samples from p(xq) to
    estimate unknown parameter vector q

8
Maximum-likelihood Estimate
9
Maximum-likelihood Estimation
10
A Note
  • The likelihood p(Dq) as a function of q is not a
    probability density function of q
  • Its area on the q-domain has no significance
  • The likelihood p(Dq) can be regarded as
    probability of D for a given q

11
Analytical Approach
12
MAP Estimators
13
Gaussian Case Unknown m
14
Univariate Gaussian Case Unknown m and s2
15
Multivariate Gaussian Case Unknown m and S
16
Bias, Absolutely Unbiased, and Asymptotically
Unbiased
17
Model Error
  • For reliable model, the ML classifier can give
    excellent results
  • If the model is wrong, the ML classifier can not
    get the best results, even for the assumed set of
    models

18
Bayesian Estimation (Bayesian Learning)
  • Answers obtained in general is nearly identical
    to those by maximum-likelihood
  • Basic conceptual difference
  • The parameter vector q is a random variable
  • Use the training data to convert a distribution
    on this variable into a posterior probability
    density

19
Central Problem
20
Parameter Distribution
  • Assume p(x) has a known parametric form with
    parameter vector q of unknown value
  • Thus, p(xq) is completely known
  • Information about q prior to observing samples is
    contained in known prior density p(q)
  • Observations convert p(q) to p(qD)
  • should be sharply peaked about the true value of q

21
Parameter Distribution
22
Univariate Gaussian Case p(mD)
23
Reproducing Density
24
Bayesian Learning
25
Dogmatism
26
Univariate Gaussian Case p(xD)
27
Multivariate Gaussian Case
28
Multivariate Gaussian Case
29
Multivariate Bayesian Learning
30
General Bayesian Estimation
31
Recursive Bayesian Learning
32
Example 1Recursive Bayes Learning
33
Example 1Recursive Bayes Learning
34
Example 1 Bayes vs. ML
35
Identifiability
  • p(xq) is identifiable
  • Sequence of posterior densities p(qDn) converge
    to a delta function
  • Only one q causes p(xq) to fit the data
  • In some occasions, more than one q values may
    yield the same p(xq)
  • p(qDn) will peak near all q that explain the
    data
  • Ambiguity is erased in integration for p(xDn),
    which converges to p(x) whether or not p(xq) is
    identifiable

36
ML vs. Bayes Methods
  • Computational complexity
  • Interpretability
  • Confidence in prior information
  • Form of the underlying distribution p(xq)
  • Results differs when p(qD) is broad, or
    asymmetric around the estimated q
  • Bayes methods would exploit such information
    whereas ML would not

37
Classification Errors
  • Bayes or indistinguishability error
  • Model error
  • Estimation error
  • Parameters are estimated from a finite sample
  • Vanishes in the limit of infinite training data
    (ML and Bayes would have the same total
    classification error)

38
Invariance and Non-informative Priors
  • Guidance in creating priors
  • Invariance
  • Translation invariance
  • Scale invariance
  • Non-informative with respect to an invariance
  • Much better than accommodating arbitrary
    transformation in a MAP estimator
  • Of great use in Bayesian estimation

39
Gibbs Algorithm
40
Sufficient Statistics
  • Statistic
  • Any function of samples
  • Sufficient statistic s of samples D
  • s Contains all information relevant to
    estimating some parameter q
  • Definition p(Ds, q) is independent of q
  • If q can be regarded as a random variable

41
Factorization Theorem
  • A statistic s is sufficient for q if and only if
    P(Dq) can be written as the product
  • P(Dq) g(s, q) h(D)
  • for some functions g(.,.) and h(.)

42
Example Multivariate Gaussian
43
Proof of Factorization Theorem The Only if Part
44
Proof of Factorization Theorem The if Part
45
Kernel Density
  • Factoring of P(Dq) into g(s,q)h(D) is not unique
  • If f(s) is any function, g(s,q)f(s)g(s,q) and
    h(D) h(D)/f(s) are equivalent factors
  • Ambiguity is removed by defining the kernel
    density invariant to such scaling

46
Example Multivariate Gaussian
47
Kernel Density and Parameter Estimation
  • Maximum-likelihood
  • maximization of g(s,q)
  • Bayesian
  • If prior knowledge of q is vague, p(q) tend to be
    uniform, and p(qD) is approximately the same as
    the kernel density
  • If p(xq) is identifiable, g(s,q) peaks sharply
    at some value, and p(q) is continuous as well as
    non-zero there, p(qD) approaches the kernel
    density

48
Sufficient Statistics for Exponential Family
49
Error Rate and Dimensionality
50
Accuracy and Dimensionality
51
Effects of Additional Features
  • In practice, beyond a certain point, inclusion of
    additional features leads to worse rather than
    better performance
  • Sources of difficulty
  • Wrong models
  • Number of design or training samples is finite
    and thus the distributions are not estimated
    accurately

52
Computational Complexity for Maximum-Likelihood
Estimation
53
Computational Complexity for Classification
54
Approaches for Inadequate Samples
  • Reduce dimensionality
  • Redesign feature extractor
  • Select appropriate subset of features
  • Combine the existing features
  • Pool the available data by assuming all classes
    share the same covariance matrix
  • Look for a better estimate for S
  • Use Bayesian estimate and diagonal S0
  • Threshold sample covariance matrix
  • Assume statistical independence

55
Shrinkage (Regularized Discriminant Analysis)
56
Concept of Overfitting
57
Best Representative Point
58
Projection Along a Line
59
Best Projection to a Line Through the Sample Mean
60
Best Representative Direction
61
Principal Component Analysis (PCA)
62
Concept of Fisher Linear Discriminant
63
Fisher Linear Discriminant Analysis
64
Fisher Linear Discriminant Analysis
65
Fisher Linear Discriminant Analysis
66
Fisher Linear Discriminant Analysis for
Multivariate Normal
67
Concept of Multidimensional Discriminant Analysis
68
Multiple Discriminant Analysis
69
Multiple Discriminant Analysis
70
Multiple Discriminant Analysis
71
Multiple Discriminant Analysis
72
Expectation-Maximization (EM)
  • Finding the maximum-likelihood estimate of the
    parameters of an underlying distribution
  • from a given data set when the data is incomplete
    or has missing values
  • Two main applications
  • When the data indeed has missing values
  • When optimizing the likelihood function is
    analytically intractable but when the likelihood
    function can be simplified by assuming the
    existence of and values for additional but
    missing (or hidden) parameters

73
Expectation-Maximization (EM)
  • Full sample D x1, . . ., xn
  • xk xkg, xkb
  • Separate individual features into Dg and Db
  • D is the union of Dg and Db
  • Form the function

74
Expectation-Maximization (EM)
  • begin initialize q0, T, i ? 0
  • do i ? i 1
  • E step Compute Q(q q i)
  • M step q i1 ? arg maxq Q(q,q i)
  • until Q(q i1q i)-Q(q iq i-1) T
  • return q ? qi1
  • end

75
Expectation-Maximization (EM)
76
Example 2D Model
77
Example 2D Model
78
Example 2D Model
79
Example 2D Model
80
Generalized Expectation-Maximization (GEM)
  • Instead of maximizing Q(q q i), we find some q
    i1 such that
  • Q(q i1q i)gtQ(q q i)
  • and is also guaranteed to converge
  • Convergence will not as rapid
  • Offers great freedom to choose computationally
    simpler steps
  • e.g., using maximum-likelihood value of unknown
    values, if they lead to a greater likelihood

81
Hidden Markov Model (HMM)
  • Used for problems of making a series of decisions
  • e.g., speech or gesture recognition
  • Problem states at time t are influenced directly
    by a state at t-1
  • More reference
  • L. A. Rabiner and B. W. Juang, Fundamentals of
    Speech Recognition, Prentice-Hall, 1993, Chapter
    6.

82
First Order Markov Models
83
First Order Hidden Markov Models
84
Hidden Markov Model Probabilities
85
Hidden Markov Model Computation
  • Evaluation problem
  • Given aij and bjk, determine P(VTq)
  • Decoding problem
  • Given VT, determine the most likely sequence of
    hidden states that lead to VT
  • Learning problem
  • Given training observations of visible symbols
    and the coarse structure but not the
    probabilities, determine aij and bjk

86
Evaluation
87
HMM Forward
88
HMM Forward and Trellis
89
HMM Forward
90
HMM Backward
91
HMM Backward
92
Example 3 Hidden Markov Model
93
Example 3 Hidden Markov Model
94
Example 3 Hidden Markov Model
95
Left-to-Right Models for Speech
96
HMM Decoding
97
Problem of Local Optimization
  • This decoding algorithm depends only on the
    single previous time step, not the full sequence
  • Not guarantee that the path is indeed allowable

98
HMM Decoding
99
Example 4 HMM Decoding
100
Forward-Backward Algorithm
  • Determines model parameters, aij and bjk, from an
    ensemble of training samples
  • An instance of a generalized expectation-maximizat
    ion algorithm
  • No known method for the optimal or most likely
    set of parameters from data

101
Probability of Transition
102
Improved Estimate for aij
103
Improved Estimate for bjk
104
Forward-Backward Algorithm(Baum-Welch Algorithm)
Write a Comment
User Comments (0)
About PowerShow.com