Parametric Methods - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Parametric Methods

Description:

Understand what are parametric methods. Understand how to learn ... P (x) = pox (1 po ) (1 x) L (po|X) = log t poxt (1 po ) (1 xt) MLE: po = t xt / N ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 34
Provided by: isabellebi
Category:

less

Transcript and Presenter's Notes

Title: Parametric Methods


1
Parametric Methods
2
Learning Objectives
  • Understand what are parametric methods
  • Understand how to learn probabilities from a
    training set
  • Understand regression as a method for prediction

3
Acknowledgements
  • Some of these slides have been adapted from Ethem
    Alpaydin.

4
Terminology
  • A statistic is a value calculated from a given
    sample.
  • Parametric methods assume that the training set
    obeys a known model (Gaussian or normal
    distribution, )

5
Parametric Estimation
  • X xt t where xt p (x)
  • Parametric estimation
  • Assume a form for p (x ?) and estimate ?, its
    sufficient statistics, using X
  • e.g., N ( µ, s2) where ? µ, s2

6
Maximum Likelihood Estimation
  • Likelihood of ? given the sample X
  • l (?X) p (X ?) ?t p (xt?)
  • Log likelihood
  • L(?X) log l (?X) ?t log p (xt?)
  • Maximum likelihood estimator (MLE)
  • ? argmax? L(?X)

7
Examples Bernoulli/Multinomial
  • Bernoulli Two states, failure/success, x in
    0,1
  • P (x) pox (1 po ) (1 x)
  • L (poX) log ?t poxt (1 po ) (1 xt)
  • MLE po ?t xt / N
  • Multinomial Kgt2 states, xi in 0,1
  • P (x1,x2,...,xK) ?i pixi
  • L(p1,p2,...,pKX) log ?t ?i pixit
  • MLE pi ?t xit / N

8
Gaussian (Normal) Distribution
  • p(x) N ( µ, s2)
  • MLE for µ and s2

µ
s
9
Bias and Variance
Unknown parameter ? Estimator di d (Xi) on
sample Xi Bias b?(d) E d ? Variance E
(dE d)2 Mean square error r (d,?) E
(d?)2 (E d ?)2 E (dE d)2
Bias2 Variance
10
Bayes Estimator
  • Treat ? as a random var with prior p (?)
  • Bayes rule p (?X) p(X?) p(?) / p(X)
  • Full p(xX) ? p(x?) p(?X) d?
  • Maximum a Posteriori (MAP) ?MAP argmax? p(?X)
  • Maximum Likelihood (ML) ?ML argmax? p(X?)
  • Bayes ?Bayes E?X ? ? p(?X) d?

11
Bayes Estimator Example
  • xt N (?, so2) and ? N ( µ, s2)
  • ?ML m
  • ?MAP ?Bayes

12
Bayesian Classification Why?
  • Probabilistic learning Calculate explicit
    probabilities for hypothesis, among the most
    practical approaches to certain types of learning
    problems
  • Incremental Each training example can
    incrementally increase/decrease the probability
    that a hypothesis is correct. Prior knowledge
    can be combined with observed data.
  • Probabilistic prediction Predict multiple
    hypotheses, weighted by their probabilities
  • Standard Even when Bayesian methods are
    computationally intractable, they can provide a
    standard of optimal decision making against which
    other methods can be measured

13
Bayesian Theorem
  • Given training data D, posteriori probability of
    a hypothesis h, P(hD) follows the Bayes theorem
  • MAP (maximum posteriori) hypothesis
  • Practical difficulty require initial knowledge
    of many probabilities, significant computational
    cost

14
Naïve Bayes Classifier (I)
  • A simplified assumption attributes are
    conditionally independent
  • Greatly reduces the computation cost, only count
    the class distribution.

15
Naive Bayesian Classifier (II)
  • Given a training set, we can compute the
    probabilities

16
Bayesian classification
  • The classification problem may be formalized
    using a-posteriori probabilities
  • P(CX) prob. that the sample tuple
    Xltx1,,xkgt is of class C.
  • E.g. P(classN outlooksunny,windytrue,)
  • Idea assign to sample X the class label C such
    that P(CX) is maximal

17
Estimating a-posteriori probabilities
  • Bayes theorem
  • P(CX) P(XC)P(C) / P(X)
  • P(X) is constant for all classes
  • P(C) relative freq of class C samples
  • C such that P(CX) is maximum C such that
    P(XC)P(C) is maximum
  • Problem computing P(XC) is unfeasible!

18
Naïve Bayesian Classification
  • Naïve assumption attribute independence
  • P(x1,,xkC) P(x1C)P(xkC)
  • If i-th attribute is categoricalP(xiC) is
    estimated as the relative freq of samples having
    value xi as i-th attribute in class C
  • If i-th attribute is continuousP(xiC) is
    estimated thru a Gaussian density function
  • Computationally easy in both cases

19
Play-tennis example estimating P(xiC)
20
Play-tennis example classifying X
  • An unseen sample X ltrain, hot, high, falsegt
  • P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
    ep)P(p) 3/92/93/96/99/14 0.010582
  • P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
    en)P(n) 2/52/54/52/55/14 0.018286
  • Sample X is classified in class n (dont play)

21
The independence hypothesis
  • makes computation possible
  • yields optimal classifiers when satisfied
  • but is seldom satisfied in practice, as
    attributes (variables) are often correlated.
  • Attempts to overcome this limitation
  • Bayesian networks, that combine Bayesian
    reasoning with causal relationships between
    attributes
  • Decision trees, that reason on one attribute at
    the time, considering most important attributes
    first

22
Bayesian Belief Networks (I)
Family History
Smoker
(FH, S)
(FH, S)
(FH, S)
(FH, S)
LC
0.7
0.8
0.5
0.1
LungCancer
Emphysema
LC
0.3
0.2
0.5
0.9
The conditional probability table for the
variable LungCancer
PositiveXRay
Dyspnea
Bayesian Belief Networks
23
Bayesian Belief Networks (II)
  • Bayesian belief network allows a subset of the
    variables conditionally independent
  • A graphical model of causal relationships
  • Several cases of learning Bayesian belief
    networks
  • Given both network structure and all the
    variables easy
  • Given network structure but only some variables
  • When the network structure is not known in advance

24
Regression
25
Regression From LogL to Error
26
Linear Regression
  • Linear regression Y ? ? X
  • Two parameters , ? and ? specify the line and
    are to be estimated by using the data at hand.
  • using the least squares criterion to the known
    values of Y1, Y2, , X1, X2, .
  • ß S (xi avg(x)) ( yi avg(y)) / (xi
    avg(x))2
  • a avg(y) ß avg(x)

27
Other Error Measures
  • Square Error
  • Relative Square Error
  • Absolute Error E (?X) ?t rt g(xt?)
  • e-sensitive Error E (?X) ? t 1(rt
    g(xt?)gte) (rt g(xt?) e)

28
Bias and Variance
29
Estimating Bias and Variance
  • M samples Xixti , rti, i1,...,M
  • are used to fit gi (x), i 1,...,M

30
Bias/Variance Dilemma
  • Example gi(x)2 has no variance and high bias
  • gi(x) ?t rti/N has lower bias with variance
  • As we increase complexity,
  • bias decreases (a better fit to data) and
  • variance increases (fit varies more with data)
  • Bias/Variance dilemma (Geman et al., 1992)

31
f
f
bias
gi
g
variance
32
Polynomial Regression
Best fit min error
33
Best fit, elbow
34
Model Selection
  • Cross-validation Measure generalization accuracy
    by testing on data unused during training
  • Regularization Penalize complex models
  • Eerror on data ? model complexity
  • Akaikes information criterion (AIC), Bayesian
    information criterion (BIC)
  • Minimum description length (MDL) Kolmogorov
    complexity, shortest description of data
  • Structural risk minimization (SRM)

35
Bayesian Model Selection
  • Prior on models, p(model)
  • Regularization, when prior favors simpler models
  • Bayes, MAP of the posterior, p(modeldata)
  • Average over a number of models with high
    posterior (voting, ensembles Chapter 15)
Write a Comment
User Comments (0)
About PowerShow.com