Parametric Methods presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parametric Methods

1
Parametric Methods
2
Learning Objectives

Understand what are parametric methods
Understand how to learn probabilities from a
training set
Understand regression as a method for prediction

3
Acknowledgements

Some of these slides have been adapted from Ethem
Alpaydin.

4
Terminology

A statistic is a value calculated from a given
sample.
Parametric methods assume that the training set
obeys a known model (Gaussian or normal
distribution, )

5
Parametric Estimation

X xt t where xt p (x)
Parametric estimation
Assume a form for p (x ?) and estimate ?, its
sufficient statistics, using X
e.g., N ( µ, s2) where ? µ, s2

6
Maximum Likelihood Estimation

Likelihood of ? given the sample X
l (?X) p (X ?) ?t p (xt?)
Log likelihood
L(?X) log l (?X) ?t log p (xt?)
Maximum likelihood estimator (MLE)
? argmax? L(?X)

7
Examples Bernoulli/Multinomial

Bernoulli Two states, failure/success, x in
0,1
P (x) pox (1 po ) (1 x)
L (poX) log ?t poxt (1 po ) (1 xt)
MLE po ?t xt / N
Multinomial Kgt2 states, xi in 0,1
P (x1,x2,...,xK) ?i pixi
L(p1,p2,...,pKX) log ?t ?i pixit
MLE pi ?t xit / N

8
Gaussian (Normal) Distribution

p(x) N ( µ, s2)
MLE for µ and s2

µ
s
9
Bias and Variance
Unknown parameter ? Estimator di d (Xi) on
sample Xi Bias b?(d) E d ? Variance E
(dE d)2 Mean square error r (d,?) E
(d?)2 (E d ?)2 E (dE d)2
Bias2 Variance
10
Bayes Estimator

Treat ? as a random var with prior p (?)
Bayes rule p (?X) p(X?) p(?) / p(X)
Full p(xX) ? p(x?) p(?X) d?
Maximum a Posteriori (MAP) ?MAP argmax? p(?X)
Maximum Likelihood (ML) ?ML argmax? p(X?)
Bayes ?Bayes E?X ? ? p(?X) d?

11
Bayes Estimator Example

xt N (?, so2) and ? N ( µ, s2)
?ML m
?MAP ?Bayes

12
Bayesian Classification Why?

Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems
Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data.
Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities
Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured

13
Bayesian Theorem

Given training data D, posteriori probability of
a hypothesis h, P(hD) follows the Bayes theorem
MAP (maximum posteriori) hypothesis
Practical difficulty require initial knowledge
of many probabilities, significant computational
cost

14
Naïve Bayes Classifier (I)

A simplified assumption attributes are
conditionally independent
Greatly reduces the computation cost, only count
the class distribution.

15
Naive Bayesian Classifier (II)

Given a training set, we can compute the
probabilities

16
Bayesian classification

The classification problem may be formalized
using a-posteriori probabilities
P(CX) prob. that the sample tuple
Xltx1,,xkgt is of class C.
E.g. P(classN outlooksunny,windytrue,)
Idea assign to sample X the class label C such
that P(CX) is maximal

17
Estimating a-posteriori probabilities

Bayes theorem
P(CX) P(XC)P(C) / P(X)
P(X) is constant for all classes
P(C) relative freq of class C samples
C such that P(CX) is maximum C such that
P(XC)P(C) is maximum
Problem computing P(XC) is unfeasible!

18
Naïve Bayesian Classification

Naïve assumption attribute independence
P(x1,,xkC) P(x1C)P(xkC)
If i-th attribute is categoricalP(xiC) is
estimated as the relative freq of samples having
value xi as i-th attribute in class C
If i-th attribute is continuousP(xiC) is
estimated thru a Gaussian density function
Computationally easy in both cases

19
Play-tennis example estimating P(xiC)
20
Play-tennis example classifying X

An unseen sample X ltrain, hot, high, falsegt
P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
ep)P(p) 3/92/93/96/99/14 0.010582
P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286
Sample X is classified in class n (dont play)

21
The independence hypothesis

makes computation possible
yields optimal classifiers when satisfied
but is seldom satisfied in practice, as
attributes (variables) are often correlated.
Attempts to overcome this limitation
Bayesian networks, that combine Bayesian
reasoning with causal relationships between
attributes
Decision trees, that reason on one attribute at
the time, considering most important attributes
first

22
Bayesian Belief Networks (I)
Family History
Smoker
(FH, S)
(FH, S)
(FH, S)
(FH, S)
LC
0.7
0.8
0.5
0.1
LungCancer
Emphysema
LC
0.3
0.2
0.5
0.9
The conditional probability table for the
variable LungCancer
PositiveXRay
Dyspnea
Bayesian Belief Networks
23
Bayesian Belief Networks (II)

Bayesian belief network allows a subset of the
variables conditionally independent
A graphical model of causal relationships
Several cases of learning Bayesian belief
networks
Given both network structure and all the
variables easy
Given network structure but only some variables
When the network structure is not known in advance

24
Regression
25
Regression From LogL to Error
26
Linear Regression

Linear regression Y ? ? X
Two parameters , ? and ? specify the line and
are to be estimated by using the data at hand.
using the least squares criterion to the known
values of Y1, Y2, , X1, X2, .
ß S (xi avg(x)) ( yi avg(y)) / (xi
avg(x))2
a avg(y) ß avg(x)

27
Other Error Measures

Square Error
Relative Square Error
Absolute Error E (?X) ?t rt g(xt?)
e-sensitive Error E (?X) ? t 1(rt
g(xt?)gte) (rt g(xt?) e)

28
Bias and Variance
29
Estimating Bias and Variance

M samples Xixti , rti, i1,...,M
are used to fit gi (x), i 1,...,M

30
Bias/Variance Dilemma

Example gi(x)2 has no variance and high bias
gi(x) ?t rti/N has lower bias with variance
As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance dilemma (Geman et al., 1992)

31
f
f
bias
gi
g
variance
32
Polynomial Regression
Best fit min error
33
Best fit, elbow
34
Model Selection

Cross-validation Measure generalization accuracy
by testing on data unused during training
Regularization Penalize complex models
Eerror on data ? model complexity
Akaikes information criterion (AIC), Bayesian
information criterion (BIC)
Minimum description length (MDL) Kolmogorov
complexity, shortest description of data
Structural risk minimization (SRM)

35
Bayesian Model Selection

Prior on models, p(model)
Regularization, when prior favors simpler models
Bayes, MAP of the posterior, p(modeldata)
Average over a number of models with high
posterior (voting, ensembles Chapter 15)

Write a Comment

User Comments (0)

About PowerShow.com

Parametric Methods PowerPoint PPT Presentation