Linear Methods For Classification Chapter 4

About This Presentation

Title:

Linear Methods For Classification Chapter 4

Description:

Linear Methods For Classification Chapter 4 Machine Learning Seminar Shinjae Yoo Tal Blum – PowerPoint PPT presentation

Number of Views:178

Avg rating:3.0/5.0

Slides: 51

Provided by: School132

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Linear Methods For Classification Chapter 4

1
Linear MethodsFor ClassificationChapter 4

Machine Learning Seminar
Shinjae Yoo
Tal Blum

2
Bayesian Decision Theory

World states ?j (i.e. classes)
Actions ?(x) (i.e. classification)
R(?(x) x) Risk or cost function
The total risk
R s R(?(x)x) p(x) dx
The Bayes Decision Rule
?(x) min?j R(?jx)
min?j ?ck1 ?(?j ?k) P(?k x)

3
Minimum Error-Rate Classification

Introduction of the zero-one loss function
Therefore, the conditional risk is
The risk corresponding to this loss function is
the average probability error
?

4
Minimum Error-Rate Classification

Minimizing the risk is equivalent to maximizing
P(?i x)
Optimal Strategy is
Decide ?i if P (?i x) gt P(?j x) ?j ? i
For classification choose the class with the
highest posterior probability

5
Discriminant Functions

Denoting gi(x) P(?ix)
8f s.t. f is monotonic increasing
the set f(gi(x)) gives the same classification
as the set gi(x)
Examples
Gc(x) P(xc)P(c) / (?c P(xc)P(c))
Gc(x) P(xc)P(c)
Gc(x) lnP(xc)lnP(c)

6
(No Transcript)
7
Linear Discriminant Functions

Special case when a monotone function of the
boundary f(g(x)) is linear
Example
Logistic regression
f the log function
Decision boundary

8
Extensions to Linear Discriminant Functions
9
Linear RegressionOf An Indicator Matrix

Y (Y1,Yk) an NK indicator matrix
Yj,k1 iff Gj k
Can be seen as K separate linear regressions

10
Linear RegressionOf An Indicator Matrix

The algorithm
Compute
a K vector

11
Linear RegressionOf An Indicator Matrix

Properties
Is linear regression flexible enough to model
fi(x)?
fi(x) can be negative or bigger than 1
Incorporating more basis elements can help
Gives the same estimate as
minB ?Ni1 yi-(1,x)BT2

12
Masking Effect
13
(No Transcript)
14
Gaussians Discriminant Functions

Multivariate density
Multivariate normal density in d dimensions is
where
x (x1, x2, , xd)t (t stands for
the transpose vector form)
? (?1, ?2, , ?d)t mean vector
? dd covariance matrix
? and ?-1 are determinant and
inverse respectively

15
Gaussians Discriminant Functions(2)

The discriminant function we use
gi(x) ln P(x ?i) ln P(?i)
The parameter are
usually not known so they
are estimated

16
LDA Linear Discriminant Analysis

Case 1 ?i are equal and ?i ?I
The separating plane is ?(x-x0)0
Where ? (?i-?j)/?2 is the direction of the
means difference
X0 is given by
X0 is on the line separating the means, but not
necessary in the middle of the line, unless the
priors are equal.

17
Case where ?i ?I
18
LDA - where ?i ?

Case 2 ?i are equal (?i ?)
The separating plane is ?(x-x0)0
Where ? ? -1(?i-?j)
? is generally not orthogonal to the vector
separating the means

19
(No Transcript)
20
Connection Between LDA and Multiple Linear
Regression

For 2 class problem, the directions of the
decision boundaries are the same, but unless
N1N2 the intercepts are different.
While both are estimating discriminative
functions and produce the same types of linear
boundaries Linear regression is a discriminative
method while LDA a generative.

21
(No Transcript)
22
QDA Arbitrary ?i

Case 3 ?i are arbitrary
The discriminant functions
Where

23
(No Transcript)
24
(No Transcript)
25
Comparing Extended LDA and QDA
26
What do we use, LDA or QDA?

QDA is more expressive, but requires more
parameters.
Number of parameters
LDA (K-1)(P1)
QDA (K-1)P(P3)/21
Both perform very well on many tasks.
Usually because the data does not support more
complex boundaries.
If the data is not Gaussian, CV on the cutoffs
may help.

27
Regularized Discriminant Analysis

Is there a a middle way between LDA and QDA?

28
Regularized Discriminant Analysis
29
Computation of LDA

Compute the eigen decomposition
Project X by X D-1/2UTX
Classify to the closest centroid in the
transformed space modulo the prior probabilities
?i.

30
matrix factorization!
Using matrix factorization, we hope to - Reduce
the dimensionality (compress)
X
W
C
0.4999 1.1964 1.1389 1.1556
0.9290 0.7520 0.7321 0.6830 0.8260
0.7515 1.0596 1.3355 1.1624 1.1964
1.4396 0.5631 0.8010 0.8009 0.9455
0.6194 0.2071 0.9708 0.8749
0.7495 0.7896 0.4049 0.9841 0.8991
0.8562 0.8418 0.7940 0.9445 0.9178
1.0947 0.8313 0.9538 0.6973 0.6195
0.7928 0.8628 0.4401 1.0705
0.9657 0.9013 0.9417 0.7626 0.9141
0.8859 1.0518 0.8075
0.9501 0.6154 0.0579 0.2311
0.7919 0.3529 0.6068 0.9218 0.8132
0.4860 0.7382 0.0099 0.8913 0.1763
0.1389 0.7621 0.4057 0.2028
0.4565 0.9355 0.1987 0.0185 0.9169
0.6038 0.8214 0.4103 0.2722 0.4447
0.8936 0.1988
0.0153 0.9318 0.8462 0.6721
0.6813 0.7468 0.4660 0.5252 0.8381
0.3795 0.4451 0.4186 0.2026 0.0196
0.8318
?
?
31
SVD
32
Reduced Rank LDA

LDA can be computed by projecting the points into
a K-1 dimensional space and computing distances
there
Reduced rank PCA minimizes the reconstruction
error
Reduced Rank LDA is about finding orthogonal set
of vectors that maximized Rayleigh Quotient
W the within class covariance, a sum of the
class cov matrices
B the between class covariance, the cov matrix
of the centroids of X
WB T XTX

33
Reduced Rank LDA

Helps to visualize high dimensional data
The reduction in dimension is usually done by
ordering the vectors and then choosing the first
M vectors
LDA is also used just as a dimension reduction
method together with other classification methods
such as Nearest Neighbor.
It is equivalent to projecting the vectors and
their class centers into a low dimensional
subspace

34
Reduced LDA
35
(No Transcript)
36
Questions

2 Kinds of questions
Clarification / derivation
Performance / understanding

37
Questions

1) Why is E(Yk Xx) Pr(Gk Xx)?
Answer In general
EYXx(Y) ?y yP(YXx)
1p0(1-p)
P(Yx)
In this example Yk is an indicator variable for
the class k

38
Questions(2)

The relationship to Linear Regression is that we
are modeling E(YXx) as a linear function of x.

39
Questions(3)

1) last paragraph in P.90, it's said, "the large
discrepancy between the training and test error
is partly due to the fact that there many repeat
measurements on a small number of individuals",
two questions about this. a) since it's "partly
due to", what are the other factors b) give some
explanation on "many repeat measurements on a
small number of individuals".
2) I am curious how does the number 30 come
from, in first paragraph in P.105?
can we explain it only because of the Gaussian
assumption of fk(x)?

40
Questions(4)

1) Eq. 4.3, Y is a matrix of 0's and 1's, with
each row having a single 1. If we put more than
one 1 in a row, can we explore Linear regression
to multiple classification?
2) p.95, the book said we can apply
classification after data reduction. However, to
me, LDA used label already (Y), and for
classification we will use label to estimate the
distribution. Is it overfitted?
3) What is generative model and discriminative
model? Is LDA a generative or discriminative
model?

41
Questions(5)

A high level picture that I get from this chapter
is that we are trying to come up with models that
approximate the conditional expection E
Pr(Gk/Xx). We start with a linear regression
model that tries to approximate E by fhat_k(x)
which is rigid and is not a good approximation to
the posterior probability (E) as it could be -ve
or gt 1. Then we go to LDA model, where we assume
a model for class density, and use conditional
class densities to calculate our posterior (E),
the posterior obtained is a better approximation
of E but we are also adding bias (class
densities) which also reduces variance. Then
using Logistic regression model, we still
approximate E better than linear regression
model, and this model is more robust as it
maximizes the likelihood of training data and it
also has few assumptions compared to LDA.
How can we extend this picture? what other models
can we use that can better approximate the
conditional expectation Pr(Gk/Xx) ? One way to
extend LDA is getting a better approximation of
class models using unlabeled data and label them
using a distance metric similar to distance
metric used in k-NN classifier in ch 2.

42
Questions(6)

2. Sometimes we can get an estimate of prior
probabilities of classes using domain knowledge,
LDA framework allows us to use these prior
probabilities during calculation of the
posterior. Can we use prior probabilities in
other models like Linear regression and Logistic
regression?

43
Questions(7)

The chapter has two broad categories, one where
we use discrimnant functions like Pr(Gk/Xx) and
then use this for classification and the other
method is to directly model the boundaries
between the classes (using the hyperplanes
approach). When to use which approach ? One case
is when you know the densities, you can use LDA
because optimal hyperplane might use noise as
support points -- but what if you don't know the
density ? Another example is -- a variant of the
hyperplane approach (SVM's) perform better in
high dimensional input data -- similarly are
there any other situations where a particular
approach is preferred?

44
Questions(8)

1. P.83 describes the masking phenomenon. In
figure 4.2., consider the case where the middle
class is moved slightly to right-down. Though it
will not be masked, it still has only a small
slice of area, which is very bad prediction.
Hence, to me, masking phenomenon means linear
regression for multivariate Gaussian is generally
inaccurate. And of course it is inaccurate,
because linear regression assumes the probability
distribution is linear not Gaussian. Now since
this assumption no longer holds, why we go on to
quadratic fit? What is the rationale? Since the
data is Gaussian, why not just use LDA or QDA,
with sound theoretical inference?
Put it another way. The logic looks like linear
regression may cause masking quadratic fit can
avoid masking when Kgt3 use polynomial terms up
to K-1. My point is, just because quadratic fit
avoids masking does not mean it is a reasonable
classification, especially when the reason of
masking is the model being Gaussian not linear.

45
Questions(9)

I suppose most cases people use LDA / QDA, but in
some cases people use polynomail forms of linear
regression by heuristics and experiences. What is
the real situation?
P.90 Regularized Discriminant Analysis Why we
want compromise between LDA and QDA? Because of
computational tractability? It seems to me that
RDA requires more computation, not less

46
Questions(10)

When does "mask" usually happen with the
regression approach?I was wondering the
opposite question. When doing linear regression
on an indicator matrix, with classes Kgt3, is
there ever a case where K-2 of the classes aren't
masked? It seems that with a linear decision
boundary, you can only ever bisect the space, so
you can only decide between two classes. (Except
when augmenting the data with X2, etc.) Is this
correct?p.83-84) The book mentions a "loose but
general rule" for using a polynomial terms in
linear regression for classification. Is there a
hard rule for the maximal degree of polynomial
input that would be required to resolve K
separable classes?- Is all separable data
separable by some polynomial? Of bounded degree?

47
Questions(11)

In logistic regression, it is shown (Eqn 4.26)
that it is a reweighted least square problem.
Conceptually, what is the role of reweighting
using W? Does other reweighting scheme exist
which gives better classification performance

48
Questions(12)

(1) In section 4.3.1, regularized discriminant
analysis, what is the effect to diagonize E?
It seems to me to strengthen the
independence assumption so does it really help in
practice?
(2) On p.92 the Lth discriminant variable is
computed as Z_l v_l' X, where v_l W(-1/2)
v_l,
why v_l has a W(-1/2) term instead of
W(1/2) term?

49
Questions(13)

I'm not sure I immediately see how allowing
linear regression onto basis expansions h(X) of
the inputs will lead to consistent estimates of
the posterior probabilities Pr(G kX x)
(p82). Is there a clearer way of demonstrating
this (perhaps this question is better suited to
Ch 5)? The text also suggests that these
expansions should be adaptively applied as the
size of our training set grows, which is also a
little ambiguous. Perhaps basis expansions could
be applied to the example vowel recognition
problem on pp 84-5 to demonstrate this? I'll look
into it.

50
More Questions

1. We learned from Chap3 that how to do hypo.
testing for linear regression, could you talk a
little bit about that for logistic regression?
That is, what are the assumptions, and what are
the tests?
2. Already seen this question twice how does 30
come out anyway?

Write a Comment

User Comments (0)

About PowerShow.com

Linear Methods For Classification Chapter 4 - PowerPoint PPT Presentation

Linear Methods For Classification Chapter 4

Linear Methods For Classification Chapter 4 Machine Learning Seminar Shinjae Yoo Tal Blum – PowerPoint PPT presentation