Title: Linear Methods For Classification Chapter 4
1Linear MethodsFor ClassificationChapter 4
- Machine Learning Seminar
- Shinjae Yoo
- Tal Blum
2Bayesian Decision Theory
- World states ?j (i.e. classes)
- Actions ?(x) (i.e. classification)
- R(?(x) x) Risk or cost function
- The total risk
- R s R(?(x)x) p(x) dx
- The Bayes Decision Rule
- ?(x) min?j R(?jx)
- min?j ?ck1 ?(?j ?k) P(?k x)
3Minimum Error-Rate Classification
- Introduction of the zero-one loss function
- Therefore, the conditional risk is
-
- The risk corresponding to this loss function is
the average probability error - ?
4Minimum Error-Rate Classification
- Minimizing the risk is equivalent to maximizing
P(?i x) - Optimal Strategy is
- Decide ?i if P (?i x) gt P(?j x) ?j ? i
- For classification choose the class with the
highest posterior probability
5Discriminant Functions
- Denoting gi(x) P(?ix)
- 8f s.t. f is monotonic increasing
- the set f(gi(x)) gives the same classification
as the set gi(x) - Examples
- Gc(x) P(xc)P(c) / (?c P(xc)P(c))
- Gc(x) P(xc)P(c)
- Gc(x) lnP(xc)lnP(c)
6(No Transcript)
7Linear Discriminant Functions
- Special case when a monotone function of the
boundary f(g(x)) is linear - Example
- Logistic regression
- f the log function
- Decision boundary
-
8Extensions to Linear Discriminant Functions
9Linear RegressionOf An Indicator Matrix
- Y (Y1,Yk) an NK indicator matrix
- Yj,k1 iff Gj k
- Can be seen as K separate linear regressions
10Linear RegressionOf An Indicator Matrix
- The algorithm
- Compute
- a K vector
-
11Linear RegressionOf An Indicator Matrix
- Properties
-
- Is linear regression flexible enough to model
fi(x)? - fi(x) can be negative or bigger than 1
- Incorporating more basis elements can help
- Gives the same estimate as
- minB ?Ni1 yi-(1,x)BT2
12Masking Effect
13(No Transcript)
14Gaussians Discriminant Functions
- Multivariate density
- Multivariate normal density in d dimensions is
- where
- x (x1, x2, , xd)t (t stands for
the transpose vector form) - ? (?1, ?2, , ?d)t mean vector
- ? dd covariance matrix
- ? and ?-1 are determinant and
inverse respectively -
15Gaussians Discriminant Functions(2)
- The discriminant function we use
- gi(x) ln P(x ?i) ln P(?i)
- The parameter are
- usually not known so they
- are estimated
16LDA Linear Discriminant Analysis
- Case 1 ?i are equal and ?i ?I
- The separating plane is ?(x-x0)0
- Where ? (?i-?j)/?2 is the direction of the
means difference - X0 is given by
- X0 is on the line separating the means, but not
necessary in the middle of the line, unless the
priors are equal.
17Case where ?i ?I
18LDA - where ?i ?
- Case 2 ?i are equal (?i ?)
- The separating plane is ?(x-x0)0
- Where ? ? -1(?i-?j)
- ? is generally not orthogonal to the vector
separating the means
19(No Transcript)
20Connection Between LDA and Multiple Linear
Regression
- For 2 class problem, the directions of the
decision boundaries are the same, but unless
N1N2 the intercepts are different. - While both are estimating discriminative
functions and produce the same types of linear
boundaries Linear regression is a discriminative
method while LDA a generative.
21(No Transcript)
22QDA Arbitrary ?i
- Case 3 ?i are arbitrary
- The discriminant functions
- Where
23(No Transcript)
24(No Transcript)
25Comparing Extended LDA and QDA
26What do we use, LDA or QDA?
- QDA is more expressive, but requires more
parameters. - Number of parameters
- LDA (K-1)(P1)
- QDA (K-1)P(P3)/21
- Both perform very well on many tasks.
- Usually because the data does not support more
complex boundaries. - If the data is not Gaussian, CV on the cutoffs
may help.
27Regularized Discriminant Analysis
- Is there a a middle way between LDA and QDA?
28Regularized Discriminant Analysis
29Computation of LDA
- Compute the eigen decomposition
- Project X by X D-1/2UTX
- Classify to the closest centroid in the
transformed space modulo the prior probabilities
?i.
30matrix factorization!
Using matrix factorization, we hope to - Reduce
the dimensionality (compress)
X
W
C
0.4999 1.1964 1.1389 1.1556
0.9290 0.7520 0.7321 0.6830 0.8260
0.7515 1.0596 1.3355 1.1624 1.1964
1.4396 0.5631 0.8010 0.8009 0.9455
0.6194 0.2071 0.9708 0.8749
0.7495 0.7896 0.4049 0.9841 0.8991
0.8562 0.8418 0.7940 0.9445 0.9178
1.0947 0.8313 0.9538 0.6973 0.6195
0.7928 0.8628 0.4401 1.0705
0.9657 0.9013 0.9417 0.7626 0.9141
0.8859 1.0518 0.8075
0.9501 0.6154 0.0579 0.2311
0.7919 0.3529 0.6068 0.9218 0.8132
0.4860 0.7382 0.0099 0.8913 0.1763
0.1389 0.7621 0.4057 0.2028
0.4565 0.9355 0.1987 0.0185 0.9169
0.6038 0.8214 0.4103 0.2722 0.4447
0.8936 0.1988
0.0153 0.9318 0.8462 0.6721
0.6813 0.7468 0.4660 0.5252 0.8381
0.3795 0.4451 0.4186 0.2026 0.0196
0.8318
?
?
31SVD
32Reduced Rank LDA
- LDA can be computed by projecting the points into
a K-1 dimensional space and computing distances
there - Reduced rank PCA minimizes the reconstruction
error - Reduced Rank LDA is about finding orthogonal set
of vectors that maximized Rayleigh Quotient - W the within class covariance, a sum of the
class cov matrices - B the between class covariance, the cov matrix
of the centroids of X - WB T XTX
33Reduced Rank LDA
- Helps to visualize high dimensional data
- The reduction in dimension is usually done by
ordering the vectors and then choosing the first
M vectors - LDA is also used just as a dimension reduction
method together with other classification methods
such as Nearest Neighbor. - It is equivalent to projecting the vectors and
their class centers into a low dimensional
subspace
34Reduced LDA
35(No Transcript)
36Questions
- 2 Kinds of questions
- Clarification / derivation
- Performance / understanding
37Questions
- 1) Why is E(Yk Xx) Pr(Gk Xx)?
- Answer In general
- EYXx(Y) ?y yP(YXx)
- 1p0(1-p)
- P(Yx)
- In this example Yk is an indicator variable for
the class k
38Questions(2)
- The relationship to Linear Regression is that we
are modeling E(YXx) as a linear function of x.
39Questions(3)
- 1) last paragraph in P.90, it's said, "the large
discrepancy between the training and test error
is partly due to the fact that there many repeat
measurements on a small number of individuals",
two questions about this. a) since it's "partly
due to", what are the other factors b) give some
explanation on "many repeat measurements on a
small number of individuals". - 2) I am curious how does the number 30 come
from, in first paragraph in P.105? - can we explain it only because of the Gaussian
assumption of fk(x)?
40Questions(4)
- 1) Eq. 4.3, Y is a matrix of 0's and 1's, with
each row having a single 1. If we put more than
one 1 in a row, can we explore Linear regression
to multiple classification? - 2) p.95, the book said we can apply
classification after data reduction. However, to
me, LDA used label already (Y), and for
classification we will use label to estimate the
distribution. Is it overfitted? - 3) What is generative model and discriminative
model? Is LDA a generative or discriminative
model? -
41Questions(5)
- A high level picture that I get from this chapter
is that we are trying to come up with models that
approximate the conditional expection E
Pr(Gk/Xx). We start with a linear regression
model that tries to approximate E by fhat_k(x)
which is rigid and is not a good approximation to
the posterior probability (E) as it could be -ve
or gt 1. Then we go to LDA model, where we assume
a model for class density, and use conditional
class densities to calculate our posterior (E),
the posterior obtained is a better approximation
of E but we are also adding bias (class
densities) which also reduces variance. Then
using Logistic regression model, we still
approximate E better than linear regression
model, and this model is more robust as it
maximizes the likelihood of training data and it
also has few assumptions compared to LDA. - How can we extend this picture? what other models
can we use that can better approximate the
conditional expectation Pr(Gk/Xx) ? One way to
extend LDA is getting a better approximation of
class models using unlabeled data and label them
using a distance metric similar to distance
metric used in k-NN classifier in ch 2.
42Questions(6)
- 2. Sometimes we can get an estimate of prior
probabilities of classes using domain knowledge,
LDA framework allows us to use these prior
probabilities during calculation of the
posterior. Can we use prior probabilities in
other models like Linear regression and Logistic
regression?
43Questions(7)
- The chapter has two broad categories, one where
we use discrimnant functions like Pr(Gk/Xx) and
then use this for classification and the other
method is to directly model the boundaries
between the classes (using the hyperplanes
approach). When to use which approach ? One case
is when you know the densities, you can use LDA
because optimal hyperplane might use noise as
support points -- but what if you don't know the
density ? Another example is -- a variant of the
hyperplane approach (SVM's) perform better in
high dimensional input data -- similarly are
there any other situations where a particular
approach is preferred?
44Questions(8)
- 1. P.83 describes the masking phenomenon. In
figure 4.2., consider the case where the middle
class is moved slightly to right-down. Though it
will not be masked, it still has only a small
slice of area, which is very bad prediction.
Hence, to me, masking phenomenon means linear
regression for multivariate Gaussian is generally
inaccurate. And of course it is inaccurate,
because linear regression assumes the probability
distribution is linear not Gaussian. Now since
this assumption no longer holds, why we go on to
quadratic fit? What is the rationale? Since the
data is Gaussian, why not just use LDA or QDA,
with sound theoretical inference? - Put it another way. The logic looks like linear
regression may cause masking quadratic fit can
avoid masking when Kgt3 use polynomial terms up
to K-1. My point is, just because quadratic fit
avoids masking does not mean it is a reasonable
classification, especially when the reason of
masking is the model being Gaussian not linear.
45Questions(9)
- I suppose most cases people use LDA / QDA, but in
some cases people use polynomail forms of linear
regression by heuristics and experiences. What is
the real situation? - P.90 Regularized Discriminant Analysis Why we
want compromise between LDA and QDA? Because of
computational tractability? It seems to me that
RDA requires more computation, not less
46Questions(10)
- When does "mask" usually happen with the
regression approach?I was wondering the
opposite question. When doing linear regression
on an indicator matrix, with classes Kgt3, is
there ever a case where K-2 of the classes aren't
masked? It seems that with a linear decision
boundary, you can only ever bisect the space, so
you can only decide between two classes. (Except
when augmenting the data with X2, etc.) Is this
correct?p.83-84) The book mentions a "loose but
general rule" for using a polynomial terms in
linear regression for classification. Is there a
hard rule for the maximal degree of polynomial
input that would be required to resolve K
separable classes?- Is all separable data
separable by some polynomial? Of bounded degree?
47Questions(11)
- In logistic regression, it is shown (Eqn 4.26)
that it is a reweighted least square problem.
Conceptually, what is the role of reweighting
using W? Does other reweighting scheme exist
which gives better classification performance
48Questions(12)
- (1) In section 4.3.1, regularized discriminant
analysis, what is the effect to diagonize E? - It seems to me to strengthen the
independence assumption so does it really help in
practice? - (2) On p.92 the Lth discriminant variable is
computed as Z_l v_l' X, where v_l W(-1/2)
v_l, - why v_l has a W(-1/2) term instead of
W(1/2) term?
49Questions(13)
- I'm not sure I immediately see how allowing
linear regression onto basis expansions h(X) of
the inputs will lead to consistent estimates of
the posterior probabilities Pr(G kX x)
(p82). Is there a clearer way of demonstrating
this (perhaps this question is better suited to
Ch 5)? The text also suggests that these
expansions should be adaptively applied as the
size of our training set grows, which is also a
little ambiguous. Perhaps basis expansions could
be applied to the example vowel recognition
problem on pp 84-5 to demonstrate this? I'll look
into it.
50More Questions
- 1. We learned from Chap3 that how to do hypo.
testing for linear regression, could you talk a
little bit about that for logistic regression?
That is, what are the assumptions, and what are
the tests? - 2. Already seen this question twice how does 30
come out anyway?