Loading...

PPT – Linear Methods For Classification Chapter 4 PowerPoint presentation | free to download - id: 6e6c84-NGEwN

The Adobe Flash plugin is needed to view this content

Linear MethodsFor ClassificationChapter 4

- Machine Learning Seminar
- Shinjae Yoo
- Tal Blum

Bayesian Decision Theory

- World states ?j (i.e. classes)
- Actions ?(x) (i.e. classification)
- R(?(x) x) Risk or cost function
- The total risk
- R s R(?(x)x) p(x) dx
- The Bayes Decision Rule
- ?(x) min?j R(?jx)
- min?j ?ck1 ?(?j ?k) P(?k x)

Minimum Error-Rate Classification

- Introduction of the zero-one loss function
- Therefore, the conditional risk is
- The risk corresponding to this loss function is

the average probability error - ?

Minimum Error-Rate Classification

- Minimizing the risk is equivalent to maximizing

P(?i x) - Optimal Strategy is
- Decide ?i if P (?i x) gt P(?j x) ?j ? i
- For classification choose the class with the

highest posterior probability

Discriminant Functions

- Denoting gi(x) P(?ix)
- 8f s.t. f is monotonic increasing
- the set f(gi(x)) gives the same classification

as the set gi(x) - Examples
- Gc(x) P(xc)P(c) / (?c P(xc)P(c))
- Gc(x) P(xc)P(c)
- Gc(x) lnP(xc)lnP(c)

(No Transcript)

Linear Discriminant Functions

- Special case when a monotone function of the

boundary f(g(x)) is linear - Example
- Logistic regression
- f the log function
- Decision boundary

Extensions to Linear Discriminant Functions

Linear RegressionOf An Indicator Matrix

- Y (Y1,Yk) an NK indicator matrix
- Yj,k1 iff Gj k
- Can be seen as K separate linear regressions

Linear RegressionOf An Indicator Matrix

- The algorithm
- Compute
- a K vector

Linear RegressionOf An Indicator Matrix

- Properties
- Is linear regression flexible enough to model

fi(x)? - fi(x) can be negative or bigger than 1
- Incorporating more basis elements can help
- Gives the same estimate as
- minB ?Ni1 yi-(1,x)BT2

Masking Effect

(No Transcript)

Gaussians Discriminant Functions

- Multivariate density
- Multivariate normal density in d dimensions is
- where
- x (x1, x2, , xd)t (t stands for

the transpose vector form) - ? (?1, ?2, , ?d)t mean vector
- ? dd covariance matrix
- ? and ?-1 are determinant and

inverse respectively

Gaussians Discriminant Functions(2)

- The discriminant function we use
- gi(x) ln P(x ?i) ln P(?i)
- The parameter are
- usually not known so they
- are estimated

LDA Linear Discriminant Analysis

- Case 1 ?i are equal and ?i ?I
- The separating plane is ?(x-x0)0
- Where ? (?i-?j)/?2 is the direction of the

means difference - X0 is given by
- X0 is on the line separating the means, but not

necessary in the middle of the line, unless the

priors are equal.

Case where ?i ?I

LDA - where ?i ?

- Case 2 ?i are equal (?i ?)
- The separating plane is ?(x-x0)0
- Where ? ? -1(?i-?j)
- ? is generally not orthogonal to the vector

separating the means

(No Transcript)

Connection Between LDA and Multiple Linear

Regression

- For 2 class problem, the directions of the

decision boundaries are the same, but unless

N1N2 the intercepts are different. - While both are estimating discriminative

functions and produce the same types of linear

boundaries Linear regression is a discriminative

method while LDA a generative.

(No Transcript)

QDA Arbitrary ?i

- Case 3 ?i are arbitrary
- The discriminant functions
- Where

(No Transcript)

(No Transcript)

Comparing Extended LDA and QDA

What do we use, LDA or QDA?

- QDA is more expressive, but requires more

parameters. - Number of parameters
- LDA (K-1)(P1)
- QDA (K-1)P(P3)/21
- Both perform very well on many tasks.
- Usually because the data does not support more

complex boundaries. - If the data is not Gaussian, CV on the cutoffs

may help.

Regularized Discriminant Analysis

- Is there a a middle way between LDA and QDA?

Regularized Discriminant Analysis

Computation of LDA

- Compute the eigen decomposition
- Project X by X D-1/2UTX
- Classify to the closest centroid in the

transformed space modulo the prior probabilities

?i.

matrix factorization!

Using matrix factorization, we hope to - Reduce

the dimensionality (compress)

X

W

C

0.4999 1.1964 1.1389 1.1556

0.9290 0.7520 0.7321 0.6830 0.8260

0.7515 1.0596 1.3355 1.1624 1.1964

1.4396 0.5631 0.8010 0.8009 0.9455

0.6194 0.2071 0.9708 0.8749

0.7495 0.7896 0.4049 0.9841 0.8991

0.8562 0.8418 0.7940 0.9445 0.9178

1.0947 0.8313 0.9538 0.6973 0.6195

0.7928 0.8628 0.4401 1.0705

0.9657 0.9013 0.9417 0.7626 0.9141

0.8859 1.0518 0.8075

0.9501 0.6154 0.0579 0.2311

0.7919 0.3529 0.6068 0.9218 0.8132

0.4860 0.7382 0.0099 0.8913 0.1763

0.1389 0.7621 0.4057 0.2028

0.4565 0.9355 0.1987 0.0185 0.9169

0.6038 0.8214 0.4103 0.2722 0.4447

0.8936 0.1988

0.0153 0.9318 0.8462 0.6721

0.6813 0.7468 0.4660 0.5252 0.8381

0.3795 0.4451 0.4186 0.2026 0.0196

0.8318

?

?

SVD

Reduced Rank LDA

- LDA can be computed by projecting the points into

a K-1 dimensional space and computing distances

there - Reduced rank PCA minimizes the reconstruction

error - Reduced Rank LDA is about finding orthogonal set

of vectors that maximized Rayleigh Quotient - W the within class covariance, a sum of the

class cov matrices - B the between class covariance, the cov matrix

of the centroids of X - WB T XTX

Reduced Rank LDA

- Helps to visualize high dimensional data
- The reduction in dimension is usually done by

ordering the vectors and then choosing the first

M vectors - LDA is also used just as a dimension reduction

method together with other classification methods

such as Nearest Neighbor. - It is equivalent to projecting the vectors and

their class centers into a low dimensional

subspace

Reduced LDA

(No Transcript)

Questions

- 2 Kinds of questions
- Clarification / derivation
- Performance / understanding

Questions

- 1) Why is E(Yk Xx) Pr(Gk Xx)?
- Answer In general
- EYXx(Y) ?y yP(YXx)
- 1p0(1-p)
- P(Yx)
- In this example Yk is an indicator variable for

the class k

Questions(2)

- The relationship to Linear Regression is that we

are modeling E(YXx) as a linear function of x.

Questions(3)

- 1) last paragraph in P.90, it's said, "the large

discrepancy between the training and test error

is partly due to the fact that there many repeat

measurements on a small number of individuals",

two questions about this. a) since it's "partly

due to", what are the other factors b) give some

explanation on "many repeat measurements on a

small number of individuals". - 2) I am curious how does the number 30 come

from, in first paragraph in P.105? - can we explain it only because of the Gaussian

assumption of fk(x)?

Questions(4)

- 1) Eq. 4.3, Y is a matrix of 0's and 1's, with

each row having a single 1. If we put more than

one 1 in a row, can we explore Linear regression

to multiple classification? - 2) p.95, the book said we can apply

classification after data reduction. However, to

me, LDA used label already (Y), and for

classification we will use label to estimate the

distribution. Is it overfitted? - 3) What is generative model and discriminative

model? Is LDA a generative or discriminative

model?

Questions(5)

- A high level picture that I get from this chapter

is that we are trying to come up with models that

approximate the conditional expection E

Pr(Gk/Xx). We start with a linear regression

model that tries to approximate E by fhat_k(x)

which is rigid and is not a good approximation to

the posterior probability (E) as it could be -ve

or gt 1. Then we go to LDA model, where we assume

a model for class density, and use conditional

class densities to calculate our posterior (E),

the posterior obtained is a better approximation

of E but we are also adding bias (class

densities) which also reduces variance. Then

using Logistic regression model, we still

approximate E better than linear regression

model, and this model is more robust as it

maximizes the likelihood of training data and it

also has few assumptions compared to LDA. - How can we extend this picture? what other models

can we use that can better approximate the

conditional expectation Pr(Gk/Xx) ? One way to

extend LDA is getting a better approximation of

class models using unlabeled data and label them

using a distance metric similar to distance

metric used in k-NN classifier in ch 2.

Questions(6)

- 2. Sometimes we can get an estimate of prior

probabilities of classes using domain knowledge,

LDA framework allows us to use these prior

probabilities during calculation of the

posterior. Can we use prior probabilities in

other models like Linear regression and Logistic

regression?

Questions(7)

- The chapter has two broad categories, one where

we use discrimnant functions like Pr(Gk/Xx) and

then use this for classification and the other

method is to directly model the boundaries

between the classes (using the hyperplanes

approach). When to use which approach ? One case

is when you know the densities, you can use LDA

because optimal hyperplane might use noise as

support points -- but what if you don't know the

density ? Another example is -- a variant of the

hyperplane approach (SVM's) perform better in

high dimensional input data -- similarly are

there any other situations where a particular

approach is preferred?

Questions(8)

- 1. P.83 describes the masking phenomenon. In

figure 4.2., consider the case where the middle

class is moved slightly to right-down. Though it

will not be masked, it still has only a small

slice of area, which is very bad prediction.

Hence, to me, masking phenomenon means linear

regression for multivariate Gaussian is generally

inaccurate. And of course it is inaccurate,

because linear regression assumes the probability

distribution is linear not Gaussian. Now since

this assumption no longer holds, why we go on to

quadratic fit? What is the rationale? Since the

data is Gaussian, why not just use LDA or QDA,

with sound theoretical inference? - Put it another way. The logic looks like linear

regression may cause masking quadratic fit can

avoid masking when Kgt3 use polynomial terms up

to K-1. My point is, just because quadratic fit

avoids masking does not mean it is a reasonable

classification, especially when the reason of

masking is the model being Gaussian not linear.

Questions(9)

- I suppose most cases people use LDA / QDA, but in

some cases people use polynomail forms of linear

regression by heuristics and experiences. What is

the real situation? - P.90 Regularized Discriminant Analysis Why we

want compromise between LDA and QDA? Because of

computational tractability? It seems to me that

RDA requires more computation, not less

Questions(10)

- When does "mask" usually happen with the

regression approach?I was wondering the

opposite question. When doing linear regression

on an indicator matrix, with classes Kgt3, is

there ever a case where K-2 of the classes aren't

masked? It seems that with a linear decision

boundary, you can only ever bisect the space, so

you can only decide between two classes. (Except

when augmenting the data with X2, etc.) Is this

correct?p.83-84) The book mentions a "loose but

general rule" for using a polynomial terms in

linear regression for classification. Is there a

hard rule for the maximal degree of polynomial

input that would be required to resolve K

separable classes?- Is all separable data

separable by some polynomial? Of bounded degree?

Questions(11)

- In logistic regression, it is shown (Eqn 4.26)

that it is a reweighted least square problem.

Conceptually, what is the role of reweighting

using W? Does other reweighting scheme exist

which gives better classification performance

Questions(12)

- (1) In section 4.3.1, regularized discriminant

analysis, what is the effect to diagonize E? - It seems to me to strengthen the

independence assumption so does it really help in

practice? - (2) On p.92 the Lth discriminant variable is

computed as Z_l v_l' X, where v_l W(-1/2)

v_l, - why v_l has a W(-1/2) term instead of

W(1/2) term?

Questions(13)

- I'm not sure I immediately see how allowing

linear regression onto basis expansions h(X) of

the inputs will lead to consistent estimates of

the posterior probabilities Pr(G kX x)

(p82). Is there a clearer way of demonstrating

this (perhaps this question is better suited to

Ch 5)? The text also suggests that these

expansions should be adaptively applied as the

size of our training set grows, which is also a

little ambiguous. Perhaps basis expansions could

be applied to the example vowel recognition

problem on pp 84-5 to demonstrate this? I'll look

into it.

More Questions

- 1. We learned from Chap3 that how to do hypo.

testing for linear regression, could you talk a

little bit about that for logistic regression?

That is, what are the assumptions, and what are

the tests? - 2. Already seen this question twice how does 30

come out anyway?