Linear Methods For Classification Chapter 4 - PowerPoint PPT Presentation


PPT – Linear Methods For Classification Chapter 4 PowerPoint presentation | free to download - id: 6e6c84-NGEwN


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Linear Methods For Classification Chapter 4


Linear Methods For Classification Chapter 4 Machine Learning Seminar Shinjae Yoo Tal Blum – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 51
Provided by: School132
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Linear Methods For Classification Chapter 4

Linear MethodsFor ClassificationChapter 4
  • Machine Learning Seminar
  • Shinjae Yoo
  • Tal Blum

Bayesian Decision Theory
  • World states ?j (i.e. classes)
  • Actions ?(x) (i.e. classification)
  • R(?(x) x) Risk or cost function
  • The total risk
  • R s R(?(x)x) p(x) dx
  • The Bayes Decision Rule
  • ?(x) min?j R(?jx)
  • min?j ?ck1 ?(?j ?k) P(?k x)

Minimum Error-Rate Classification
  • Introduction of the zero-one loss function
  • Therefore, the conditional risk is
  • The risk corresponding to this loss function is
    the average probability error
  • ?

Minimum Error-Rate Classification
  • Minimizing the risk is equivalent to maximizing
    P(?i x)
  • Optimal Strategy is
  • Decide ?i if P (?i x) gt P(?j x) ?j ? i
  • For classification choose the class with the
    highest posterior probability

Discriminant Functions
  • Denoting gi(x) P(?ix)
  • 8f s.t. f is monotonic increasing
  • the set f(gi(x)) gives the same classification
    as the set gi(x)
  • Examples
  • Gc(x) P(xc)P(c) / (?c P(xc)P(c))
  • Gc(x) P(xc)P(c)
  • Gc(x) lnP(xc)lnP(c)

(No Transcript)
Linear Discriminant Functions
  • Special case when a monotone function of the
    boundary f(g(x)) is linear
  • Example
  • Logistic regression
  • f the log function
  • Decision boundary

Extensions to Linear Discriminant Functions
Linear RegressionOf An Indicator Matrix
  • Y (Y1,Yk) an NK indicator matrix
  • Yj,k1 iff Gj k
  • Can be seen as K separate linear regressions

Linear RegressionOf An Indicator Matrix
  • The algorithm
  • Compute
  • a K vector

Linear RegressionOf An Indicator Matrix
  • Properties
  • Is linear regression flexible enough to model
  • fi(x) can be negative or bigger than 1
  • Incorporating more basis elements can help
  • Gives the same estimate as
  • minB ?Ni1 yi-(1,x)BT2

Masking Effect
(No Transcript)
Gaussians Discriminant Functions
  • Multivariate density
  • Multivariate normal density in d dimensions is
  • where
  • x (x1, x2, , xd)t (t stands for
    the transpose vector form)
  • ? (?1, ?2, , ?d)t mean vector
  • ? dd covariance matrix
  • ? and ?-1 are determinant and
    inverse respectively

Gaussians Discriminant Functions(2)
  • The discriminant function we use
  • gi(x) ln P(x ?i) ln P(?i)
  • The parameter are
  • usually not known so they
  • are estimated

LDA Linear Discriminant Analysis
  • Case 1 ?i are equal and ?i ?I
  • The separating plane is ?(x-x0)0
  • Where ? (?i-?j)/?2 is the direction of the
    means difference
  • X0 is given by
  • X0 is on the line separating the means, but not
    necessary in the middle of the line, unless the
    priors are equal.

Case where ?i ?I
LDA - where ?i ?
  • Case 2 ?i are equal (?i ?)
  • The separating plane is ?(x-x0)0
  • Where ? ? -1(?i-?j)
  • ? is generally not orthogonal to the vector
    separating the means

(No Transcript)
Connection Between LDA and Multiple Linear
  • For 2 class problem, the directions of the
    decision boundaries are the same, but unless
    N1N2 the intercepts are different.
  • While both are estimating discriminative
    functions and produce the same types of linear
    boundaries Linear regression is a discriminative
    method while LDA a generative.

(No Transcript)
QDA Arbitrary ?i
  • Case 3 ?i are arbitrary
  • The discriminant functions
  • Where

(No Transcript)
(No Transcript)
Comparing Extended LDA and QDA
What do we use, LDA or QDA?
  • QDA is more expressive, but requires more
  • Number of parameters
  • LDA (K-1)(P1)
  • QDA (K-1)P(P3)/21
  • Both perform very well on many tasks.
  • Usually because the data does not support more
    complex boundaries.
  • If the data is not Gaussian, CV on the cutoffs
    may help.

Regularized Discriminant Analysis
  • Is there a a middle way between LDA and QDA?

Regularized Discriminant Analysis
Computation of LDA
  • Compute the eigen decomposition
  • Project X by X D-1/2UTX
  • Classify to the closest centroid in the
    transformed space modulo the prior probabilities

matrix factorization!
Using matrix factorization, we hope to - Reduce
the dimensionality (compress)
0.4999 1.1964 1.1389 1.1556
0.9290 0.7520 0.7321 0.6830 0.8260
0.7515 1.0596 1.3355 1.1624 1.1964
1.4396 0.5631 0.8010 0.8009 0.9455
0.6194 0.2071 0.9708 0.8749
0.7495 0.7896 0.4049 0.9841 0.8991
0.8562 0.8418 0.7940 0.9445 0.9178
1.0947 0.8313 0.9538 0.6973 0.6195
0.7928 0.8628 0.4401 1.0705
0.9657 0.9013 0.9417 0.7626 0.9141
0.8859 1.0518 0.8075
0.9501 0.6154 0.0579 0.2311
0.7919 0.3529 0.6068 0.9218 0.8132
0.4860 0.7382 0.0099 0.8913 0.1763
0.1389 0.7621 0.4057 0.2028
0.4565 0.9355 0.1987 0.0185 0.9169
0.6038 0.8214 0.4103 0.2722 0.4447
0.8936 0.1988
0.0153 0.9318 0.8462 0.6721
0.6813 0.7468 0.4660 0.5252 0.8381
0.3795 0.4451 0.4186 0.2026 0.0196
Reduced Rank LDA
  • LDA can be computed by projecting the points into
    a K-1 dimensional space and computing distances
  • Reduced rank PCA minimizes the reconstruction
  • Reduced Rank LDA is about finding orthogonal set
    of vectors that maximized Rayleigh Quotient
  • W the within class covariance, a sum of the
    class cov matrices
  • B the between class covariance, the cov matrix
    of the centroids of X
  • WB T XTX

Reduced Rank LDA
  • Helps to visualize high dimensional data
  • The reduction in dimension is usually done by
    ordering the vectors and then choosing the first
    M vectors
  • LDA is also used just as a dimension reduction
    method together with other classification methods
    such as Nearest Neighbor.
  • It is equivalent to projecting the vectors and
    their class centers into a low dimensional

Reduced LDA
(No Transcript)
  • 2 Kinds of questions
  • Clarification / derivation
  • Performance / understanding

  • 1) Why is E(Yk Xx) Pr(Gk Xx)?
  • Answer In general
  • EYXx(Y) ?y yP(YXx)
  • 1p0(1-p)
  • P(Yx)
  • In this example Yk is an indicator variable for
    the class k

  • The relationship to Linear Regression is that we
    are modeling E(YXx) as a linear function of x.

  • 1) last paragraph in P.90, it's said, "the large
    discrepancy between the training and test error
    is partly due to the fact that there many repeat
    measurements on a small number of individuals",
    two questions about this. a) since it's "partly
    due to", what are the other factors b) give some
    explanation on "many repeat measurements on a
    small number of individuals".
  • 2) I am curious how does the number 30 come
    from, in first paragraph in P.105?
  • can we explain it only because of the Gaussian
    assumption of fk(x)?

  • 1) Eq. 4.3, Y is a matrix of 0's and 1's, with
    each row having a single 1. If we put more than
    one 1 in a row, can we explore Linear regression
    to multiple classification?
  • 2) p.95, the book said we can apply
    classification after data reduction. However, to
    me, LDA used label already (Y), and for
    classification we will use label to estimate the
    distribution. Is it overfitted?
  • 3) What is generative model and discriminative
    model? Is LDA a generative or discriminative

  • A high level picture that I get from this chapter
    is that we are trying to come up with models that
    approximate the conditional expection E
    Pr(Gk/Xx). We start with a linear regression
    model that tries to approximate E by fhat_k(x)
    which is rigid and is not a good approximation to
    the posterior probability (E) as it could be -ve
    or gt 1. Then we go to LDA model, where we assume
    a model for class density, and use conditional
    class densities to calculate our posterior (E),
    the posterior obtained is a better approximation
    of E but we are also adding bias (class
    densities) which also reduces variance. Then
    using Logistic regression model, we still
    approximate E better than linear regression
    model, and this model is more robust as it
    maximizes the likelihood of training data and it
    also has few assumptions compared to LDA.
  • How can we extend this picture? what other models
    can we use that can better approximate the
    conditional expectation Pr(Gk/Xx) ? One way to
    extend LDA is getting a better approximation of
    class models using unlabeled data and label them
    using a distance metric similar to distance
    metric used in k-NN classifier in ch 2.

  • 2. Sometimes we can get an estimate of prior
    probabilities of classes using domain knowledge,
    LDA framework allows us to use these prior
    probabilities during calculation of the
    posterior. Can we use prior probabilities in
    other models like Linear regression and Logistic

  • The chapter has two broad categories, one where
    we use discrimnant functions like Pr(Gk/Xx) and
    then use this for classification and the other
    method is to directly model the boundaries
    between the classes (using the hyperplanes
    approach). When to use which approach ? One case
    is when you know the densities, you can use LDA
    because optimal hyperplane might use noise as
    support points -- but what if you don't know the
    density ? Another example is -- a variant of the
    hyperplane approach (SVM's) perform better in
    high dimensional input data -- similarly are
    there any other situations where a particular
    approach is preferred?

  • 1. P.83 describes the masking phenomenon. In
    figure 4.2., consider the case where the middle
    class is moved slightly to right-down. Though it
    will not be masked, it still has only a small
    slice of area, which is very bad prediction.
    Hence, to me, masking phenomenon means linear
    regression for multivariate Gaussian is generally
    inaccurate. And of course it is inaccurate,
    because linear regression assumes the probability
    distribution is linear not Gaussian. Now since
    this assumption no longer holds, why we go on to
    quadratic fit? What is the rationale? Since the
    data is Gaussian, why not just use LDA or QDA,
    with sound theoretical inference?
  • Put it another way. The logic looks like linear
    regression may cause masking quadratic fit can
    avoid masking when Kgt3 use polynomial terms up
    to K-1. My point is, just because quadratic fit
    avoids masking does not mean it is a reasonable
    classification, especially when the reason of
    masking is the model being Gaussian not linear.

  • I suppose most cases people use LDA / QDA, but in
    some cases people use polynomail forms of linear
    regression by heuristics and experiences. What is
    the real situation?
  • P.90 Regularized Discriminant Analysis Why we
    want compromise between LDA and QDA? Because of
    computational tractability? It seems to me that
    RDA requires more computation, not less

  • When does "mask" usually happen with the
    regression approach?I was wondering the
    opposite question. When doing linear regression
    on an indicator matrix, with classes Kgt3, is
    there ever a case where K-2 of the classes aren't
    masked? It seems that with a linear decision
    boundary, you can only ever bisect the space, so
    you can only decide between two classes. (Except
    when augmenting the data with X2, etc.) Is this
    correct?p.83-84) The book mentions a "loose but
    general rule" for using a polynomial terms in
    linear regression for classification. Is there a
    hard rule for the maximal degree of polynomial
    input that would be required to resolve K
    separable classes?- Is all separable data
    separable by some polynomial? Of bounded degree?

  • In logistic regression, it is shown (Eqn 4.26)
    that it is a reweighted least square problem.
    Conceptually, what is the role of reweighting
    using W? Does other reweighting scheme exist
    which gives better classification performance

  • (1) In section 4.3.1, regularized discriminant
    analysis, what is the effect to diagonize E?
  • It seems to me to strengthen the
    independence assumption so does it really help in
  • (2) On p.92 the Lth discriminant variable is
    computed as Z_l v_l' X, where v_l W(-1/2)
  • why v_l has a W(-1/2) term instead of
    W(1/2) term?

  • I'm not sure I immediately see how allowing
    linear regression onto basis expansions h(X) of
    the inputs will lead to consistent estimates of
    the posterior probabilities Pr(G kX x)
    (p82). Is there a clearer way of demonstrating
    this (perhaps this question is better suited to
    Ch 5)? The text also suggests that these
    expansions should be adaptively applied as the
    size of our training set grows, which is also a
    little ambiguous. Perhaps basis expansions could
    be applied to the example vowel recognition
    problem on pp 84-5 to demonstrate this? I'll look
    into it.

More Questions
  • 1. We learned from Chap3 that how to do hypo.
    testing for linear regression, could you talk a
    little bit about that for logistic regression?
    That is, what are the assumptions, and what are
    the tests?
  • 2. Already seen this question twice how does 30
    come out anyway?