Logistic Regression - PowerPoint PPT Presentation

About This Presentation
Title:

Logistic Regression

Description:

Logistic Regression Rong Jin Logistic Regression Model In Gaussian generative model: Generalize the ratio to a linear model Parameters: w and c Logistic Regression ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 51
Provided by: cseMsuEd9
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Logistic Regression


1
Logistic Regression
  • Rong Jin

2
Logistic Regression Model
  • In Gaussian generative model
  • Generalize the ratio to a linear model
  • Parameters w and c

3
Logistic Regression Model
  • In Gaussian generative model
  • Generalize the ratio to a linear model
  • Parameters w and c

4
Logistic Regression Model
  • The log-ratio of positive class to negative class
  • Results

5
Logistic Regression Model
  • The log-ratio of positive class to negative class
  • Results

6
Logistic Regression Model
  • Assume the inputs and outputs are related in the
    log linear function
  • Estimate weights MLE approach

7
Example 1 Heart Disease
1 25-29 2 30-34 3 35-39 4 40-44 5 45-49 6
50-54 7 55-59 8 60-64
  • Input feature x age group id
  • output y having heart disease or not
  • 1 having heart disease
  • -1 no heart disease

8
Example 1 Heart Disease
  • Logistic regression model
  • Learning w and c MLE approach
  • Numerical optimization w 0.58, c -3.34

9
Example 1 Heart Disease
  • W 0.58
  • An old person is more likely to have heart
    disease
  • C -3.34
  • xwc lt 0 ? p(x) lt p(-x)
  • xwc gt 0 ? p(x) gt p(-x)
  • xwc 0 ? decision boundary
  • x 5.78 ? 53 year old

10
Naïve Bayes Solution
  • Inaccurate fitting
  • Non Gaussian distribution
  • i 5.59
  • Close to the estimation by logistic regression
  • Even though naïve Bayes does not fit input
    patterns well, it still works fine for the
    decision boundary

11
Problems with Using Histogram Data?
12
Uneven Sampling for Different Ages
13
Solution
w 0.63, c -3.56 ? i 5.65
14
Solution
w 0.63, c -3.56 ? i 5.65 lt 5.78
15
Solution
w 0.63, c -3.56 ? i 5.65 lt 5.78
16
Example Text Classification
  • Learn to classify text into predefined categories
  • Input x a document
  • Represented by a vector of words
  • Example (president, 10), (bush, 2), (election,
    5),
  • Output y if the document is politics or not
  • 1 for political document, -1 for not political
    document
  • Training data

17
Example 2 Text Classification
  • Logistic regression model
  • Every term ti is assigned with a weight wi
  • Learning parameters MLE approach
  • Need numerical solutions

18
Example 2 Text Classification
  • Logistic regression model
  • Every term ti is assigned with a weight wi
  • Learning parameters MLE approach
  • Need numerical solutions

19
Example 2 Text Classification
  • Weight wi
  • wi gt 0 term ti is a positive evidence
  • wi lt 0 term ti is a negative evidence
  • wi 0 term ti is irrelevant to the category of
    documents
  • The larger the wi , the more important ti term
    is determining whether the document is
    interesting.
  • Threshold c

20
Example 2 Text Classification
  • Weight wi
  • wi gt 0 term ti is a positive evidence
  • wi lt 0 term ti is a negative evidence
  • wi 0 term ti is irrelevant to the category of
    documents
  • The larger the wi , the more important ti term
    is determining whether the document is
    interesting.
  • Threshold c

21
Example 2 Text Classification
  • Dataset Reuter-21578
  • Classification accuracy
  • Naïve Bayes 77
  • Logistic regression 88

22
Why Logistic Regression Works better for Text
Classification?
  • Optimal linear decision boundary
  • Generative model
  • Weight logp(w) - logp(w-)
  • Sub-optimal weights
  • Independence assumption
  • Naive Bayes assumes that each word is generated
    independently
  • Logistic regression is able to take into account
    of the correlation of words

23
Discriminative Model
  • Logistic regression model is a discriminative
    model
  • Models the conditional probability p(yx), i.e.,
    the decision boundary
  • Gaussian generative model
  • Models p(xy), i.e., input patterns of different
    classes

24
Comparison
  • Generative Model
  • Model P(xy)
  • Model the input patterns
  • Usually fast converge
  • Cheap computation
  • Robust to noise data
  • But
  • Usually performs worse
  • Discriminative Model
  • Model P(yx) directly
  • Model the decision boundary
  • Usually good performance
  • But
  • Slow convergence
  • Expensive computation
  • Sensitive to noise data

25
Comparison
  • Generative Model
  • Model P(xy)
  • Model the input patterns
  • Usually fast converge
  • Cheap computation
  • Robust to noise data
  • But
  • Usually performs worse
  • Discriminative Model
  • Model P(yx) directly
  • Model the decision boundary
  • Usually good performance
  • But
  • Slow convergence
  • Expensive computation
  • Sensitive to noise data

26
A Few Words about Optimization
  • Convex objective function
  • Solution could be non-unique

27
Problems with Logistic Regression?
How about words that only appears in one class?
28
Overfitting Problem with Logistic Regression
  • Consider word t that only appears in one document
    d, and d is a positive document. Let w be its
    associated weight
  • Consider the derivative of l(Dtrain) with respect
    to w
  • w will be infinite !

29
Overfitting Problem with Logistic Regression
  • Consider word t that only appears in one document
    d, and d is a positive document. Let w be its
    associated weight
  • Consider the derivative of l(Dtrain) with respect
    to w
  • w will be infinite !

30
Example of Overfitting for LogRes
Decrease in the classification accuracy of test
data
Iteration
31
Solution Regularization
  • Regularized log-likelihood
  • sw2 is called the regularizer
  • Favors small weights
  • Prevents weights from becoming too large

32
The Rare Word Problem
  • Consider word t that only appears in one document
    d, and d is a positive document. Let w be its
    associated weight

33
The Rare Word Problem
  • Consider the derivative of l(Dtrain) with respect
    to w
  • When s is small, the derivative is still positive
  • But, it becomes negative when w is large

34
The Rare Word Problem
  • Consider the derivative of l(Dtrain) with respect
    to w
  • When w is small, the derivative is still positive
  • But, it becomes negative when w is large

35
Regularized Logistic Regression
36
Interpretation of Regularizer
  • Many interpretation of regularizer
  • Bayesian stat. model prior
  • Statistical learning minimize the generalized
    error
  • Robust optimization min-max solution

37
Regularizer Robust Optimization
  • assume each data point is unknown-but-bounded in
    a sphere of radius s and center xi
  • find the classifier w that is able to classify
    the unknown-but-bounded data point with high
    classification confidence

38
Sparse Solution
  • What does the solution of regularized logistic
    regression look like ?
  • A sparse solution
  • Most weights are small and close to zero

39
Sparse Solution
  • What does the solution of regularized logistic
    regression look like ?
  • A sparse solution
  • Most weights are small and close to zero

40
Why do We Need Sparse Solution?
  • Two types of solutions
  • Many non-zero weights but many of them are small
  • Only a small number of non-zero weights, and many
    of them are large
  • Occams Razor the simpler the better
  • A simpler model that fits data unlikely to be
    coincidence
  • A complicated model that fit data might be
    coincidence
  • Smaller number of non-zero weights
  • ? less amount of evidence to consider
  • ? simpler model
  • ? case 2 is preferred

41
Occams Razer
42
Occams Razer Power 1
43
Occams Razer Power 3
44
Occams Razor Power 10
45
Finding Optimal Solutions
  • Concave objective function
  • No local maximum
  • Many standard optimization algorithms work

46
Gradient Ascent
  • Maximize the log-likelihood by iteratively
    adjusting the parameters in small increments
  • In each iteration, we adjust w in the direction
    that increases the log-likelihood (toward the
    gradient)

47
Graphical Illustration
No regularization case
48
Gradient Ascent
  • Maximize the log-likelihood by iteratively
    adjusting the parameters in small increments
  • In each iteration, we adjust w in the direction
    that increases the log-likelihood (toward the
    gradient)

49
When should Stop?
  • Log-likelihood will monotonically increase during
    the gradient ascent iterations
  • When should we stop?

50
(No Transcript)
51
When should Stop?
  • The gradient ascent learning method converges
    when there is no incentive to move the parameters
    in any particular direction
Write a Comment
User Comments (0)
About PowerShow.com