Application of Proper Scoring Rules to CostWeighted Classification - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Application of Proper Scoring Rules to CostWeighted Classification

Description:

New weighting schemes by using tailored loss functions (proper scoring rules) ... Tailor the fitted linear function to a specific level, and ignore the others! ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 56
Provided by: wwwstatWh
Category:

less

Transcript and Presenter's Notes

Title: Application of Proper Scoring Rules to CostWeighted Classification


1
Application of Proper Scoring Rules to
Cost-Weighted Classification
  • Yi Shen
  • Department of Statistics
  • University of Pennsylvania

2
Outline
  • Introduction of cost-weighted classification.
  • Review of logistic regression and potential
    problems.
  • New weighting schemes by using tailored loss
    functions (proper scoring rules).
  • Application to real world data sets.
  • From logistic regression to boosting.
  • Conclusion.

3
Cost-Weighted Classification
  • In a two-class classification problem, we observe
    (x1,y1),,(xn,yn). The response Y takes two
    values, for example, diseased or not, default or
    not. For the purpose of modeling, we label ys as
    0 or 1. The xi are features of the ith subject,
    for example, education, capital gain, etc.
  • The goal is to predict the value of the response
    given the values of features. Thus one needs to
    find a classification rule, f(x), that divides
    the feature space into two parts.

4
Cost-Weighted Classification
  • Given a classification rule, one might make two
    types of mistakes
  • When the true class label is 0, he classifies it
    as 1, which is called
  • class 0 misclassification
  • When the true class label is 1, he classifies it
    as 0, which is called
  • class 1 misclassification

5
Cost-Weighted Classification
  • In a cost-weighted classification problem, we
    associate different costs with the two types of
    misclassification. For example, it is more
    expensive to miss a bankruptcy or a severely
    diseased patient.
  • Let c1 be the cost for class 1 misclassification,
    c0 be the cost for class 0 misclassification.
  • In the ordinary classification, c1 and c0 are set
    to be equal.
  • Now the goal is to find a classification rule
    such that the overall cost is minimized.

6
The Optimal Classification Rule (Bayes Rule)
  • Theoretically, the best classification is made
    according to the posterior probability p(Y1x)
    which we write p(x) from now on.
  • For a cost-weighted classification problem, the
    optimal classification rule is to classify as
    class 1 if p(x) gt c0 /(c1c0), otherwise
    classify as class 0.
  • W.l.o.g. assume c1c0 1. Hence we are
    interested in estimating
  • p(x) c0
  • In reality, p(x) is unknown and needs to be
    estimated from data. We denote the estimate of
    p(x) as q(x).

7
Logistic Regression
  • Logistic regression is a generalized linear model
    in which
  • The above transformation is called the logistic
    link.
  • The link function maps the probability scale to
    the modeling scale.
  • Logistic regression tries to minimize the
    negative
  • log-likelihood of a Bernoulli distribution

(Log-loss)
8
Logistic Regression
  • Logistic regression is equivalent to an
    iteratively reweighted least squares which
    minimizes
  • where is the estimated variance
  • of a Bernoulli distributed variable.

9
Logistic Regression
  • The data points at the two tails are weighted
    more than those in the middle.

10
Logistic Regression
  • When the model assumption is not true, the
    weighting scheme from the logistic regression may
    suffer.
  • D. Hand et al. (2003) illustrates this
  • problem with an artificial example in
  • which the true p(x) is not globally linear
  • in x, yet p(x) c is linear for all c.

11
Hands Example
  • 50 class 1 and 50 class 0.
  • Linear functions
  • can describe all boundaries p(x) c.
  • But NO single linear function
  • can describe all of them!

12
Hands Example
  • In Hands example,
  • p(x) X2 / (X1X2).
  • p(x) c implies
  • (1 c) X2 c X1

13
Hands Example
  • Logistic regression will suffer
  • since it tries to fit a single linear
  • function to the boundaries!

14
Hands Example
Logistic regression has a problem estimating p(x)
0.3 and p(x) 0.5
15
Another Example
  • 80 class 1 and 20 class 0.
  • p(x) 5X2 / (X1 5 X2)
  • logistic regression induces bias even in
    estimating
  • p(x) .5

16
How to fix it?
  • Recall Each probability boundary is linear.
  • Idea
  • Tailor the fitted linear function to a
    specific level,
  • and ignore the others!

17
Modification of Logistic Regression
  • Recall logistic regr. upweights extreme qs.
  • Idea upweight the points with q(x) near c0 !

18
New Weighting Schemes
  • Our goal is to design iteratively reweighted
    least squares with new weights

19
New Weighting Schemes(Method 1)
  • Let
  • where
  • Nice when a b 0, the weight function is
    equivalent to the weight in logistic regression.

20
New Weighting Schemes(Method 1)
  • We focus the weight function on C0 by

21
New Weighting Schemes(Method 1)
22
New Weighting Schemes(Method 1)
  • The weight is the density function of a Beta
    distribution with parameter a and b, where a
    and b are greater than 0.
  • The mean of the Beta distribution is
  • The constraint () suggests setting the mean of
    the Beta distribution to c0 .
  • As a and b goes to infinity, the weight
    function puts
  • all the mass on q c0 .

23
New Weighting Schemes (Method 2)
  • Scheme 2
  • where

24
New Weighting Schemes(Method 2)
C0 .3
25
New Weighting Schemes(Method 2)
  • When a 0 and c0 .5, it is equivalent to the
    weight function in logistic regression.
  • The weight function is a density when a is
    greater than 0. Its mean is c0 .
  • As a goes to infinity, the weight function puts
    all the mass on q c0 .

26
Properties of the Weight Functions
  • The weight functions define a family of loss
    functions that share a property which we motivate
    now.
  • Recall that in logistic regression we minimize
    the negative log-likelihood

27
Properties of the Weight Functions
  • Minimize the conditional expected loss w.r.t. q

For given p, logistic regression tries to
minimize q. If no predictors are present, the
solution should be
q p
  • Indeed setting L / q 0
  • implies that the minimum is achieved at q p.

28
Properties of the Weight Functions
  • Any loss L(pq) that satisfies such a property
    is called

proper scoring rule
29
Structures of Proper Scoring Rules
  • Let
  • L1 (1-q) denotes the loss associated with
    classification into class 1 it is monotone
    decreasing in q.
  • L0 (q) denotes the loss associated with the
    classification into class 0 it is monotone
    increasing in q.

30
Structure of Proper Scoring Rules
  • Assume the losses L1(1-q) and L0(q) are
    differentiable. They form a proper scoring rule
    if and only if they satisfy

31
Implementation of the IRLS for the tailored loss
  • We want to minimize the empirical loss
  • where
  • The optimization is still equivalent to IRLS

32
Experiments
  • Hands Example
  • Training set size 2000

33
Experiments
  • Unbalanced design of Hands example.
  • Training set size 2000

34
Experiments(Pima Indians diabetes)
  • Pima Indians diabetes. 752 instances, 9
    features, 35 1s.
  • Buja etal(2001) found that BODY MASS
  • and PLASMA are the two most
  • dominating predictors.
  • They also found that that p(x) has the
  • similar characteristic as that in the
  • Hands example.

35
Experiments
The highlights represent slices with
near-constant estimated class 1 probability p c
e lt plt c e. The values of c in the slices
increase left to right and top to bottom. The
class 1 probabilities were estimated with
20-nearest-neighbor estimates. Glyphs open
circles no diabetes, class 0filled circles
diabetes, class 1 small gray dots points
outside the slice
36
Experiments
  • Using IRLS with tailored loss, we have

37
Experiments
  • German credit data 1000 instances, 20
    attributes, Y good credit or not.
  • For c0 1/6, 5-CV test error

38
Experiments
  • Adult income data14 features, response Annual
    income exceeds 50k or not.(25 1s)
  • Training set size30562, test set size15060.

39
Experiments
  • For c0 .2,

40
Experiments
  • For c0 .8,

41
From Logistic Regression to Boosting
  • Boosting by Freud and Schapire (1996) has
    achieved great success in machine learning for
    classification problems.
  • It combines a group of weak learners
  • (classifiers) with a weighted voting
  • scheme so that a strong learner
  • (classifier) is obtained.

42
From Logistic Regression to Boosting
  • Suppose the response y is labeled 1 or 1, i.e,
  • y 2y-1. The popular version of the
    Adaboost algorithm is
  • 1. Initialize F(x) 0 and weights wi 1/n, i
    1,,n.
  • 2. From t 1,,T
  • (a) Fit a weak classifier ft(x) -1,1 to
    the weighted training data..
  • (b) Compute weighted misclassification
    rate for ft(x )
  • (c) Change weights
  • 3. Final classifier

and
43
Boosting As a stagewise Additive Fitting
  • In Friedman, Hastie, Tibshirani (2000), boosting
    is viewed as an stagewise additive fitting
    which exploits IRLS-like steps to minimize the
    exponential loss
  • The update is
  • where usually ft(x) is taken as stumps or
    shallow trees.

44
Boosting As a stagewise Additive Fitting
  • Boosting is a stagewise fitting as opposed to
    stepwise fitting
  • In stagewise fitting, Fold is not updated
  • while in stepwise fitting, it will.

45
Exponential Loss Can Be Mapped to A Proper
Scoring Rule
  • Exponential loss can be decomposed to a proper
    scoring rule and a link function

46
Logit-Boost
  • FHT(2000) proposed a stagewise additive fitting
    algorithm to minimize the negative log-likelihood.

47
Cost-weighted Boosting
  • Again, for cost weighted classification, we could
    optimize the tailored losses with

48
Experiments
  • Artificial data 2000 instances.
  • Class 1 and Class 0 have equal margin, i.e, there
    are 50 class 1 and 50 class 0.
  • Suppose we are interested in estimating
  • p(x) .3.

49
Experiments
  • t50.

Logloss
Method1
Method2
50
Experiments
  • t300, overfitting.

Logloss
Method1
Method2
51
Experiments
  • Test set size 10000, test error

t
52
Experiments
  • Pima Indian diabetes (using all features)

C0.2
C0.8
53
Experiments
  • German Credit data

54
Experiments
  • Adult income

C0.8
C0.2
55
Conclusion
  • Logistic regression has potential problems for
    cost-weighted classification.
  • Reweighting with proper scoring rules
  • helps us stay more focused on the
  • classification boundary.
  • Reweighting could be extended to
  • semiparametric modeling like boosting.
Write a Comment
User Comments (0)
About PowerShow.com