Title: Application of Proper Scoring Rules to CostWeighted Classification
1Application of Proper Scoring Rules to
Cost-Weighted Classification
- Yi Shen
- Department of Statistics
- University of Pennsylvania
2Outline
- Introduction of cost-weighted classification.
- Review of logistic regression and potential
problems. - New weighting schemes by using tailored loss
functions (proper scoring rules). - Application to real world data sets.
- From logistic regression to boosting.
- Conclusion.
3Cost-Weighted Classification
- In a two-class classification problem, we observe
(x1,y1),,(xn,yn). The response Y takes two
values, for example, diseased or not, default or
not. For the purpose of modeling, we label ys as
0 or 1. The xi are features of the ith subject,
for example, education, capital gain, etc. - The goal is to predict the value of the response
given the values of features. Thus one needs to
find a classification rule, f(x), that divides
the feature space into two parts.
4Cost-Weighted Classification
- Given a classification rule, one might make two
types of mistakes - When the true class label is 0, he classifies it
as 1, which is called - class 0 misclassification
- When the true class label is 1, he classifies it
as 0, which is called - class 1 misclassification
5Cost-Weighted Classification
- In a cost-weighted classification problem, we
associate different costs with the two types of
misclassification. For example, it is more
expensive to miss a bankruptcy or a severely
diseased patient. - Let c1 be the cost for class 1 misclassification,
c0 be the cost for class 0 misclassification. - In the ordinary classification, c1 and c0 are set
to be equal. - Now the goal is to find a classification rule
such that the overall cost is minimized.
6The Optimal Classification Rule (Bayes Rule)
- Theoretically, the best classification is made
according to the posterior probability p(Y1x)
which we write p(x) from now on. - For a cost-weighted classification problem, the
optimal classification rule is to classify as
class 1 if p(x) gt c0 /(c1c0), otherwise
classify as class 0. - W.l.o.g. assume c1c0 1. Hence we are
interested in estimating
- p(x) c0
-
- In reality, p(x) is unknown and needs to be
estimated from data. We denote the estimate of
p(x) as q(x).
7Logistic Regression
- Logistic regression is a generalized linear model
in which - The above transformation is called the logistic
link. - The link function maps the probability scale to
the modeling scale. -
- Logistic regression tries to minimize the
negative - log-likelihood of a Bernoulli distribution
(Log-loss)
8Logistic Regression
- Logistic regression is equivalent to an
iteratively reweighted least squares which
minimizes - where is the estimated variance
- of a Bernoulli distributed variable.
9Logistic Regression
- The data points at the two tails are weighted
more than those in the middle.
10Logistic Regression
- When the model assumption is not true, the
weighting scheme from the logistic regression may
suffer.
- D. Hand et al. (2003) illustrates this
- problem with an artificial example in
- which the true p(x) is not globally linear
- in x, yet p(x) c is linear for all c.
11Hands Example
- 50 class 1 and 50 class 0.
- Linear functions
- can describe all boundaries p(x) c.
- But NO single linear function
- can describe all of them!
12Hands Example
- In Hands example,
- p(x) X2 / (X1X2).
- p(x) c implies
- (1 c) X2 c X1
13Hands Example
- Logistic regression will suffer
- since it tries to fit a single linear
- function to the boundaries!
14Hands Example
Logistic regression has a problem estimating p(x)
0.3 and p(x) 0.5
15Another Example
- 80 class 1 and 20 class 0.
- logistic regression induces bias even in
estimating - p(x) .5
16How to fix it?
- Recall Each probability boundary is linear.
- Idea
- Tailor the fitted linear function to a
specific level, - and ignore the others!
17Modification of Logistic Regression
- Recall logistic regr. upweights extreme qs.
- Idea upweight the points with q(x) near c0 !
18New Weighting Schemes
- Our goal is to design iteratively reweighted
least squares with new weights
19New Weighting Schemes(Method 1)
- Let
- where
- Nice when a b 0, the weight function is
equivalent to the weight in logistic regression.
20New Weighting Schemes(Method 1)
- We focus the weight function on C0 by
21New Weighting Schemes(Method 1)
22New Weighting Schemes(Method 1)
- The weight is the density function of a Beta
distribution with parameter a and b, where a
and b are greater than 0. - The mean of the Beta distribution is
- The constraint () suggests setting the mean of
the Beta distribution to c0 . -
- As a and b goes to infinity, the weight
function puts - all the mass on q c0 .
23New Weighting Schemes (Method 2)
24New Weighting Schemes(Method 2)
C0 .3
25New Weighting Schemes(Method 2)
- When a 0 and c0 .5, it is equivalent to the
weight function in logistic regression. - The weight function is a density when a is
greater than 0. Its mean is c0 . - As a goes to infinity, the weight function puts
all the mass on q c0 .
26Properties of the Weight Functions
- The weight functions define a family of loss
functions that share a property which we motivate
now. - Recall that in logistic regression we minimize
the negative log-likelihood
27Properties of the Weight Functions
- Minimize the conditional expected loss w.r.t. q
For given p, logistic regression tries to
minimize q. If no predictors are present, the
solution should be
q p
- Indeed setting L / q 0
- implies that the minimum is achieved at q p.
28Properties of the Weight Functions
- Any loss L(pq) that satisfies such a property
is called
proper scoring rule
29Structures of Proper Scoring Rules
- Let
- L1 (1-q) denotes the loss associated with
classification into class 1 it is monotone
decreasing in q. - L0 (q) denotes the loss associated with the
classification into class 0 it is monotone
increasing in q.
30Structure of Proper Scoring Rules
- Assume the losses L1(1-q) and L0(q) are
differentiable. They form a proper scoring rule
if and only if they satisfy
31Implementation of the IRLS for the tailored loss
- We want to minimize the empirical loss
- where
- The optimization is still equivalent to IRLS
32Experiments
- Hands Example
- Training set size 2000
33Experiments
- Unbalanced design of Hands example.
- Training set size 2000
34Experiments(Pima Indians diabetes)
- Pima Indians diabetes. 752 instances, 9
features, 35 1s.
- Buja etal(2001) found that BODY MASS
- and PLASMA are the two most
- dominating predictors.
- They also found that that p(x) has the
- similar characteristic as that in the
- Hands example.
35Experiments
The highlights represent slices with
near-constant estimated class 1 probability p c
e lt plt c e. The values of c in the slices
increase left to right and top to bottom. The
class 1 probabilities were estimated with
20-nearest-neighbor estimates. Glyphs open
circles no diabetes, class 0filled circles
diabetes, class 1 small gray dots points
outside the slice
36Experiments
- Using IRLS with tailored loss, we have
37Experiments
- German credit data 1000 instances, 20
attributes, Y good credit or not.
- For c0 1/6, 5-CV test error
38Experiments
- Adult income data14 features, response Annual
income exceeds 50k or not.(25 1s) - Training set size30562, test set size15060.
39Experiments
40Experiments
41From Logistic Regression to Boosting
- Boosting by Freud and Schapire (1996) has
achieved great success in machine learning for
classification problems.
- It combines a group of weak learners
- (classifiers) with a weighted voting
- scheme so that a strong learner
- (classifier) is obtained.
42From Logistic Regression to Boosting
- Suppose the response y is labeled 1 or 1, i.e,
- y 2y-1. The popular version of the
Adaboost algorithm is - 1. Initialize F(x) 0 and weights wi 1/n, i
1,,n. - 2. From t 1,,T
- (a) Fit a weak classifier ft(x) -1,1 to
the weighted training data.. - (b) Compute weighted misclassification
rate for ft(x ) -
- (c) Change weights
- 3. Final classifier
and
43Boosting As a stagewise Additive Fitting
- In Friedman, Hastie, Tibshirani (2000), boosting
is viewed as an stagewise additive fitting
which exploits IRLS-like steps to minimize the
exponential loss -
-
-
- The update is
-
- where usually ft(x) is taken as stumps or
shallow trees.
44Boosting As a stagewise Additive Fitting
- Boosting is a stagewise fitting as opposed to
stepwise fitting
- In stagewise fitting, Fold is not updated
- while in stepwise fitting, it will.
45Exponential Loss Can Be Mapped to A Proper
Scoring Rule
- Exponential loss can be decomposed to a proper
scoring rule and a link function -
46Logit-Boost
- FHT(2000) proposed a stagewise additive fitting
algorithm to minimize the negative log-likelihood.
47Cost-weighted Boosting
- Again, for cost weighted classification, we could
optimize the tailored losses with
48Experiments
- Artificial data 2000 instances.
- Class 1 and Class 0 have equal margin, i.e, there
are 50 class 1 and 50 class 0. - Suppose we are interested in estimating
- p(x) .3.
49Experiments
Logloss
Method1
Method2
50Experiments
Logloss
Method1
Method2
51Experiments
- Test set size 10000, test error
t
52Experiments
- Pima Indian diabetes (using all features)
C0.2
C0.8
53Experiments
54Experiments
C0.8
C0.2
55Conclusion
- Logistic regression has potential problems for
cost-weighted classification.
- Reweighting with proper scoring rules
- helps us stay more focused on the
- classification boundary.
- Reweighting could be extended to
- semiparametric modeling like boosting.