Application of Proper Scoring Rules to CostWeighted Classification - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Application of Proper Scoring Rules to CostWeighted Classification

Description:

New weighting schemes by using tailored loss functions (proper scoring rules) ... Tailor the fitted linear function to a specific level, and ignore the others! ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 56

Provided by: wwwstatWh

Category:

more less

Transcript and Presenter's Notes

Title: Application of Proper Scoring Rules to CostWeighted Classification

1
Application of Proper Scoring Rules to
Cost-Weighted Classification

Yi Shen
Department of Statistics
University of Pennsylvania

2
Outline

Introduction of cost-weighted classification.
Review of logistic regression and potential
problems.
New weighting schemes by using tailored loss
functions (proper scoring rules).
Application to real world data sets.
From logistic regression to boosting.
Conclusion.

3
Cost-Weighted Classification

In a two-class classification problem, we observe
(x1,y1),,(xn,yn). The response Y takes two
values, for example, diseased or not, default or
not. For the purpose of modeling, we label ys as
0 or 1. The xi are features of the ith subject,
for example, education, capital gain, etc.
The goal is to predict the value of the response
given the values of features. Thus one needs to
find a classification rule, f(x), that divides
the feature space into two parts.

4
Cost-Weighted Classification

Given a classification rule, one might make two
types of mistakes
When the true class label is 0, he classifies it
as 1, which is called
class 0 misclassification
When the true class label is 1, he classifies it
as 0, which is called
class 1 misclassification

5
Cost-Weighted Classification

In a cost-weighted classification problem, we
associate different costs with the two types of
misclassification. For example, it is more
expensive to miss a bankruptcy or a severely
diseased patient.
Let c1 be the cost for class 1 misclassification,
c0 be the cost for class 0 misclassification.
In the ordinary classification, c1 and c0 are set
to be equal.
Now the goal is to find a classification rule
such that the overall cost is minimized.

6
The Optimal Classification Rule (Bayes Rule)

Theoretically, the best classification is made
according to the posterior probability p(Y1x)
which we write p(x) from now on.
For a cost-weighted classification problem, the
optimal classification rule is to classify as
class 1 if p(x) gt c0 /(c1c0), otherwise
classify as class 0.
W.l.o.g. assume c1c0 1. Hence we are
interested in estimating
p(x) c0
In reality, p(x) is unknown and needs to be
estimated from data. We denote the estimate of
p(x) as q(x).

7
Logistic Regression

Logistic regression is a generalized linear model
in which
The above transformation is called the logistic
link.
The link function maps the probability scale to
the modeling scale.

Logistic regression tries to minimize the
negative
log-likelihood of a Bernoulli distribution

(Log-loss)
8
Logistic Regression

Logistic regression is equivalent to an
iteratively reweighted least squares which
minimizes
where is the estimated variance
of a Bernoulli distributed variable.

9
Logistic Regression

The data points at the two tails are weighted
more than those in the middle.

10
Logistic Regression

When the model assumption is not true, the
weighting scheme from the logistic regression may
suffer.

D. Hand et al. (2003) illustrates this
problem with an artificial example in
which the true p(x) is not globally linear
in x, yet p(x) c is linear for all c.

11
Hands Example

50 class 1 and 50 class 0.

Linear functions
can describe all boundaries p(x) c.

But NO single linear function
can describe all of them!

12
Hands Example

In Hands example,
p(x) X2 / (X1X2).
p(x) c implies
(1 c) X2 c X1

13
Hands Example

Logistic regression will suffer
since it tries to fit a single linear
function to the boundaries!

14
Hands Example
Logistic regression has a problem estimating p(x)
0.3 and p(x) 0.5
15
Another Example

80 class 1 and 20 class 0.

p(x) 5X2 / (X1 5 X2)

logistic regression induces bias even in
estimating
p(x) .5

16
How to fix it?

Recall Each probability boundary is linear.

Idea
Tailor the fitted linear function to a
specific level,
and ignore the others!

17
Modification of Logistic Regression

Recall logistic regr. upweights extreme qs.

Idea upweight the points with q(x) near c0 !

18
New Weighting Schemes

Our goal is to design iteratively reweighted
least squares with new weights

19
New Weighting Schemes(Method 1)

Let
where
Nice when a b 0, the weight function is
equivalent to the weight in logistic regression.

20
New Weighting Schemes(Method 1)

We focus the weight function on C0 by

21
New Weighting Schemes(Method 1)
22
New Weighting Schemes(Method 1)

The weight is the density function of a Beta
distribution with parameter a and b, where a
and b are greater than 0.
The mean of the Beta distribution is
The constraint () suggests setting the mean of
the Beta distribution to c0 .

As a and b goes to infinity, the weight
function puts
all the mass on q c0 .

23
New Weighting Schemes (Method 2)

Scheme 2
where

24
New Weighting Schemes(Method 2)
C0 .3
25
New Weighting Schemes(Method 2)

When a 0 and c0 .5, it is equivalent to the
weight function in logistic regression.
The weight function is a density when a is
greater than 0. Its mean is c0 .
As a goes to infinity, the weight function puts
all the mass on q c0 .

26
Properties of the Weight Functions

The weight functions define a family of loss
functions that share a property which we motivate
now.
Recall that in logistic regression we minimize
the negative log-likelihood

27
Properties of the Weight Functions

Minimize the conditional expected loss w.r.t. q

For given p, logistic regression tries to
minimize q. If no predictors are present, the
solution should be
q p

Indeed setting L / q 0
implies that the minimum is achieved at q p.

28
Properties of the Weight Functions

Any loss L(pq) that satisfies such a property
is called

proper scoring rule
29
Structures of Proper Scoring Rules

Let
L1 (1-q) denotes the loss associated with
classification into class 1 it is monotone
decreasing in q.
L0 (q) denotes the loss associated with the
classification into class 0 it is monotone
increasing in q.

30
Structure of Proper Scoring Rules

Assume the losses L1(1-q) and L0(q) are
differentiable. They form a proper scoring rule
if and only if they satisfy

31
Implementation of the IRLS for the tailored loss

We want to minimize the empirical loss
where

The optimization is still equivalent to IRLS

32
Experiments

Hands Example
Training set size 2000

33
Experiments

Unbalanced design of Hands example.
Training set size 2000

34
Experiments(Pima Indians diabetes)

Pima Indians diabetes. 752 instances, 9
features, 35 1s.

Buja etal(2001) found that BODY MASS
and PLASMA are the two most
dominating predictors.

They also found that that p(x) has the
similar characteristic as that in the
Hands example.

35
Experiments
The highlights represent slices with
near-constant estimated class 1 probability p c
e lt plt c e. The values of c in the slices
increase left to right and top to bottom. The
class 1 probabilities were estimated with
20-nearest-neighbor estimates. Glyphs open
circles no diabetes, class 0filled circles
diabetes, class 1 small gray dots points
outside the slice
36
Experiments

Using IRLS with tailored loss, we have

37
Experiments

German credit data 1000 instances, 20
attributes, Y good credit or not.

For c0 1/6, 5-CV test error

38
Experiments

Adult income data14 features, response Annual
income exceeds 50k or not.(25 1s)
Training set size30562, test set size15060.

39
Experiments

For c0 .2,

40
Experiments

For c0 .8,

41
From Logistic Regression to Boosting

Boosting by Freud and Schapire (1996) has
achieved great success in machine learning for
classification problems.

It combines a group of weak learners
(classifiers) with a weighted voting
scheme so that a strong learner
(classifier) is obtained.

42
From Logistic Regression to Boosting

Suppose the response y is labeled 1 or 1, i.e,
y 2y-1. The popular version of the
Adaboost algorithm is
1. Initialize F(x) 0 and weights wi 1/n, i
1,,n.
2. From t 1,,T
(a) Fit a weak classifier ft(x) -1,1 to
the weighted training data..
(b) Compute weighted misclassification
rate for ft(x )
(c) Change weights
3. Final classifier

and
43
Boosting As a stagewise Additive Fitting

In Friedman, Hastie, Tibshirani (2000), boosting
is viewed as an stagewise additive fitting
which exploits IRLS-like steps to minimize the
exponential loss
The update is
where usually ft(x) is taken as stumps or
shallow trees.

44
Boosting As a stagewise Additive Fitting

Boosting is a stagewise fitting as opposed to
stepwise fitting

In stagewise fitting, Fold is not updated
while in stepwise fitting, it will.

45
Exponential Loss Can Be Mapped to A Proper
Scoring Rule

Exponential loss can be decomposed to a proper
scoring rule and a link function

46
Logit-Boost

FHT(2000) proposed a stagewise additive fitting
algorithm to minimize the negative log-likelihood.

47
Cost-weighted Boosting

Again, for cost weighted classification, we could
optimize the tailored losses with

48
Experiments

Artificial data 2000 instances.
Class 1 and Class 0 have equal margin, i.e, there
are 50 class 1 and 50 class 0.
Suppose we are interested in estimating
p(x) .3.

49
Experiments