Classification: Definition

About This Presentation

Title:

Classification: Definition

Description:

convertible to simple and easy to understand classification rules ... Play-tennis example: classifying X. An unseen sample X = rain, hot, high, false P(X|p) P(p) ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 74

Provided by: jiaw206

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classification: Definition

1
Classification Definition

Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.

2
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
3
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
4
Classification by Decision Tree Induction

Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the
root
Partition examples recursively based on selected
attributes
Tree pruning
Identify and remove branches that reflect noise
or outliers
Use of decision tree Classifying an unknown
sample
Test the attribute values of the sample against
the decision tree

5
Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
6
Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
7
Apply Model to Test Data
Test Data
Start from the root of tree.
8
Apply Model to Test Data
Test Data
9
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
10
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
11
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
12
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
13
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

14
General Structure of Hunts Algorithm

Let Dt be the set of training records that reach
a node t
General Procedure
If Dt contains records that belong the same class
yt, then t is a leaf node labeled as yt
If Dt is an empty set, then t is a leaf node
labeled by the default class, yd
If Dt contains records that belong to more than
one class, use an attribute test to split the
data into smaller subsets. Recursively apply the
procedure to each subset.

Dt
?
15
Hunts Algorithm
Dont Cheat
16
Tree Induction

Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting

17
How to Specify Test Condition?

Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split

18
Splitting Based on Nominal Attributes

Multi-way split Use as many partitions as
distinct values.
Binary split Divides values into two subsets.
Need to find optimal partitioning.

19
Splitting Based on Continuous Attributes

Different ways of handling
Discretization to form an ordinal categorical
attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing (percenti
les), or clustering.
Binary Decision (A lt v) or (A ? v)
consider all possible splits and finds the best
cut
can be more compute intensive

20
Splitting Based on Continuous Attributes
21
Information Gain (ID3/C4.5)

Select the attribute with the highest information
gain
Assume there are two classes, P and N
Let the set of examples S contain p elements of
class P and n elements of class N
The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined as

22
Refund attribute Information Gain
Dont Cheat

Class C(cheat contains 3 tuples)
Class NC contains 7 tuples
I(C,NC) -3/10log(3/10) -7/10log(7/10) .2653
Check attribute Refund Yes value contains 3 NC
and 0 C. No value contains 4 NC and 3 C
For Yes 3/3log3/3 0/3log0 0
For No 4/7log(7/4)3/7log(7/3) .2966
E(Refund) 3/100 7/10.2966 .2076
Gain I(C,NC) - E(Refund) .0577

23
Marital Status Information Gain
Dont Cheat

Check attribute MS Single value contains 3 NC
and 3 C. Married value contains 4 NC and 0 C
For Single 3/6log6/3 3/6log6/3 .30
For Married 4/4log(4/4)0/4log(0) 0
E(Refund) 6/10.3 4/10.0.18 .
Gain I(C,NC) - E(MS) .0853

24
Taxable income(TI) Information Gain
Dont Cheat

Suppose that taxable income is discretized into
(0, 75, (75,100,(100, 1000000
First segment contains 3NC 0C
Second segment contains 1NC 3C
Third segment contains 3NC 0C
For 1st segment 3/3log1 0/3log1 0
For 2d segment 1/4log(4/1)3/4log(4/3)
.2442
For 3d segment we also obtain 0
E(TI) 3/100 4/10.2442 3/100 .0977
Gain I(C,NC) - E(MS) .1676

25
Information Gain in Decision Tree Induction

Assume that using attribute A a set S will be
partitioned into sets S1, S2 , , Sv
If Si contains pi examples of P and ni examples
of N, the entropy, or the expected information
needed to classify objects in all subtrees Si is
The encoding information that would be gained by
branching on A

26
RID Age Income Student Credit
Class buysComputer
1 30 high no Fair No
2 30 high No Excellent No
3 31..40 high No Fair Yes
4 gt40 medium No Fair Yes
5 gt40 low Yes Fair Yes
6 gt40 low Yes Excellent No
7 31..40 low Yes Excellent Yes
8 30 medium No Fair No
9 30 low Yes Fair Yes
10 gt40 medium Yes Fair Yes
11 30 medium Yes Excellent Yes
12 31..40 medium No Excellent Yes
13 31..40 high Yes Fair Yes
14 gt40 medium no excellent no
27
Attribute Selection by Information Gain
Computation

Hence
Similarly

Class P buys_computer yes
Class N buys_computer no
I(p, n) I(9, 5) 0.940
Compute the entropy for age

28
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN
rules
One rule is created for each path from the root
to a leaf
Each attribute-value pair along a path forms a
conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age lt30 AND student no THEN
buys_computer no
IF age lt30 AND student yes THEN
buys_computer yes
IF age 3140 THEN buys_computer yes
IF age gt40 AND credit_rating excellent
THEN buys_computer yes
IF age gt40 AND credit_rating fair THEN
buys_computer no

29
Stopping Criteria for Tree Induction

Stop expanding a node when all the records belong
to the same class
Stop expanding a node when all the records have
similar attribute values
Early termination (to be discussed later)

30
Table to be classified
name btemp skinCover GivesBirth aquatic aerial Legs? hibernates class
Human Warm Hair Yes No No Yrs No mammal
Python Cold Scales No No No No Yes Reptile
Salmon Cold Scales No Yes No No No Fish
Whale Warm Hair Yes Yes No No No Mammal
Frog Cold None No Semi No Yes Yes Amphbian
Komodo Cold Scales No No No Yes No Reptile
eel cold scales no Yes No No No Fish
Bat Warm Hair Yes No Yes Yes Yes Mammal
Pigeon Warm Feather No No Yes Yes No Bird
Cat Warm Fur Yes No No Yes No Mammal
Leopard warm fur Yes no No Yes No Mammal
Shark cold scales yes yes no no no fish
Turtle Cold Scales No Semi No Yes No Reptile
penguin warm feather No semi no yes no bird
31
Decision Tree Based Classification

Advantages
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets

32
Practical Issues of Classification

Underfitting and Overfitting
Missing Values
Costs of Classification

33
Underfitting and Overfitting (Example)
500 circular and 500 triangular data
points. Circular points 0.5 ? sqrt(x12x22) ?
1 Triangular points sqrt(x12x22) gt 0.5
or sqrt(x12x22) lt 1
34
Underfitting and Overfitting
Overfitting
Underfitting when model is too simple, both
training and test errors are large
35
Overfitting due to Noise
Decision boundary is distorted by noise point
36
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the
diagram makes it difficult to predict correctly
the class labels of that region - Insufficient
number of training records in the region causes
the decision tree to predict the test examples
using other training records that are irrelevant
to the classification task
37
Notes on Overfitting

Overfitting results in decision trees that are
more complex than necessary
Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records
Need new ways for estimating errors

38
Minimum Description Length (MDL)

Cost(Model,Data) Cost(DataModel) Cost(Model)
Cost is the number of bits needed for encoding.
Search for the least costly model.
Cost(DataModel) encodes the misclassification
errors.
Cost(Model) uses node encoding (number of
children) plus splitting condition encoding.

39
Metrics for Performance Evaluation

Focus on the predictive capability of a model
Rather than how fast it takes to classify or
build models, scalability, etc.
Confusion Matrix

PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a b
ACTUALCLASS ClassNo c d
a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
40
Metrics for Performance Evaluation
PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a(TP) b(FN)
ACTUALCLASS ClassNo c(FP) d(TN)

Most widely-used metric

41
Limitation of Accuracy

Consider a 2-class problem
Number of Class 0 examples 9990
Number of Class 1 examples 10
If model predicts everything to be class 0,
accuracy is 9990/10000 99.9
Accuracy is misleading because model does not
detect any class 1 example

42
Cost Matrix
PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS C(ij) ClassYes ClassNo
ACTUALCLASS ClassYes C(YesYes) C(NoYes)
ACTUALCLASS ClassNo C(YesNo) C(NoNo)
C(ij) Cost of misclassifying class j example as
class i
43
Computing Cost of Classification
Cost Matrix PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS C(ij) -
ACTUALCLASS -1 100
ACTUALCLASS - 1 0
Model M1 PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS -
ACTUALCLASS 150 40
ACTUALCLASS - 60 250
Model M2 PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS -
ACTUALCLASS 250 45
ACTUALCLASS - 5 200
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
44
Cost vs Accuracy
Count PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a b
ACTUALCLASS ClassNo c d
Cost PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes p q
ACTUALCLASS ClassNo q p
45
Cost-Sensitive Measures

Precision is biased towards C(YesYes)
C(YesNo)
Recall is biased towards C(YesYes) C(NoYes)
F-measure is biased towards all except C(NoNo)

46
Bayesian Classification Why?

Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems
Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data.
Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities
Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured

47
Naive BayesianClassification Example

Discretization of Height is as follows
(0,1.6, (1.6,1.7, (1.7,1.8, (1.8,1.9,
(1.9, (1.9,2.0, (2.0,)
P(Short) 4/15.267
P(Medium) 8/15 .533
P(Tall) 3/15 .2
P(MShort) 1/4.25
P(MMedium)2/8.25
P(MTall) 3/3 1.0
P(FShort)3/4.75
P(FMedium)6/8.75
P(FTall)0/30.0
P((0,1.6short) 2/4.5
P((1.6,1.7Short)2/4.0
P((1.9,2Tall)1/3.333
P((1.9,2Tall)2/3.666
P((1.7,1.8Medium)3/8.375
P((1.8,1.9Medium)4/8.5
P((1.9,2Medium 1/8.125

Id Gender Heigth class
1 F 1.6 Short
2 M 2.0 Tall
3 F 1.9 Medium
4 F 1.88 Medium
5 F 1.7 Short
6 M 1.85 Medium
7 F 1.6 Short
8 M 1.7 Short
9 M 2.2 Tall
10 M 2.1 Tall
11 F 1.8 Medium
12 M 1.95 Medium
13 F 1.9 Medium
14 F 1.8 Medium
15 F 1.75 Medium
48
Naive BayesianClassification Example
Id Gender Heigth class
1 F 1.6 Short
2 M 2.0 Tall
3 F 1.9 Medium
4 F 1.88 Medium
5 F 1.7 Short
6 M 1.85 Medium
7 F 1.6 Short
8 M 1.7 Short
9 M 2.2 Tall
10 M 2.1 Tall
11 F 1.8 Medium
12 M 1.95 Medium
13 F 1.9 Medium
14 F 1.8 Medium
15 F 1.75 Medium

Consider tuple tlt16,M,1.95gt
Bayesian rule
P(tShort)P(Short)
P(Shortt) ----------------------
P(t)
P(t) is a constant for any class.
P(tshort)P(Mshort) P((1.9,2short)
.2500
P(short).267 P(Shortt) 0/P(t)
P(tMedium)
P(MMedium)P((1.9,2Medium).25.125
.031
P(Mediumt) .031.533/P(t) .016/P(t)
Similarly for P(Tallt) .333.2/P(t).066P(t)
Thus, t is Tall

49
Estimating a-posteriori probabilities

Bayes theorem
P(CX) P(XC)P(C) / P(X)
P(X) is constant for all classes
P(C) relative freq of class C samples
C such that P(CX) is maximum C such that
P(XC)P(C) is maximum
Problem computing P(XC) is unfeasible!

50
Naïve Bayesian Classification

Naïve assumption attribute independence
P(x1,,xkC) P(x1C)P(xkC)
If i-th attribute is categoricalP(xiC) is
estimated as the relative freq of samples having
value xi as i-th attribute in class C
Computationally easy in both cases

51
Play-tennis example estimating P(xiC)
outlook
P(sunnyp) 2/9 P(sunnyn) 3/5
P(overcastp) 4/9 P(overcastn) 0
P(rainp) 3/9 P(rainn) 2/5
temperature
P(hotp) 2/9 P(hotn) 2/5
P(mildp) 4/9 P(mildn) 2/5
P(coolp) 3/9 P(cooln) 1/5
humidity
P(highp) 3/9 P(highn) 4/5
P(normalp) 6/9 P(normaln) 2/5
windy
P(truep) 3/9 P(truen) 3/5
P(falsep) 6/9 P(falsen) 2/5
P(p) 9/14
P(n) 5/14
52
Play-tennis example classifying X

An unseen sample X ltrain, hot, high, falsegt
P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
ep)P(p) 3/92/93/96/99/14 0.010582
P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286
Sample X is classified in class n (dont play)

53
Instance-Based Methods

Instance-based learning
Store training examples and delay the processing
(lazy evaluation) until a new instance must be
classified
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean
space.
Locally weighted regression
Constructs local approximation
Case-based reasoning
Uses symbolic representations and knowledge-based
inference

54
Other Classification Methods

k-nearest neighbor classifier
case-based reasoning
Rough set approach
Fuzzy set approaches

55
The k-Nearest Neighbor Algorithm

All instances correspond to points in the n-D
space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete- or real-
valued.
For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.

56
Discussion on the k-NN Algorithm

The k-NN algorithm for continuous-valued target
functions
Calculate the mean values of the k nearest
neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k
neighbors according to their distance to the
query point xq giving greater weight to closer
neighbors
Robust to noisy data by averaging k-nearest
neighbors
Curse of dimensionality distance between
neighbors could be dominated by irrelevant
attributes.
To overcome it, axes stretch or elimination of
the least relevant attributes.

57
Remarks on Lazy vs. Eager Learning

Instance-based learning lazy evaluation
Decision-tree and Bayesian classification eager
evaluation
Key differences
Lazy method may consider query instance xq when
deciding how to generalize beyond the training
data D
Eager method cannot since they have already
chosen global approximation when seeing the query
Efficiency Lazy - less time training but more
time predicting
Accuracy
Lazy method effectively uses a richer hypothesis
space since it uses many local linear functions
to form its implicit global approximation to the
target function
Eager must commit to a single hypothesis that
covers the entire instance space

58
Rough Set Approach

Rough sets are used to approximately or roughly
define equivalent classes
A rough set for a given class C is approximated
by two sets a lower approximation (certain to be
in C) and an upper approximation (cannot be
described as not belonging to C)
Finding the minimal subsets (reducts) of
attributes (for feature reduction) is NP-hard but
a discernibility matrix is used to reduce the
computation intensity

59
Fuzzy Set Approaches

Fuzzy logic uses truth values between 0.0 and 1.0
to represent the degree of membership (such as
using fuzzy membership graph)
Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete
categories low, medium, high with fuzzy values
calculated
For a given new sample, more than one fuzzy value
may apply
Each applicable rule contributes a vote for
membership in the categories
Typically, the truth values for each predicted
category are summed

60
What Is Prediction?

Prediction is similar to classification
First, construct a model
Second, use model to predict unknown value
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
Prediction is different from classification
Classification refers to predict categorical
class label
Prediction models continuous-valued functions

61
Predictive Modeling in Databases

Predictive modeling Predict data values or
construct generalized linear models based on
the database data.
One can only predict value ranges or category
distributions
Method outline
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the
prediction
Data relevance analysis uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction drill-down and roll-up
analysis

62
Linear Regression for prediction

Given a set of tuples (x1,y1), (x2,y2), . . .
.(xs,ys)
Assume that YABX
SUM(xi-E(X))(yi-E(Y)
B-------------------------
SUM(xi- E(X))2
A E(Y)-BE(X)

63
Linear Regression for prediction
X(Years of Experience) Y(Salary
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

E(X) 9.1 E(Y)55.4
B (3-9.1)(30-55.4)(8-9.1)(57-55.4)
../(3-9.1)2 3.5
A55.4-(3.59.1)23.6
Y23.63.5X

64
Linear Regression using a minimal error approach

YABXe, where e is an error if Y is replaced by
ABX
LSUM e2 SUM(y-A-Bx)2
Take a derivative for A and B respectively
DL/DA -2SUMyi2SUM A 2SUM Bxi0
DL/DB 2SUM (yi-A-Bxi)(xi) 0
A(SUMyi-SUMBxi)/N B((SUM(xiyi)-(SUMxiSUMyi)/
N)/(SUM(xi2)-(SUM(xi)2)/N))

65
Linear regression using minimal error
X(Years of Experience) Y(Salary
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

A 17.91 B4.12
Y 17.914.12X

66
Regress Analysis and Log-Linear Models in
Prediction

Multiple regression Y b0 b1 X1 b2 X2.
Many nonlinear functions can be transformed into
the above.
Log-linear models
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability p(a, b, c, d) ?ab ?ac?ad ?bcd

67
Locally Weighted Regression

Construct an explicit approximation to f over a
local region surrounding query instance xq.
Locally weighted linear regression
The target function f is approximated near xq
using the linear function
minimize the squared error distance-decreasing
weight K
the gradient descent training rule
In most cases, the target function is
approximated by a constant, linear, or quadratic
function.

68
Prediction Numerical Data
69
Prediction Categorical Data
70
Classification Accuracy Estimating Error Rates

Partition Training-and-testing
use two independent data sets, e.g., training set
(2/3), test set(1/3)
used for data set with large number of samples
Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one
sub-sample as test data --- k-fold
cross-validation
for data set with moderate size
Bootstrapping (leave-one-out)
for small size data

71
Boosting and Bagging

Boosting increases classification accuracy
Applicable to decision trees or Bayesian
classifier
Learn a series of classifiers, where each
classifier in the series pays more attention to
the examples misclassified by its predecessor
Boosting requires only linear time and constant
space

72
Boosting Technique (II) Algorithm

Assign every example an equal weight 1/N
For t 1, 2, , T Do
Obtain a hypothesis (classifier) h(t) under w(t)
Calculate the error of h(t) and re-weight the
examples based on the error
Normalize w(t1) to sum to 1
Output a weighted sum of all the hypothesis, with
each hypothesis weighted according to its
accuracy on the training set

73
Summary

Classification is an extensively studied problem
(mainly in statistics, machine learning neural
networks)
Classification is probably one of the most widely
used data mining techniques with a lot of
extensions
Scalability is still an important issue for
database applications thus combining
classification with database techniques should be
a promising topic
Research directions classification of
non-relational data, e.g., text, spatial,
multimedia, etc..