Loading...

PPT – CENG 464 Introduction to Data Mining PowerPoint presentation | free to download - id: 600dcb-MDAzM

The Adobe Flash plugin is needed to view this content

CENG 464 Introduction to Data Mining

Supervised vs. Unsupervised Learning

- Supervised learning (classification)
- Supervision The training data (observations,

measurements, etc.) are accompanied by labels

indicating the class of the observations - New data is classified based on the training set
- Unsupervised learning (clustering)
- The class labels of training data is unknown
- Given a set of measurements, observations, etc.

with the aim of establishing the existence of

classes or clusters in the data

Classification Definition

- Given a collection of records (training set )
- Each record contains a set of attributes, one of

the attributes is the class. - Find a model for class attribute as a function

of the values of other attributes. - Goal previously unseen records should be

assigned a class as accurately as possible. - A test set is used to determine the accuracy of

the model. Usually, the given data set is divided

into training and test sets, with training set

used to build the model and test set used to

validate it.

Classification Definition

Prediction Problems Classification vs. Numeric

Prediction

- Classification
- predicts categorical class labels (discrete or

nominal) - classifies data (constructs a model) based on the

training set and the values (class labels) in a

classifying attribute and uses it in classifying

new data - Numeric Prediction
- models continuous-valued functions, i.e.,

predicts unknown or missing values - Typical applications
- Credit/loan approval
- Medical diagnosis if a tumor is cancerous or

benign - Fraud detection if a transaction is fraudulent
- Web page categorization which category it is

ClassificationA Two-Step Process

- Model construction describing a set of

predetermined classes - Each tuple/sample is assumed to belong to a

predefined class, as determined by the class

label attribute - The set of tuples used for model construction is

training set - The model is represented as classification rules,

decision trees, or mathematical formulae - Model usage for classifying future or unknown

objects - Estimate accuracy of the model
- The known label of test sample is compared with

the classified result from the model - Accuracy rate is the percentage of test set

samples that are correctly classified by the

model - Test set is independent of training set

(otherwise overfitting) - If the accuracy is acceptable, use the model to

classify new data - Note If the test set is used to select models,

it is called validation (test) set

Process (1) Model Construction

Classification Algorithms

IF rank professor OR years gt 6 THEN tenured

yes

Process (2) Using the Model in Prediction

(Jeff, Professor, 4)

Tenured?

Illustrating Classification Task

Training and Test set are randomly sampled

supervised

accuracy

Find a mapping OR function that can predict

class label of given tuple X

Classification Techniques

- Decision Tree based Methods
- Bayes Classification Methods
- Rule-based Methods
- Nearest-Neighbor Classifier
- Artificial Neural Networks
- Support Vector Machines
- Memory based reasoning

Example of a Decision Tree

Root node Internal nodes attribute test

conditions Leaf nodes class label

Splitting Attributes

Refund

Yes

No

MarSt

NO

Married

Single, Divorced

TaxInc

NO

lt 80K

gt 80K

YES

NO

Model Decision Tree

Training Data

Another Example of Decision Tree

categorical

categorical

continuous

class

Single, Divorced

MarSt

Married

Refund

NO

No

Yes

TaxInc

lt 80K

gt 80K

YES

NO

There could be more than one tree that fits the

same data!

Decision Tree Classification Task

Decision Tree

Apply Model to Test Data

Test Data

Start from the root of tree.

Apply Model to Test Data

Test Data

Apply Model to Test Data

Test Data

Refund

Yes

No

MarSt

NO

Married

Single, Divorced

TaxInc

NO

lt 80K

gt 80K

YES

NO

Apply Model to Test Data

Test Data

Refund

Yes

No

MarSt

NO

Married

Single, Divorced

TaxInc

NO

lt 80K

gt 80K

YES

NO

Apply Model to Test Data

Test Data

Refund

Yes

No

MarSt

NO

Married

Single, Divorced

TaxInc

NO

lt 80K

gt 80K

YES

NO

Apply Model to Test Data

Test Data

Refund

Yes

No

MarSt

NO

Assign Cheat to No

Married

Single, Divorced

TaxInc

NO

lt 80K

gt 80K

YES

NO

Decision Tree Classification Task

Decision Tree

Decision Tree Induction

- Many Algorithms
- Hunts Algorithm
- ID3, C4.5
- CART
- SLIQ,SPRINT

Algorithm for Decision Tree Induction

- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive

divide-and-conquer manner - At start, all the training examples are at the

root - Attributes are categorical (if continuous-valued,

they are discretized in advance) - Examples are partitioned recursively based on

selected attributes - Test attributes are selected on the basis of a

heuristic or statistical measure (e.g.,

information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same

class - There are no remaining attributes for further

partitioning majority voting is employed for

classifying the leaf - There are no samples left

Tree Induction

- Greedy strategy.
- Split the records based on an attribute test that

optimizes certain criterion. - Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting

How to Specify Test Condition?

- Depends on attribute types
- Nominal
- Ordinal
- Continuous
- Depends on number of ways to split
- 2-way split
- Multi-way split

Splitting Based on Nominal Attributes

- Multi-way split Use as many partitions as

distinct values. - Binary split Divides values into two subsets.

Need to find optimal partitioning.

OR

Splitting Based on Ordinal Attributes

- Multi-way split Use as many partitions as

distinct values. - Binary split Divides values into two subsets.

Need to find optimal partitioning. - What about this split?

OR

Splitting Based on Continuous Attributes

- Different ways of handling
- Discretization to form an ordinal categorical

attribute - Static discretize once at the beginning
- Dynamic ranges can be found by equal interval

bucketing, equal frequency bucketing (percenti

les), or clustering. - Binary Decision (A lt v) or (A ? v)
- consider all possible splits and finds the best

cut - can be more compute intensive

Splitting Based on Continuous Attributes

How to determine the Best Split

Before Splitting 10 records of class 0, 10

records of class 1

Which test condition is the best?

How to determine the Best Split

- Greedy approach
- Nodes with homogeneous class distribution are

preferred - Need a measure of node impurity

Non-homogeneous, High degree of impurity

Homogeneous, Low degree of impurity

Attribute Selection-Splitting Rules

Measures (Measures of Node Impurity)

- Provides a ranking for each attribute describing

the given training tuples. The attribute having

the best score for the measure is chosen as the

splitting attribute for the given tuples. - Information Gain-Entropy
- Gini Index
- Misclassification error

Brief Review of Entropy

m 2

Attribute Selection Measure Information Gain

(ID3/C4.5)

- Select the attribute with the highest information

gain - This attribute minimizes the information needed

to classify the tuples in the resulting

partitions and reflects the least randomness or

impurity in these partitions - Let pi be the probability that an arbitrary tuple

in D belongs to class Ci, estimated by Ci,

D/D - Expected information (entropy) needed to classify

a tuple in D - Information needed (after using A to split D into

v partitions) to classify D - Information gained by branching on attribute A

Attribute Selection Information Gain

- Class P buys_computer yes
- Class N buys_computer no

- means youth has 5 out of 14

samples, with 2 yeses and 3 nos. Hence - Similarly,

Computing Information-Gain for Continuous-Valued

Attributes

- Let attribute A be a continuous-valued attribute
- Must determine the best split point for A
- Sort the value A in increasing order
- Typically, the midpoint between each pair of

adjacent values is considered as a possible split

point - (aiai1)/2 is the midpoint between the values of

ai and ai1 - The point with the minimum expected information

requirement for A is selected as the split-point

for A - Split
- D1 is the set of tuples in D satisfying A

split-point, and D2 is the set of tuples in D

satisfying A gt split-point

Gain Ratio for Attribute Selection (C4.5)

- Information gain measure is biased towards

attributes with a large number of values - C4.5 (a successor of ID3) uses gain ratio to

overcome the problem (normalization to

information gain) - GainRatio(A) Gain(A)/SplitInfo(A)
- Ex.
- gain_ratio(income) 0.029/1.557 0.019
- The attribute with the maximum gain ratio is

selected as the splitting attribute

Gini Index (CART, IBM IntelligentMiner)

- If a data set D contains examples from n classes,

gini index, gini(D) is defined as - where pj is the relative frequency of class

j in D - If a data set D is split on A into two subsets

D1 and D2, the gini index gini(D) is defined as - Reduction in Impurity
- The attribute provides the smallest ginisplit(D)

(or the largest reduction in impurity) is chosen

to split the node (need to enumerate all the

possible splitting points for each attribute)

Computation of Gini Index

- Ex. D has 9 tuples in buys_computer yes and

5 in no - Suppose the attribute income partitions D into 10

in D1 low, medium and 4 in D2 - Ginilow,high is 0.458 Ginimedium,high is

0.450. Thus, split on the low,medium (and

high) since it has the lowest Gini index - All attributes are assumed continuous-valued
- May need other tools, e.g., clustering, to get

the possible split values - Can be modified for categorical attributes

Comparing Attribute Selection Measures

- The three measures, in general, return good

results but - Information gain
- biased towards multivalued attributes
- Gain ratio
- tends to prefer unbalanced splits in which one

partition is much smaller than the others - Gini index
- biased to multivalued attributes
- has difficulty when of classes is large
- tends to favor tests that result in equal-sized

partitions and purity in both partitions

Other Attribute Selection Measures

- CHAID a popular decision tree algorithm, measure

based on ?2 test for independence - C-SEP performs better than info. gain and gini

index in certain cases - G-statistic has a close approximation to ?2

distribution - MDL (Minimal Description Length) principle (i.e.,

the simplest solution is preferred) - The best tree as the one that requires the fewest

of bits to both (1) encode the tree, and (2)

encode the exceptions to the tree - Multivariate splits (partition based on multiple

variable combinations) - CART finds multivariate splits based on a linear

comb. of attrs. - Which attribute selection measure is the best?
- Most give good results, none is significantly

superior than others

Overfitting and Tree Pruning

- Overfitting An induced tree may overfit the

training data - Too many branches, some may reflect anomalies due

to noise or outliers - Poor accuracy for unseen samples
- Two approaches to avoid overfitting
- Prepruning Halt tree construction early ? do not

split a node if this would result in the goodness

measure falling below a threshold - Difficult to choose an appropriate threshold
- Postpruning Remove branches from a fully grown

treeget a sequence of progressively pruned trees - Use a set of data different from the training

data to decide which is the best pruned tree

Decision Tree Based Classification

- Advantages
- Inexpensive to construct
- Extremely fast at classifying unknown records
- Easy to interpret for small-sized trees
- Accuracy is comparable to other classification

techniques for many simple data sets

Chapter 8. Classification Basic Concepts

- Classification Basic Concepts
- Decision Tree Induction
- Bayes Classification Methods
- Rule-Based Classification
- Model Evaluation and Selection
- Techniques to Improve Classification Accuracy

Ensemble Methods - Summary

43

Bayesian Classification Why?

- A statistical classifier performs probabilistic

prediction, i.e., predicts class membership

probabilities - Foundation Based on Bayes Theorem.
- Performance A simple Bayesian classifier, naïve

Bayesian classifier, has comparable performance

with decision tree and selected neural network

classifiers - Incremental Each training example can

incrementally increase/decrease the probability

that a hypothesis is correct prior knowledge

can be combined with observed data - Standard Even when Bayesian methods are

computationally intractable, they can provide a

standard of optimal decision making against which

other methods can be measured

Bayes Theorem Basics

- Total probability Theorem
- Bayes Theorem
- Let X be a data sample (evidence) class label

is unknown - Let H be a hypothesis that X belongs to class C
- Classification is to determine P(HX), (i.e.,

posteriori probability) the probability that

the hypothesis holds given the observed data

sample X - P(H) (prior probability) the initial probability
- E.g., X will buy computer, regardless of age,

income, - P(X) probability that sample data is observed
- P(XH) (likelihood) the probability of observing

the sample X, given that the hypothesis holds - E.g., Given that X will buy computer, the prob.

that X is 31..40, medium income

Prediction Based on Bayes Theorem

- Given training data X, posteriori probability of

a hypothesis H, P(HX), follows the Bayes

theorem - Informally, this can be viewed as
- posteriori likelihood x prior/evidence
- Predicts X belongs to Ci iff the probability

P(CiX) is the highest among all the P(CkX) for

all the k classes - Practical difficulty It requires initial

knowledge of many probabilities, involving

significant computational cost

Classification Is to Derive the Maximum Posteriori

- Let D be a training set of tuples and their

associated class labels, and each tuple is

represented by an n-D attribute vector X (x1,

x2, , xn) - Suppose there are m classes C1, C2, , Cm.
- Classification is to derive the maximum

posteriori, i.e., the maximal P(CiX) - This can be derived from Bayes theorem
- Since P(X) is constant for all classes, only

- needs to be maximized

Naïve Bayes Classifier

- A simplified assumption attributes are

conditionally independent (i.e., no dependence

relation between attributes) - This greatly reduces the computation cost Only

counts the class distribution - If Ak is categorical, P(xkCi) is the of tuples

in Ci having value xk for Ak divided by Ci, D

( of tuples of Ci in D) - If Ak is continous-valued, P(xkCi) is usually

computed based on Gaussian distribution with a

mean µ and standard deviation s - and P(xkCi) is

Naïve Bayes Classifier Training Dataset

Class C1buys_computer yes C2buys_computer

no Data to be classified X (age youth,

Income medium, Student yes Credit_rating

Fair)

Naïve Bayes Classifier An Example

- P(Ci) P(buys_computer yes) 9/14

0.643 - P(buys_computer no)

5/14 0.357 - Compute P(XCi) for each class
- P(age youth buys_computer yes)

2/9 0.222 - P(age youth buys_computer no)

3/5 0.6 - P(income medium buys_computer yes)

4/9 0.444 - P(income medium buys_computer no)

2/5 0.4 - P(student yes buys_computer yes)

6/9 0.667 - P(student yes buys_computer no)

1/5 0.2 - P(credit_rating fair buys_computer

yes) 6/9 0.667 - P(credit_rating fair buys_computer

no) 2/5 0.4 - X (age lt 30 , income medium, student yes,

credit_rating fair) - P(XCi) P(Xbuys_computer yes) 0.222 x

0.444 x 0.667 x 0.667 0.044 - P(Xbuys_computer no) 0.6 x

0.4 x 0.2 x 0.4 0.019 - P(XCi)P(Ci) P(Xbuys_computer yes)

P(buys_computer yes) 0.028 - P(Xbuys_computer no)

P(buys_computer no) 0.007 - Therefore, X belongs to class (buys_computer

yes)

Avoiding the Zero-Probability Problem

- Naïve Bayesian prediction requires each

conditional prob. be non-zero. Otherwise, the

predicted prob. will be zero - Ex. Suppose a dataset with 1000 tuples,

incomelow (0), income medium (990), and income

high (10) - Use Laplacian correction (or Laplacian estimator)
- Adding 1 to each case
- Prob(income low) 1/1003
- Prob(income medium) 991/1003
- Prob(income high) 11/1003
- The corrected prob. estimates are close to

their uncorrected counterparts

Naïve Bayes Classifier Comments

- Advantages
- Easy to implement
- Robust to noise
- Can handle null values
- Good results obtained in most of the cases
- Disadvantages
- Assumption class conditional independence,

therefore loss of accuracy - Practically, dependencies exist among variables
- E.g., hospitals patients Profile age, family

history, etc. - Symptoms fever, cough etc., Disease lung

cancer, diabetes, etc. - Dependencies among these cannot be modeled by

Naïve Bayes Classifier - How to deal with these dependencies? Bayesian

Belief Networks

Chapter 8. Classification Basic Concepts

- Classification Basic Concepts
- Decision Tree Induction
- Bayes Classification Methods
- Rule-Based Classification
- Model Evaluation and Selection
- Techniques to Improve Classification Accuracy

Ensemble Methods - Summary

53

Using IF-THEN Rules for Classification

- Represent the knowledge in the form of IF-THEN

rules - R IF age youth AND student yes THEN

buys_computer yes - Rule antecedent/precondition vs. rule consequent
- If rule is satisfied by X, it covers the tupple,

the rule is said to be triggered - If R1 is the rule satisfied, then the rule fires

by returning the class predictiin - Assessment of a rule coverage and accuracy
- ncovers of tuples covered by R
- ncorrect of tuples correctly classified by R
- coverage(R) ncovers /D where D

training data set - accuracy(R) ncorrect / ncovers

Using IF-THEN Rules for Classification

- If more than one rule are triggered, need

conflict resolution - Size ordering assign the highest priority to the

triggering rules that has the toughest

requirement (i.e., with the most attribute tests) - Rule ordering prioritize rules beforehand
- Class-based ordering classes are sorted in order

of decreasing importance like order of prevalence

or misclassification cost per class. Within each

class rules are nor ordered - Rule-based ordering (decision list) rules are

organized into one long priority list, according

to some measure of rule quality like accuracy,

coverage or size. The first rule satisfying X

fires class prediction, any other rule satisfying

X is ignored. Each rule in the list implies the

negation of the rules that come before

it?difficult to interpret - What if no rule is fired for X? default rule!

Rule Extraction from a Decision Tree

- Rules are easier to understand than large trees
- One rule is created for each path from the root

to a leaf and logically ANDed to form the rule

antecedent - Each attribute-value pair along a path forms a

conjunction the leaf holds the class prediction - Rules are mutually exclusive and exhaustive
- Mutually exclusive no two rules will be

triggered for the same tuple - Exhaustive there is one rule for each possible

attribute value combination?no need for a default

rule

Rule Extraction from a Decision Tree

- Example Rule extraction from our buys_computer

decision-tree - IF age young AND student no

THEN buys_computer no - IF age young AND student yes

THEN buys_computer yes - IF age middle_aged THEN buys_computer

yes - IF age senior AND credit_rating excellent

THEN buys_computer no - IF age senior AND credit_rating fair

THEN buys_computer yes

Rule Induction Sequential Covering Method

- Sequential covering algorithm Extracts rules

directly from training data - Typical sequential covering algorithms FOIL, AQ,

CN2, RIPPER - Rules are learned sequentially, each for a given

class Ci will cover many tuples of Ci but none

(or few) of the tuples of other classes - Steps
- Rules are learned one at a time
- Each time a rule is learned, the tuples covered

by the rules are removed - Repeat the process on the remaining tuples until

termination condition, e.g., when no more

training examples or when the quality of a rule

returned is below a user-specified threshold - Comp. w. decision-tree induction learning a set

of rules simultaneously

Sequential Covering Algorithm

- When learning a rule for a class, C, we would

like the rule to cover all or most of the

training tuples of class C and none or few of the

tuples from other classes - while (enough target tuples left)
- generate a rule
- remove positive target tuples satisfying this

rule

Examples covered by Rule 2

Examples covered by Rule 1

Examples covered by Rule 3

Positive examples

How to Learn-One-Rule?

- Two approaches
- Specialization
- Start with the most general rule possible empty

rule?class y - Best attribute-value pair is added from list A

into the antecedent - Continue until rule performance measure cannot

improve further - If incomehigh THEN loan_decisionaccept
- If incomehigh AND credit_ratingexcellent THEN

loan_decisionaccept - Greedy algorithm always add attribute value

pair which is best at the moment

How to Learn-One-Rule?

- Two approaches
- generalization
- Start with the randomly selected positive tuple

and converted to a rule that covers - Tuple (overcast, high,false,P) can be converted

to a rule as - Outlookovercast AND humidityhigh AND

windyfalse ?classP - Choose one attribute-value pair and remove it

sothat rule covers more positive examples - Repeat the process until the rule starts to cover

negative examples

How to Learn-One-Rule?

- Rule-Quality measures
- used to decide if appending a test to the rules

condition will result in an improved rule

accuracy, coverage - Consider R1 correctly classifies 38 0f 40 tuples

whereas R2 covers 2 tuples and correctly

classifies all which rule is better? Accuracy? - Different Measures Foil-gain, likelihood ratio

statistics, chisquare statistics

How to Learn-One-Rule?

- Rule-Quality measures Foil-gain checks if

ANDing a new condition results in a better rule - considers both coverage and accuracy
- Foil-gain (in FOIL RIPPER) assesses info_gain

by extending condition - pos and neg are the of positively and

negatively covered tuples by R and - Pos and neg are the of positively and

negatively covered tuples by R - favors rules that have high accuracy and cover

many positive tuples - No test set for evaluating rules but Rule pruning

is performed by removing a condition - Pos/neg are of positive/negative tuples covered

by R. - If FOIL_Prune is higher for the pruned version of

R, prune R

Nearest Neighbour Approach

- General Idea
- The Model a set of training examples stored in

memory - Lazy Learning delaying the decision to the time

of classification. In other words, there is no

training! - To classify an unseen record compute its

proximity to all training examples and locate 1

or k nearest neighbours examples. The nearest

neighbours determine the class of the record

(e.g. majority vote) - Rationale If it walks like a duck, quacks like

a duck, and looks like a duck, it probably is a

duck.

Nearest Neighbour Approach

- kNN Classification Algorithm
- algorithm kNN (Tr training set k integer r

data record) Class - begin
- for each training example t in Tr do
- calculate proximity d(t, r) upon descriptive

attributes - end for
- select the top k nearest neighbours into set D

accordingly - Class majority class in D
- return Class
- end

Class(?)

Nearest Neighbour Approach

- PEBLS Algorithm
- Class based similarity measure is used
- A nearest neighbour algorithm (k 1)
- Examples in memory have weights (exemplars)
- Simple training assigning and refining weights
- A different proximity measure
- Algorithm outline
- Build value difference tables for descriptive

attributes (in preparation of measuring distances

between examples) - For each training, refine the weight of its

nearest neighbour - Refine the weights of some training examples when

classifying validation examples

Nearest Neighbour Approach

- PEBLS Value Difference Table

r is set to 1. Cv1 total number of examples

with V1 Cv2 total number of examples with V2

Civ1 total number of examples with V1 and of

class i Civ2 total number of examples with V2

and of class i

Nearest Neighbour Approach

- PEBLS Distance Function

where wX, wY weights for X and Y, m the number

of attributes, xi, yi values of the ith

attribute for X and Y.

where T the total number of times that X is

selected as the nearest neighbour, C the total

number of times that X correctly classifies

examples.

Nearest Neighbour Approach

- PEBLS Distance Function (Example)

Value Difference Tables

Assuming row1.weight row2.weight 1, ?(row1,

row2) d(row1outlook, row2outlook)2

d(row1temperature, row2temperature)2

d(row1humidity, row2humidity)2 d(row1windy,

row2windy)2 d(sunny, sunny)2

d(hot, hot)2 d(high, high)2 d(false, true)2

0 0 0 (1/2)2 1/4

Nearest Neighbour Approach

- PEBLS Example

Artificial Neural Network Approach

- Our brains are made up of about 100 billion tiny

units called neurons. - Each neuron is connected to thousands of other

neurons and communicates with them via

electrochemical signals. - Signals coming into the neuron are received via

junctions called synapses, these in turn are

located at the end of branches of the neuron cell

called dendrites. - The neuron continuously receives signals from

these inputs - What the neuron does is sum up the inputs to

itself in some way and then, if the end result is

greater than some threshold value, the neuron

fires. - It generates a voltage and outputs a signal along

something called an axon.

Artificial Neural Network Approach

- General Idea
- The Model A network of connected artificial

neurons - Training select a specific network topology and

use the training example to tune the weights

attached on the links connecting the neurons - To classify an unseen record X, feed the

descriptive attribute values of the record into

the network as inputs. The network computes an

output value that can be converted to a class

label

Artificial Neural Network Approach

- Artificial Neuron (Unit)

Sum function x w1i1 w2i2 w3i3

Transformation function Sigmoid(x) 1/(1e-x)

Artificial Neural Network Approach

- A neural network can have many hidden layers, but

one layer is normally considered sufficient - The more units a hidden layer has, the more

capacity of pattern recognition - The constant inputs can be fed into the units in

the hidden and output layers as inputs. - Network with links from lower layers to upper

layers?feed-forward nw - Network with links between nodes of the same

layer?recurrent nw

Artificial Neural Network Approach

- Artificial Neuron (Perceptron)

Sum function x w1i1 w2i2 w3i3

Transformation function Sigmoid(x) 1/(1e-x)

Artificial Neural Network Approach

- General Principle for Training an ANN
- algorithm trainNetwork (Tr training set)

Network - Begin
- R initial network with a particular topology
- initialise the weight vector with random values

w(0) - repeat
- for each training example tltxi, yigt in Tr do
- compute the predicted class output y(k)
- for each weight wj in the weight vector do
- update the weight wj wj(k1) wj(k) ?(yi -

y(k))xij - end for
- end for
- until stopping criterion is met
- return R
- end

? the learning factor. The more the value is,

the bigger amount weight changes.

Artificial Neural Network Approach

- Using ANN for Classification
- Multiple hidden layers
- Do not know the actual class value and hence

difficult to adjust the weight - Solution Back-propagation (layer by layer from

the output layer) - Model Overfitting use validation examples to

further tune the weights in the network - Descriptive attributes should be normalized or

converted to binary - Training examples are used repeatedly. The

training cost is therefore very high. - Difficulty in explaining classification decisions

Artificial Neural Network Approach

- Network Topology
- of nodes in input layer determined by and

data types of attributes - Continuous and binary attributes 1 node for each

attribute - categorical attribute convert to numeric or

binary - Attribute w k labels needs at least log k nodes
- of nodes in output layer determined by of

classess - For 2 class solution? 1 node
- K class solution ? at least log k nodes
- of hidden layers and nodes in the hidden

layers difficult to decide - in NWs with hidden laeyrs updating weights using

backpropagation

Model Evaluation and Selection

- Evaluation metrics How can we measure accuracy?

Other metrics to consider? - Use validation test set of class-labeled tuples

instead of training set when assessing accuracy - Methods for estimating a classifiers accuracy
- Holdout method, random subsampling
- Cross-validation
- Bootstrap
- Comparing classifiers
- Confidence intervals
- Cost-benefit analysis and ROC Curves

79

Classifier Evaluation Metrics Confusion Matrix

Confusion Matrix

Actual class\Predicted class yes no

yes True Positives (TP) False Negatives (FN)

no False Positives (FP) True Negatives (TN)

Example of Confusion Matrix

Actual class\Predicted class buy_computer yes buy_computer no Total

buy_computer yes 6954 46 7000

buy_computer no 412 2588 3000

Total 7366 2634 10000

- TP and TN are the correctly predicted tuples
- May have extra rows/columns to provide totals

80

Classifier Evaluation Metrics Accuracy, Error

Rate, Sensitivity and Specificity

- Class Imbalance Problem
- One class may be rare, e.g. fraud, or

HIV-positive - Significant majority of the negative class and

minority of the positive class - Sensitivity True Positive recognition rate
- Sensitivity TP/P
- Specificity True Negative recognition rate
- Specificity TN/N

A\P Y N

Y TP FN P

N FP TN N

P N All

- Classifier Accuracy, or recognition rate

percentage of test set tuples that are correctly

classified - Accuracy (TP TN)/All
- Error ratemisclassification rate
- 1 accuracy, or
- Error rate (FP FN)/All

81

Classifier Evaluation Metrics Precision and

Recall, and F-measures

- Precision exactness what of tuples that the

classifier labeled as positive are actually

positive - Recall completeness what of positive tuples

did the classifier label as positive? - Perfect score is 1.0
- Inverse relationship between precision recall
- F measure (F1 or F-score) harmonic mean of

precision and recall, - Fß weighted measure of precision and recall
- assigns ß times as much weight to recall as to

precision

82

Classifier Evaluation Metrics Example

Actual Class\Predicted class cancer yes cancer no Total Recognition()

cancer yes 90 210 300 30.00 (sensitivity

cancer no 140 9560 9700 98.56 (specificity)

Total 230 9770 10000 96.40 (accuracy)

- Precision ?? Recall ??

83

Evaluating Classifier Accuracy Holdout

Cross-Validation Methods

- Holdout method
- Given data is randomly partitioned into two

independent sets - Training set (e.g., 2/3) for model construction
- Test set (e.g., 1/3) for accuracy estimation
- Random sampling a variation of holdout
- Repeat holdout k times, accuracy avg. of the

accuracies obtained - Cross-validation (k-fold, where k 10 is most

popular) - Randomly partition the data into k mutually

exclusive subsets, each approximately equal size - At i-th iteration, use Di as test set and others

as training set - Leave-one-out k folds where k of tuples, for

small sized data

84

Evaluating Classifier Accuracy Bootstrap

- Bootstrap
- Works well with small data sets
- Samples the given training tuples uniformly with

replacement - i.e., each time a tuple is selected, it is

equally likely to be selected again and re-added

to the training set - Examples used for training set can be used for

test set too

85

Ensemble Methods Increasing the Accuracy

- Ensemble methods
- Use a combination of models to increase accuracy
- Combine a series of k learned models, M1, M2, ,

Mk, with the aim of creating an improved model M - Popular ensemble methods
- Bagging, boosting, Ensemble

86

Classification of Class-Imbalanced Data Sets

- Class-imbalance problem Rare positive example

but numerous negative ones, e.g., medical

diagnosis, fraud, oil-spill, fault, etc. - Traditional methods assume a balanced

distribution of classes and equal error costs

not suitable for class-imbalanced data - Typical methods for imbalance data in 2-class

classification - Oversampling re-sampling of data from positive

class - Under-sampling randomly eliminate tuples from

negative class

87

Model Selection ROC Curves

- ROC (Receiver Operating Characteristics) curves

for visual comparison of classification models - Originated from signal detection theory
- Shows the trade-off between the true positive

rate and the false positive rate - The area under the ROC curve is a measure of the

accuracy of the model - Diagonal line for every TP, equally likely to

encounter FP - The closer to the diagonal line (i.e., the closer

the area is to 0.5), the less accurate is the

model

- Vertical axis represents the true positive rate
- Horizontal axis rep. the false positive rate
- The plot also shows a diagonal line
- A model with perfect accuracy will have an area

of 1.0

88

Issues Affecting Model Selection

- Accuracy
- classifier accuracy predicting class label
- Speed
- time to construct the model (training time)
- time to use the model (classification/prediction

time) - Robustness handling noise and missing values
- Scalability efficiency in disk-resident

databases - Interpretability
- understanding and insight provided by the model
- Other measures, e.g., goodness of rules, such as

decision tree size or compactness of

classification rules

89

Comparison of Techniques

- Comparison of Approaches

Model Interpretability ease of understanding

classification decisions

Model maintenability ease of modifying the model

in the presence of new training examples

Training cost computational cost for building a

model

Classification cost computational cost for

classifying an unseen record

Comparison of Techniques

- Comparison of Approaches

Model Interpretability ease of understanding

classification decisions

Model maintenability ease of modifying the model

in the presence of new training examples

Training cost computational cost for building a

model

Classification cost computational cost for

classifying an unseen record

Decision Tree Induction in Weka

- Overview
- ID3 (only work for categorical attributes)
- J48 (Java implementation of C4.5)
- RandomTree (with K attributes)
- RandomForest (a forest of random trees)
- REPTree (regression tree with reduced error

pruning) - BFTree (best-first tree, using Gain or Gini)
- FT (functional tree, logistic regression as split

nodes) - SimpleCart (CART with cost-complexity pruning)

Decision Tree Induction in Weka

- Preparation

Pre-processing attributes if necessary

Specifying the class attribute

Selecting attributes

Decision Tree Induction in Weka

- Constructing Classification Models (ID3)

1. Choosing a method and setting parameters

2. Setting a test option

4. View the model and evaluation results

3. Starting the process

5. Selecting the option to view the tree

Decision Tree Induction in Weka

- J48 (unpruned tree)

Decision Tree Induction in Weka

- RandomTree

Decision Tree Induction in Weka

- Classifying Unseen Records
- Preparing unseen records in an ARFF file

Class values are left as unknown (?)

Decision Tree Induction in Weka

- Classifying Unseen Records
- Classifying unseen records in the file

- Selecting this option and click Set button

- Press the button and load the file

- Press the button and load the file

- Press to start the classification

Decision Tree Induction in Weka

- Classifying Unseen Records
- Saving Classification Results into a file

- Setting both X and Y to instance_number

- Saving the results into a file

- Selecting the option to pop up visualisation

Decision Tree Induction in Weka

- Classifying Unseen Records
- Classification Results in an ARFF file

Class labels assinged

Comparison of Techniques

- Comparison of Performance in Weka
- A system module known as Experimenter
- Designated for comparing performances on

techniques for classification over a single or a

collection of data sets - Data miners setting up an experiment with
- Selected data set(s)
- Selected algorithms(s) and times of repeated

operations - Selected test option (e.g. cross validation)
- Selected p value (indicating confidence)
- Output accuracy rates of the algorithms
- Pairwise comparison of algorithms with

significant better and worse accuracy marked out.

Comparison of Techniques

- Setting up Experiment in Weka

New or existing experiment

Choosing a Test options

Naming the file to store experiment results

No. of times each algorithm repeated

Adding data sets

Add an algorithm

The list of data sets selected

The list of selected algorithms

Comparison of Techniques

- Experiment Results in Weka

Analysis method

Loading Experiment Data

Value of significance

Performing the Analysis

Results of Pairwise Comparisons

Classification in Practice

- Process of a Classification Project
- Locate data
- Prepare data
- Choose a classification method
- Construct the model and tune the model
- Measure its accuracy and go back to step 3 or 4

until the accuracy is satisfactory - Further evaluate the model from other aspects

such as complexity, comprehensibility, etc. - Deliver the model and test it in real

environment. Further modify the model if

necessary

Classification in Practice

- Data Preparation
- Identify descriptive features (input attributes)
- Identify or define the class
- Determine the sizes of the training, validation

and test sets - Select examples
- Spread and coverage of classes
- Spread and coverage of attribute values
- Null values
- Noisy data
- Prepare the input values (categorical to

continuous, continuous to categorical)

References (1)

- C. Apte and S. Weiss. Data mining with decision

trees and decision rules. Future Generation

Computer Systems, 13, 1997 - C. M. Bishop, Neural Networks for Pattern

Recognition. Oxford University Press, 1995 - L. Breiman, J. Friedman, R. Olshen, and C. Stone.

Classification and Regression Trees. Wadsworth

International Group, 1984 - C. J. C. Burges. A Tutorial on Support Vector

Machines for Pattern Recognition. Data Mining and

Knowledge Discovery, 2(2) 121-168, 1998 - P. K. Chan and S. J. Stolfo. Learning arbiter and

combiner trees from partitioned data for scaling

machine learning. KDD'95 - H. Cheng, X. Yan, J. Han, and C.-W. Hsu,

Discriminative Frequent Pattern Analysis for

Effective Classification, ICDE'07 - H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct

Discriminative Pattern Mining for Effective

Classification, ICDE'08 - W. Cohen. Fast effective rule induction. ICML'95
- G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu.

Mining top-k covering rule groups for gene

expression data. SIGMOD'05

106

References (3)

- T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A

comparison of prediction accuracy, complexity,

and training time of thirty-three old and new

classification algorithms. Machine Learning,

2000. - J. Magidson. The Chaid approach to segmentation

modeling Chi-squared automatic interaction

detection. In R. P. Bagozzi, editor, Advanced

Methods of Marketing Research, Blackwell

Business, 1994. - M. Mehta, R. Agrawal, and J. Rissanen. SLIQ A

fast scalable classifier for data mining.

EDBT'96. - T. M. Mitchell. Machine Learning. McGraw Hill,

1997. - S. K. Murthy, Automatic Construction of Decision

Trees from Data A Multi-Disciplinary Survey,

Data Mining and Knowledge Discovery 2(4)

345-389, 1998 - J. R. Quinlan. Induction of decision trees.

Machine Learning, 181-106, 1986. - J. R. Quinlan and R. M. Cameron-Jones. FOIL A

midterm report. ECML93. - J. R. Quinlan. C4.5 Programs for Machine

Learning. Morgan Kaufmann, 1993. - J. R. Quinlan. Bagging, boosting, and c4.5.

AAAI'96.

107

References (4)

- R. Rastogi and K. Shim. Public A decision tree

classifier that integrates building and pruning.

VLDB98. - J. Shafer, R. Agrawal, and M. Mehta. SPRINT A

scalable parallel classifier for data mining.

VLDB96. - J. W. Shavlik and T. G. Dietterich. Readings in

Machine Learning. Morgan Kaufmann, 1990. - P. Tan, M. Steinbach, and V. Kumar. Introduction

to Data Mining. Addison Wesley, 2005. - S. M. Weiss and C. A. Kulikowski. Computer

Systems that Learn Classification and

Prediction Methods from Statistics, Neural Nets,

Machine Learning, and Expert Systems. Morgan

Kaufman, 1991. - S. M. Weiss and N. Indurkhya. Predictive Data

Mining. Morgan Kaufmann, 1997. - I. H. Witten and E. Frank. Data Mining Practical

Machine Learning Tools and Techniques, 2ed.

Morgan Kaufmann, 2005. - X. Yin and J. Han. CPAR Classification based on

predictive association rules. SDM'03 - H. Yu, J. Yang, and J. Han. Classifying large

data sets using SVM with hierarchical clusters.

KDD'03.

108