# CENG 464 Introduction to Data Mining - PowerPoint PPT Presentation

PPT – CENG 464 Introduction to Data Mining PowerPoint presentation | free to download - id: 600dcb-MDAzM The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## CENG 464 Introduction to Data Mining

Description:

### Title: CENG 464 Introduction to Data Mining Author: computer Last modified by: student Created Date: 10/23/2013 5:55:50 AM Document presentation format – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 109
Provided by: comp88
Category:
Tags:
Transcript and Presenter's Notes

Title: CENG 464 Introduction to Data Mining

1
CENG 464 Introduction to Data Mining
2
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data

3
Classification Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of
the attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
• Goal previously unseen records should be
assigned a class as accurately as possible.
• A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.

4
Classification Definition
5
Prediction Problems Classification vs. Numeric
Prediction
• Classification
• predicts categorical class labels (discrete or
nominal)
• classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data
• Numeric Prediction
• models continuous-valued functions, i.e.,
predicts unknown or missing values
• Typical applications
• Credit/loan approval
• Medical diagnosis if a tumor is cancerous or
benign
• Fraud detection if a transaction is fraudulent
• Web page categorization which category it is

6
ClassificationA Two-Step Process
• Model construction describing a set of
predetermined classes
• Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute
• The set of tuples used for model construction is
training set
• The model is represented as classification rules,
decision trees, or mathematical formulae
• Model usage for classifying future or unknown
objects
• Estimate accuracy of the model
• The known label of test sample is compared with
the classified result from the model
• Accuracy rate is the percentage of test set
samples that are correctly classified by the
model
• Test set is independent of training set
(otherwise overfitting)
• If the accuracy is acceptable, use the model to
classify new data
• Note If the test set is used to select models,
it is called validation (test) set

7
Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
8
Process (2) Using the Model in Prediction
(Jeff, Professor, 4)
Tenured?
9
Training and Test set are randomly sampled
supervised
accuracy
Find a mapping OR function that can predict
class label of given tuple X
10
Classification Techniques
• Decision Tree based Methods
• Bayes Classification Methods
• Rule-based Methods
• Nearest-Neighbor Classifier
• Artificial Neural Networks
• Support Vector Machines
• Memory based reasoning

11
Example of a Decision Tree
Root node Internal nodes attribute test
conditions Leaf nodes class label
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
12
Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
13
Decision Tree
14
Apply Model to Test Data
Test Data
Start from the root of tree.
15
Apply Model to Test Data
Test Data
16
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
17
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
18
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
19
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
20
Decision Tree
21
Decision Tree Induction
• Many Algorithms
• Hunts Algorithm
• ID3, C4.5
• CART
• SLIQ,SPRINT

22
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive
divide-and-conquer manner
• At start, all the training examples are at the
root
• Attributes are categorical (if continuous-valued,
• Examples are partitioned recursively based on
selected attributes
• Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same
class
• There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
• There are no samples left

23
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting

24
How to Specify Test Condition?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous
• Depends on number of ways to split
• 2-way split
• Multi-way split

25
Splitting Based on Nominal Attributes
• Multi-way split Use as many partitions as
distinct values.
• Binary split Divides values into two subsets.
Need to find optimal partitioning.

OR
26
Splitting Based on Ordinal Attributes
• Multi-way split Use as many partitions as
distinct values.
• Binary split Divides values into two subsets.
Need to find optimal partitioning.

OR
27
Splitting Based on Continuous Attributes
• Different ways of handling
• Discretization to form an ordinal categorical
attribute
• Static discretize once at the beginning
• Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing (percenti
les), or clustering.
• Binary Decision (A lt v) or (A ? v)
• consider all possible splits and finds the best
cut
• can be more compute intensive

28
Splitting Based on Continuous Attributes
29
How to determine the Best Split
Before Splitting 10 records of class 0, 10
records of class 1
Which test condition is the best?
30
How to determine the Best Split
• Greedy approach
• Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity

Non-homogeneous, High degree of impurity
Homogeneous, Low degree of impurity
31
Attribute Selection-Splitting Rules
Measures (Measures of Node Impurity)
• Provides a ranking for each attribute describing
the given training tuples. The attribute having
the best score for the measure is chosen as the
splitting attribute for the given tuples.
• Information Gain-Entropy
• Gini Index
• Misclassification error

32
Brief Review of Entropy
•

m 2
33
Attribute Selection Measure Information Gain
(ID3/C4.5)
• Select the attribute with the highest information
gain
• This attribute minimizes the information needed
to classify the tuples in the resulting
partitions and reflects the least randomness or
impurity in these partitions
• Let pi be the probability that an arbitrary tuple
in D belongs to class Ci, estimated by Ci,
D/D
• Expected information (entropy) needed to classify
a tuple in D
• Information needed (after using A to split D into
v partitions) to classify D
• Information gained by branching on attribute A

34
Attribute Selection Information Gain
• means youth has 5 out of 14
samples, with 2 yeses and 3 nos. Hence
• Similarly,

35
Computing Information-Gain for Continuous-Valued
Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of
adjacent values is considered as a possible split
point
• (aiai1)/2 is the midpoint between the values of
ai and ai1
• The point with the minimum expected information
requirement for A is selected as the split-point
for A
• Split
• D1 is the set of tuples in D satisfying A
split-point, and D2 is the set of tuples in D
satisfying A gt split-point

36
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards
attributes with a large number of values
• C4.5 (a successor of ID3) uses gain ratio to
overcome the problem (normalization to
information gain)
• GainRatio(A) Gain(A)/SplitInfo(A)
• Ex.
• gain_ratio(income) 0.029/1.557 0.019
• The attribute with the maximum gain ratio is
selected as the splitting attribute

37
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes,
gini index, gini(D) is defined as
• where pj is the relative frequency of class
j in D
• If a data set D is split on A into two subsets
D1 and D2, the gini index gini(D) is defined as
• Reduction in Impurity
• The attribute provides the smallest ginisplit(D)
(or the largest reduction in impurity) is chosen
to split the node (need to enumerate all the
possible splitting points for each attribute)

38
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer yes and
5 in no
• Suppose the attribute income partitions D into 10
in D1 low, medium and 4 in D2
• Ginilow,high is 0.458 Ginimedium,high is
0.450. Thus, split on the low,medium (and
high) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get
the possible split values
• Can be modified for categorical attributes

39
Comparing Attribute Selection Measures
• The three measures, in general, return good
results but
• Information gain
• biased towards multivalued attributes
• Gain ratio
• tends to prefer unbalanced splits in which one
partition is much smaller than the others
• Gini index
• biased to multivalued attributes
• has difficulty when of classes is large
• tends to favor tests that result in equal-sized
partitions and purity in both partitions

40
Other Attribute Selection Measures
• CHAID a popular decision tree algorithm, measure
based on ?2 test for independence
• C-SEP performs better than info. gain and gini
index in certain cases
• G-statistic has a close approximation to ?2
distribution
• MDL (Minimal Description Length) principle (i.e.,
the simplest solution is preferred)
• The best tree as the one that requires the fewest
of bits to both (1) encode the tree, and (2)
encode the exceptions to the tree
• Multivariate splits (partition based on multiple
variable combinations)
• CART finds multivariate splits based on a linear
comb. of attrs.
• Which attribute selection measure is the best?
• Most give good results, none is significantly
superior than others

41
Overfitting and Tree Pruning
• Overfitting An induced tree may overfit the
training data
• Too many branches, some may reflect anomalies due
to noise or outliers
• Poor accuracy for unseen samples
• Two approaches to avoid overfitting
• Prepruning Halt tree construction early ? do not
split a node if this would result in the goodness
measure falling below a threshold
• Difficult to choose an appropriate threshold
• Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees
• Use a set of data different from the training
data to decide which is the best pruned tree

42
Decision Tree Based Classification
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification
techniques for many simple data sets

43
Chapter 8. Classification Basic Concepts
• Classification Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy
Ensemble Methods
• Summary

43
44
Bayesian Classification Why?
• A statistical classifier performs probabilistic
prediction, i.e., predicts class membership
probabilities
• Foundation Based on Bayes Theorem.
• Performance A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance
with decision tree and selected neural network
classifiers
• Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct prior knowledge
can be combined with observed data
• Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured

45
Bayes Theorem Basics
• Total probability Theorem
• Bayes Theorem
• Let X be a data sample (evidence) class label
is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(HX), (i.e.,
posteriori probability) the probability that
the hypothesis holds given the observed data
sample X
• P(H) (prior probability) the initial probability
• E.g., X will buy computer, regardless of age,
income,
• P(X) probability that sample data is observed
• P(XH) (likelihood) the probability of observing
the sample X, given that the hypothesis holds
• E.g., Given that X will buy computer, the prob.
that X is 31..40, medium income

46
Prediction Based on Bayes Theorem
• Given training data X, posteriori probability of
a hypothesis H, P(HX), follows the Bayes
theorem
• Informally, this can be viewed as
• posteriori likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability
P(CiX) is the highest among all the P(CkX) for
all the k classes
• Practical difficulty It requires initial
knowledge of many probabilities, involving
significant computational cost

47
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector X (x1,
x2, , xn)
• Suppose there are m classes C1, C2, , Cm.
• Classification is to derive the maximum
posteriori, i.e., the maximal P(CiX)
• This can be derived from Bayes theorem
• Since P(X) is constant for all classes, only
• needs to be maximized

48
Naïve Bayes Classifier
• A simplified assumption attributes are
conditionally independent (i.e., no dependence
relation between attributes)
• This greatly reduces the computation cost Only
counts the class distribution
• If Ak is categorical, P(xkCi) is the of tuples
in Ci having value xk for Ak divided by Ci, D
( of tuples of Ci in D)
• If Ak is continous-valued, P(xkCi) is usually
computed based on Gaussian distribution with a
mean µ and standard deviation s
• and P(xkCi) is

49
Naïve Bayes Classifier Training Dataset
no Data to be classified X (age youth,
Income medium, Student yes Credit_rating
Fair)
50
Naïve Bayes Classifier An Example
0.643
5/14 0.357
• Compute P(XCi) for each class
2/9 0.222
3/5 0.6
4/9 0.444
2/5 0.4
6/9 0.667
1/5 0.2
yes) 6/9 0.667
no) 2/5 0.4
• X (age lt 30 , income medium, student yes,
credit_rating fair)
• P(XCi) P(Xbuys_computer yes) 0.222 x
0.444 x 0.667 x 0.667 0.044
0.4 x 0.2 x 0.4 0.019
• Therefore, X belongs to class (buys_computer
yes)

51
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each
conditional prob. be non-zero. Otherwise, the
predicted prob. will be zero
• Ex. Suppose a dataset with 1000 tuples,
incomelow (0), income medium (990), and income
high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
• Prob(income low) 1/1003
• Prob(income medium) 991/1003
• Prob(income high) 11/1003
• The corrected prob. estimates are close to
their uncorrected counterparts

52
• Easy to implement
• Robust to noise
• Can handle null values
• Good results obtained in most of the cases
• Assumption class conditional independence,
therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals patients Profile age, family
history, etc.
• Symptoms fever, cough etc., Disease lung
cancer, diabetes, etc.
• Dependencies among these cannot be modeled by
Naïve Bayes Classifier
• How to deal with these dependencies? Bayesian
Belief Networks

53
Chapter 8. Classification Basic Concepts
• Classification Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy
Ensemble Methods
• Summary

53
54
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN
rules
• R IF age youth AND student yes THEN
• Rule antecedent/precondition vs. rule consequent
• If rule is satisfied by X, it covers the tupple,
the rule is said to be triggered
• If R1 is the rule satisfied, then the rule fires
by returning the class predictiin
• Assessment of a rule coverage and accuracy
• ncovers of tuples covered by R
• ncorrect of tuples correctly classified by R
• coverage(R) ncovers /D where D
training data set
• accuracy(R) ncorrect / ncovers

55
Using IF-THEN Rules for Classification
• If more than one rule are triggered, need
conflict resolution
• Size ordering assign the highest priority to the
triggering rules that has the toughest
requirement (i.e., with the most attribute tests)
• Rule ordering prioritize rules beforehand
• Class-based ordering classes are sorted in order
of decreasing importance like order of prevalence
or misclassification cost per class. Within each
class rules are nor ordered
• Rule-based ordering (decision list) rules are
organized into one long priority list, according
to some measure of rule quality like accuracy,
coverage or size. The first rule satisfying X
fires class prediction, any other rule satisfying
X is ignored. Each rule in the list implies the
negation of the rules that come before
it?difficult to interpret
• What if no rule is fired for X? default rule!

56
Rule Extraction from a Decision Tree
• Rules are easier to understand than large trees
• One rule is created for each path from the root
to a leaf and logically ANDed to form the rule
antecedent
• Each attribute-value pair along a path forms a
conjunction the leaf holds the class prediction
• Rules are mutually exclusive and exhaustive
• Mutually exclusive no two rules will be
triggered for the same tuple
• Exhaustive there is one rule for each possible
attribute value combination?no need for a default
rule

57
Rule Extraction from a Decision Tree
• Example Rule extraction from our buys_computer
decision-tree
• IF age young AND student no
• IF age young AND student yes
• IF age middle_aged THEN buys_computer
yes
• IF age senior AND credit_rating excellent
• IF age senior AND credit_rating fair

58
Rule Induction Sequential Covering Method
• Sequential covering algorithm Extracts rules
directly from training data
• Typical sequential covering algorithms FOIL, AQ,
CN2, RIPPER
• Rules are learned sequentially, each for a given
class Ci will cover many tuples of Ci but none
(or few) of the tuples of other classes
• Steps
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered
by the rules are removed
• Repeat the process on the remaining tuples until
termination condition, e.g., when no more
training examples or when the quality of a rule
returned is below a user-specified threshold
• Comp. w. decision-tree induction learning a set
of rules simultaneously

59
Sequential Covering Algorithm
• When learning a rule for a class, C, we would
like the rule to cover all or most of the
training tuples of class C and none or few of the
tuples from other classes
• while (enough target tuples left)
• generate a rule
• remove positive target tuples satisfying this
rule

Examples covered by Rule 2
Examples covered by Rule 1
Examples covered by Rule 3
Positive examples
60
How to Learn-One-Rule?
• Two approaches
• Specialization
rule?class y
• Best attribute-value pair is added from list A
into the antecedent
• Continue until rule performance measure cannot
improve further
• If incomehigh THEN loan_decisionaccept
• If incomehigh AND credit_ratingexcellent THEN
loan_decisionaccept
• Greedy algorithm always add attribute value
pair which is best at the moment

61
How to Learn-One-Rule?
• Two approaches
• generalization
and converted to a rule that covers
• Tuple (overcast, high,false,P) can be converted
to a rule as
• Outlookovercast AND humidityhigh AND
windyfalse ?classP
• Choose one attribute-value pair and remove it
sothat rule covers more positive examples
• Repeat the process until the rule starts to cover
negative examples

62
How to Learn-One-Rule?
• Rule-Quality measures
• used to decide if appending a test to the rules
condition will result in an improved rule
accuracy, coverage
• Consider R1 correctly classifies 38 0f 40 tuples
whereas R2 covers 2 tuples and correctly
classifies all which rule is better? Accuracy?
• Different Measures Foil-gain, likelihood ratio
statistics, chisquare statistics

63
How to Learn-One-Rule?
• Rule-Quality measures Foil-gain checks if
ANDing a new condition results in a better rule
• considers both coverage and accuracy
• Foil-gain (in FOIL RIPPER) assesses info_gain
by extending condition
• pos and neg are the of positively and
negatively covered tuples by R and
• Pos and neg are the of positively and
negatively covered tuples by R
• favors rules that have high accuracy and cover
many positive tuples
• No test set for evaluating rules but Rule pruning
is performed by removing a condition
• Pos/neg are of positive/negative tuples covered
by R.
• If FOIL_Prune is higher for the pruned version of
R, prune R

64
Nearest Neighbour Approach
• General Idea
• The Model a set of training examples stored in
memory
• Lazy Learning delaying the decision to the time
of classification. In other words, there is no
training!
• To classify an unseen record compute its
proximity to all training examples and locate 1
or k nearest neighbours examples. The nearest
neighbours determine the class of the record
(e.g. majority vote)
• Rationale If it walks like a duck, quacks like
a duck, and looks like a duck, it probably is a
duck.

65
Nearest Neighbour Approach
• kNN Classification Algorithm
• algorithm kNN (Tr training set k integer r
data record) Class
• begin
• for each training example t in Tr do
• calculate proximity d(t, r) upon descriptive
attributes
• end for
• select the top k nearest neighbours into set D
accordingly
• Class majority class in D
• return Class
• end

Class(?)
66
Nearest Neighbour Approach
• PEBLS Algorithm
• Class based similarity measure is used
• A nearest neighbour algorithm (k 1)
• Examples in memory have weights (exemplars)
• Simple training assigning and refining weights
• A different proximity measure
• Algorithm outline
• Build value difference tables for descriptive
attributes (in preparation of measuring distances
between examples)
• For each training, refine the weight of its
nearest neighbour
• Refine the weights of some training examples when
classifying validation examples

67
Nearest Neighbour Approach
• PEBLS Value Difference Table

r is set to 1. Cv1 total number of examples
with V1 Cv2 total number of examples with V2
Civ1 total number of examples with V1 and of
class i Civ2 total number of examples with V2
and of class i
68
Nearest Neighbour Approach
• PEBLS Distance Function

where wX, wY weights for X and Y, m the number
of attributes, xi, yi values of the ith
attribute for X and Y.
where T the total number of times that X is
selected as the nearest neighbour, C the total
number of times that X correctly classifies
examples.
69
Nearest Neighbour Approach
• PEBLS Distance Function (Example)

Value Difference Tables
Assuming row1.weight row2.weight 1, ?(row1,
row2) d(row1outlook, row2outlook)2
d(row1temperature, row2temperature)2
d(row1humidity, row2humidity)2 d(row1windy,
row2windy)2 d(sunny, sunny)2
d(hot, hot)2 d(high, high)2 d(false, true)2
0 0 0 (1/2)2 1/4
70
Nearest Neighbour Approach
• PEBLS Example

71
Artificial Neural Network Approach
units called neurons.
• Each neuron is connected to thousands of other
neurons and communicates with them via
electrochemical signals.
• Signals coming into the neuron are received via
junctions called synapses, these in turn are
located at the end of branches of the neuron cell
called dendrites.
• The neuron continuously receives signals from
these inputs
• What the neuron does is sum up the inputs to
itself in some way and then, if the end result is
greater than some threshold value, the neuron
fires.
• It generates a voltage and outputs a signal along
something called an axon.

72
Artificial Neural Network Approach
• General Idea
• The Model A network of connected artificial
neurons
• Training select a specific network topology and
use the training example to tune the weights
attached on the links connecting the neurons
• To classify an unseen record X, feed the
descriptive attribute values of the record into
the network as inputs. The network computes an
output value that can be converted to a class
label

73
Artificial Neural Network Approach
• Artificial Neuron (Unit)

Sum function x w1i1 w2i2 w3i3
Transformation function Sigmoid(x) 1/(1e-x)
74
Artificial Neural Network Approach
• A neural network can have many hidden layers, but
one layer is normally considered sufficient
• The more units a hidden layer has, the more
capacity of pattern recognition
• The constant inputs can be fed into the units in
the hidden and output layers as inputs.
• Network with links from lower layers to upper
layers?feed-forward nw
• Network with links between nodes of the same
layer?recurrent nw

75
Artificial Neural Network Approach
• Artificial Neuron (Perceptron)

Sum function x w1i1 w2i2 w3i3
Transformation function Sigmoid(x) 1/(1e-x)
76
Artificial Neural Network Approach
• General Principle for Training an ANN
• algorithm trainNetwork (Tr training set)
Network
• Begin
• R initial network with a particular topology
• initialise the weight vector with random values
w(0)
• repeat
• for each training example tltxi, yigt in Tr do
• compute the predicted class output y(k)
• for each weight wj in the weight vector do
• update the weight wj wj(k1) wj(k) ?(yi -
y(k))xij
• end for
• end for
• until stopping criterion is met
• return R
• end

? the learning factor. The more the value is,
the bigger amount weight changes.
77
Artificial Neural Network Approach
• Using ANN for Classification
• Multiple hidden layers
• Do not know the actual class value and hence
• Solution Back-propagation (layer by layer from
the output layer)
• Model Overfitting use validation examples to
further tune the weights in the network
• Descriptive attributes should be normalized or
converted to binary
• Training examples are used repeatedly. The
training cost is therefore very high.
• Difficulty in explaining classification decisions

78
Artificial Neural Network Approach
• Network Topology
• of nodes in input layer determined by and
data types of attributes
• Continuous and binary attributes 1 node for each
attribute
• categorical attribute convert to numeric or
binary
• Attribute w k labels needs at least log k nodes
• of nodes in output layer determined by of
classess
• For 2 class solution? 1 node
• K class solution ? at least log k nodes
• of hidden layers and nodes in the hidden
layers difficult to decide
• in NWs with hidden laeyrs updating weights using
backpropagation

79
Model Evaluation and Selection
• Evaluation metrics How can we measure accuracy?
Other metrics to consider?
• Use validation test set of class-labeled tuples
instead of training set when assessing accuracy
• Methods for estimating a classifiers accuracy
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers
• Confidence intervals
• Cost-benefit analysis and ROC Curves

79
80
Classifier Evaluation Metrics Confusion Matrix
Confusion Matrix
Actual class\Predicted class yes no
yes True Positives (TP) False Negatives (FN)
no False Positives (FP) True Negatives (TN)
Example of Confusion Matrix
Total 7366 2634 10000
• TP and TN are the correctly predicted tuples
• May have extra rows/columns to provide totals

80
81
Classifier Evaluation Metrics Accuracy, Error
Rate, Sensitivity and Specificity
• Class Imbalance Problem
• One class may be rare, e.g. fraud, or
HIV-positive
• Significant majority of the negative class and
minority of the positive class
• Sensitivity True Positive recognition rate
• Sensitivity TP/P
• Specificity True Negative recognition rate
• Specificity TN/N

A\P Y N
Y TP FN P
N FP TN N
P N All
• Classifier Accuracy, or recognition rate
percentage of test set tuples that are correctly
classified
• Accuracy (TP TN)/All
• Error ratemisclassification rate
• 1 accuracy, or
• Error rate (FP FN)/All

81
82
Classifier Evaluation Metrics Precision and
Recall, and F-measures
• Precision exactness what of tuples that the
classifier labeled as positive are actually
positive
• Recall completeness what of positive tuples
did the classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision recall
• F measure (F1 or F-score) harmonic mean of
precision and recall,
• Fß weighted measure of precision and recall
• assigns ß times as much weight to recall as to
precision

82
83
Classifier Evaluation Metrics Example
Actual Class\Predicted class cancer yes cancer no Total Recognition()
cancer yes 90 210 300 30.00 (sensitivity
cancer no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
• Precision ?? Recall ??

83
84
Evaluating Classifier Accuracy Holdout
Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two
independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling a variation of holdout
• Repeat holdout k times, accuracy avg. of the
accuracies obtained
• Cross-validation (k-fold, where k 10 is most
popular)
• Randomly partition the data into k mutually
exclusive subsets, each approximately equal size
• At i-th iteration, use Di as test set and others
as training set
• Leave-one-out k folds where k of tuples, for
small sized data

84
85
Evaluating Classifier Accuracy Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with
replacement
• i.e., each time a tuple is selected, it is
equally likely to be selected again and re-added
to the training set
• Examples used for training set can be used for
test set too

85
86
Ensemble Methods Increasing the Accuracy
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, ,
Mk, with the aim of creating an improved model M
• Popular ensemble methods
• Bagging, boosting, Ensemble

86
87
Classification of Class-Imbalanced Data Sets
• Class-imbalance problem Rare positive example
but numerous negative ones, e.g., medical
diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced
distribution of classes and equal error costs
not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class
classification
• Oversampling re-sampling of data from positive
class
• Under-sampling randomly eliminate tuples from
negative class

87
88
Model Selection ROC Curves
• ROC (Receiver Operating Characteristics) curves
for visual comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true positive
rate and the false positive rate
• The area under the ROC curve is a measure of the
accuracy of the model
• Diagonal line for every TP, equally likely to
encounter FP
• The closer to the diagonal line (i.e., the closer
the area is to 0.5), the less accurate is the
model
• Vertical axis represents the true positive rate
• Horizontal axis rep. the false positive rate
• The plot also shows a diagonal line
• A model with perfect accuracy will have an area
of 1.0

88
89
Issues Affecting Model Selection
• Accuracy
• classifier accuracy predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction
time)
• Robustness handling noise and missing values
• Scalability efficiency in disk-resident
databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as
decision tree size or compactness of
classification rules

89
90
Comparison of Techniques
• Comparison of Approaches

Model Interpretability ease of understanding
classification decisions
Model maintenability ease of modifying the model
in the presence of new training examples
Training cost computational cost for building a
model
Classification cost computational cost for
classifying an unseen record
91
Comparison of Techniques
• Comparison of Approaches

Model Interpretability ease of understanding
classification decisions
Model maintenability ease of modifying the model
in the presence of new training examples
Training cost computational cost for building a
model
Classification cost computational cost for
classifying an unseen record
92
Decision Tree Induction in Weka
• Overview
• ID3 (only work for categorical attributes)
• J48 (Java implementation of C4.5)
• RandomTree (with K attributes)
• RandomForest (a forest of random trees)
• REPTree (regression tree with reduced error
pruning)
• BFTree (best-first tree, using Gain or Gini)
• FT (functional tree, logistic regression as split
nodes)
• SimpleCart (CART with cost-complexity pruning)

93
Decision Tree Induction in Weka
• Preparation

Pre-processing attributes if necessary
Specifying the class attribute
Selecting attributes
94
Decision Tree Induction in Weka
• Constructing Classification Models (ID3)

1. Choosing a method and setting parameters
2. Setting a test option
4. View the model and evaluation results
3. Starting the process
5. Selecting the option to view the tree
95
Decision Tree Induction in Weka
• J48 (unpruned tree)

96
Decision Tree Induction in Weka
• RandomTree

97
Decision Tree Induction in Weka
• Classifying Unseen Records
• Preparing unseen records in an ARFF file

Class values are left as unknown (?)
98
Decision Tree Induction in Weka
• Classifying Unseen Records
• Classifying unseen records in the file
1. Selecting this option and click Set button
1. Press the button and load the file
1. Press the button and load the file
1. Press to start the classification

99
Decision Tree Induction in Weka
• Classifying Unseen Records
• Saving Classification Results into a file
1. Setting both X and Y to instance_number
1. Saving the results into a file
1. Selecting the option to pop up visualisation

100
Decision Tree Induction in Weka
• Classifying Unseen Records
• Classification Results in an ARFF file

Class labels assinged
101
Comparison of Techniques
• Comparison of Performance in Weka
• A system module known as Experimenter
• Designated for comparing performances on
techniques for classification over a single or a
collection of data sets
• Data miners setting up an experiment with
• Selected data set(s)
• Selected algorithms(s) and times of repeated
operations
• Selected test option (e.g. cross validation)
• Selected p value (indicating confidence)
• Output accuracy rates of the algorithms
• Pairwise comparison of algorithms with
significant better and worse accuracy marked out.

102
Comparison of Techniques
• Setting up Experiment in Weka

New or existing experiment
Choosing a Test options
Naming the file to store experiment results
No. of times each algorithm repeated
The list of data sets selected
The list of selected algorithms
103
Comparison of Techniques
• Experiment Results in Weka

Analysis method
Value of significance
Performing the Analysis
Results of Pairwise Comparisons
104
Classification in Practice
• Process of a Classification Project
• Locate data
• Prepare data
• Choose a classification method
• Construct the model and tune the model
• Measure its accuracy and go back to step 3 or 4
until the accuracy is satisfactory
• Further evaluate the model from other aspects
such as complexity, comprehensibility, etc.
• Deliver the model and test it in real
environment. Further modify the model if
necessary

105
Classification in Practice
• Data Preparation
• Identify descriptive features (input attributes)
• Identify or define the class
• Determine the sizes of the training, validation
and test sets
• Select examples
• Spread and coverage of classes
• Spread and coverage of attribute values
• Null values
• Noisy data
• Prepare the input values (categorical to
continuous, continuous to categorical)

106
References (1)
• C. Apte and S. Weiss. Data mining with decision
trees and decision rules. Future Generation
Computer Systems, 13, 1997
• C. M. Bishop, Neural Networks for Pattern
Recognition. Oxford University Press, 1995
• L. Breiman, J. Friedman, R. Olshen, and C. Stone.
International Group, 1984
• C. J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Data Mining and
Knowledge Discovery, 2(2) 121-168, 1998
• P. K. Chan and S. J. Stolfo. Learning arbiter and
combiner trees from partitioned data for scaling
machine learning. KDD'95
• H. Cheng, X. Yan, J. Han, and C.-W. Hsu,
Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
• H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct
Discriminative Pattern Mining for Effective
Classification, ICDE'08
• W. Cohen. Fast effective rule induction. ICML'95
• G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu.
Mining top-k covering rule groups for gene
expression data. SIGMOD'05

106
107
References (3)
• T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A
comparison of prediction accuracy, complexity,
and training time of thirty-three old and new
classification algorithms. Machine Learning,
2000.
• J. Magidson. The Chaid approach to segmentation
modeling Chi-squared automatic interaction
detection. In R. P. Bagozzi, editor, Advanced
Methods of Marketing Research, Blackwell
• M. Mehta, R. Agrawal, and J. Rissanen. SLIQ A
fast scalable classifier for data mining.
EDBT'96.
• T. M. Mitchell. Machine Learning. McGraw Hill,
1997.
• S. K. Murthy, Automatic Construction of Decision
Trees from Data A Multi-Disciplinary Survey,
Data Mining and Knowledge Discovery 2(4)
345-389, 1998
• J. R. Quinlan. Induction of decision trees.
Machine Learning, 181-106, 1986.
• J. R. Quinlan and R. M. Cameron-Jones. FOIL A
midterm report. ECML93.
• J. R. Quinlan. C4.5 Programs for Machine
Learning. Morgan Kaufmann, 1993.
• J. R. Quinlan. Bagging, boosting, and c4.5.
AAAI'96.

107
108
References (4)
• R. Rastogi and K. Shim. Public A decision tree
classifier that integrates building and pruning.
VLDB98.
• J. Shafer, R. Agrawal, and M. Mehta. SPRINT A
scalable parallel classifier for data mining.
VLDB96.
• J. W. Shavlik and T. G. Dietterich. Readings in
Machine Learning. Morgan Kaufmann, 1990.
• P. Tan, M. Steinbach, and V. Kumar. Introduction
to Data Mining. Addison Wesley, 2005.
• S. M. Weiss and C. A. Kulikowski. Computer
Systems that Learn Classification and
Prediction Methods from Statistics, Neural Nets,
Machine Learning, and Expert Systems. Morgan
Kaufman, 1991.
• S. M. Weiss and N. Indurkhya. Predictive Data
Mining. Morgan Kaufmann, 1997.
• I. H. Witten and E. Frank. Data Mining Practical
Machine Learning Tools and Techniques, 2ed.
Morgan Kaufmann, 2005.
• X. Yin and J. Han. CPAR Classification based on
predictive association rules. SDM'03
• H. Yu, J. Yang, and J. Han. Classifying large
data sets using SVM with hierarchical clusters.
KDD'03.

108