Chap' 6 Classification and Prediction - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Chap' 6 Classification and Prediction

Description:

... is represented as classification rules, decision trees, or mathematical formulae ... Classification: Mathematical mapping. x X = n, y Y = { 1, 1} We want a ... – PowerPoint PPT presentation

Number of Views:229
Avg rating:3.0/5.0
Slides: 60
Provided by: jiaw185
Category:

less

Transcript and Presenter's Notes

Title: Chap' 6 Classification and Prediction


1
Chap. 6 Classification and Prediction
  • Data Mining

2
Classification vs. Prediction
  • Classification
  • Predicts categorical class labels of a data
  • Constructs a model based on the training set,
    tests the model by using the test set, and uses
    it in classifying new data
  • Prediction
  • Models continuous-valued functions
  • Typical Applications
  • Credit approval
  • Target marketing
  • Medical diagnosis

3
Classification
  • Model construction
  • The set of tuples used for model construction
    training set
  • ? Each tuple/sample belongs to a predefined
    class, as determined by the class label
  • The model is represented as classification rules,
    decision trees, or mathematical formulae
  • Model usage
  • Classify future or unknown data
  • Estimate accuracy of the model - a test set with
    known class label is classified, and the result
    is compared
  • Accuracy rate the percentage of test set samples
    that are correctly classified by the model

4
Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN
tenured yes
5
Use the Model in Prediction
Unseen Data
Class
(Jeff, Professor, 4)
Yes
Tenured?
6
Supervised vs. Unsupervised
  • Supervised learning (classification)
  • Supervision The training data (observations,
    measurements, etc.) are accompanied by labels
    indicating the class
  • New data is classified based on the training set
  • Unsupervised learning (clustering)
  • The class labels of training data is not given
  • Given a set of measurements, observations, etc.
  • ? establishing the existence of classes or
    clusters in the data

7
Preparing the Data
  • Data cleaning
  • Preprocess data in order to reduce noise and
    handle missing values
  • Relevance analysis (feature selection)
  • Remove the irrelevant or redundant attributes
  • Data transformation
  • Generalize and/or normalize data
  • Exgt income ? Low, Medium, High
  • Exgt income, age integer ? 0.0, 1.0

8
Comparing Classification Methods
  • Predictive accuracy
  • Speed
  • Time to construct the model / use the model
  • Robustness
  • Handling noise and missing values
  • Scalability
  • Efficiency in disk-resident large databases
  • Interpretability
  • Understanding and insight provided by the model

9
Decision Tree Induction
  • Decision tree
  • Internal node - a test on an attribute
  • Branch - an outcome of the test
  • Leaf nodes - class labels
  • Decision tree generation - two phases
  • Tree construction
  • At start, all the training examples are at the
    root
  • Partition examples recursively based on selected
    attributes
  • Tree pruning
  • Identify and remove branches that reflect noise
    or outliers
  • Classifying an unknown sample
  • Test the attribute values of the sample against
    the decision tree

10
Training Dataset
11
Output A Decision Tree for buys_computer
X (agelt30, incomemedium,
studentyes, creditfair)
Yes
12
Algorithm for DT Induction
  • Basic algorithm
  • Top-down recursive divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Examples are partitioned recursively based on
    selected attributes
  • Conditions for stopping partitioning
  • All samples belong to the same class
  • No samples left ? majority class
  • No attributes left ? majority sample

13
Algorithm for DT Induction - Example
age?
lt30
gt40
30..40
9,11(Yes) 1,2,8(No)
3,7,12,13(Yes)
4,5,10(Yes) 6,14(No)
age?
lt30
gt40
30..40
student?
credit rating?
Yes
n
y
fair
excellent
1,2,8(No)
9,11(Yes)
4,5,10(Yes)
6,14(No)
14
Information Gain (ID3/C4.5)
  • S contains s data samples in C1, C2, , Cn
    classes
  • Number of samples in class Ci si
  • Prob. That a sample belongs to Ci pi si / s
  • Information required to classify a sample in Ci
  • Expected information to classify a sample in S
    (Entropy of S)

15
Information Gain (ID3/C4.5)
  • Expected entropy after check attribute A value
  • Average entropy of S1, S2, Sn after partition
    S using attribute A with values a1,a2,,av
  • Information gained by branching on attribute A
  • Select attribute with largest gain

16
Attribute Selection by Information Gain - Example
  • C1 buys_computer yes, C2 buys_computer
    no
  • I(S) I(9,5)
  • E(age)
  • Gain(age)
  • Gain(income) 0.03, Gain(student) 0.15

17
Information Gain for Continuous-Value Attributes
  • Let attribute A be a continuous-valued attribute
  • Must determine the best split point for A
  • Sort the value A in increasing order
  • Typically, the midpoint between each pair of
    adjacent values is considered as a possible split
    point
  • (aiai1)/2 is the midpoint between the values of
    ai and ai1
  • The point with the minimum expected information
    requirement for A is selected as the split-point
    for A
  • Split
  • D1 is the set of tuples in D satisfying A
    split-point, and D2 is the set of tuples in D
    satisfying A gt split-point

18
Gain Ratio for Attribute Selection
  • Information gain measure is biased towards
    attributes with a large number of values
  • C4.5 uses gain ratio to overcome the problem
    (normalization to information gain)
  • GainRatio(A) Gain(A)/SplitInfo(A)
  • Ex.
  • gain_ratio(income) 0.029/0.926 0.031
  • The attribute with the maximum gain ratio is
    selected as the splitting attribute

19
Extracting Classification Rules
  • Represent the knowledge in the form of IF-THEN
    rules
  • One rule is created for each path from the root
    to a leaf
  • Each attribute-value pair along a path forms a
    conjunction
  • The leaf node holds the class prediction
  • Rules are easier for humans to understand
  • Example
  • IF age lt30 AND student no THEN
    buys_computer no
  • IF age lt30 AND student yes THEN
    buys_computer yes
  • IF age 3140 THEN buys_computer
    yes
  • IF age gt40 AND credit excellent THEN
    buys_computer yes
  • IF age gt40 AND credit fair THEN
    buys_computer no

20
Avoid Overfitting
  • The generated tree may overfit the training data
  • Too many branches, some may reflect anomalies due
    to noise or outliers
  • Result is in poor accuracy for unseen samples
  • Prepruning
  • Halt tree construction earlydo not split a node
    if this would result in the goodness measure
    falling below a threshold
  • Postpruning
  • Remove branches from a fully grown tree
  • If pruning a node lead to a smaller error rate
    (with test set), prune it

21
Discussion on DT
  • Advantages
  • Convertible to understandable classification
    rules
  • Relatively faster learning/classification speed
  • Disadvantages
  • Sensitive (not robust) to noises
  • Continuous-valued attributes - dynamically
    partition the continuous attribute value into a
    discrete set of intervals

22
Bayes Theorem
  • Given a data X, we want to know P(hX)
  • h a hypothesis that X belongs to a class C
  • Posteriori probability of a hypothesis h, P(hX)
  • MAP (maximum posteriori) hypothesis
  • Assign X to h with maximum P(h X) Bayesian
    Classifier

23
Naïve Bayesian Classifier
  • Assumption attributes are conditionally
    independent
  • Steps
  • Compute P(Ck), P(xiCk) for all xi and Ck from
    training samples
  • To classify an unknown sample X (x1, x2, , xn
    ), compute
  • Assign X to Ck with maximum probability

24
Naïve Bayesian Classifier - Example
25
Naïve Bayesian Classifier - Example
  • Compute P(Ck), P(xiCk)
  • P(Yes) 9/14 0.643
  • P(No) 5/14 0.357
  • P(age lt30 Yes) 2/9 0.222
  • P(age lt30 No) 3/5 0.600
  • P(income medium Yes) 4/9 0.444
  • P(income medium No) 2/5 0.400
  • P(student yes Yes) 6/9 0.667
  • P(student yes No) 1/5 0.200
  • P(credit fair Yes) 6/9 0.667
  • P(credit fair Yes) 2/5 0.400

26
Naïve Bayesian Classifier - Example
  • Classify unknown sample
  • X (agelt30, incomemedium,
  • studentyes, creditfair)
  • P(Yes X) ? P(Yes) P(XYes)
  • 0.643 x 0.222 x 0.444 x 0.667 x 0.667
    0.028 ?
  • P(No X) ? P(No) P(XNo)
  • 0.357 x 0.600 x 0.400 x 0.200 x 0.400
    0.007 ?
  • Classify X as Yes

27
Discussion on Naïve Bayesian
  • Advantages
  • Optimal classifier if all the joint probabilities
    P(X h) are known (without the independence
    assumption)
  • Easy to apply
  • Disadvantages
  • Need large number of training examples
  • Low accuracy when there are strong dependencies
    between attributes
  • Laplace correction
  • Eliminate possibility of too strong probability
    estimation (0, 1)

28
Bayesian Belief Networks
  • The independence hypothesis
  • Makes computation possible and yields optimal
    classifiers when satisfied
  • But it is seldom satisfied in practice, as
    attributes are often correlated
  • Bayesian networks
  • A graphical model of causal relationships
  • Combine Bayesian reasoning with causal
    relationships between attributes joint
    conditional probability distribution
  • It allows a subset of the variables conditionally
    independent

29
Bayesian Belief Networks
30
Bayesian Belief Networks
  • Computing probability
  • Compute any P(Ck x1 .. xn)
  • with the probability distribution

31
Bayesian Belief Networks
  • Example compute P(LungCancer F, S, PX, D)
  • Use Naive Bayesian
  • Need P(LC), P(F LC), P(S LC), P(PX LC), P(D
    LC)
  • ? 2 4 4 4 4 18 prob.
  • Use Bayesian network model
  • Need P(F), P(S), P(LC F, S), P(PX LC), P(D
    LC)
  • ? 2 2 8 4 4 20 prob.
  • Use full joint probability table
  • Need P(LC, F, S, PX, D) ? 25 prob.

32
Linear Classification
  • Classification Mathematical mapping
  • x ? X ?n, y ? Y 1, 1
  • We want a function f X ? Y
  • Binary Classification problem
  • The data above the red line
  • belongs to class x
  • The data below red line
  • belongs to class o
  • Examples SVM, Perceptron,
  • Probabilistic Classifiers

33
Perceptron
  • Vector X, W
  • Input (X1, y1),
  • Output classification function f(X)
  • f(Xi) gt 0 for yi 1
  • f(Xi) lt 0 for yi -1
  • f(x) ? WX b 0
  • or w1x1w2x2b 0
  • Perceptron update W additively

34
Neural Networks
  • A neuron

?j
w0
x0
w1
x1
å
. . .
output o
wn
xn
Input vector X
weighted sum
Activation function
weight vector W
35
Learning of a Neuron
  • Given (X, t), minimize error E

36
Multi-Layer Perceptron
Output vector O
Output nodes
wjk
Hidden nodes
wij
x1
x2
Input vector X
37
Network Training
  • The objective of training
  • Obtain a set of weights that makes almost all the
    samples in the training data classified correctly
  • Steps
  • Initialize weights wij with random values
  • Feed the traning samples X into the network one
    by one
  • For each unit
  • Compute the output value O using the activation
    function
  • Compute the error E
  • Update the weights wij and the biases

38
Backpropagation Learning
Output vector O
Output nodes
wjk
Hidden nodes
wij
x1
x2
Given (X, T)
Input vector X
39
Neural Network Classifier - Example
40
Neural Network Classifier - Example
  • Training with 14 examples 1000 times
  • Classify unknown sample
  • X (agelt30, incomemedium,
  • studentyes, creditfair)
  • ? (0, 0.5, 1, 0)
  • O 0.85
  • Classify X as Yes

41
Discussion on NN
  • Advantages
  • Robust - works when training examples contain
    errors
  • Output may be discrete, real-valued, or a vector
    of several discrete or real-valued attributes
  • Fast evaluation of the learned target function
  • Criticism
  • Long training time
  • Difficult to understand the learned function
    (weights)
  • Not easy to incorporate domain knowledge

42
Classification Based on Association
  • Use association rule mining
  • Associative classification
  • Mines high support and high confidence rules
  • cond_set gt y (y a class
    label)
  • If several rules have same cond_set, select
    highest confidence rule ? Possible Rule Set
  • Rules are organized in decreasing confidence
    order
  • Classification of new data
  • - First rule that satisfying the data is applied

43
Associative Classification - Example
  • Possible rule set
  • (age 30-40) ? Yes (c 100, s 21)
  • (age lt30)?(student No) ? No (c 100, s
    14)
  • (student yes) ? Yes (c 86, s 43)
  • (income low) ? Yes (c 75, s 21)
  • Classify unknown sample
  • X (incomemedium, studentyes)
  • Classify X as Yes

44
k-Nearest Neighbor Algorithm
  • Store all training examples (instances)
  • All instances correspond to points in the n-D
    space.
  • Classify new instance by finding the nearest
    example
  • The nearest neighbor are defined in terms of
    Euclidean distance.
  • The target function could be discrete or real
    valued.

_
_
_

_


X
_

_

45
k-Nearest Neighbor Algorithm
  • For discrete-valued target
  • The k-NN returns the most common value among the
    k training examples nearest to X
  • Exgt Classify Yes or No, 10-NN, 7 Yes, 3 No ? Yes
  • For continuous-valued target
  • Calculate the mean values of the k nearest
    neighbors
  • Distance-weighted method
  • Weight k neighbors according to their distance to
    X
  • Larger weight to closer neighbor

46
k-NN - Example
  • Given 14 examples ? map to 4-D space
  • Classify unknown sample
  • X (agelt30, incomemedium,
  • studentyes, creditfair) ? (0,
    0.5, 1, 0)
  • 3-NN (0, 0, 1, 0) 1(yes), d 0.5, w 1/0.5
  • (0, 0.5, 0, 0) 0(no), d 1.0, w
    1/1.0
  • (0, 0.5, 1, 1) 1(yes), d 1.0, w
    1/1.0
  • W 4
  • y 2/4 x 1(yes) 1/4 x 0(no) 1/4 x 1(yes)
    0.75
  • Classify X as Yes

47
Discussion on k-NN
  • Advantage
  • Robust to noisy data (averaging k-nearest
    neighbors)
  • No training time
  • Disadvantage
  • Classification time can be long when there are
    too many instances (O(n2) distance computations)
  • Curse of dimensionality - distance between
    neighbors could be dominated by irrelevant
    attributes

48
Case-Based Reasoning
  • Similar to k-NN
  • Instances(cases) are symbolic descriptions (not
    points in a Euclidean space)
  • Customer service help desk, law cases, technical
    designs
  • Methodology
  • Instances are represented by symbolic
    descriptions
  • Find cases with similar descriptions
  • Solutions of multiple retrieved cases are
    combined
  • Research issues
  • Finding similarity measure
  • Combining cases

49
Lazy vs. Eager Learning
  • Eager learning
  • Construct generalization model before receiving
    new samples to classify
  • Decision tree, Bayesian classifier, neural
    network
  • Lazy learning
  • Do not build a model until a new sample to
    classify is given
  • k-nearlest neighbor classifier, case-based
    reasoning
  • Difference
  • Training - Lazy learning is faster
  • Classifying Eager learning is faster

50
Fuzzy Set Approaches
  • Fuzzy logic
  • Uses truth values in 0.0, 1.0 (not F, T)
  • ? Represented as a membership function
  • IF (income gt 50K) THEN yes ? false for income 49K
  • IF (income high) THEN yes ? 0.9 yes for income
    49K

51
Fuzzy Set Approaches
  • Using fuzzy logic in rule-based system
  • Rules are represented with fuzzy categories
  • IF (income high) THEN yes
  • IF (income medium) THEN no
  • For a given new sample, attribute values are
    converted to fuzzy values
  • income 49K ? 0.1 medium, 0.9 high
  • Each applicable rule contributes a vote for
    membership in the categories
  • 0.9 yes, 0.1 no,
  • The truth values for each predicted category are
    summed

52
Prediction
  • Prediction is similar to classification
  • Construct a model ? use model to predict unknown
    value
  • Linear regression Y ? ? X
  • ? and ? are estimated using the least squares
    criterion to the known data (X1, Y1 ), (X2, Y2 ),
    , (Xn, Yn)
  • Mininize ?(Yi - ? - ? Xi )2 ? find ? and ?
  • Multiple regression Y ? ?1X1 ?2X2
  • Y is modeled as a linear function of multiple
    attributes
  • Non-linear regression Y ? ?1X ?2X2
  • Y is modeled as a non-linear function of single
    attribute

53
Classification Accuracy
  • Holdout
  • Partition dataset into training set and test set
  • Training set derive the classifier
  • Test set estimate the accuracy
  • K-fold cross-validation
  • Divide the data set into k subsets
  • Use k-1 subsets as training data and 1 subset as
    test data
  • Repeat k times, and average the accuracy

S1, S2, S3, S4, S5
? S1, S2, S3, S4, S5
? S1, S2, S3, S4, S5
54
Classification Accuracy
  • Bootstrap
  • Works well with small data sets
  • Samples the given training tuples uniformly with
    replacement
  • i.e., each time a tuple is selected, it is
    equally likely to be selected again and re-added
    to the training set
  • .632 boostrap
  • Given a data set of d tuples, the data set is
    sampled d times, with replacement, resulting in a
    training set of d samples.
  • The data tuples that did not make it into the
    training set end up forming the test set.
  • 36.8 will form the test set. (1 1/d)d e-1
    0.368

55
Ensemble Methods
  • Ensemble
  • Use a combination of models to increase accuracy
  • Combine a series of k learned models, M1, M2, ,
    Mk, with the aim of creating an improved model M
  • Popular ensemble methods
  • Bagging averaging the prediction over a
    collection of classifiers
  • Boosting weighted vote with a collection of
    classifiers

56
Bagging
  • Analogy
  • Diagnosis based on multiple doctors majority
    vote
  • Training
  • Given a set D of d tuples, at each iteration i, a
    training set Di of d tuples is sampled with
    replacement from D (i.e., boostrap)
  • A classifier model Mi is learned for each
    training set Di
  • Classification
  • Each classifier Mi returns its class prediction
  • The bagged classifier M counts the votes and
    assigns the class with the most votes to an
    unknown sample X
  • Accuracy
  • Often significant better than a single classifier
    derived from D
  • For noise data not considerably worse, more
    robust
  • Proved improved accuracy in prediction

57
Boosting
  • Analogy
  • Consult several doctors, based on a combination
    of diagnoses
  • Weight assigned based on the previous diagnosis
    accuracy
  • Training
  • Weights are assigned to each training tuple
  • A series of k classifiers is iteratively learned
  • After a classifier Mi is learned, the weights are
    updated to allow the subsequent classifier, Mi1,
    to pay more attention to the training tuples that
    were misclassified by Mi
  • Classification
  • The final M combines the votes of each
    individual classifier, where the weight of each
    classifier's vote is a function of its accuracy
  • Accuracy
  • Comparing with bagging boosting tends to achieve
    greater accuracy, but it also risks overfitting
    the model to misclassified data

58
References
  • C. Apte and S. Weiss. Data mining with decision
    trees and decision rules. Future Generation
    Computer Systems, 13, 1997.
  • C. M. Bishop, Neural Networks for Pattern
    Recognition. Oxford University Press, 1995.
  • L. Breiman, J. Friedman, R. Olshen, and C. Stone.
    Classification and Regression Trees. Wadsworth
    International Group, 1984.
  • C. J. C. Burges. A Tutorial on Support Vector
    Machines for Pattern Recognition. Data Mining and
    Knowledge Discovery, 2(2) 121-168, 1998.
  • P. K. Chan and S. J. Stolfo. Learning arbiter and
    combiner trees from partitioned data for scaling
    machine learning. KDD'95.
  • W. Cohen. Fast effective rule induction.
    ICML'95.
  • G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu.
    Mining top-k covering rule groups for gene
    expression data. SIGMOD'05.
  • A. J. Dobson. An Introduction to Generalized
    Linear Models. Chapman and Hall, 1990.
  • G. Dong and J. Li. Efficient mining of emerging
    patterns Discovering trends and differences.
    KDD'99.
  • R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
    Classification, 2ed. John Wiley and Sons, 2001
  • U. M. Fayyad. Branching on attribute values in
    decision tree generation. AAAI94.
  • Y. Freund and R. E. Schapire. A
    decision-theoretic generalization of on-line
    learning and an application to boosting. J.
    Computer and System Sciences, 1997.
  • J. Gehrke, R. Ramakrishnan, and V. Ganti.
    Rainforest A framework for fast decision tree
    construction of large datasets. VLDB98.

59
References
  • J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y.
    Loh, BOAT -- Optimistic Decision Tree
    Construction. SIGMOD'99.
  • T. Hastie, R. Tibshirani, and J. Friedman. The
    Elements of Statistical Learning Data Mining,
    Inference, and Prediction. Springer-Verlag,
    2001.
  • D. Heckerman, D. Geiger, and D. M. Chickering.
    Learning Bayesian networks The combination of
    knowledge and statistical data. Machine Learning,
    1995.
  • M. Kamber, L. Winstone, W. Gong, S. Cheng, and
    J. Han. Generalization and decision tree
    induction Efficient classification in data
    mining. RIDE'97.
  • B. Liu, W. Hsu, and Y. Ma. Integrating
    Classification and Association Rule. KDD'98.
  • W. Li, J. Han, and J. Pei, CMAR Accurate and
    Efficient Classification Based on Multiple
    Class-Association Rules, ICDM'01.
  • T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A
    comparison of prediction accuracy, complexity,
    and training time of thirty-three old and new
    classification algorithms. Machine Learning,
    2000.
  • J. Magidson. The Chaid approach to segmentation
    modeling Chi-squared automatic interaction
    detection. In R. P. Bagozzi, editor, Advanced
    Methods of Marketing Research, Blackwell
    Business, 1994.
  • M. Mehta, R. Agrawal, and J. Rissanen. SLIQ A
    fast scalable classifier for data mining.
    EDBT'96.
  • T. M. Mitchell. Machine Learning. McGraw Hill,
    1997.
  • S. K. Murthy, Automatic Construction of Decision
    Trees from Data A Multi-Disciplinary Survey,
    Data Mining and Knowledge Discovery 2(4)
    345-389, 1998

60
References
  • J. R. Quinlan. Induction of decision trees.
    Machine Learning, 181-106, 1986.
  • J. R. Quinlan and R. M. Cameron-Jones. FOIL A
    midterm report. ECML93.
  • J. R. Quinlan. C4.5 Programs for Machine
    Learning. Morgan Kaufmann, 1993.
  • J. R. Quinlan. Bagging, boosting, and c4.5.
    AAAI'96.
  • R. Rastogi and K. Shim. Public A decision tree
    classifier that integrates building and pruning.
    VLDB98.
  • J. Shafer, R. Agrawal, and M. Mehta. SPRINT A
    scalable parallel classifier for data mining.
    VLDB96.
  • J. W. Shavlik and T. G. Dietterich. Readings in
    Machine Learning. Morgan Kaufmann, 1990.
  • P. Tan, M. Steinbach, and V. Kumar. Introduction
    to Data Mining. Addison Wesley, 2005.
  • S. M. Weiss and C. A. Kulikowski. Computer
    Systems that Learn Classification and
    Prediction Methods from Statistics, Neural Nets,
    Machine Learning, and Expert Systems. Morgan
    Kaufman, 1991.
  • S. M. Weiss and N. Indurkhya. Predictive Data
    Mining. Morgan Kaufmann, 1997.
  • I. H. Witten and E. Frank. Data Mining Practical
    Machine Learning Tools and Techniques, 2ed.
    Morgan Kaufmann, 2005.
  • X. Yin and J. Han. CPAR Classification based on
    predictive association rules. SDM'03
  • H. Yu, J. Yang, and J. Han. Classifying large
    data sets using SVM with hierarchical clusters.
    KDD'03.
Write a Comment
User Comments (0)
About PowerShow.com