DATA MINING : CLASSIFICATION presentation

About This Presentation

Transcript and Presenter's Notes

Title: DATA MINING : CLASSIFICATION

1
DATA MINING CLASSIFICATION
2
Classification Definition

Classification is a supervised learning.
Uses training sets which has correct answers
(class label attributes).
A model is created by running the algorithm on
the training data.
Test the model. If accuracy is low, regenerate
the model, after changing features,reconsidering
samples.
Identify a class label for the incoming new
data.

3
Applications

Classifying credit card transactions as
legitimate or fraudulent.
Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil.
Categorizing news stories as finance, weather,
entertainment, sports, etc.

4
Classification A two step process

Model construction describing a set of
predetermined classes.
Each sample is assumed to belong to a predefined
class, as determined by the class label
attribute.
The set of samples used for model construction is
training set.
The model is represented as classification rules,
decision trees, or mathematical formula.

Model usage for classifying future or unknown
objects.
Estimate accuracy of the model.
The known label of test sample is compared with
the classified result from the model.
Accuracy rate is the percentage of test set
samples that are correctly classified by the
model.
Test set is independent of training set.
If the accuracy is acceptable, use the model to
classify data samples whose class labels are not
known.

6
Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
7
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
8
Classification techniques

Decision Tree based Methods
Rule-based Methods
Neural Networks
Bayesian Classification
Support Vector Machines

9
Algorithm for decision tree induction

Basic algorithm
Tree is constructed in a top-down
recursive divide-and-conquer manner.
At start, all the training examples are at the
root.
Attributes are categorical (if continuous-valued,
they are discretized in advance).
Examples are partitioned recursively based on
selected attributes.

10
Example of Decision Tree
Training Dataset
11
Output A Decision Tree forbuys_computer
12
Advantages of decision tree based classification

Inexpensive to construct.
Extremely fast at classifying unknown records.
Easy to interpret for small-sized trees.
Accuracy is comparable to other classification
techniques for many simple data sets.

13
Enhancements to basic decision tree induction

Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that
are sparsely represented
This reduces fragmentation, repetition, and
replication

14
Potential Problem

Over fitting This is when the generated model
does not apply to the new incoming data.
Either too small of training data, not
covering many cases.
Wrong assumptions
Over fitting results in decision trees that are
more complex than necessary
Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records
Need new ways for estimating errors

15
How to avoid Over fitting

Two ways to avoid over fitting are
Pre-pruning
Post-pruning
Pre-pruning
Stop the algorithm before it becomes a fully
grown tree.
Stop if all instances belong to the same class.
Stop if no. of instances is less than some user
specified threshold

Post-pruning
Grow decision tree to its entirety.
Trim the nodes of the decision tree in a
bottom-up fashion.
If generalization error improves after trimming,
replace sub-tree by a leaf node.
Class label of leaf node is determined from
majority class of instances in the sub-tree.

17
Bayesian Classification Algorithm

Let X be a data sample whose class label is
unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X)
the probability that the hypothesis holds given
the observed data sample X
P(H) prior probability of hypothesis H (i.e. the
initial probability before we observe any data,
reflects the background knowledge)
P(X) probability that sample data is observed
P(XH) probability of observing the sample X,
given that the hypothesis holds

18
Training dataset for Bayesian Classification
Class C1buys_computer yes C2buys_computer
no Data sample X (agelt30, Incomemedium, Stud
entyes Credit_rating Fair)
19
Advantages Disadvantages of Bayesian
Classification

Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Due to assumption there is loss of accuracy.
Practically, dependencies exist among variables
E.g., hospitals patients Profile age,
family history etc ,Symptoms fever, cough etc.,
Disease lung cancer, diabetes etc
Dependencies among these cannot be modeled by
Bayesian Classifier

20
Conclusion

Training data is an important factor in building
a model in supervised algorithms.
The classification results generated by each of
the algorithms (Naïve Bayes, Decision Tree,
Neural Networks,) is not considerably different
from each other.
Different classification algorithms can take
different time to train and build models.
Mechanical classification is faster

21
References

www.google.com
http//www.thearling.com
www.mamma.com
www.amazon.com
http//www.kdnuggets.com
C. Apte and S. Weiss. Data mining with decision
trees and decision rules. Future Generation
Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth
International Group, 1984.

22
Thank you !!!

Write a Comment

User Comments (0)

About PowerShow.com

DATA MINING : CLASSIFICATION PowerPoint PPT Presentation