Classification and Prediction - PowerPoint PPT Presentation

1 / 148
About This Presentation
Title:

Classification and Prediction

Description:

Classification and Prediction – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 149
Provided by: SCS75
Category:

less

Transcript and Presenter's Notes

Title: Classification and Prediction


1
Classification and Prediction
2
- The Course
DS
OLAP
DP
DW
DS
DM
Association
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining DP Staging Database
3
Chapter Objectives
  • Learn basic techniques for data classification
    and prediction.
  • Realize the difference between the following
    classifications of data
  • supervised classification
  • prediction
  • unsupervised classification

4
Chapter Outline
  • What is classification and prediction of data?
  • How do we classify data by decision tree
    induction?
  • What are neural networks and how can they
    classify?
  • What is Bayesian classification?
  • Are there other classification techniques?
  • How do we predict continuous values?

5
What is Classification?
  • The goal of data classification is to organize
    and categorize data in distinct classes.
  • A model is first created based on the data
    distribution.
  • The model is then used to classify new data.
  • Given the model, a class can be predicted for new
    data.
  • Classification prediction for discrete and
    nominal values

6
What is Prediction?
  • The goal of prediction is to forecast or deduce
    the value of an attribute based on values of
    other attributes.
  • A model is first created based on the data
    distribution.
  • The model is then used to predict future or
    unknown values
  • In Data Mining
  • If forecasting discrete value ? Classification
  • If forecasting continuous value ? Prediction

7
Supervised and Unsupervised
  • Supervised Classification Classification
  • We know the class labels and the number of
    classes
  • Unsupervised Classification Clustering
  • We do not know the class labels and may not know
    the number of classes

8
Preparing Data Before Classification
  • Data transformation
  • Discretization of continuous data
  • Normalization to -1..1 or 0..1
  • Data Cleaning
  • Smoothing to reduce noise
  • Relevance Analysis
  • Feature selection to eliminate irrelevant
    attributes

9
Application
  • Credit approval
  • Target marketing
  • Medical diagnosis
  • Defective parts identification in manufacturing
  • Crime zoning
  • Treatment effectiveness analysis
  • Etc

10
Supervised learning process 3 steps
  • Training
  • Data

1.
Classification Method
Classification Model
Accuracy
Test Data
Classification Model
2.
New Data
Class
Classification Model
3.
11
Classification is a 3-step process
  • 1. Model construction (Learning)
  • Each tuple is assumed to belong to a predefined
    class, as determined by one of the attributes,
    called the class label.
  • The set of all tuples used for construction of
    the model is called training set.
  • The model is represented in the following forms
  • Classification rules, (IF-THEN statements),
  • Decision tree
  • Mathematical formulae

12
1. Classification Process (Learning)
Name Income Age
Samir Low lt30
Ahmed Medium 30...40
Salah High lt30
Ali Medium gt40
Sami Low 30..40
Emad Medium lt30
Classification Method
Credit rating
bad
good
good
good
good
bad
Classification Model
IF Income High OR Age gt 30 THEN Class
Good OR Decision Tree OR Mathematical For
Training Data
class
13
Classification is a 3-step process
  • 2. Model Evaluation (Accuracy)
  • Estimate accuracy rate of the model based on a
    test set.
  • The known label of test sample is compared with
    the classified result from the model.
  • Accuracy rate is the percentage of test set
    samples that are correctly classified by the
    model.
  • Test set is independent of training set otherwise
    over-fitting will occur

14
2. Classification Process (Accuracy Evaluation)
Classification Model
Name Income Age
Naser Low lt30
Lutfi Medium lt30
Adel High gt40
Fahd Medium 30..40
Credit rating
Bad
Bad
good
good
Model
Bad
good
good
good
Accuracy 75
class
15
Classification is a three-step process
  • 3. Model Use (Classification)
  • The model is used to classify unseen objects.
  • Give a class label to a new tuple
  • Predict the value of an actual attribute

16
3. Classification Process (Use)
Classification Model
Name Income Age
Adham Low lt30
Credit rating
?
Good
17
Classification Methods
Classification Method
  • Decision Tree Induction
  • Neural Networks
  • Bayesian Classification
  • Association-Based Classification
  • K-Nearest Neighbour
  • Case-Based Reasoning
  • Genetic Algorithms
  • Rough Set Theory
  • Fuzzy Sets
  • Etc.

18
Evaluating Classification Methods
  • Predictive accuracy
  • Ability of the model to correctly predict the
    class label
  • Speed and scalability
  • Time to construct the model
  • Time to use the model
  • Robustness
  • Handling noise and missing values
  • Scalability
  • Efficiency in large databases (not memory
    resident data)
  • Interpretability
  • The level of understanding and insight provided
    by the model

19
Chapter Outline
  • What is classification and prediction of data?
  • How do we classify data by decision tree
    induction?
  • What are neural networks and how can they
    classify?
  • What is Bayesian classification?
  • Are there other classification techniques?
  • How do we predict continuous values?

20
Decision Tree
21
What is a Decision Tree?
  • A decision tree is a flow-chart-like tree
    structure.
  • Internal node denotes a test on an attribute
  • Branch represents an outcome of the test
  • All tuples in branch have the same value for the
    tested attribute.
  • Leaf node represents class label or class label
    distribution

22
Sample Decision Tree
Excellent customers
Fair customers
80


Income
lt 6K
gt 6K
Age
50
YES
No
20
10000
6000
2000
Income
23
Sample Decision Tree
80


Income
lt6k
gt6k
Age
NO
Age
50
gt50
lt50
NO
Yes
20
2000
6000
10000
Income
24
Sample Decision Tree
Outlook Temp Humidity Windy
sunny hot high FALSE
sunny hot high TRUE
overcast hot high FALSE
rainy mild high FALSE
rainy cool normal FALSE
rainy cool Normal TRUE
overcast cool Normal TRUE
sunny mild High FALSE
sunny cool Normal FALSE
rainy mild Normal FALSE
sunny mild normal TRUE
overcast mild High TRUE
overcast hot Normal FALSE
rainy mild high TRUE
Play?
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
http//www-lmmb.ncifcrf.gov/toms/paper/primer/lat
ex/index.html http//directory.google.com/Top/Scie
nce/Math/Applications/Information_Theory/Papers/
25
Decision-Tree Classification Methods
  • The basic top-down decision tree generation
    approach usually consists of two phases
  • Tree construction
  • At the start, all the training examples are at
    the root.
  • Partition examples are recursively based on
    selected attributes.
  • Tree pruning
  • Aiming at removing tree branches that may reflect
    noise in the training data and lead to errors
    when classifying test data ? improve
    classification accuracy

26
How to Specify Test Condition?
  • Depends on attribute types
  • Nominal
  • Ordinal
  • Continuous
  • Depends on number of ways to split
  • 2-way split
  • Multi-way split

27
Splitting Based on Nominal Attributes
  • Multi-way split Use as many partitions as
    distinct values.
  • Binary split Divides values into two subsets.
    Need to find optimal partitioning.

OR
28
Splitting Based on Ordinal Attributes
  • Multi-way split Use as many partitions as
    distinct values.
  • Binary split Divides values into two subsets.
    Need to find optimal partitioning.
  • What about this split?

OR
29
Splitting Based on Continuous Attributes
  • Different ways of handling
  • Discretization to form an ordinal categorical
    attribute
  • Static discretize once at the beginning
  • Dynamic ranges can be found by equal interval
    bucketing, equal frequency bucketing
    (percentiles), or clustering.
  • Binary Decision (A lt v) or (A ? v)
  • consider all possible splits and finds the best
    cut
  • can be more compute intensive

30
Splitting Based on Continuous Attributes
31
Tree Induction
  • Greedy strategy.
  • Split the records based on an attribute test that
    optimizes certain criterion.
  • Issues
  • Determine how to split the records
  • How to specify the attribute test condition?
  • How to determine the best split?
  • Determine when to stop splitting

32
How to determine the Best Split
fair customers
Good customers
Customers
Income
Age
lt10k
gt10k
young
old
33
How to determine the Best Split
  • Greedy approach
  • Nodes with homogeneous class distribution are
    preferred
  • Need a measure of node impurity

pure
High degree of impurity
Low degree of impurity
50 red 50 green
75 red 25 green
100 red 0 green
34
Measures of Node Impurity
  • Information gain
  • Uses Entropy
  • Gain Ratio
  • Uses Information Gain and Splitinfo
  • Gini Index
  • Used only for binary splits

35
Algorithm for Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (if continuous-valued,
    they are discretized in advance)
  • Examples are partitioned recursively based on
    selected attributes
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no samples left

36
Classification Algorithms
  • ID3
  • Uses information gain
  • C4.5
  • Uses Gain Ratio
  • CART
  • Uses Gini

37
Entropy Used by ID3
Entropy(S) - p log2 p - q log2 q
  • Entropy measures the impurity of S
  • S is a set of examples
  • p is the proportion of positive examples
  • q is the proportion of negative examples

38
ID3
39
ID3
0.94 bits
maximal information gain
amount of information required to specify class
of an example given that it reaches node
0.97 bits 5/14
gain 0.25 bits
40
ID3
outlook
sunny
overcast
rainy
0.97 bits
maximal information gain
41
ID3
outlook
sunny
overcast
rainy
0.97 bits
humidity

high
normal
42
ID3
outlook
sunny
overcast
rainy
Yes
windy
humidity
high
false
true
normal
Yes
No
No
Yes
43
C4.5
  • Information gain measure is biased towards
    attributes with a large number of values
  • C4.5 (a successor of ID3) uses gain ratio to
    overcome the problem (normalization to
    information gain)
  • GainRatio(A) Gain(A)/SplitInfo(A)
  • Ex.
  • gain_ratio(income) 0.029/0.926 0.031
  • The attribute with the maximum gain ratio is
    selected as the splitting attribute

44
CART
  • If a data set D contains examples from n classes,
    gini index, gini(D) is defined as
  • where pj is the relative frequency of class
    j in D
  • If a data set D is split on A into two subsets
    D1 and D2, the gini index gini(D) is defined as
  • Reduction in Impurity
  • The attribute provides the smallest ginisplit(D)
    (or the largest reduction in impurity) is chosen
    to split the node (need to enumerate all the
    possible splitting points for each attribute)

45
CART
  • Ex. D has 9 tuples in buys_computer yes and
    5 in no
  • Suppose the attribute income partitions D into 10
    in D1 low, medium and 4 in D2
  • but ginimedium,high is 0.30 and thus the best
    since it is the lowest
  • All attributes are assumed continuous-valued
  • May need other tools, e.g., clustering, to get
    the possible split values
  • Can be modified for categorical attributes

46
Comparing Attribute Selection Measures
  • The three measures, in general, return good
    results but
  • Information gain
  • biased towards multivalued attributes
  • Gain ratio
  • tends to prefer unbalanced splits in which one
    partition is much smaller than the others
  • Gini index
  • biased to multivalued attributes
  • has difficulty when of classes is large
  • tends to favor tests that result in equal-sized
    partitions and purity in both partitions

47
Other Attribute Selection Measures
  • CHAID a popular decision tree algorithm, measure
    based on ?2 test for independence
  • C-SEP performs better than info. gain and gini
    index in certain cases
  • G-statistics has a close approximation to ?2
    distribution
  • MDL (Minimal Description Length) principle (i.e.,
    the simplest solution is preferred)
  • The best tree as the one that requires the fewest
    of bits to both (1) encode the tree, and (2)
    encode the exceptions to the tree
  • Multivariate splits (partition based on multiple
    variable combinations)
  • CART finds multivariate splits based on a linear
    comb. of attrs.
  • Which attribute selection measure is the best?
  • Most give good results, none is significantly
    superior than others

48
Underfitting and Overfitting
Overfitting
Underfitting when model is too simple, both
training and test errors are large
49
Overfitting due to Noise
Decision boundary is distorted by noise point
50
Underfitting due to Insufficient Examples
Lack of data points in the lower half of the
diagram makes it difficult to predict correctly
the class labels of that region - Insufficient
number of training records in the region causes
the decision tree to predict the test examples
using other training records that are irrelevant
to the classification task
51
Two approaches to avoid Overfitting
  • Prepruning
  • Halt tree construction earlydo not split a node
    if this would result in the goodness measure
    falling below a threshold
  • Difficult to choose an appropriate threshold
  • Postpruning
  • Remove branches from a fully grown treeget a
    sequence of progressively pruned trees
  • Use a set of data different from the training
    data to decide which is the best pruned tree

52
Scalable Decision Tree Induction Methods
  • ID3, C4.5, and CART are not efficient when the
    training set doesnt fit the available memory.
    Instead the following algorithms are used
  • SLIQ
  • Builds an index for each attribute and only class
    list and the current attribute list reside in
    memory
  • SPRINT
  • Constructs an attribute list data structure
  • RainForest
  • Builds an AVC-list (attribute, value, class
    label)
  • BOAT
  • Uses bootstrapping to create several small samples

53
BOAT
  • BOAT (Bootstrapped Optimistic Algorithm for Tree
    Construction)
  • Use a statistical technique called bootstrapping
    to create several smaller samples (subsets), each
    fits in memory
  • Each subset is used to create a tree, resulting
    in several trees
  • These trees are examined and used to construct a
    new tree T
  • It turns out that T is very close to the tree
    that would be generated using the whole data set
    together
  • Adv requires only two scans of DB, an
    incremental alg.

54
Why decision tree induction in data mining?
  • Relatively faster learning speed (than other
    classification methods)
  • Convertible to simple and easy to understand
    classification rules
  • Comparable classification accuracy with other
    methods

55
Converting Tree to Rules
R1 IF (OutlookSunny) AND (HumidityHigh) THEN
PlayNo R2 IF (OutlookSunny) AND
(HumidityNormal) THEN PlayYes R3 IF
(OutlookOvercast) THEN PlayYes R4 IF
(OutlookRain) AND (WindStrong) THEN
PlayNo R5 IF (OutlookRain) AND (WindWeak)
THEN PlayYes
56
Decision treesThe Weka tool
_at_relation weather.symbolic _at_attribute outlook
sunny, overcast, rainy _at_attribute temperature
hot, mild, cool _at_attribute humidity high,
normal _at_attribute windy TRUE, FALSE _at_attribute
play yes, no _at_data sunny,hot,high,FALSE,no sunn
y,hot,high,TRUE,no overcast,hot,high,FALSE,yes rai
ny,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no overcast,cool,normal,TR
UE,yes sunny,mild,high,FALSE,no sunny,cool,normal,
FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,n
ormal,TRUE,yes overcast,mild,high,TRUE,yes overcas
t,hot,normal,FALSE,yes rainy,mild,high,TRUE,no
http//www.cs.waikato.ac.nz/ml/weka/
57
Bayesian Classifier
Thomas Bayes (1702-1761)
58
Basic Statistics
  • Assume
  • D All students
  • X ICS students
  • C SWE students

D
74
C
X
4
6
16
X 10 C 20 D 100
P(X) 10/100 P(C) 20/100 P(X,C) 4/100
P(XC) P(X,C)/P(C) 4/20 P(CX) P(X,C)/P(X)
4/10
P(X,C) P(CX)P(X) P(XC)P(C)
59
Bayesian Classifier Basic Equation
P(X,C) P(CX)P(X) P(XC)P(C)
Class Prior Probability
Descriptor Posterior Probability
Class Posterior Probability
Descriptor Prior Probability
60
Naive Bayesian Classifier
Independence assumption about descriptors
61
Training Data
Play?
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Outlook Temp Humidity Windy
sunny hot high FALSE
sunny hot high TRUE
overcast hot high FALSE
rainy mild high FALSE
rainy cool normal FALSE
rainy cool Normal TRUE
overcast cool Normal TRUE
sunny mild High FALSE
sunny cool Normal FALSE
rainy mild Normal FALSE
sunny mild normal TRUE
overcast mild High TRUE
overcast hot Normal FALSE
rainy mild high TRUE
P(yes) 9/14 P(no) 5/14
62
Bayesian Classifier Probabilities for the
weather data
Frequency Tables
Outlook No Yes ----------------------
------------ Sunny 3
2 ---------------------------------- Overcast
0 4 ----------------------------------
Rainy 2 3
Temp. No Yes ----------------------
------------ Hot 2
2 ---------------------------------- Mild
2 4 -------------------------------
--- Cool 1 3
Humidity No Yes -----------------------
----------- High 4
3 ---------------------------------- Normal
1 6
Windy No Yes ----------------------
------------ False 2
6 ---------------------------------- True
3 3
Outlook No Yes ----------------------
------------ Sunny 3/5
2/9 ---------------------------------- Overcast
0/5 4/9 -------------------------------
--- Rainy 2/5 3/9
Temp. No Yes ----------------------
------------ Hot 2/5
2/9 ---------------------------------- Mild
2/5 4/9 -----------------------------
----- Cool 1/5 3/9
Humidity No Yes -----------------------
----------- High 4/5
3/9 ---------------------------------- Normal
1/5 6/9
Windy No Yes ----------------------
------------ False 2/5
6/9 ---------------------------------- True
3/5 3/9
Likelihood Tables
63
Bayesian Classifier Predicting a new day
Outlook Temp. Humidity Windy Play
sunny cool high true ?
X
Class?
NO
P(yesX) p(sunnyyes) x p(coolyes) x
p(highyes) x p(trueyes) x p(yes)
2/9 x 3/9 x 3/9 x 3/9 x 9/14
0.0053 gt 0.0053/(0.00530.0206) 0.205
P(noX) p(sunnyno) x p(coolno) x p(highno) x
p(trueno) x p(no)
3/5 x 1/5 x 4/5 x 3/5 x 5/14
0.02060.0206/(0.00530.0206) 0.795
64
Bayesian Classifier zero frequency problem
  • What if a descriptor value doesnt occur with
    every class value
  • P(outlookovercastNo)0
  • Remedy add 1 to the count for every
    descriptor-class combination
  • (Laplace Estimator)

Outlook No Yes ----------------------
------------ Sunny 31
21 ---------------------------------- Overcast
01 41 ---------------------------------
- Rainy 21 31
Temp. No Yes ----------------------
------------ Hot 21
21 ---------------------------------- Mild
21 41 -------------------------------
--- Cool 11 31
Humidity No Yes -----------------------
----------- High 41
31 ---------------------------------- Normal
11 61
Windy No Yes ----------------------
------------ False 21
61 ---------------------------------- True
31 31
65
Bayesian Classifier General Equation
Likelihood
Continues variable
66
Bayesian Classifier Dealing with numeric
attributes
67
Bayesian Classifier Dealing with numeric
attributes
68
Naïve Bayesian Classifier Comments
  • Advantages
  • Easy to implement
  • Good results obtained in most of the cases
  • Disadvantages
  • Assumption class conditional independence,
    therefore loss of accuracy
  • Practically, dependencies exist among variables
  • E.g., hospitals patients Profile age, family
    history, etc.
  • Symptoms fever, cough etc., Disease lung
    cancer, diabetes, etc.
  • Dependencies among these cannot be modeled by
    Naïve Bayesian Classifier
  • How to deal with these dependencies?
  • Bayesian Belief Networks

69
Bayesian Belief Networks
  • Bayesian belief network allows a subset of the
    variables conditionally independent
  • A graphical model of causal relationships
  • Represents dependency among the variables
  • Gives a specification of joint probability
    distribution
  • Nodes random variables
  • Links dependency
  • X and Y are the parents of Z, and Y is the
    parent of P
  • No dependency between Z and P
  • Has no loops or cycles

X
70
Bayesian Belief Network An Example
The conditional probability table (CPT) for
variable LungCancer
Family History
Smoker
LungCancer
Emphysema
CPT shows the conditional probability for each
possible combination of its parents
PositiveXRay
Dyspnea
Derivation of the probability of a particular
combination of values of X, from CPT
Bayesian Belief Networks
71
Training Bayesian Networks
  • Several scenarios
  • Given both the network structure and all
    variables observable learn only the CPTs
  • Network structure known, some hidden variables
    gradient descent (greedy hill-climbing) method,
    analogous to neural network learning
  • Network structure unknown, all variables
    observable search through the model space to
    reconstruct network topology
  • Unknown structure, all hidden variables No good
    algorithms known for this purpose.

72
Support Vector Machines
73
Sabic
  • Email Mohammed S. Al-Shahrani
  • shahranims_at_sabic.com

74
Support Vector Machines
  • Find a linear hyperplane (decision boundary) that
    will separate the data

75
Support Vector Machines
  • One Possible Solution

76
Support Vector Machines
  • Another possible solution

77
Support Vector Machines
  • Other possible solutions

78
Support Vector Machines
  • Which one is better? B1 or B2?
  • How do you define better?

79
Support Vector Machines
  • Find a hyper plane that maximizes the margin gt
    B1 is better than B2

80
Support Vectors
Support Vectors
81
Support Vector Machines
Support Vectors
82
Support Vector Machines
83
Finding the Decision Boundary
  • Let x1, ..., xn be our data set and let yi Î
    1,-1 be the class label of xi
  • The decision boundary should classify all points
    correctly Þ
  • The decision boundary can be found by solving the
    following constrained optimization problem
  • This is a constrained optimization problem.
    Solving it is beyond our course

84
Support Vector Machines
  • We want to maximize
  • Which is equivalent to minimizing
  • But subjected to the following constraints
  • This is a constrained optimization problem
  • Numerical approaches to solve it (e.g., quadratic
    programming)

85
Classifying new Tuples
  • The decision boundary is determined only by the
    support vectors
  • Let tj (j1, ..., s) be the indices of the s
    support vectors.
  • For testing with a new data z
  • Compute
    and classify z as class 1 if
    the sum is positive, and class 2 otherwise

86
Support Vector Machines
Support Vectors
87
Support Vector Machines
  • What if the training set is not linearly
    separable?
  • Slack variables ?i can be added to allow
    misclassification of difficult or noisy examples,
    resulting margin called soft.

?i
?i
88
Support Vector Machines
  • What if the problem is not linearly separable?
  • Introduce slack variables
  • Need to minimize
  • Subject to

89
Nonlinear Support Vector Machines
  • What if decision boundary is not linear?

90
Non-linear SVMs
  • Datasets that are linearly separable with some
    noise work out great
  • But what are we going to do if the dataset is
    just too hard?
  • How about mapping data to a higher-dimensional
    space

x
0
x
0
x2
x
0
91
Non-linear SVMs Feature spaces
  • General idea the original feature space can
    always be mapped to some higher-dimensional
    feature space where the training set is separable

F x ? f(x)
92
prediction
Linear Regression
93
What Is Prediction?
  • (Numerical) prediction is similar to
    classification
  • construct a model
  • use model to predict continuous or ordered value
    for a given input
  • Prediction is different from classification
  • Classification refers to predict categorical
    class label
  • Prediction models continuous-valued functions
  • Major method for prediction regression
  • model the relationship between one or more
    predictor variables and a response variable

94
Prediction
Training data
Response
Attribute (Y)
Attribute (X)
Predictor
95
Types of Correlation
Positive correlation
Negative correlation
No correlation
96
Regression Analysis
  • Simple Linear regression
  • multiple regression
  • Non-linear regression
  • Other regression methods
  • generalized linear model,
  • Poisson regression,
  • log-linear models,
  • regression trees

97
Simple Linear Regression
describes the linear relationship between a
predictor variable, plotted on the x-axis, and a
response variable, plotted on the y-axis
Y
X
98
Simple Linear Regression
Y
1.0
X
99
Simple Linear Regression
Y
X
100
Simple Linear Regression
Y
e
e
X
101
Simple Linear Regression
Fitting data to a linear model
intercept
slope
residuals
102
Simple Linear Regression
How to fit data to a linear model?
Least Square Method
103
Least Squares Regression
Model line
Residual (e)
Sum of squares of residuals
  • we must find values of and that
    minimise


104
Linear Regression
  • A model line y w0 w1 x acquired by using
    Method of least squares to estimates the
    best-fitting straight line has

105
Multiple Linear Regression
  • Multiple linear regression involves more than
    one predictor variable
  • The linear model with a single predictor variable
    X can easily be extended to two or more
    predictor variables
  • Solvable by extension of least square method or
    using SAS, S-Plus

106
Nonlinear Regression
  • Some nonlinear models can be modeled by a
    polynomial function
  • A polynomial regression model can be transformed
    into linear regression model. For example,
  • y w0 w1 x w2 x2 w3 x3
  • convertible to linear with new variables x2
    x2, x3 x3
  • y w0 w1 x w2 x2 w3 x3
  • Other functions, such as power function, can also
    be transformed to linear model
  • Some models are intractable nonlinear
  • possible to obtain least square estimates through
    extensive calculation on more complex formulae

107
Artificial Neural Networks (ANN)
108
What is a ANN?
  • ANN is a data structure that supposedly simulates
    the behavior of neurons in a biological brain.
  • ANN is composed of layers of units
    interconnected.
  • Messages are passed along the connections from
    one unit to the other.
  • Messages can change based on the weight of the
    connection and the value in the node

109
General Structure of ANN
110
ANN
Output Y is 1 if at least two of the three inputs
are equal to 1.
111
ANN
112
Artificial Neural Networks
  • Model is an assembly of inter-connected nodes and
    weighted links
  • Output node sums up each of its input value
    according to the weights of its links
  • Compare output node against some threshold t

Perceptron Model
or
113
Neural Networks
  • Advantages
  • prediction accuracy is generally high.
  • robust, works when training examples contain
    errors.
  • output may be discrete, real-valued, or a vector
    of several discrete or real-valued attributes.
  • fast evaluation of the learned target function.
  • Criticism
  • long training time.
  • difficult to understand the learned function
    (weights).
  • not easy to incorporate domain knowledge.

114
Learning Algorithms
  • Back propagation for classification
  • Kohonen feature maps for clustering
  • Recurrent back propagation for classification
  • Radial basis function for classification
  • Adaptive resonance theory
  • Probabilistic neural networks

115
Major Steps for Back Propagation Network
  • Constructing a network
  • input data representation
  • selection of number of layers, number of nodes in
    each layer.
  • Training the network using training data
  • Pruning the network
  • Interpret the results

116
A Multi-Layer Feed-Forward Neural Network
wij
117
How A Multi-Layer Neural Network Works?
  • The inputs to the network correspond to the
    attributes measured for each training tuple
  • Inputs are fed simultaneously into the units
    making up the input layer
  • They are then weighted and fed simultaneously to
    a hidden layer
  • The number of hidden layers is arbitrary,
    although usually only one
  • The weighted outputs of the last hidden layer are
    input to units making up the output layer, which
    emits the network's prediction
  • The network is feed-forward in that none of the
    weights cycles back to an input unit or to an
    output unit of a previous layer
  • From a statistical point of view, networks
    perform nonlinear regression Given enough hidden
    units and enough training samples, they can
    closely approximate any function

118
Defining a Network Topology
  • First decide the network topology of units in
    the input layer, of hidden layers (if gt 1),
    of units in each hidden layer, and of units in
    the output layer
  • Normalizing the input values for each attribute
    measured in the training tuples to 0.01.0
  • One input unit per domain value
  • Output, if for classification and more than two
    classes, one output unit per class is used
  • Once a network has been trained and its accuracy
    is unacceptable, repeat the training process with
    a different network topology or a different set
    of initial weights

119
Backpropagation
  • Iteratively process a set of training tuples
    compare the network's prediction with the actual
    known target value
  • For each training tuple, the weights are modified
    to minimize the mean squared error between the
    network's prediction and the actual target value
  • Modifications are made in the backwards
    direction from the output layer, through each
    hidden layer down to the first hidden layer,
    hence backpropagation
  • Steps
  • Initialize weights (to small random s) and
    biases in the network
  • Propagate the inputs forward (by applying
    activation function)
  • Backpropagate the error (by updating weights and
    biases)
  • Terminating condition (when error is very small,
    etc.)

120
Backpropagation
Correct value
Generated value
121
Network Pruning
  • Fully connected network will be hard to
    articulate
  • n input nodes, h hidden nodes and m output nodes
    lead to h(mn) links (weights)
  • Pruning Remove some of the links without
    affecting classification accuracy of the network.

122
Other Classification Methods
  • Associative classification Association rule
    based condSet ?class
  • Genetic algorithm Initial population of encoded
    rules are changed by mutation and cross-over
    based on survival of accurate once (survival).
  • K-nearest neighbor classifier Learning by
    analogy.
  • Case-based reasoning Similarity with other
    cases.
  • Rough set theory Approximation to equivalence
    classes.
  • Fuzzy sets Based on fuzzy logic (truth values
    between 0..1).

123
Lazy Learners
124
Lazy vs. Eager Learning
  • Lazy vs. eager learning
  • Lazy learning (e.g., instance-based learning)
    Simply stores training data (or only minor
    processing) and waits until it is given a test
    tuple
  • Eager learning (the above discussed methods)
    Given a set of training set, constructs a
    classification model before receiving new (e.g.,
    test) data to classify
  • Lazy less time in training but more time in
    predicting

125
Lazy Learner Instance-Based Methods
  • Instance-based learning
  • Store training examples and delay the processing
    (lazy evaluation) until a new instance must be
    classified
  • Typical approaches
  • k-nearest neighbor approach
  • Instances represented as points in a Euclidean
    space.
  • Case-based reasoning
  • Uses symbolic representations and knowledge-based
    inference

126
Nearest Neighbor Classifiers
  • Basic idea
  • If it walks like a duck, quacks like a duck, then
    its probably a duck

Training records
127
Instance-Based Classifiers
  • Store the training records
  • Use training records to predict the class
    label of unseen cases

128
Definition of Nearest Neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x
129
The k-Nearest Neighbor Algorithm
  • All instances correspond to points in the n-D
    space
  • The nearest neighbor are defined in terms of
    Euclidean distance, dist(X1, X2)
  • Target function could be discrete- or real-
    valued
  • For discrete-valued, k-NN returns the most common
    value among the k training examples nearest to xq
  • Vonoroi diagram the decision surface induced by
    1-NN for a typical set of training examples

.
_
_
_
.
.
_
.

.

_

xq
.
_

130
Nearest-Neighbor Classifiers
  • Requires three things
  • The set of stored records
  • Distance Metric to compute distance between
    records
  • The value of k, the number of nearest neighbors
    to retrieve
  • To classify an unknown record
  • Compute distance to other training records
  • Identify k nearest neighbors
  • Use class labels of nearest neighbors to
    determine the class label of unknown record
    (e.g., by taking majority vote)

131
Nearest Neighbor Classification
  • Compute distance between two points
  • Euclidean distance
  • Determine the class from nearest neighbor list
  • take the majority vote of class labels among the
    k-nearest neighbors
  • Weigh the vote according to distance
  • weight factor, w 1/d2

132
Nearest Neighbor Classification
  • Scaling issues
  • Attributes may have to be scaled to prevent
    distance measures from being dominated by one of
    the attributes
  • Example
  • height of a person may vary from 1.5m to 1.8m
  • weight of a person may vary from 90lb to 300lb
  • income of a person may vary from 10K to 1M

133
Nearest Neighbor Classification
  • Choosing the value of k
  • If k is too small, sensitive to noise points
  • If k is too large, neighborhood may include
    points from other classes

134
Metrics for Performance Evaluation
  • Focus on the predictive capability of a model
  • Rather than how fast it takes to classify or
    build models, scalability, etc.
  • Confusion Matrix

PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a b
ACTUALCLASS ClassNo c d
a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
135
Metrics for Performance Evaluation
PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a(TP) b(FN)
ACTUALCLASS ClassNo c(FP) d(TN)
  • Most widely-used metric

Error Rate 1 - Accuracy
136
Limitation of Accuracy
  • Consider a 2-class problem
  • Number of Class 0 examples 9990
  • Number of Class 1 examples 10
  • If model predicts everything to be class 0,
    accuracy is 9990/10000 99.9
  • Accuracy is misleading because model does not
    detect any class 1 example

137
Alternative Classifier Accuracy Measures
  • accuracy sensitivity pos/(pos neg)
    specificity neg/(pos neg)
  • sensitivity tp/pos / true positive
    recognition rate /
  • specificity tn/neg / true negative
    recognition rate /
  • precision tp/(tp fp)

138
Predictor Error Measures
  • Test error (generalization error) the average
    loss over the test set
  • Mean absolute error
  • Mean squared error
  • Relative absolute error
  • Relative squared error
  • The mean squared-error exaggerates the presence
    of outliers Popularly use (square) root
    mean-square error, similarly, root relative
    squared error

139
Evaluating Accuracy
  • Holdout method
  • Given data is randomly partitioned into two
    independent sets
  • Training set (e.g., 2/3) for model construction
  • Test set (e.g., 1/3) for accuracy estimation
  • Random sampling a variation of holdout
  • Repeat holdout k times, accuracy avg. of the
    accuracies obtained
  • Cross-validation (k-fold, where k 10 is most
    popular)
  • Randomly partition the data into k mutually
    exclusive subsets, each approximately equal size
  • At i-th iteration, use Di as test set and others
    as training set

140
Evaluating Accuracy
  • Bootstrap
  • Works well with small data sets
  • Samples the given training tuples uniformly with
    replacement
  • Several boostrap methods, and a common one is
    .632 boostrap
  • Suppose we are given a data set of d tuples. The
    data set is sampled d times, with replacement,
    resulting in a training set of d samples. The
    data tuples that did not make it into the
    training set end up forming the test set. About
    63.2 of the original data will end up in the
    bootstrap, and the remaining 36.8 will form the
    test set (since (1 1/d)d e-1 0.368)
  • Repeat the sampling procedure k times, overall
    accuracy of the model

141
Ensemble Methods
  • Construct a set of classifiers from the training
    data
  • Predict class label of previously unseen records
    by aggregating predictions made by multiple
    classifiers
  • Use a combination of models to increase accuracy
  • Combine a series of k learned models, M1, M2, ,
    Mk, with the aim of creating an improved model M
  • Popular ensemble methods
  • Bagging
  • averaging the prediction over a collection of
    classifiers
  • Boosting
  • weighted vote with a collection of classifiers

142
General Idea
143
Bagging Boostrap Aggregation
  • Analogy Diagnosis based on multiple doctors
    majority vote
  • Training
  • Given a set D of d tuples, at each iteration i, a
    training set Di of d tuples is sampled with
    replacement from D (i.e., boostrap)
  • A classifier model Mi is learned for each
    training set Di
  • Classification classify an unknown sample X
  • Each classifier Mi returns its class prediction
  • The bagged classifier M counts the votes and
    assigns the class with the most votes to X
  • Prediction can be applied to the prediction of
    continuous values by taking the average value of
    each prediction for a given test tuple

144
Bagging Boostrap Aggregation
  • Accuracy
  • Often significant better than a single classifier
    derived from D
  • For noise data not considerably worse, more
    robust
  • Proved improved accuracy in prediction

145
Boosting
  • Analogy Consult several doctors, based on a
    combination of weighted diagnosesweight assigned
    based on the previous diagnosis accuracy
  • How boosting works?
  • Weights are assigned to each training tuple
  • A series of k classifiers is iteratively learned
  • After a classifier Mi is learned, the weights are
    updated to allow the subsequent classifier, Mi1,
    to pay more attention to the training tuples that
    were misclassified by Mi
  • The final M combines the votes of each
    individual classifier, where the weight of each
    classifier's vote is a function of its accuracy

146
Boosting
  • The boosting algorithm can be extended for the
    prediction of continuous values
  • Comparing with bagging boosting tends to achieve
    greater accuracy, but it also risks overfitting
    the model to misclassified data

147
Boosting Adaboost
  • Given a set of d class-labeled tuples, (X1, y1),
    , (Xd, yd)
  • Initially, all the weights of tuples are set the
    same (1/d)
  • Generate k classifiers in k rounds. At round i,
  • Tuples from D are sampled (with replacement) to
    form a training set Di of the same size
  • Each tuples chance of being selected is based on
    its weight
  • A classification model Mi is derived from Di
  • Its error rate is calculated using Di as a test
    set
  • If a tuple is misclassified, its weight is
    increased, otherwise it is decreased
  • Error rate err(Xj) is the misclassification
    error of tuple Xj. Classifier Mi error rate is
    the sum of the weights of the misclassified
    tuples
  • The weight of classifier Mis vote is

148
Summary
  • Classification Vs prediction
  • Eager learners
  • Decision tree
  • Bayesian
  • Support vector Machines (SVM)
  • Neural Networks
  • Linear regression
  • Lazy learners
  • K-Nearest Neighbor (KNN)
  • Performance (Accuracy) Evaluation
  • Holdout
  • Cross validation
  • Bootstrap
  • Ensemble Methods
  • Bagging
  • Boosting

149
END
Write a Comment
User Comments (0)
About PowerShow.com