COMP 578 Artificial Neural Networks for Data Mining

About This Presentation

Title:

COMP 578 Artificial Neural Networks for Data Mining

Description:

Input Feature: 1, -1, -1, -1, 1, 1, -1, 1 Output Feature: 1 ... Let SUM be the weighted sum, the output of the Perceptron, y = f(SUM), can be 1, 0, -1. ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 83

Provided by: keithc5

Category:

more less

Transcript and Presenter's Notes

Title: COMP 578 Artificial Neural Networks for Data Mining

1
COMP 578Artificial Neural Networks for Data
Mining

Keith C.C. Chan
Department of Computing
The Hong Kong Polytechnic University

2
Human vs. Computer

Computers
Not good at performing such tasks as visual or
audio processing/recognition.
Execute instructions one after another extremely
rapidly.
Good at serial activities (e.g. counting,
adding).
Human brain
Units respond at ?10/s (vs. PV 2.5GHz).
Work on many different things at once.
Vision or speech recognition by interaction of
many different pieces of information.

3
The brain

Human brain is complicated and poorly understood.
Contains approximately 1010 basic units called
neurons.
Each neuron connected to about 10,000 others.

Dendrites
Soma (or Cell Body)
Axon
Synapse
4
The Neuron
Dendrites
Soma
Axon
Synapse

Neuron accepts many inputs (through dendrites).
Inputs are all added up in some fashion.
If enough active inputs are received at once,
neuron will be activated and fire (through
axon).

5
The Synapse

Axon produce voltage pulse called action
potential (AP).
Need arrival of more than one AP to trigger
synapse.
Synapse releases neurotransmitters when AP is
raised sufficiently.
Neurotransmitters diffuse across the gap
chemically activating dendrites on the other
side.
Some synapses pass a large signal across, whilst
others allow very little through.

6
Modeling the Single Neuron

n inputs.
Efficiency of synapses modeled by having a
multiplicative factor on each of the inputs to
the neuron.
Multiplicative factor associated weights on
input lines.
Neurons tasks
Calculates weighted sum of its inputs.
Compares sum to some internal threshold.
Turn on if threshold exceeded.

x1
w1
x2
w2
S
y
wn
xn
7
A Mathematical Model of Neurons

Neuron computes weighted sum
Fire if SUM exceeds a threshold ?.
y1 if SUM gt ?
y0 if SUM ? ?.

8
Learning in Simple Neurons

Need to be able to determine connection weights.
Inspiration comes from looking at real neural
systems.
Reinforce good behavior and reprimand bad.
E.g., train a NN to recognize 2 characters H and
F
Output 1 when a H is presented and 0 when it sees
a F.
If it produces an incorrect output, we want to
reduce the chances of that happening again.
This is done by modifying the weights.

9
Learning in Simple Neurons (2)

Neuron given random initial weights.
At starting state, neuron knows nothing.
Present an H.
Neuron computes the weighted sum of inputs.
Compare weighted sum with threshold.
If exceeds threshold, output a 1 otherwise a 0.
If output is 1, neuron is correct.
Do nothing.
Otherwise if neuron produces a 0.
Increase the weights so that next time it will
exceed the threshold and produces a 1.

10
A Simple Learning Rule

How much weight to increase?
Can follow simple rule
Add the input values to the weights when we want
the output to be on.
Subtract the input values from the weights when
we want the output to be off.
This learning rule is called the Hebb rule
It is a variant on one proposed by Donald Hebb
and is called Hebbian learning.
It is the earliest and simplest learning rule for
a neuron.

11
The Hebb Net

Step 0. Initialize all weights
wi 0 (i 1 to n).
Step 1. For each input training record (s) its
target output (t), do steps 2-4.
Step 2. Set activations for all input units
Step 3. Set activation for the output unit
Step 4. Adjust the weights and the bias
wi (new) wi (old) xi y (i 1 to n) (note
?wi xi y)
?(new) ?(old) y .
The bias (the ?) adjusted like a weight from a
unit whose output signal is always 1.

12
A Hebb Net Example
13
The Data Set

Attributes
HS_Index Drop, Rise
Trading_Vol Small, Medium, Large
DJIA Drop, Rise
Class Label
Buy_Sell Buy, Sell

14
The Data Set
HS_Index Trading_Vol DJIA Buy_Sell
1 Drop Large Drop Buy
2 Rise Large Rise Sell
3 Rise Medium Drop Buy
4 Drop Small Drop Sell
5 Rise Small Drop Sell
6 Rise Large Drop Buy
7 Rise Small Rise Sell
8 Drop Large Rise Sell
15
Transformation
Bias

Input Features
HS_Index_Drop -1, 1
HS_Index_Rise -1, 1
Trading_Vol_Small -1, 1
Trading_Vol_Medium -1, 1
Trading_Vol_Large -1, 1
DJIA_Drop -1, 1
DJIA_Rise -1, 1
Bias 1
Output Feature
Buy_Sell -1, 1

HISDrop
HISRise
B/S
DJIADrop
DJIARise
16
Transformed Data
Input Feature Output Feature
1 lt1, -1, -1, -1, 1, 1, -1, 1gt lt1gt
2 lt-1, 1, -1, -1, 1, -1, 1, 1gt lt-1gt
3 lt-1, 1, -1, 1, -1, 1, -1, 1gt lt1gt
4 lt1, -1, 1, -1, -1, 1, -1, 1gt lt-1gt
5 lt-1, 1, 1, -1, -1, 1, -1, 1gt lt-1gt
6 lt-1, 1, -1, -1, 1, 1, -1, 1gt lt1gt
7 lt-1, 1, 1, -1, -1, -1, 1, 1gt lt-1gt
8 lt1, -1, -1, -1, 1, -1, 1, 1gt lt-1gt
17
Record 1

Input Feature lt1, -1, -1, -1, 1, 1, -1, 1gt
Output Feature lt1gt
Original Weight lt0, 0, 0, 0, 0, 0, 0, 0gt
Weight Change lt1, -1, -1, -1, 1, 1, -1, 1gt
New Weight lt1, -1, -1, -1, 1, 1, -1, 1gt

18
Record 2

Input Feature lt-1, 1, -1, -1, 1, -1, 1, 1gt
Output Feature lt-1gt
Original Weight lt1, -1, -1, -1, 1, 1, -1, 1gt
Weight Change lt1, -1, 1, 1, -1, 1, -1, -1gt
New Weight lt2, -2, 0, 0, 0, 2, -2, 0gt

19
Record 3

Input Feature lt-1, 1, -1, 1, -1, 1, -1, 1gt
Output Feature lt1gt
Original Weight lt2, -2, 0, 0, 0, 2, -2, 0gt
Weight Change lt-1, 1, -1, 1, -1, 1, -1, 1gt
New Weight lt1, -1, -1, 1, -1, 3, -3, 1gt

20
Record 4

Input Feature lt1, -1, 1, -1, -1, 1, -1, 1gt
Output Feature lt-1gt
Original Weight lt1, -1, -1, 1, -1, 3, -3, 1gt
Weight Change lt-1, 1, -1, 1, 1, -1, 1, -1gt
New Weight lt0, 0, -2, 2, 0, 2, -2, 0gt

21
Record 5

Input Feature lt-1, 1, 1, -1, -1, 1, -1, 1gt
Output Feature lt-1gt
Original Weight lt0, 0, -2, 2, 0, 2, -2, 0gt
Weight Change lt1, -1, -1, 1, 1, -1, 1, -1gt
New Weight lt1, -1, -3, 3, 1, 1, -1, -1gt

22
Record 6

Input Feature lt-1, 1, -1, -1, 1, 1, -1, 1gt
Output Feature lt1gt
Original Weight lt1, -1, -3, 3, 1, 1, -1, -1gt
Weight Change lt-1, 1, -1, -1, 1, 1, -1, 1gt
New Weight lt0, 0, -4, 2, 2, 2, -2, 0gt

23
Record 7

Input Feature lt-1, 1, 1, -1, -1, -1, 1, 1gt
Output Feature lt-1gt
Original Weight lt0, 0, -4, 2, 2, 2, -2, 0gt
Weight Change lt1, -1, -1, 1, 1, 1, -1, -1gt
New Weight lt1, -1, -5, 3, 3, 3, -3, -1gt

24
Record 8

Input Feature lt1, -1, -1, -1, 1, -1, 1, 1gt
Output Feature lt-1gt
Original Weight lt1, -1, -5, 3, 3, 3, -3, -1gt
Weight Change lt-1, 1, 1, 1, -1, 1, -1, -1gt
New Weight lt0, 0, -4, 4, 2, 4, -4, -2gt

25
A Hebb Net Example 2
Input
Target
(x1 X2 1)
(1 1 1) 1
(1 -1 1) -1
(-1 1 1) -1
(-1 -1 1) -1
26
Input Target
Weight Changes Weights
(x1 x2 1) (?w ?w2 ??) (w1 w2 ?)
(0 0 0)
(1 1 1) 1 (1 1 1) (1 1 1)
The separating line becomes x2 -
x1 - 1
27
Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(1 1 1)
(1 -1 1) -1 (-1 1 -1) (0 2 0)
The separating line becomes x2 0
28
Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(0 2 0)
(-1 1 1) -1 (1 -1 -1) (1 1 -1)
x2
The separating line becomes x2 - x1 1
x1
29
Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(1 1 -1)
(-1 -1 1) -1 (1 1 -1) (2 2 -2)
x2
Even though the weights have changed, the
separating line is still x2 - x1 1 The
graph of the decision regions (the positive
response and the negative response) remains as
shown.
x1
30
A Hebb Net Example 3
Input
Target
(x1 x2 1)
(1 1 1) 1
(1 0 1) 0
(0 1 1) 0
(0 0 1) 0
31
Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(0 0 0)
(1 1 1) 1 (1 1 1) (1 1 1)
The separating line becomes x2 - x1 - 1
32
Since the target value is 0, no learning
occurs.Using binary target values prevents the
net from learning any pattern for which the
target is off.
Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(1 0 1) 0 (0 0 0) (1 1 1)
(0 1 1) 0 (0 0 0) (1 1 1)
(0 0 1) 0 (0 0 0) (1 1 1)
33
Characteristics of the Hebb Net

Choice of training records determines which
problems can be solved.
Training records corresponding to the AND
function can be solved if inputs and targets in
bipolar form.
Bipolar representation allows modification of a
weight when input and target are both on and
when they are both off at the same time.

34
The Perceptron Learning Rule

More powerful than the Hebb rule.
The Perceptron learning rule convergence theorem
states that
If weights exist to allow neuron to respond
correctly to all training patterns, then the rule
will find such weights.
The neuron will find these weights in a finite
number of training steps.
Let SUM be the weighted sum, the output of the
Perceptron, y f(SUM), can be 1, 0, -1.
The activation function is

35
Perceptron Learning

For each training record, the net would calculate
the response of the output unit.
The net would determine whether an error occurred
for this pattern (comparing the calculated with
target value).
If an error occurred, weights would be changed
according to wi (new) wi (old) ?txiwhere
t is 1 or 1 and ? is the learning rate.
If an error did not occur, the weights would not
be changed.
Training continue until no error occurred.

36
Perceptron for classification

Step 0. Initialize all weights and bias (For
simplicity, set weights and bias to zero.) Set
learning rate ? (0 lt ? lt 1). (For simplicity,
? can be set to 1.)
Step 1. While stopping condition is false, do
steps 2-6.
Step 2. For each training pair, do Steps 3-5
Step 3. Set activation for input unit, xi.
Step 4. Compute response of output unit SUM
? ?i xi wi.
Step 5. Update weights and bias if error
occurred for this vector. If y? y, wi (new)
wi (old) ?txi ?(new) ? (old) ?t else
wi (new) wi (old) ? (new) ? (old)
Step 6. If no weights changed in 2, stop else
continue.

37
Perceptron for classification (2)

Only weights connecting active input units (xi?0)
are updated.
Weights are updated only for patterns that do not
produce the correct value of y.
Less learning as more training patterns produce
the correct response.
The threshold on the activation function for the
response unit is a fixed, non-negative value ?.
The form of the activation function for the
output unit constitutes an undecided band of
fixed width determined by ? separating the region
of positive response from that of negative
response.

38
Perceptron for classification (3)

Instead of one separating line, we have a line
separating the region of positive response from
the region of zero response (line bounding
inequality)
w1 x1 w2 x2 b gt ?
and a line separating the region of zero response
from the region of negative response (line
bounding the inequality) w1 x1 w2 x2 b lt ??

w1 x1 w2 x2 b gt ?
w1 x1 w2 x2 b lt ??
39
Perceptron
40
The Data Set (1)

Attributes
HS_Index Drop, Rise
Trading_Vol Small, Medium, Large
DJIA Drop, Rise
Class Label
Buy_Sell Buy, Sell

41
The Data Set (2)
HS_Index Trading_Vol DJIA Buy_Sell
1 Drop Large Drop Buy
2 Rise Large Rise Sell
3 Rise Medium Drop Buy
4 Drop Small Drop Sell
5 Rise Small Drop Sell
6 Rise Large Drop Buy
7 Rise Small Rise Sell
8 Drop Large Rise Sell
42
Transformation

Input Features
HS_Index_Drop 0, 1
HS_Index_Rise 0, 1
Trading_Vol_Small 0, 1
Trading_Vol_Medium 0, 1
Trading_Vol_Large 0, 1
DJIA_Drop 0, 1
DJIA_Rise 0, 1
Bias 0
Output Feature
Buy ? 1
Sell ? -1

43
Transformed Data
Input Feature Output Feature
1 lt1, 0, 0, 0, 1, 1, 0, 1gt lt1gt
2 lt0, 1, 0, 0, 1, 0, 1, 1gt lt-1gt
3 lt0, 1, 0, 1, 0, 1, 0, 1gt lt1gt
4 lt1, 0, 1, 0, 0, 1, 0, 1gt lt-1gt
5 lt0, 1, 1, 0, 0, 1, 0, 1gt lt-1gt
6 lt0, 1, 0, 0, 1, 1, 0, 1gt lt1gt
7 lt0, 1, 1, 0, 0, 0, 1, 1gt lt-1gt
8 lt1, 0, 0, 0, 1, 0, 1, 1gt lt-1gt
44
Record 1

Input Feature lt1, 0, 0, 0, 1, 1, 0, 1gt
Output Feature lt1gt
Original Weight lt0, 0, 0, 0, 0, 0, 0, 0gt
Output f(0) 0
Weight Change lt1, 0, 0, 0, 1, 1, 0, 1gt
New Weight lt1, 0, 0, 0, 1, 1, 0, 1gt

45
Record 2

Input Feature lt0, 1, 0, 0, 1, 0, 1, 1gt
Output Feature lt-1gt
Original Weight lt1, 0, 0, 0, 1, 1, 0, 1gt
Output f(2) 1
Weight Change lt0, -1, 0, 0, -1, 0, -1, -1gt
New Weight lt1, -1, 0, 0, 0, 1, -1, 0gt

46
Record 3

Input Feature lt0, 1, 0, 1, 0, 1, 0, 1gt
Output Feature lt1gt
Original Weight lt1, -1, 0, 0, 0, 1, -1, 0gt
Output f(1) 0
Weight Change lt0, 1, 0, 1, 0, 1, 0, 1gt
New Weight lt1, 0, 0, 1, 0, 2, -1, 1gt

47
Record 4

Input Feature lt1, 0, 1, 0, 0, 1, 0, 1gt
Output Feature lt-1gt
Original Weight lt1, 0, 0, 1, 0, 2, -1, 1gt
Output f(4) 1
Weight Change lt-1, 0, -1, 0, 0, -1, 0, -1gt
New Weight lt0, 0, -1, 1, 0, 1, -1, 0gt

48
Record 5

Input Feature lt0, 1, 1, 0, 0, 1, 0, 1gt
Output Feature lt-1gt
Original Weight lt0, 0, -1, 1, 0, 1, -1, 0gt
Output f(0) 0
Weight Change lt0, -1, -1, 0, 0, -1, 0, -1gt
New Weight lt0, -1, -2, 1, 0, 0, -1, -1gt

49
Record 6

Input Feature lt0, 1, 0, 0, 1, 1, 0, 1gt
Output Feature lt1gt
Original Weight lt0, -1, -2, 1, 0, 0, -1, -1gt
Output f(-2) -1
Weight Change lt0, 1, 0, 0, 1, 1, 0, 1gt
New Weight lt0, 0, -2, 1, 1, 1, -1, 0gt

50
Record 7

Input Feature lt0, 1, 1, 0, 0, 0, 1, 1gt
Output Feature lt-1gt
Original Weight lt0, 0, -2, 1, 1, 1, -1, 0gt
Output f(-3) -1
Weight Change lt0, 0, 0, 0, 0, 0, 0gt
New Weight lt0, 0, -2, 1, 1, 1, -1, 0gt

51
Record 8

Input Feature lt1, 0, 0, 0, 1, 0, 1, 1gt
Output Feature lt-1gt
Original Weight lt0, 0, -2, 1, 1, 1, -1, 0gt
Output f(0) 0
Weight Change lt-1, 0, 0, 0, -1, 0, -1, -1gt
New Weight lt-1, -1, -3, 1, 0, 1, -3, -2gt

52
A Perceptron Example
(x1 x2 1)
(1 1 1) 1
(1 0 1) -1
(0 1 1) -1
(0 0 1) -1
53
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 0 0)
(1 1 1) 0 0 1 (1 1 1) (1 1 1)
The separating lines become x1 x2 1
.2 and x1 x2 1 -.2
54
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(1 1 1)
(1 0 1) 2 1 -1 (-1 0 -1) (0 1 0)
x2
The separating lines become x2 .2 and x2 -.2
x1
55
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 1 0)
(0 (0 1 0 1) 1) 1 -1 1 -1 -1 -1 (0 (0 -1 0 -1) 0) (0 (0 0 0 -1) -1)
56
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 0 -1)
(1 1 1) -1 -1 1 (1 1 1) (1 1 0)
x2
The separating line become x1 x2 .2 and
x1 x2 -.2
x1
57
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(1 1 0)
(1 0 1) 1 1 -1 (-1 0 -1) (0 1 -1)
x2
Te separating line become x1 x2 .2 and x1
x2 -.2
x1
58
The results for the third epoch are
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 1 -1)
(0 (0 1 0 1) 1) 0 -2 0 -1 -1 -1 (0 (0 -1 0 -1) 0) (0 (0 0 0 -2) -2)
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 0 -2)
(1 1 1) -2 -1 1 (1 1 1) (1 1 -1)
(1 0 1) 0 0 -1 (-1 0 -1) (0 1 -1)
(0 1 1) -1 -1 -1 (0 0 0) (0 1 -2)
(0 0 1) -2 -1 -1 (0 0 0) (0 1 -2)
59
The results for the fourth epoch are

(1 1 1) -1 -1 1 (1 1 1) (1 2 -1)
(1 0 1) 0 0 -1 (-1 0 -1) (0 2 -2)
(0 1 1) 0 0 -1 (0 -1 -1) (0 1 -3)
(0 0 1) -3 -1 -1 (0 0 0) (0 1 -3)

(1 1 1) -2 -1 1 (1 1 1) (1 2 -2)
(1 0 1) -1 -1 -1 (0 0 0) (1 2 -2)
(0 1 1) 0 0 -1 (0 -1 -1) (1 1 -3)
(0 0 1) -3 -1 -1 (0 0 0) (1 1 -3)
For the fifth epoch, we have

(1 1 1) -1 -1 1 (1 1 1) (2 2 -2)
(1 0 1) 0 0 -1 (-1 0 -1) (1 2 -3)
(0 1 1) -1 -1 -1 (0 0 0) (1 2 -3)
(0 0 1) -3 -1 -1 (0 0 0) (1 2 -3)
And for the sixth epoch,
60
The eight epoch yields

(1 1 1) 0 0 1 (1 1 1) (2 3 -2)
(1 0 1) 0 0 -1 (-1 0 -1) (1 3 -3)
(0 1 1) 0 0 -1 (0 -1 -1) (1 2 -4)
(0 0 1) -4 -1 -1 (0 0 0) (1 2 -4)
The results for the seventh epoch are

(1 1 1) -1 -1 1 (1 1 1) (2 3 -3)
(1 0 1) -1 -1 -1 (0 0 0) (2 3 -3)
(0 1 1) 0 0 -1 (0 -1 -1) (2 2 -4)
(0 0 1) -4 -1 -1 (0 0 0) (2 2 -4)

(1 1 1) 0 0 1 (1 1 1) (3 3 -3)
(1 0 1) 0 0 -1 (-1 0 -1) (2 3 -4)
(0 1 1) -1 -1 -1 (0 0 0) (2 3 -4)
(0 0 1) -4 -1 -1 (0 0 0) (2 3 -4)
And the ninth
61
Finally, the results for the tenth epoch are

(1 1 1) 1 1 1 (0 0 0) (2 3 -4)
(1 0 1) -2 -1 -1 (0 0 0) (2 3 -4)
(0 1 1) -1 -1 -1 (0 0 0) (2 3 -4)
(0 0 1) -4 -1 -1 (0 0 0) (2 3 -4)

The positive response is given by
2x1 3x2 4 gt .2
with boundary line
x2 -2 / 3x1 7 / 5
The negative response is given by
2x1 3x2 4 lt -.2
with boundary line
x2 -2 / 3x1 19 / 15

62
The 2nd Perceptron Algorithm
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 0 0)
(1 1 1) 0 0 1 (1 1 1) (1 1 1)
(1 -1 1) 1 1 -1 (-1 1 -1) (0 2 0)
(-1 1 1) 2 1 -1 (1 -1 -1) (1 1 -1)
(-1 -1 1) -3 -1 -1 (0 0 0) (1 1 -1)
63
In the second epoch of training, we have

(1 1 1) 1 1 1 (0 0 0) (1 1 -1)
(1 -1 1) -1 -1 -1 (0 0 0) (1 1 -1)
(-1 1 1) -1 -1 -1 (0 0 0) (1 1 -1)
(-1 -1 1) -3 -1 -1 (0 0 0) (1 1 -1)
Since all the ? ws are 0 in epoch 2, the system
was fully trained after the first epoch.
64
Limitations of Perceptrons

Perceptron finds a straight line that separates
classes.
It cannot learn for exclusive-or (XOR) problems.
Such patterns are not linearly separable.
Not much work after Minsky and Papert published
their book in 1969.
Rumelhart and McClelland produced an improvement
in 1986.
Proposed some modern adaptations to Perceptron,
called multilayer Perceptron.

65
The Multilayer Perceptron

Overcome linearly inseparability
Use more perceptrons.
Each set up to identify small, linearly
separable sections of the inputs.
Combine their outputs into another perceptron.
Each neuron still takes weighted sum of inputs,
thresholds it, outputs 1 or 0.
But how can we learn?

66
The Multilayer Perceptron (2)

Perceptrons in the 2nd layer do not know which of
the real inputs were on or not.
Only 2-state, on or off, gives no indication of
how much to adjust the weights.
Some weighted input definitely turn on a neuron.
Some weighted inputs only just turn a neuron on
and should not be altered to the same extent.
What changes to produce a better solution next
time?
Which of the input weights should be increased
and which should not?
But we have no way of finding out (the credit
assignment problem).

67
The Solution

Need a non-binary thresholding function.
Use a slightly different non-linearity so that it
more or less turns on or off.
A possible new thresholding function is the
sigmoid function.
Sigmoid thresholding function does not mask
inputs from the outputs.

68
The Multi-layer Preceptron

An input layer, an output layer, and a hidden
layer.
Each unit in hidden and output layer is like a
perceptron unit.
But the thresholding function is sigmoid.
Units in input layer serve to distribute values
they receive to next layer
Input units do not perform a weighted sum or
threshold.

69
The Backpropagation Rule

Single-layer perceptron model changed.
Thresholding function from a step to a sigmoid
function.
A hidden layer added.
Learning rule needs to be altered.
New learning rule for multilayer perceptron is
called the generalized delta rule, or the
backpropagation rule.
Show NN a pattern and calculate its response.
Compare with desired response.
Alter weights so that NN can produce a more
accurate output next time.
The learning rule provides the method for
adjusting the weights so as to decrease the error
next time.

70
Backpropagation Details

Define an error function to represent difference
between NN's current output and the correct
output.
The backpropagation rule aims to reduce the error
by
Calculating the value of the error for a
particular input.
Then back-propagates the error from one layer to
the previous one.
Each unit in the net has its weights adjusted so
that it reduces the value of the error function
For units on the output.
Their output and desired output is known and
adjusting the weights is relatively simple.
For units in the middle
Those that are connected to outputs with a large
error should have their weights adjusted a lot.
Those that feed almost correct outputs should not
be altered much.

71
The Detailed Algorithm

Step0. Initialize weights (Set to small random
values).
Step 1. While stopping condition is false, do
Steps 2-9.
Step 2. For each training pair, do Steps 3-8.
Feedbackward.
Step 3. Each input unit (xi , i 1, , n)
receives input signal xi and broadcasts this
signal to all units in the layer above (the
hidden units).
Step 4. Each hidden unit (Zj , j 1, , p) sums
its weighted input signals,
applies its activation function to compute its
output signal,
zj f(z_inj),
and sends this signal to all units in the layer
above (output units).
Step 5. Each output unit (Yk , k1, , m) sums
its weighted input signals,
And applies its activation function to compute
its output signal,
yk f(z_inj),

72
The Detailed Algorithm (2)

Feedbackward.
Step 6. Each output unit (yk , k 1, , m)
receives a target pattern corresponding to the
input training pattern, computes its error
information term,
Calculates its weight correction term (used to
update wjk later),
?wjk??kzj,
Calculates its bias correction term (used to
upate w0k later),
?w0k??k,
And sends ?k to units in the layer below.
Step 7. Each hidden unit (Zj, j1, , p) sums
its delta inputs (from units in the layer above),
Multiplies by the derivative of its activation
function to calculate its error information term,
?j ? _inj f(z_inj),
Calculates its weight correction term (used to
update vij later),
?vij??jxi,
And calculates its bias correction term (used to
update v0j later),
?v0j??j,

73
The Detailed Algorithm (3)

Update weights and biases
Step 8. Each output unit (Yk , k 1, , m)
updates its bias and weights (j0, , p)
wjk(new) wjk (old)?wjk ,
Each hidden unit (Zj,j1, , p) updates its bias
and weights (I0,,n)
vjk(new) vjk (old)?vjk ,
Step 9. Test stopping condition.

74
An exampleMultilayer Perceptron Networkwith
Backpropagation Training
HSIRise
VolHigh
DJIADrop
75
Initial Weights and Bias Values

wij Weight between nodes i and j.
?i Bias value of node i.
For node 4,
w14 0.2, w24 0.4, w34 0.5, ?4 0.4
For node 5,
w15 0.3, w25 0.1, w35 0.2, ?5 0.2
For node 6,
w16 0.6, w26 0.7, w36 0.1, ?6 0.1
For node 7,
w47 0.3, w57 0.2, w67 0.1, ?7 0.6
For node 8,
w48 0.5, w58 0.1, w68 0.3, ?8 0.3

76
Training (1)

Learning Rate 0.9
Input lt1, 0, 1gt
Output lt1, 0gt
For node 4,
Input 0.2 0 0.5 0.4 0.7
Output 1 / (1 e 0.7) 0.332
For node 5,
Input 0.3 0 0.2 0.2 0.1
Output 1 / (1 e 0.1) 0.525
For node 6,
Input 0.6 0 0.1 0.1 0.6
Output 1 / (1 e 0.6) 0.646
For node 7,
Input 0.332 ( 0.3) 0.525 ( 0.2) 0.646
0.1 0.6 0.460
Output 1 / (1 e 0.460) 0.613
For node 8,
Input 0.322 ( 0.5) 0.525 0.1 0.646 (
0.3) 0.3 0.007
Output 1 / (1 e 0.007) 0.498

77
Training (2)

For node 7,
Error 0.613 (1 0.613) (1 0.613) 0.092
For node 8,
Error 0.498 (1 0.498) (0 0.498) 0.125
For node 4,
Error 0.332 (1 0.332) (0.092 ( 0.3) 0.125
( 0.5)) 0.008
For node 5,
Error 0.525 (1 0.525) (0.092 ( 0.2) 0.125
0.1) 0.009
For node 6,
Error 0.646 (1 0.646) (0.092 0.1 0.125
( 0.3)) 0.008

78
Training (3)

For each weight,
w14 0.2 0.9 (0.008) (0.332) 0.202
w15 0.3 0.9 (0.009) (0.525) 0.296
For each bias,
?4 0.4 0.9 (0.008) 0.393
?5 0.2 0.9 (0.009) 0.208

79
Using ANN for Data Mining

Constructing a network
input data representation
selection of number of layers, number of nodes in
each layer
Training the network using training data
Pruning the network
Interpret the results

80
Step 1 Constructing the Network
Multi-layer perceptron (MLP) feed forward back
propagation
x1 of Terms
w1
o1 Persist
x2 GPA
x3 Demographics
o2 Not-persist
x4 Courses
w5n
x5 Fin Aid
xjn
81
Constructing the Network (2)

The number of input nodes corresponds to the
dimensionality of the input tuples
Thermometer coding
age 20-80 6 intervals
20, 30) 000001, 30, 40) 000011, ., 70,
80) 111111
Number of hidden nodes adjusted during training
Number of output nodes number of classes

82
Step 2 Network Training

The ultimate objective of training
obtain a set of weights that makes almost all the
tuples in the training data classified correctly
Steps
Initial weights are set randomly
Input tuples are fed into the network one by one
Activation values for the hidden nodes are
computed
Output vector can be computed after the
activation values of all hidden node are
available
Weights are adjusted using error
(desired output - actual output)

83
Step 3 Network Pruning

Fully connected network will be hard to
articulate
n input nodes, h hidden nodes and m output nodes
lead to h(mn) links (weights)
Pruning Remove some of the links without
affecting classification accuracy of the network.

84
Step 4 Extracting Rules from ANN

Discretize activation values replace individual
activation value by the cluster average maintain
the network accuracy
Enumerate the output from the discretized the
activation values to find rules between
activation value and output
Find the relationship between the input and
activation value
Combine the above two to have rules related the
output to input

85
An Example (I)

IBM synthetic data
nine attributes (age, salary, )
classification function
if ((age lt 40) Ù (50K salary 100K)) Ú ((40
age lt 60) Ù (75K salary 125K)) Ú ((age gt 60)
Ù (25K salary 75K))
then class A else class B
initial network
87 input nodes, 2 output nodes, 4 hidden nodes
trained network using 1000 tuples
pruned network
7 input nodes, 3 hidden nodes, 2 output nodes
17 links
accuracy 96.30

86
An Example (II)

Hidden node value discretization
a1 (-1, 0, 1)
a2 (0, 1)
a3 (-1, 0.24, 1)
Enumerate output from a
a2 0, a3 -1
a1 -1, a2 1, a3 -1
a1 -1, a2 0, a3 -0.24
Þ C1 1, C2 0
otherwise C1 0, C2 1

87
An Example (III)

From input to hidden node
I2 I17 0 Þ a2 0
I5 I15 1 Þ a3 -1
I13 0 Þ a3 -1
Obtain rules relating input and output
I2 I17 0, I5 I15 1 Þ class 1
I2 I17 0, I13 0 Þ class 1
Transform to original input attributes
I17 0 Þ age lt 40, I2 0 Þ salary lt 100K

88
ANN vs. Others for Data Mining

Advantages
prediction accuracy is generally high
robust, works when training examples contain
errors
output may be discrete, real-valued, or a vector
of several discrete or real-valued attributes
fast evaluation of the learned target function.
Criticism
long training time
difficult to understand the learned function
(weights).
not easy to incorporate domain knowledge