Title: COMP 578 Artificial Neural Networks for Data Mining
1COMP 578Artificial Neural Networks for Data
Mining
- Keith C.C. Chan
- Department of Computing
- The Hong Kong Polytechnic University
2Human vs. Computer
- Computers
- Not good at performing such tasks as visual or
audio processing/recognition. - Execute instructions one after another extremely
rapidly. - Good at serial activities (e.g. counting,
adding). - Human brain
- Units respond at ?10/s (vs. PV 2.5GHz).
- Work on many different things at once.
- Vision or speech recognition by interaction of
many different pieces of information.
3The brain
- Human brain is complicated and poorly understood.
- Contains approximately 1010 basic units called
neurons. - Each neuron connected to about 10,000 others.
Dendrites
Soma (or Cell Body)
Axon
Synapse
4The Neuron
Dendrites
Soma
Axon
Synapse
- Neuron accepts many inputs (through dendrites).
- Inputs are all added up in some fashion.
- If enough active inputs are received at once,
neuron will be activated and fire (through
axon).
5The Synapse
- Axon produce voltage pulse called action
potential (AP). - Need arrival of more than one AP to trigger
synapse. - Synapse releases neurotransmitters when AP is
raised sufficiently. - Neurotransmitters diffuse across the gap
chemically activating dendrites on the other
side. - Some synapses pass a large signal across, whilst
others allow very little through.
6Modeling the Single Neuron
- n inputs.
- Efficiency of synapses modeled by having a
multiplicative factor on each of the inputs to
the neuron. - Multiplicative factor associated weights on
input lines. - Neurons tasks
- Calculates weighted sum of its inputs.
- Compares sum to some internal threshold.
- Turn on if threshold exceeded.
x1
w1
x2
w2
S
y
wn
xn
7A Mathematical Model of Neurons
- Neuron computes weighted sum
- Fire if SUM exceeds a threshold ?.
- y1 if SUM gt ?
- y0 if SUM ? ?.
8Learning in Simple Neurons
- Need to be able to determine connection weights.
- Inspiration comes from looking at real neural
systems. - Reinforce good behavior and reprimand bad.
- E.g., train a NN to recognize 2 characters H and
F - Output 1 when a H is presented and 0 when it sees
a F. - If it produces an incorrect output, we want to
reduce the chances of that happening again. - This is done by modifying the weights.
9Learning in Simple Neurons (2)
- Neuron given random initial weights.
- At starting state, neuron knows nothing.
- Present an H.
- Neuron computes the weighted sum of inputs.
- Compare weighted sum with threshold.
- If exceeds threshold, output a 1 otherwise a 0.
- If output is 1, neuron is correct.
- Do nothing.
- Otherwise if neuron produces a 0.
- Increase the weights so that next time it will
exceed the threshold and produces a 1.
10A Simple Learning Rule
- How much weight to increase?
- Can follow simple rule
- Add the input values to the weights when we want
the output to be on. - Subtract the input values from the weights when
we want the output to be off. - This learning rule is called the Hebb rule
- It is a variant on one proposed by Donald Hebb
and is called Hebbian learning. - It is the earliest and simplest learning rule for
a neuron.
11The Hebb Net
- Step 0. Initialize all weights
- wi 0 (i 1 to n).
- Step 1. For each input training record (s) its
target output (t), do steps 2-4. - Step 2. Set activations for all input units
- Step 3. Set activation for the output unit
- Step 4. Adjust the weights and the bias
- wi (new) wi (old) xi y (i 1 to n) (note
?wi xi y) - ?(new) ?(old) y .
- The bias (the ?) adjusted like a weight from a
unit whose output signal is always 1.
12A Hebb Net Example
13The Data Set
- Attributes
- HS_Index Drop, Rise
- Trading_Vol Small, Medium, Large
- DJIA Drop, Rise
- Class Label
- Buy_Sell Buy, Sell
14The Data Set
HS_Index Trading_Vol DJIA Buy_Sell
1 Drop Large Drop Buy
2 Rise Large Rise Sell
3 Rise Medium Drop Buy
4 Drop Small Drop Sell
5 Rise Small Drop Sell
6 Rise Large Drop Buy
7 Rise Small Rise Sell
8 Drop Large Rise Sell
15Transformation
Bias
- Input Features
- HS_Index_Drop -1, 1
- HS_Index_Rise -1, 1
- Trading_Vol_Small -1, 1
- Trading_Vol_Medium -1, 1
- Trading_Vol_Large -1, 1
- DJIA_Drop -1, 1
- DJIA_Rise -1, 1
- Bias 1
- Output Feature
- Buy_Sell -1, 1
HISDrop
HISRise
B/S
DJIADrop
DJIARise
16Transformed Data
Input Feature Output Feature
1 lt1, -1, -1, -1, 1, 1, -1, 1gt lt1gt
2 lt-1, 1, -1, -1, 1, -1, 1, 1gt lt-1gt
3 lt-1, 1, -1, 1, -1, 1, -1, 1gt lt1gt
4 lt1, -1, 1, -1, -1, 1, -1, 1gt lt-1gt
5 lt-1, 1, 1, -1, -1, 1, -1, 1gt lt-1gt
6 lt-1, 1, -1, -1, 1, 1, -1, 1gt lt1gt
7 lt-1, 1, 1, -1, -1, -1, 1, 1gt lt-1gt
8 lt1, -1, -1, -1, 1, -1, 1, 1gt lt-1gt
17Record 1
- Input Feature lt1, -1, -1, -1, 1, 1, -1, 1gt
- Output Feature lt1gt
- Original Weight lt0, 0, 0, 0, 0, 0, 0, 0gt
- Weight Change lt1, -1, -1, -1, 1, 1, -1, 1gt
- New Weight lt1, -1, -1, -1, 1, 1, -1, 1gt
18Record 2
- Input Feature lt-1, 1, -1, -1, 1, -1, 1, 1gt
- Output Feature lt-1gt
- Original Weight lt1, -1, -1, -1, 1, 1, -1, 1gt
- Weight Change lt1, -1, 1, 1, -1, 1, -1, -1gt
- New Weight lt2, -2, 0, 0, 0, 2, -2, 0gt
19Record 3
- Input Feature lt-1, 1, -1, 1, -1, 1, -1, 1gt
- Output Feature lt1gt
- Original Weight lt2, -2, 0, 0, 0, 2, -2, 0gt
- Weight Change lt-1, 1, -1, 1, -1, 1, -1, 1gt
- New Weight lt1, -1, -1, 1, -1, 3, -3, 1gt
20Record 4
- Input Feature lt1, -1, 1, -1, -1, 1, -1, 1gt
- Output Feature lt-1gt
- Original Weight lt1, -1, -1, 1, -1, 3, -3, 1gt
- Weight Change lt-1, 1, -1, 1, 1, -1, 1, -1gt
- New Weight lt0, 0, -2, 2, 0, 2, -2, 0gt
21Record 5
- Input Feature lt-1, 1, 1, -1, -1, 1, -1, 1gt
- Output Feature lt-1gt
- Original Weight lt0, 0, -2, 2, 0, 2, -2, 0gt
- Weight Change lt1, -1, -1, 1, 1, -1, 1, -1gt
- New Weight lt1, -1, -3, 3, 1, 1, -1, -1gt
22Record 6
- Input Feature lt-1, 1, -1, -1, 1, 1, -1, 1gt
- Output Feature lt1gt
- Original Weight lt1, -1, -3, 3, 1, 1, -1, -1gt
- Weight Change lt-1, 1, -1, -1, 1, 1, -1, 1gt
- New Weight lt0, 0, -4, 2, 2, 2, -2, 0gt
23Record 7
- Input Feature lt-1, 1, 1, -1, -1, -1, 1, 1gt
- Output Feature lt-1gt
- Original Weight lt0, 0, -4, 2, 2, 2, -2, 0gt
- Weight Change lt1, -1, -1, 1, 1, 1, -1, -1gt
- New Weight lt1, -1, -5, 3, 3, 3, -3, -1gt
24Record 8
- Input Feature lt1, -1, -1, -1, 1, -1, 1, 1gt
- Output Feature lt-1gt
- Original Weight lt1, -1, -5, 3, 3, 3, -3, -1gt
- Weight Change lt-1, 1, 1, 1, -1, 1, -1, -1gt
- New Weight lt0, 0, -4, 4, 2, 4, -4, -2gt
25A Hebb Net Example 2
Input
Target
(x1 X2 1)
(1 1 1) 1
(1 -1 1) -1
(-1 1 1) -1
(-1 -1 1) -1
26 Input Target
Weight Changes Weights
(x1 x2 1) (?w ?w2 ??) (w1 w2 ?)
(0 0 0)
(1 1 1) 1 (1 1 1) (1 1 1)
The separating line becomes x2 -
x1 - 1
27 Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(1 1 1)
(1 -1 1) -1 (-1 1 -1) (0 2 0)
The separating line becomes x2 0
28 Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(0 2 0)
(-1 1 1) -1 (1 -1 -1) (1 1 -1)
x2
The separating line becomes x2 - x1 1
x1
29 Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(1 1 -1)
(-1 -1 1) -1 (1 1 -1) (2 2 -2)
x2
Even though the weights have changed, the
separating line is still x2 - x1 1 The
graph of the decision regions (the positive
response and the negative response) remains as
shown.
x1
30A Hebb Net Example 3
Input
Target
(x1 x2 1)
(1 1 1) 1
(1 0 1) 0
(0 1 1) 0
(0 0 1) 0
31 Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(0 0 0)
(1 1 1) 1 (1 1 1) (1 1 1)
The separating line becomes x2 - x1 - 1
32Since the target value is 0, no learning
occurs.Using binary target values prevents the
net from learning any pattern for which the
target is off.
Input Target
Weight Changes Weights
(x1 x2 1) (?w1 ?w2 ?b) (w1 w2 b)
(1 0 1) 0 (0 0 0) (1 1 1)
(0 1 1) 0 (0 0 0) (1 1 1)
(0 0 1) 0 (0 0 0) (1 1 1)
33Characteristics of the Hebb Net
- Choice of training records determines which
problems can be solved. - Training records corresponding to the AND
function can be solved if inputs and targets in
bipolar form. - Bipolar representation allows modification of a
weight when input and target are both on and
when they are both off at the same time.
34The Perceptron Learning Rule
- More powerful than the Hebb rule.
- The Perceptron learning rule convergence theorem
states that - If weights exist to allow neuron to respond
correctly to all training patterns, then the rule
will find such weights. - The neuron will find these weights in a finite
number of training steps. - Let SUM be the weighted sum, the output of the
Perceptron, y f(SUM), can be 1, 0, -1. - The activation function is
35Perceptron Learning
- For each training record, the net would calculate
the response of the output unit. - The net would determine whether an error occurred
for this pattern (comparing the calculated with
target value). - If an error occurred, weights would be changed
according to wi (new) wi (old) ?txiwhere
t is 1 or 1 and ? is the learning rate. - If an error did not occur, the weights would not
be changed. - Training continue until no error occurred.
36Perceptron for classification
- Step 0. Initialize all weights and bias (For
simplicity, set weights and bias to zero.) Set
learning rate ? (0 lt ? lt 1). (For simplicity,
? can be set to 1.) - Step 1. While stopping condition is false, do
steps 2-6. - Step 2. For each training pair, do Steps 3-5
- Step 3. Set activation for input unit, xi.
- Step 4. Compute response of output unit SUM
? ?i xi wi. - Step 5. Update weights and bias if error
occurred for this vector. If y? y, wi (new)
wi (old) ?txi ?(new) ? (old) ?t else
wi (new) wi (old) ? (new) ? (old) - Step 6. If no weights changed in 2, stop else
continue.
37Perceptron for classification (2)
- Only weights connecting active input units (xi?0)
are updated. - Weights are updated only for patterns that do not
produce the correct value of y. - Less learning as more training patterns produce
the correct response. - The threshold on the activation function for the
response unit is a fixed, non-negative value ?. - The form of the activation function for the
output unit constitutes an undecided band of
fixed width determined by ? separating the region
of positive response from that of negative
response.
38Perceptron for classification (3)
- Instead of one separating line, we have a line
separating the region of positive response from
the region of zero response (line bounding
inequality) - w1 x1 w2 x2 b gt ?
- and a line separating the region of zero response
from the region of negative response (line
bounding the inequality) w1 x1 w2 x2 b lt ??
w1 x1 w2 x2 b gt ?
w1 x1 w2 x2 b lt ??
39Perceptron
40The Data Set (1)
- Attributes
- HS_Index Drop, Rise
- Trading_Vol Small, Medium, Large
- DJIA Drop, Rise
- Class Label
- Buy_Sell Buy, Sell
41The Data Set (2)
HS_Index Trading_Vol DJIA Buy_Sell
1 Drop Large Drop Buy
2 Rise Large Rise Sell
3 Rise Medium Drop Buy
4 Drop Small Drop Sell
5 Rise Small Drop Sell
6 Rise Large Drop Buy
7 Rise Small Rise Sell
8 Drop Large Rise Sell
42Transformation
- Input Features
- HS_Index_Drop 0, 1
- HS_Index_Rise 0, 1
- Trading_Vol_Small 0, 1
- Trading_Vol_Medium 0, 1
- Trading_Vol_Large 0, 1
- DJIA_Drop 0, 1
- DJIA_Rise 0, 1
- Bias 0
- Output Feature
- Buy ? 1
- Sell ? -1
43Transformed Data
Input Feature Output Feature
1 lt1, 0, 0, 0, 1, 1, 0, 1gt lt1gt
2 lt0, 1, 0, 0, 1, 0, 1, 1gt lt-1gt
3 lt0, 1, 0, 1, 0, 1, 0, 1gt lt1gt
4 lt1, 0, 1, 0, 0, 1, 0, 1gt lt-1gt
5 lt0, 1, 1, 0, 0, 1, 0, 1gt lt-1gt
6 lt0, 1, 0, 0, 1, 1, 0, 1gt lt1gt
7 lt0, 1, 1, 0, 0, 0, 1, 1gt lt-1gt
8 lt1, 0, 0, 0, 1, 0, 1, 1gt lt-1gt
44Record 1
- Input Feature lt1, 0, 0, 0, 1, 1, 0, 1gt
- Output Feature lt1gt
- Original Weight lt0, 0, 0, 0, 0, 0, 0, 0gt
- Output f(0) 0
- Weight Change lt1, 0, 0, 0, 1, 1, 0, 1gt
- New Weight lt1, 0, 0, 0, 1, 1, 0, 1gt
45Record 2
- Input Feature lt0, 1, 0, 0, 1, 0, 1, 1gt
- Output Feature lt-1gt
- Original Weight lt1, 0, 0, 0, 1, 1, 0, 1gt
- Output f(2) 1
- Weight Change lt0, -1, 0, 0, -1, 0, -1, -1gt
- New Weight lt1, -1, 0, 0, 0, 1, -1, 0gt
46Record 3
- Input Feature lt0, 1, 0, 1, 0, 1, 0, 1gt
- Output Feature lt1gt
- Original Weight lt1, -1, 0, 0, 0, 1, -1, 0gt
- Output f(1) 0
- Weight Change lt0, 1, 0, 1, 0, 1, 0, 1gt
- New Weight lt1, 0, 0, 1, 0, 2, -1, 1gt
47Record 4
- Input Feature lt1, 0, 1, 0, 0, 1, 0, 1gt
- Output Feature lt-1gt
- Original Weight lt1, 0, 0, 1, 0, 2, -1, 1gt
- Output f(4) 1
- Weight Change lt-1, 0, -1, 0, 0, -1, 0, -1gt
- New Weight lt0, 0, -1, 1, 0, 1, -1, 0gt
48Record 5
- Input Feature lt0, 1, 1, 0, 0, 1, 0, 1gt
- Output Feature lt-1gt
- Original Weight lt0, 0, -1, 1, 0, 1, -1, 0gt
- Output f(0) 0
- Weight Change lt0, -1, -1, 0, 0, -1, 0, -1gt
- New Weight lt0, -1, -2, 1, 0, 0, -1, -1gt
49Record 6
- Input Feature lt0, 1, 0, 0, 1, 1, 0, 1gt
- Output Feature lt1gt
- Original Weight lt0, -1, -2, 1, 0, 0, -1, -1gt
- Output f(-2) -1
- Weight Change lt0, 1, 0, 0, 1, 1, 0, 1gt
- New Weight lt0, 0, -2, 1, 1, 1, -1, 0gt
50Record 7
- Input Feature lt0, 1, 1, 0, 0, 0, 1, 1gt
- Output Feature lt-1gt
- Original Weight lt0, 0, -2, 1, 1, 1, -1, 0gt
- Output f(-3) -1
- Weight Change lt0, 0, 0, 0, 0, 0, 0gt
- New Weight lt0, 0, -2, 1, 1, 1, -1, 0gt
51Record 8
- Input Feature lt1, 0, 0, 0, 1, 0, 1, 1gt
- Output Feature lt-1gt
- Original Weight lt0, 0, -2, 1, 1, 1, -1, 0gt
- Output f(0) 0
- Weight Change lt-1, 0, 0, 0, -1, 0, -1, -1gt
- New Weight lt-1, -1, -3, 1, 0, 1, -3, -2gt
52A Perceptron Example
(x1 x2 1)
(1 1 1) 1
(1 0 1) -1
(0 1 1) -1
(0 0 1) -1
53 Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 0 0)
(1 1 1) 0 0 1 (1 1 1) (1 1 1)
The separating lines become x1 x2 1
.2 and x1 x2 1 -.2
54 Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(1 1 1)
(1 0 1) 2 1 -1 (-1 0 -1) (0 1 0)
x2
The separating lines become x2 .2 and x2 -.2
x1
55 Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 1 0)
(0 (0 1 0 1) 1) 1 -1 1 -1 -1 -1 (0 (0 -1 0 -1) 0) (0 (0 0 0 -1) -1)
56 Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 0 -1)
(1 1 1) -1 -1 1 (1 1 1) (1 1 0)
x2
The separating line become x1 x2 .2 and
x1 x2 -.2
x1
57 Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(1 1 0)
(1 0 1) 1 1 -1 (-1 0 -1) (0 1 -1)
x2
Te separating line become x1 x2 .2 and x1
x2 -.2
x1
58The results for the third epoch are
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 1 -1)
(0 (0 1 0 1) 1) 0 -2 0 -1 -1 -1 (0 (0 -1 0 -1) 0) (0 (0 0 0 -2) -2)
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 0 -2)
(1 1 1) -2 -1 1 (1 1 1) (1 1 -1)
(1 0 1) 0 0 -1 (-1 0 -1) (0 1 -1)
(0 1 1) -1 -1 -1 (0 0 0) (0 1 -2)
(0 0 1) -2 -1 -1 (0 0 0) (0 1 -2)
59The results for the fourth epoch are
(1 1 1) -1 -1 1 (1 1 1) (1 2 -1)
(1 0 1) 0 0 -1 (-1 0 -1) (0 2 -2)
(0 1 1) 0 0 -1 (0 -1 -1) (0 1 -3)
(0 0 1) -3 -1 -1 (0 0 0) (0 1 -3)
(1 1 1) -2 -1 1 (1 1 1) (1 2 -2)
(1 0 1) -1 -1 -1 (0 0 0) (1 2 -2)
(0 1 1) 0 0 -1 (0 -1 -1) (1 1 -3)
(0 0 1) -3 -1 -1 (0 0 0) (1 1 -3)
For the fifth epoch, we have
(1 1 1) -1 -1 1 (1 1 1) (2 2 -2)
(1 0 1) 0 0 -1 (-1 0 -1) (1 2 -3)
(0 1 1) -1 -1 -1 (0 0 0) (1 2 -3)
(0 0 1) -3 -1 -1 (0 0 0) (1 2 -3)
And for the sixth epoch,
60The eight epoch yields
(1 1 1) 0 0 1 (1 1 1) (2 3 -2)
(1 0 1) 0 0 -1 (-1 0 -1) (1 3 -3)
(0 1 1) 0 0 -1 (0 -1 -1) (1 2 -4)
(0 0 1) -4 -1 -1 (0 0 0) (1 2 -4)
The results for the seventh epoch are
(1 1 1) -1 -1 1 (1 1 1) (2 3 -3)
(1 0 1) -1 -1 -1 (0 0 0) (2 3 -3)
(0 1 1) 0 0 -1 (0 -1 -1) (2 2 -4)
(0 0 1) -4 -1 -1 (0 0 0) (2 2 -4)
(1 1 1) 0 0 1 (1 1 1) (3 3 -3)
(1 0 1) 0 0 -1 (-1 0 -1) (2 3 -4)
(0 1 1) -1 -1 -1 (0 0 0) (2 3 -4)
(0 0 1) -4 -1 -1 (0 0 0) (2 3 -4)
And the ninth
61Finally, the results for the tenth epoch are
(1 1 1) 1 1 1 (0 0 0) (2 3 -4)
(1 0 1) -2 -1 -1 (0 0 0) (2 3 -4)
(0 1 1) -1 -1 -1 (0 0 0) (2 3 -4)
(0 0 1) -4 -1 -1 (0 0 0) (2 3 -4)
- The positive response is given by
- 2x1 3x2 4 gt .2
- with boundary line
- x2 -2 / 3x1 7 / 5
- The negative response is given by
- 2x1 3x2 4 lt -.2
- with boundary line
- x2 -2 / 3x1 19 / 15
62The 2nd Perceptron Algorithm
Input Net Out Target
Weight Changes Weights
(x1 x2 1) (w1 w2 b)
(0 0 0)
(1 1 1) 0 0 1 (1 1 1) (1 1 1)
(1 -1 1) 1 1 -1 (-1 1 -1) (0 2 0)
(-1 1 1) 2 1 -1 (1 -1 -1) (1 1 -1)
(-1 -1 1) -3 -1 -1 (0 0 0) (1 1 -1)
63In the second epoch of training, we have
(1 1 1) 1 1 1 (0 0 0) (1 1 -1)
(1 -1 1) -1 -1 -1 (0 0 0) (1 1 -1)
(-1 1 1) -1 -1 -1 (0 0 0) (1 1 -1)
(-1 -1 1) -3 -1 -1 (0 0 0) (1 1 -1)
Since all the ? ws are 0 in epoch 2, the system
was fully trained after the first epoch.
64Limitations of Perceptrons
- Perceptron finds a straight line that separates
classes. - It cannot learn for exclusive-or (XOR) problems.
- Such patterns are not linearly separable.
- Not much work after Minsky and Papert published
their book in 1969. - Rumelhart and McClelland produced an improvement
in 1986. - Proposed some modern adaptations to Perceptron,
called multilayer Perceptron.
65The Multilayer Perceptron
- Overcome linearly inseparability
- Use more perceptrons.
- Each set up to identify small, linearly
separable sections of the inputs. - Combine their outputs into another perceptron.
- Each neuron still takes weighted sum of inputs,
thresholds it, outputs 1 or 0. - But how can we learn?
66The Multilayer Perceptron (2)
- Perceptrons in the 2nd layer do not know which of
the real inputs were on or not. - Only 2-state, on or off, gives no indication of
how much to adjust the weights. - Some weighted input definitely turn on a neuron.
- Some weighted inputs only just turn a neuron on
and should not be altered to the same extent. - What changes to produce a better solution next
time? - Which of the input weights should be increased
and which should not? - But we have no way of finding out (the credit
assignment problem).
67The Solution
- Need a non-binary thresholding function.
- Use a slightly different non-linearity so that it
more or less turns on or off. - A possible new thresholding function is the
sigmoid function. - Sigmoid thresholding function does not mask
inputs from the outputs.
68The Multi-layer Preceptron
- An input layer, an output layer, and a hidden
layer. - Each unit in hidden and output layer is like a
perceptron unit. - But the thresholding function is sigmoid.
- Units in input layer serve to distribute values
they receive to next layer - Input units do not perform a weighted sum or
threshold.
69The Backpropagation Rule
- Single-layer perceptron model changed.
- Thresholding function from a step to a sigmoid
function. - A hidden layer added.
- Learning rule needs to be altered.
- New learning rule for multilayer perceptron is
called the generalized delta rule, or the
backpropagation rule. - Show NN a pattern and calculate its response.
- Compare with desired response.
- Alter weights so that NN can produce a more
accurate output next time. - The learning rule provides the method for
adjusting the weights so as to decrease the error
next time.
70Backpropagation Details
- Define an error function to represent difference
between NN's current output and the correct
output. - The backpropagation rule aims to reduce the error
by - Calculating the value of the error for a
particular input. - Then back-propagates the error from one layer to
the previous one. - Each unit in the net has its weights adjusted so
that it reduces the value of the error function - For units on the output.
- Their output and desired output is known and
adjusting the weights is relatively simple. - For units in the middle
- Those that are connected to outputs with a large
error should have their weights adjusted a lot. - Those that feed almost correct outputs should not
be altered much.
71The Detailed Algorithm
- Step0. Initialize weights (Set to small random
values). - Step 1. While stopping condition is false, do
Steps 2-9. - Step 2. For each training pair, do Steps 3-8.
- Feedbackward.
- Step 3. Each input unit (xi , i 1, , n)
receives input signal xi and broadcasts this
signal to all units in the layer above (the
hidden units). - Step 4. Each hidden unit (Zj , j 1, , p) sums
its weighted input signals, - applies its activation function to compute its
output signal, - zj f(z_inj),
- and sends this signal to all units in the layer
above (output units). - Step 5. Each output unit (Yk , k1, , m) sums
its weighted input signals, - And applies its activation function to compute
its output signal, - yk f(z_inj),
72The Detailed Algorithm (2)
- Feedbackward.
- Step 6. Each output unit (yk , k 1, , m)
receives a target pattern corresponding to the
input training pattern, computes its error
information term, - Calculates its weight correction term (used to
update wjk later), - ?wjk??kzj,
- Calculates its bias correction term (used to
upate w0k later), - ?w0k??k,
- And sends ?k to units in the layer below.
- Step 7. Each hidden unit (Zj, j1, , p) sums
its delta inputs (from units in the layer above), - Multiplies by the derivative of its activation
function to calculate its error information term, - ?j ? _inj f(z_inj),
- Calculates its weight correction term (used to
update vij later), - ?vij??jxi,
- And calculates its bias correction term (used to
update v0j later), - ?v0j??j,
73The Detailed Algorithm (3)
- Update weights and biases
- Step 8. Each output unit (Yk , k 1, , m)
updates its bias and weights (j0, , p) - wjk(new) wjk (old)?wjk ,
- Each hidden unit (Zj,j1, , p) updates its bias
and weights (I0,,n) - vjk(new) vjk (old)?vjk ,
- Step 9. Test stopping condition.
74An exampleMultilayer Perceptron Networkwith
Backpropagation Training
HSIRise
VolHigh
DJIADrop
75Initial Weights and Bias Values
- wij Weight between nodes i and j.
- ?i Bias value of node i.
- For node 4,
- w14 0.2, w24 0.4, w34 0.5, ?4 0.4
- For node 5,
- w15 0.3, w25 0.1, w35 0.2, ?5 0.2
- For node 6,
- w16 0.6, w26 0.7, w36 0.1, ?6 0.1
- For node 7,
- w47 0.3, w57 0.2, w67 0.1, ?7 0.6
- For node 8,
- w48 0.5, w58 0.1, w68 0.3, ?8 0.3
76Training (1)
- Learning Rate 0.9
- Input lt1, 0, 1gt
- Output lt1, 0gt
- For node 4,
- Input 0.2 0 0.5 0.4 0.7
- Output 1 / (1 e 0.7) 0.332
- For node 5,
- Input 0.3 0 0.2 0.2 0.1
- Output 1 / (1 e 0.1) 0.525
- For node 6,
- Input 0.6 0 0.1 0.1 0.6
- Output 1 / (1 e 0.6) 0.646
- For node 7,
- Input 0.332 ( 0.3) 0.525 ( 0.2) 0.646
0.1 0.6 0.460 - Output 1 / (1 e 0.460) 0.613
- For node 8,
- Input 0.322 ( 0.5) 0.525 0.1 0.646 (
0.3) 0.3 0.007 - Output 1 / (1 e 0.007) 0.498
77Training (2)
- For node 7,
- Error 0.613 (1 0.613) (1 0.613) 0.092
- For node 8,
- Error 0.498 (1 0.498) (0 0.498) 0.125
- For node 4,
- Error 0.332 (1 0.332) (0.092 ( 0.3) 0.125
( 0.5)) 0.008 - For node 5,
- Error 0.525 (1 0.525) (0.092 ( 0.2) 0.125
0.1) 0.009 - For node 6,
- Error 0.646 (1 0.646) (0.092 0.1 0.125
( 0.3)) 0.008
78Training (3)
- For each weight,
- w14 0.2 0.9 (0.008) (0.332) 0.202
- w15 0.3 0.9 (0.009) (0.525) 0.296
-
- For each bias,
- ?4 0.4 0.9 (0.008) 0.393
- ?5 0.2 0.9 (0.009) 0.208
79Using ANN for Data Mining
- Constructing a network
- input data representation
- selection of number of layers, number of nodes in
each layer - Training the network using training data
- Pruning the network
- Interpret the results
80Step 1 Constructing the Network
Multi-layer perceptron (MLP) feed forward back
propagation
x1 of Terms
w1
o1 Persist
x2 GPA
x3 Demographics
o2 Not-persist
x4 Courses
w5n
x5 Fin Aid
xjn
81Constructing the Network (2)
- The number of input nodes corresponds to the
dimensionality of the input tuples - Thermometer coding
- age 20-80 6 intervals
- 20, 30) 000001, 30, 40) 000011, ., 70,
80) 111111 - Number of hidden nodes adjusted during training
- Number of output nodes number of classes
82Step 2 Network Training
- The ultimate objective of training
- obtain a set of weights that makes almost all the
tuples in the training data classified correctly - Steps
- Initial weights are set randomly
- Input tuples are fed into the network one by one
- Activation values for the hidden nodes are
computed - Output vector can be computed after the
activation values of all hidden node are
available - Weights are adjusted using error
- (desired output - actual output)
83Step 3 Network Pruning
- Fully connected network will be hard to
articulate - n input nodes, h hidden nodes and m output nodes
lead to h(mn) links (weights) - Pruning Remove some of the links without
affecting classification accuracy of the network.
84Step 4 Extracting Rules from ANN
- Discretize activation values replace individual
activation value by the cluster average maintain
the network accuracy - Enumerate the output from the discretized the
activation values to find rules between
activation value and output - Find the relationship between the input and
activation value - Combine the above two to have rules related the
output to input
85An Example (I)
- IBM synthetic data
- nine attributes (age, salary, )
- classification function
- if ((age lt 40) Ù (50K salary 100K)) Ú ((40
age lt 60) Ù (75K salary 125K)) Ú ((age gt 60)
Ù (25K salary 75K)) - then class A else class B
- initial network
- 87 input nodes, 2 output nodes, 4 hidden nodes
- trained network using 1000 tuples
- pruned network
- 7 input nodes, 3 hidden nodes, 2 output nodes
- 17 links
- accuracy 96.30
86An Example (II)
- Hidden node value discretization
- a1 (-1, 0, 1)
- a2 (0, 1)
- a3 (-1, 0.24, 1)
- Enumerate output from a
- a2 0, a3 -1
- a1 -1, a2 1, a3 -1
- a1 -1, a2 0, a3 -0.24
- Þ C1 1, C2 0
- otherwise C1 0, C2 1
87An Example (III)
- From input to hidden node
- I2 I17 0 Þ a2 0
- I5 I15 1 Þ a3 -1
- I13 0 Þ a3 -1
-
- Obtain rules relating input and output
- I2 I17 0, I5 I15 1 Þ class 1
- I2 I17 0, I13 0 Þ class 1
- Transform to original input attributes
- I17 0 Þ age lt 40, I2 0 Þ salary lt 100K
88ANN vs. Others for Data Mining
- Advantages
- prediction accuracy is generally high
- robust, works when training examples contain
errors - output may be discrete, real-valued, or a vector
of several discrete or real-valued attributes - fast evaluation of the learned target function.
- Criticism
- long training time
- difficult to understand the learned function
(weights). - not easy to incorporate domain knowledge