1 / 150

Feed-Forward Neural Networks

- ??? ???

Content

- Introduction
- Single-Layer Perceptron Networks
- Learning Rules for Single-Layer Perceptron

Networks - Perceptron Learning Rule
- Adaline Leaning Rule
- ?-Leaning Rule
- Multilayer Perceptron
- Back Propagation Learning algorithm

Feed-Forward Neural Networks

- Introduction

Historical Background

- 1943 McCulloch and Pitts proposed the first

computational models of neuron. - 1949 Hebb proposed the first learning rule.
- 1958 Rosenblatts work in perceptrons.
- 1969 Minsky and Paperts exposed limitation of

the theory. - 1970s Decade of dormancy for neural networks.
- 1980-90s Neural network return (self-organization,

back-propagation algorithms, etc)

Nervous Systems

- Human brain contains 1011 neurons.
- Each neuron is connected 104 others.
- Some scientists compared the brain with a

complex, nonlinear, parallel computer. - The largest modern neural networks achieve the

complexity comparable to a nervous system of a

fly.

Neurons

- The main purpose of neurons is to receive,

analyze and transmit further the information in a

form of signals (electric pulses). - When a neuron sends the information we say that a

neuron fires.

Neurons

Acting through specialized projections known as

dendrites and axons, neurons carry information

throughout the neural network.

This animation demonstrates the firing of a

synapse between the pre-synaptic terminal of one

neuron to the soma (cell body) of another neuron.

A Model of Artificial Neuron

A Model of Artificial Neuron

Feed-Forward Neural Networks

- Graph representation
- nodes neurons
- arrows signal flow directions
- A neural network that does not contain cycles

(feedback loops) is called a feedforward network

(or perceptron).

Layered Structure

Hidden Layer(s)

Knowledge and Memory

- The output behavior of a network is determined by

the weights. - Weights ? the memory of an NN.
- Knowledge ? distributed across the network.
- Large number of nodes
- increases the storage capacity
- ensures that the knowledge is robust
- fault tolerance.
- Store new information by changing weights.

Pattern Classification

output pattern y

- Function x ? y
- The NNs output is used to distinguish between

and recognize different input patterns. - Different output patterns correspond to

particular classes of input patterns. - Networks with hidden layers can be used for

solving more complex problems then just a linear

pattern classification.

input pattern x

Training

Training Set

. . .

. . .

Goal

. . .

. . .

Generalization

- By properly training a neural network may produce

reasonable answers for input patterns not seen

during training (generalization). - Generalization is particularly useful for the

analysis of a noisy data (e.g. timeseries).

Generalization

- By properly training a neural network may produce

reasonable answers for input patterns not seen

during training (generalization). - Generalization is particularly useful for the

analysis of a noisy data (e.g. timeseries).

Applications

- Pattern classification
- Object recognition
- Function approximation
- Data compression
- Time series analysis and forecast
- . . .

Feed-Forward Neural Networks

- Single-Layer Perceptron Networks

The Single-Layered Perceptron

Training a Single-Layered Perceptron

Training Set

Goal

Learning Rules

- Linear Threshold Units (LTUs) Perceptron

Learning Rule - Linearly Graded Units (LGUs) Widrow-Hoff

learning Rule

Training Set

Goal

Feed-Forward Neural Networks

- Learning Rules for
- Single-Layered Perceptron Networks
- Perceptron Learning Rule
- Adline Leaning Rule
- ?-Learning Rule

Perceptron

Linear Threshold Unit

sgn

Perceptron

Goal

Linear Threshold Unit

sgn

Example

Goal

Class 1

g(x) ?2x1 2x220

Class 2

Augmented input vector

Goal

Class 1 (1)

Class 2 (?1)

Augmented input vector

Goal

Augmented input vector

Goal

A plane passes through the origin in the

augmented input space.

Linearly Separable vs. Linearly Non-Separable

AND

OR

XOR

Linearly Separable

Linearly Separable

Linearly Non-Separable

Goal

- Given training sets T1?C1 and T2 ? C2 with

elements in form of x(x1, x2 , ... , xm-1 , xm)

T , where x1, x2 , ... , xm-1 ?R and xm ?1. - Assume T1 and T2 are linearly separable.
- Find w(w1, w2 , ... , wm) T such that

Goal

wTx 0 is a hyperplain passes through the origin

of augmented input space.

- Given training sets T1?C1 and T2 ? C2 with

elements in form of x(x1, x2 , ... , xm-1 , xm)

T , where x1, x2 , ... , xm-1 ?R and xm ?1. - Assume T1 and T2 are linearly separable.
- Find w(w1, w2 , ... , wm) T such that

Observation

Which ws correctly classify x?

What trick can be used?

Observation

Is this w ok?

w1x1 w2x2 0

Observation

w1x1 w2x2 0

Is this w ok?

Observation

w1x1 w2x2 0

Is this w ok?

How to adjust w?

?w ?

Observation

Is this w ok?

How to adjust w?

?w ??x

reasonable?

gt0

lt0

Observation

Is this w ok?

reasonable?

How to adjust w?

?w ?x

gt0

lt0

Observation

Is this w ok?

?

?w ?

?x

??x

or

Perceptron Learning Rule

Upon misclassification on

Define error

Perceptron Learning Rule

Define error

Perceptron Learning Rule

Summary ? Perceptron Learning Rule

Based on the general weight learning rule.

correct

incorrect

Summary ? Perceptron Learning Rule

Converge?

Perceptron Convergence Theorem

- Exercise Reference some papers or textbooks to

prove the theorem.

If the given training set is linearly separable,

the learning process will converge in a finite

number of steps.

The Learning Scenario

Linearly Separable.

The Learning Scenario

The Learning Scenario

The Learning Scenario

The Learning Scenario

The Learning Scenario

w4 w3

w3

The Learning Scenario

w

The Learning Scenario

The demonstration is in augmented space.

w

Conceptually, in augmented space, we adjust the

weight vector to fit the data.

Weight Space

A weight in the shaded area will give correct

classification for the positive example.

w

Weight Space

A weight in the shaded area will give correct

classification for the positive example.

?w ?x

w

Weight Space

A weight not in the shaded area will give correct

classification for the negative example.

w

Weight Space

A weight not in the shaded area will give correct

classification for the negative example.

w

?w ??x

The Learning Scenario in Weight Space

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w1

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w1

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w2

w1

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w2

w3

w1

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w4

w2

w3

w1

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w4

w2

w3

w5

w1

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w4

w2

w3

w5

w1

w6

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w7

w4

w2

w3

w5

w1

w6

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w8

w7

w4

w2

w3

w5

w1

w6

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w9

w2

w8

w7

w4

w2

w3

w5

w1

w6

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w9

w10

w2

w8

w7

w4

w2

w3

w5

w1

w6

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w9

w10

w2

w11

w8

w7

w4

w2

w3

w5

w1

w6

w1

w0

The Learning Scenario in Weight Space

To correctly classify the training set, the

weight must move into the shaded area.

w2

w11

w1

w0

Conceptually, in weight space, we move the weight

into the feasible region.

Feed-Forward Neural Networks

- Learning Rules for
- Single-Layered Perceptron Networks
- Perceptron Learning Rule
- Adaline Leaning Rule
- ?-Learning Rule

Adaline (Adaptive Linear Element)

Widrow 1962

Adaline (Adaptive Linear Element)

In what condition, the goal is reachable?

Goal

Widrow 1962

LMS (Least Mean Square)

Minimize the cost function (error function)

Gradient Decent Algorithm

Our goal is to go downhill.

Contour Map

?w

(w1, w2)

Gradient Decent Algorithm

Our goal is to go downhill.

How to find the steepest decent direction?

Contour Map

?w

(w1, w2)

Gradient Operator

Let f(w) f (w1, w2,, wm) be a function over Rm.

Define

Gradient Operator

df positive

df zero

df negative

Go uphill

Plain

Go downhill

The Steepest Decent Direction

To minimize f , we choose ?w ?? ? f

df positive

df zero

df negative

Go uphill

Plain

Go downhill

LMS (Least Mean Square)

Minimize the cost function (error function)

? (k)

Adaline Learning Rule

Minimize the cost function (error function)

Learning Modes

- Batch Learning Mode
- Incremental Learning Mode

Summary ? Adaline Learning Rule

?-Learning Rule LMS Algorithm Widrow-Hoff

Learning Rule

Converge?

LMS Convergence

- Based on the independence theory (Widrow, 1976).
- The successive input vectors are statistically

independent. - At time t, the input vector x(t) is statistically

independent of all previous samples of the

desired response, namely d(1), d(2), , d(t?1). - At time t, the desired response d(t) is dependent

on x(t), but statistically independent of all

previous values of the desired response. - The input vector x(t) and desired response d(t)

are drawn from Gaussian distributed populations.

LMS Convergence

It can be shown that LMS is convergent if

where ?max is the largest eigenvalue of the

correlation matrix Rx for the inputs.

LMS Convergence

Since ?max is hardly available, we commonly use

It can be shown that LMS is convergent if

where ?max is the largest eigenvalue of the

correlation matrix Rx for the inputs.

Comparisons

Hebbian Assumption

Gradient Decent

Fundamental

Converge Asymptotically

Convergence

In finite steps

Linearly Separable

Linear Independence

Constraint

Feed-Forward Neural Networks

- Learning Rules for
- Single-Layered Perceptron Networks
- Perceptron Learning Rule
- Adaline Leaning Rule
- ?-Learning Rule

Adaline

Unipolar Sigmoid

Bipolar Sigmoid

Goal

Minimize

Gradient Decent Algorithm

Minimize

The Gradient

Minimize

Depends on the activation function used.

?

?

Weight Modification Rule

Minimize

Batch

Learning Rule

Incremental

The Learning Efficacy

Minimize

Sigmoid

Unipolar

Bipolar

Adaline

Exercise

Learning Rule ? Unipolar Sigmoid

Minimize

Comparisons

Batch

Adaline

Incremental

Batch

Sigmoid

Incremental

The Learning Efficacy

Sigmoid

Adaline

depends on output

constant

The Learning Efficacy

Sigmoid

Adaline

The learning efficacy of Adaline is constant

meaning that the Adline will never get saturated.

depends on output

constant

The Learning Efficacy

Sigmoid

Adaline

The sigmoid will get saturated if its output

value nears the two extremes.

depends on output

constant

Initialization for Sigmoid Neurons

Why?

Before training, it weight must be sufficiently

small.

Feed-Forward Neural Networks

- Multilayer Perceptron

Multilayer Perceptron

Output Layer

Hidden Layer

Input Layer

Multilayer Perceptron

Where the knowledge from?

Classification

Output

Analysis

Learning

Input

How an MLP Works?

Example

- Not linearly separable.
- Is a single layer perceptron workable?

XOR

How an MLP Works?

Example

00

01

11

How an MLP Works?

Example

00

01

11

How an MLP Works?

Example

00

01

11

How an MLP Works?

Example

Parity Problem

Is the problem linearly separable?

Parity Problem

x3

P1

P2

x2

P3

x1

Parity Problem

111

011

001

000

Parity Problem

111

011

001

000

Parity Problem

111

P4

011

001

000

Parity Problem

P4

General Problem

General Problem

Hyperspace Partition

Region Encoding

001

000

010

100

101

110

111

Hyperspace Partition Region Encoding Layer

Region Identification Layer

Region Identification Layer

Region Identification Layer

Region Identification Layer

Region Identification Layer

Region Identification Layer

Region Identification Layer

Classification

0

?1

1

Feed-Forward Neural Networks

- Back Propagation Learning algorithm

Activation Function Sigmoid

Remember this

Supervised Learning

Training Set

Output Layer

Hidden Layer

Input Layer

Supervised Learning

Training Set

Sum of Squared Errors

Goal

Minimize

Back Propagation Learning Algorithm

- Learning on Output Neurons
- Learning on Hidden Neurons

Learning on Output Neurons

?

?

Learning on Output Neurons

depends on the activation function

Learning on Output Neurons

Using sigmoid,

Learning on Output Neurons

Using sigmoid,

Learning on Output Neurons

Learning on Output Neurons

How to train the weights connecting to output

neurons?

Learning on Hidden Neurons

?

?

Learning on Hidden Neurons

Learning on Hidden Neurons

?

Learning on Hidden Neurons

Learning on Hidden Neurons

Back Propagation

Back Propagation

Back Propagation

Learning Factors

- Initial Weights
- Learning Constant (?)
- Cost Functions
- Momentum
- Update Rules
- Training Data and Generalization
- Number of Layers
- Number of Hidden Nodes

Reading Assignments

- Shi Zhong and Vladimir Cherkassky, Factors

Controlling Generalization Ability of MLP

Networks. In Proc. IEEE Int. Joint Conf. on

Neural Networks, vol. 1, pp. 625-630, Washington

DC. July 1999. (http//www.cse.fau.edu/zhong/pubs

.htm) - Rumelhart, D. E., Hinton, G. E., and Williams, R.

J. (1986b). "Learning Internal Representations by

Error Propagation," in Parallel Distributed

Processing Explorations in the Microstructure of

Cognition, vol. I, D. E. Rumelhart, J. L.

McClelland, and the PDP Research Group. MIT

Press, Cambridge (1986). - (http//www.cnbc.cmu.edu/plaut/85-419/papers/Rum

elhartETAL86.backprop.pdf).