Neural Networks

WHY ARTIFICIAL NEURAL NETWORKS?

- Characteristics of the human brain that are not

present in von Neumann or modern parallel

computers include - massive parallelism,
- distributed representation and computation,
- learning ability,
- generalization ability,
- adaptivety,
- inherent contextual information processing,
- fault tolerance, and
- low energy consumption.
- It is hoped that devices based on biological

neural networks will possess some of these

desirable characteristics.

(No Transcript)

ANNs

- Inspired by biological neural networks, ANNs are

massively parallel computing systems consisting

of an extremely large number of simple processors

with many interconnections. - ANN models attempt to use some organizational

principles believed to be used in the human

Brief historical review

- ANN research has experienced three periods of

extensive activity - The first peak in the 1940s was due to McCulloch

and Pitts' - The second occurred in the 1960s with

Rosenblatt's perceptron convergence theorem and

Minsky and Papert's work showing the limitations

of a simple perceptron. Minsky and Papert's

results dampened the enthusiasm of most

researchers which lasted almost 20 years. - Since the early 1980s, ANNs have received

considerable renewed interest. The major

developments include - Hopfield's energy approach in 1982 and
- The back-propagation learning algorithm for

multilayer perceptrons (multilayer feed forward

networks) first proposed by Werbos, and then

popularized by Rumelhart et al. in 1986.

Biological neural networks

- A neuron (or nerve cell) is a special biological

cell that processes information. It is composed

of a cell body, or soma, and two types of

out-reaching tree-like branches the axon and the

dendrites.

Biological neural networks (cont.)

- A neuron receives signals (impulses) from other

neurons through its dendrites (receivers) and

transmits signals generated by its cell body

along the axon (transmitter), which eventually

branches into strands and sub strands. - At the terminals of these strands are the

synapses. - A synapse is an elementary structure and

functional unit between two neurons (an axon

strand of one neuron and a dendrite of another)

Biological neural networks (cont.)

- The human brain contains about 1011 neurons,

which is approximately the number of stars in the

Milky Way. - Neurons are massively connected, much more

complex and dense than telephone networks. - Each neuron is connected to 103 to l04 other

neurons. - In total, the human brain contains approximately

1014 to 1015 interconnections.

Biological neural networks (cont.)

- Complex perceptual decisions such as face

recognition are typically made by humans within a

few hundred milliseconds. - These decisions are made by a network of neurons

whose operational speed is only a few

milliseconds. This implies that - the computations cannot take more than about 100

serial stages. - the brain runs parallel programs that are about

100 steps long for such perceptual tasks. This is

known as the hundred step rule

Computational models of neurons

- This mathematical neuron computes a weighted sum

of its n input signals ,x,, j 1,2, . . . , n. - Generates an output of 1 if this sum gt certain

threshold U. Otherwise, an output of 0 results.

- Mathematically
- ?(.) is the unit step function
- wj is the synapse weight
- associated with the jth input
- For simplicity of notation, we often consider the

threshold U as another weight wo - U attached

to the neuron with a constant input x0 1

Activation Functions

The Sigmoid

- The standard sigmoid function is the logistic

function, defined by

where ? is the slope parameter

Network architectures

- ANNs can be viewed as weighted directed graphs in

which artificial neurons are nodes and directed

edges (with weights) are connections between

neuron outputs and neuron inputs. - feed-forward networks, in which graphs have no

loops, and - recurrent (or feedback) networks, in which loops

occur because of feedback connections.

Network architectures

Different connectivity's yield different network

behaviors

Network architectures

- Feed-forward networks are
- static, that is, they produce only one set of

output values rather than a sequence of values

from a given input. - memory-less in the sense that their response to

an input is independent of the previous network

state. - Recurrent, or feedback, networks are
- dynamic systems.
- When a new input pattern is presented, the neuron

outputs are computed. Because of the feedback

paths, the inputs to each neuron are then

modified, which leads the network to enter a new

state. - Different network architectures require

appropriate learning algorithms.

Learning

- A learning process in the ANN context can be

viewed as the problem of updating network

architecture and connection weights so that a

network can efficiently perform a specific task. - The network usually must learn the connection

weights from available training patterns. - Performance is improved over time by iteratively

updating the weights in the network.

Learning

- ANNs' ability to automatically learn from

examples makes them attractive and exciting. - ANNs appear to learn underlying rules (like

input-output relationships) from the given

collection of representative examples. - This is one of the major advantages of neural

networks over traditional expert systems.

Learning algorithm

- To understand or design a learning process, you

must have - A learning paradigm a model of the environment

in which a neural network operates, i.e., you

must know what information is available to the

network. - Learning rules you must understand how network

weights are updated, i.e., which learning rules

govern the updating process. - A learning algorithm refers to a procedure in

which learning rules are used for adjusting the

weights.

Learning paradigms

- Supervised learning The network is provided with

a correct answer (output) for every input pattern

- learning with a teacher. - Weights are determined to allow the network to

produce answers as close as possible to the known

correct answers. - Reinforcement learning is a variant of

supervised learning in which the network is

provided with only a critique on the correctness

of network outputs, not the correct answers

themselves. - Unsupervised learning The network explores the

underlying structure in the data, or correlations

between patterns in the data, and organizes

patterns into categories from these correlations

- learning without a teacher. - Hybrid learning Part of the weights are usually

determined through supervised learning, while the

others are obtained through unsupervised learning

- combines supervised and unsupervised learning.

Learning theory

- Learning theory must address three fundamental

and practical issues associated with learning

from samples capacity, sample complexity, and

computational complexity. - Capacity how many patterns can be stored, and

what functions and decision boundaries a network

can form. - Sample complexity determines the number of

training patterns needed to train the network to

guarantee a valid generalization. - Too few patterns may cause over-fitting

(wherein the network performs well on the

training data set, but poorly on independent test

patterns drawn from the same distribution as the

training patterns). - Computational complexity refers to the time

required for a learning algorithm to estimate a

solution from training patterns. - Many existing learning algorithms have high

computational complexity.

Learning rules

- Error correction, Boltzmann, Hebbian, and

Competitive learning. - ERROR-CORRECTION RULES During the learning

process, the actual output y generated by the

network may not equal the desired output d. - The basic principle of error-correction learning

rules is to use the error signal (d-y) to modify

the connection weights to gradually reduce this

error. - The perceptron learning rule is based on this

error-correction principle. - A perceptron consists of a single neuron with

adjustable weights, wj, j 1,2, . . . , n, and

threshold U (threshold function).

ERROR-CORRECTION RULES

- Given an input vector x (xl, x2, . . . , xn)t,

the net input to the neuron is - The output y of the perceptron
- is 1 if v gt 0, and 0 otherwise.
- In a two-class classification problem, the

perceptron assigns an input pattern to one class

if y 1, and to the other class if y0. - The linear equation defines the decision boundary

that halves the space.

Perceptron learning algorithm

- Randomly initialize weights and threshold w1 w2

wm - Present an input vector x (xl, x2, . . . , xn)t

and evaluate the output of the neuron. - Update the weights according to
- wj (t 1) wj (t) ?? (d-y) xj
- where d is the desired output, t is the

iteration number, and ? is the gain step size (

0.0 lt ? lt 1.0)

Perceptron learning algorithm

- Note that learning occurs only when the

perceptron makes an error. - The perceptron convergence theorem Rosenblatt

proved that when training patterns are drawn from

two linearly separable classes, the perceptron

learning procedure converges after a finite

number of iterations. - In practice, you do not know whether the patterns

are linearly separable. - Many variations of this learning algorithm have

been proposed in the literature - Other activation functions that lead to different

learning characteristics can also be used. - The back-propagation learning algorithm is based

on the error-correction principle.

Perceptrons and Boolean Functions

- If inputs are all 0s and 1s and outputs are all

0s and 1s - Can learn the function x1 ? x2
- Can learn the function x1 ? x2 .

Perceptrons and Boolean Functions

- What about the exclusive or function?
- f(x1,x2) x1 ? x2
- (x1 ? x2) ? ( x1 ? x2)

XOR problem

- Desired make an ANN which will produce Y X1

xor X2 on inputs X1 and X2. - Problem there is no single line that can cut

X1 X2 space into two proper regions. Therefore,

cannot use a single-layer neural net. - Solution use multilayer network

HEBBIAN RULE

- The oldest learning rule is Hebbs postulate of

learning. Hebb based it on the following

observation from neurobiological experiments - If neurons on both sides of a synapse are

activated synchronously and repeatedly, the

synapses strength is selectively increased. - Mathematically, the Hebbian rule can be described

as - where xi and yj are the output values of neurons

i and j, respectively, which are connected by the

synapse wij and ? is the learning rate. Note

that xi is the input to the synapse.

HEBBIAN RULE

- An important property of this rule is that
- learning is done locally, i.e., the change in

synapse weight depends only on the activities of

the two neurons connected by it. - This significantly simplifies the complexity of

the learning circuit in a VLSI implementation.

HEBBIAN RULE

- A single neuron trained using the Hebbian rule

exhibits an orientation selectivity. - The points depicted are drawn from a

two-dimensional Gaussian distribution and used

for training a neuron. - The weight vector of the neuron is initialized to

w0. - As the learning proceeds, the weight vector

moves progressively closer to the - direction w of maximal
- variance in the data.
- w is the eigenvector of the
- covariance matrix of the data
- corresponding to the largest
- eigen value.

BOLTZMANN LEARNING

- The Boltzmann machine (named in honour of a

19th-century scientist by its inventors) - Boltzmann machines are symmetric recurrent

networks consisting of binary units ( 1 for on

and -1 for off). - the weight on the connection from unit i to unit

j is equal to the weight on the connection from

unit j to unit i. - A subset of the neurons, called visible, interact

with the environment the rest, called hidden, do

not. - Each neuron is a stochastic unit that generates

an output (or state) according to the Boltzmann

distribution of statistical mechanics.

- Boltzmann machines operate in two modes
- Clamped visible neurons are clamped onto

specific states determined by the environment

and - Free-running both visible and hidden neurons are

allowed to operate freely. The hidden neurons

always operate freely. - K is the number of visible neurons
- L is the number of hidden neurons.

BOLTZMANN LEARNING

- Boltzmann learning is a stochastic learning rule

derived from information-theoretic and

thermodynamic principles. - The objective of Boltzmann learning is to adjust

the connection weights so that the states of

visible units satisfy a particular desired

probability distribution. - According to the Boltzmann learning rule, the

change in the connection weight wg is given by - where ? is the learning rate, and ?ij and ?ij are

the correlations between the states of units i

and j when the network operates in the clamped

mode and free-running mode, respectively.

Summary of the Boltzmann Machine Learning

Procedure

- 1. Initialization set weights to random numbers

in 1,1 - 2. Clamping Phase Present the net with the

mapping it is supposed to learn by clamping input

and output units to patterns. For each pattern,

perform simulated annealing on the hidden units

at a sequence T0, T1, ..., Tfinal of

temperatures. At the final temperature, collect

statistics to estimate the correlations

Summary of the Boltzmann Machine Learning

Procedure

- 3. Free-Running Phase Repeat the calculations

performed in step 2, but this time clamp only the

input units. Hence, at the final temperature,

estimate the correlations - 4. Updating of Weights update them using the

learning rule Where ? is a learning rate

parameter.

Summary of the Boltzmann Machine Learning

Procedure

- 5. Iterate until Convergence Iterate steps 2 to

4 until the learning procedure converges with no

more changes taking place in the synaptic weights

wji for all j, i.

(No Transcript)

Alternative Boltzmann Architecture

- Alternatively, the visible units may be viewed as

divided into input and output units. - In this case the Boltzmann machine performs

association under the supervision of a teacher,

with the input units receiving information form

the environment, and the output units reporting

the outcome for that input pattern.

Boltzmann vs Hopfield

- Similarities
- 1. Processing units have binary states (1)
- 2. Connections between units are symmetric
- 3. Units are picked at random and one at a time

for updating - 4. Units have no self-feedback.
- Differences
- 1. Boltzmann machine permits the use of hidden

neurons. - 2. Boltzmann machine uses stochastic neurons with

a probabilistic firing mechanism, whereas the

standard Hopfield net uses neurons based on the

McCulloch-Pitts model with a deterministic firing

mechanism. - 3. Boltzmann machine may also be trained by a

probabilistic form of supervision.

COMPETITIVE LEARNING RULES

- Competitive-learning output units compete among

themselves for activation. As a result, only one

output unit is active at any given time. This

phenomenon is known as winner-take-all. - Competitive learning has been found to exist in

biological neural network. - Competitive learning often clusters or

categorizes the input data. Similar patterns are

grouped by the network and represented by a

single unit. This grouping is done automatically

based on data correlations.

COMPETITIVE LEARNING RULES

- The simplest competitive learning network

consists of a single layer of output units. - Each output unit i in the network connects to all

the input units (xi ,s) via weights, wij , j

1,2, . . . , n. - Each output unit also connects to all other

output units via inhibitory weights but has a

self-feed back with an excitatory weight.

COMPETITIVE LEARNING RULES

- A simple competitive learning rule can be stated

as - Note that only the weights of the winner unit get

updated. - The effect of this learning rule is to move the

stored pattern in the winner unit (weights) a

little bit closer to the input pattern. - Assume that all input vectors have been

normalized to have unit length. - The weight vectors of the three units are

randomly initialized. Their initial and final

positions on the sphere after competitive

learning are marked as Xs.

- Each of the three natural groups (clusters) of

patterns has been discovered by - an output unit whose weight vector points to the

center of gravity of the - discovered group.

COMPETITIVE LEARNING RULES

- You can see from the competitive learning rule

that the network will not stop learning (updating

weights) unless the learning rate q is 0. - A particular input pattern can fire different

output units at different iterations during

learning. - The system is said to be stable if no pattern in

the training data changes its category after a

finite number of learning iterations. - One way to achieve stability is to force the

learning rate to decrease gradually as the

learning process proceeds towards 0. However,

this artificial freezing of learning causes

another problem termed plasticity, which is the

ability to adapt to new data. This is known as

Grossbergs stability- plasticity dilemma in

competitive learning.

COMPETITIVE LEARNING RULES

- The most well-known example of competitive

learning is vector quantization for data

compression. - It has been widely used in speech and image

processing for efficient storage, transmission,

and modeling. - Its goal is to represent a set or distribution of

input vectors with a relatively small number of

prototype vectors (weight vectors), or a

codebook. Once a codebook has been constructed

and agreed upon by both the transmitter and the

receiver, you need only transmit or store the

index of the corresponding prototype to the input

vector. - Given an input vector, its corresponding

prototype can be found by searching for the

nearest prototype in the codebook.

Well known learning algorithms

Well known learning algorithms

SUMMARY

- Learning rules based on error-correction can be

used for training feed-forward networks - Hebbian learning rules have been used for all

types of network architectures. - Each learning algorithm is designed for training

a specific architecture. - When we discuss a learning algorithm, a

particular network architecture association is

implied. - Each algorithm can perform only a few tasks well.

- Other algorithms, including Adaline, Madaline,

linear discriminant analysis, Sammon's projection

, and principal component analysis.

Multilayer Networks

- The class of functions representable by

perceptrons is limited

This is a nonlinear function Of a linear

combination Of non linear functions

Of linear combinations of inputs

A 1-HIDDEN LAYER NET

NINPUTS 2

NHIDDEN 3

w11

w1

x1

w21

w31

w2

w12

w22

x2

w3

w32

OTHER NEURAL NETS

Multilayer perceptron

- The most popular class of multilayer feed-forward

networks is multilayer perceptrons - Each computational unit employs either the

thresholding function or the sigmoid function. - Multilayer perceptrons can form arbitrarily

complex decision boundaries and represent any

Boolean function. - The development of the back-propagation learning

algorithm for determining weights in a multilayer

perceptron has made these networks the most

popular among researchers and users of neural

networks.

Multilayer perceptron

- We denote wij(l) as the weight on the connection

between the ith unit in layer (l-1) to jth unit

in layer l. - Let (x(1), d(1)), (x(2), d(2)), . . . , (x(p),

d(p)) be a set of p training patterns

(input-output pairs), - where x(i) ? Rn is the input vector in the

n-dimensional pattern space, and - d(i) ? 0, l m, an m-dimensional hypercube.
- For classification purposes, m is the number of

classes. The squared error cost function most

frequently used in the ANN literature is defined

as

Back-propagation

- The back-propagation algorithm is a

gradient-descent method to minimize the

squared-error cost function E.

GRADIENT DESCENT

- Suppose we have a scalar function
- We want to find a local minimum.
- Assume our current weight is w
- GRADIENT DESCENT RULE
- ? is called the LEARNING RATE. A small positive

number, e.g. ? 0.05

Gradient Descent in m Dimensions

- Given

points in direction of steepest ascent.

GRADIENT DESCENT RULE Equivalently

.where wj is the jth weight just like a linear

feedback system

A RULE KNOWN BY MANY NAMES

The Widrow Hoff rule

The LMS Rule

The delta rule

The adaline rule

Classical conditioning

Back-propagation algorithm

- 1. Initialize the weights to small random

variables - 2- Randomly choose an input pattern X(u)
- 3- propagate the signal forward through the

network - 4- Compute ?iL in the output layer (Oi yiL)
- ?il g (hil) diu yil,
- where hil represents the net input to the ith

unit in the lth layer, g is the derivative of

the activation function g. - 5- Compute the deltas for the preceding layers by

propagating the errors backwards - ?il g (hil) ?j wijl1 ?jl1 ,
- for l (L-1),, 1
- 6- Update weights using
- ?wjil ?il yjl-1
- 7- Go to step 2 and repeat for next pattern until

the error in the output layer is acceptably low,

or a prespecified number of iterations is

reached.

Backpropagation algorithm (instance-based)

- 1 Randomize the weights ws to small random

values (both positive and negative) to ensure

that the network is not saturated by large values

of weights. - 2 Select an instance t, that is the vector

xk(t), i 1,...,Ninp (a pair of input and

output patterns), from the training set. - 3 Apply the network input vector to network

input. - 4 Calculate the network output vector

zk(t), k 1,...,Nout. - 5 Calculate the errors for each of the outputs k

, k1,...,Nout, the difference between the

desired output and the network output - (for simplicity we will denote it as

simply E). - 6 Calculate the necessary updates for weights

-ws in a way that minimizes this error (discussed

below). - 7 Adjust the weights of the network by -ws.
- 8 Repeat steps 2 6 for each instance (pair of

inputoutput vectors) in the training set until

the error for the entire system (error E defined

above or the error on cross-validation set) is

acceptably low, or the pre-defined number of

iterations is reached.

Backpropagation algorithm

- Often it is reasonable not to update weights

immediately after processing each instance, but

accumulates (sums up) the necessary changes

across a subset of training instances (call an

epoch) and only then updates the weights. This

allows for faster convergence (Smith 1993). - Epoch can be the part or the whole training set.

After the whole training set is processed (this

sequence of steps is called an iteration), - the whole process is repeated again in an

iterative fashion until the total error is

acceptably low. - Number of such iterations may sometimes be as

high as several thousand.

Backpropagation algorithm (epoch-based, with

cumulative updates)

- 1 6 as above
- 7 add up the calculated weights updates -ws to

the accumulated total updates ?Ws. - 8 Repeat steps 2 7 for several instances

comprising an epoch. - 9 Adjust the weights ws of the network by the

updates -Ws. - 10 Repeat steps 2 9 until all instances in the

training set are processed. This constitutes one

iteration. - 11 Repeat the iteration of steps 2 10 until the

error for the entire system (error E defined

above or the error on cross-validation set) is

acceptably low, or the pre-defined number of

iterations is reached.

Backpropagation

- In a Single-layer network,
- Each neuron adjusts its weights according to

what output was expected of it, and the output it

gave. This can be mathematically expressed by the

Perceptron Delta Rule - Where w is the array of weights,
- x is the array of inputs.

The Sigmoid (logistic) function

- One of the more popular alternatives function

used with back-propagation nets is the Sigmoid

(logistic) function.

The perceptron learning rule

- Where w is the array of weights,
- x is the array of inputs, and ? is defined as the

learning rate. - yi and di are the actual and desired outputs,

respectively. - Calculating the deltas for the output layer as

Calculate delta for the hidden layers

- We have to know the effect on the output of the

neuron if a weight is to change. - Therefore, we need to know the derivative of the

error with respect to that weight. - It has been proven that for neuron q in hidden

layer p, delta is

Each delta value for hidden layers require that

the delta value for the layer after it be

calculated.

Backpropagation example

NINPUTS 2

NHIDDEN 2

1

W1(0,1)

1

W1(0,2)

W2(0,1)

W1(1,1)

W2(0,1)

W2(1,1)

x1

W1(1,2)

W1(2,1)

x2

W2(2,1)

W1(2,2)

Back propagation algorithm

- 1-Initialize the weights to small random

variables - Layer 1
- Layer 2

- Randomly choose an input pattern X(u)
- 3- Propagate the signal forward through the

network

Layer 1 X2(i) ?k0,1,2 Wi (k,i) X(k)

- Out(x) g(?k0,1,2 Wi (k,i) X(k) )
- X2(i) g(?k0,1,2 Wi (k,i) X(k) )
- g(x) 1/(1e-x)

- 4. Compute ?iL in the output layer (Oi yiL)
- ?il g (hil) diu yil,
- where hil represents the net input to the ith

unit in the lth layer, g is the derivative of

the activation function g. - d3(1) x3(1)(1 - x3(1))(d - x3(1))

- 5- Compute the deltas for the preceding layers by

propagating the errors backwards - ?il g (hil) ?j wijl1 ?jl1 ,
- for l (L-1),, 1

- 6- Update weights using
- ?wjil ?il yjl-1
- Taking ? as 0.05
- dw2(0,1) ?x1(0)d2(1)

- 7- Go to step 2 and repeat for next pattern until

the error in the output layer below a

prespecified number of iterations is reached.

- Run the entire process again on the next set of

training data. - Slowly, as the training data is fed in and the

network in retrained a few thousand times, the

network could balance out to certain values.

APPLICATIONS

- To successfully work with real-world problems,

you must deal with numerous design issues,

including network model, network size, activation

function, learning parameters, and number of

training samples. - Pattern classification
- Clustering
- Function approximation
- Prediction
- Optimization
- Content addressable memory
- Control

Reference