1 / 153

An Introduction to Artificial Neural Networks

- Piotr Golabek, Ph.D.
- Radom Technical University
- Poland
- pgolab_at_pr.radom.net

An overview of the lecture

- What are ANNs? What are they for?
- Neural networks as inductive machines inductive

reasoning tradition - The evolution of the concept keywords,

structures, algorithms

An overview of the lecture

- Two general tasks classification and

approximation - Above tasks in more familiar setting decision

making, signal processing, control systems - live presentations

What are ANNs?

- Dont ask me ...
- ANN is a set of processing elements (PEs),

influencing each other - (that definition suit almost everything...)

What are ANNs

- ... but seriously...
- neural following biological

(neurophysiological) inspiration, - artificial dont forget these are not real

neurons! - networks strongly interconnected (in fact

massive parallel processing) - and the implicit meaning
- ANNs are learning machines, i.E. adapt, just as

biological neurons do

Machine learning

- Important field of AI
- A computer program is said to learn from

experience E with respect to some class of tasks

T and performance measure P, if its performance

at tasks in T, as measured by P, improves with

experience E - (Take a look at Machine Learning by Tom

Mitchell)

What is ANN?

- In case of ANNs, the Experience is input data

(examples) - The ANN is a inductive learning machine, i.E.

machine constructing internal generalized

concepts based on evidence brought by data stream - ANN learns from examples a paradigm shift

What is ANN

- Structurally, ANN is a complex, interconnected

structure composed of simple processing elements,

often mimicking biological neurons - Functionally, ANN is an inductive learning

machine, it is able to undergo an adaptation

process (learning) driven by examples

What are ANNs used for?

- Recognition of images, OCR
- Recognition of time signal signatures vibration

diagnostic, sonar signal interpretation,

detection intrusion patterns in various

transaction systems - Trend prediction, esp. in financial markets (bond

rating prediction) - Decision support, eg. in credit assessment,

medical diagnosis - Industrial process control, eg. the melting

parameters in metallurgical processes - Adaptive signal filtering to restore the

information from corrupted source

Inductive process

- Concepts rooted in epistemology (episteme

knowledge) - Heraclitus The nature likes to hide
- Observations vs the true nature of the phenomenon
- The empiric (experimental) method of developing

the model (hypothesis) of the true phenomenon

the inductive process - Something like this goes on during ANN learning

ANN as inductive learning machine

- The theory the way ANN behaves
- Experimental data examples the ANN learns

from - New examples cause the ANN to change its

behaviour, in order to fit better to the evidence

brought by examples

Inductive process

- Inductive bias - the initial theory (a priori

knowledge) - Variance the evidence brought by data
- The strong bias prevents the data to affect the

theory - The weak bias makes the theory vulnerable to the

data corruption - The game is to properly set the bias-variance

balance

ANN as inductive learning machines

- We can shape the inductive bias of learning

process e.g. by tuning the number of neurons - The more neurons, the more flexible the network

(the more sensitive to data)

Inductive vs deductive reasoning

- Reasoning premises ? conclusions
- Deductive reasoning the conclusions are more

specific than premises (we just reason the

consequences) - Inductive reasoning the conclusions are more

general than premises (we reason the general

rules governing the phenomenon from the specific

examples)

The main goal of inductive reasoning

- The main goal To achive the good generalization

to reason the rule general enough, that it fits

to any futer data - This is also the main goal of machine learning

to use the experience in order to build good

enough performance (in every possible future

situation)

McCulloch-Pitts model

Warren McCulloch

- Walter Pitts

A Logical Calculus Immanent in Nervous

Activity, 1943

McCulloch-Pitts model

- Logical calculus approach
- elementary logical operations AND, OR, NOT
- basic reasoning operator, implication
- (given premises p, we draw conclusion q)

McCulloch-Pitts model

- Logical operators are functions
- Truth tables

x y x ? y

0 0 1

0 1 1

1 0 0

1 1 1

x y x AND y

0 0 0

0 1 0

1 0 0

1 1 1

x y x OR y

0 0 0

0 1 1

1 0 1

1 1 1

x NOT x

0 1

1 0

McCulloch-Pitts model

- The working question whether a neuron can

perform logical functions AND, OR, NOT - If the answer is yes, the chain of implications

(reasoning) could be implemented in neural network

McCulloch-Pitts model

Inputs

Weights

Neuron output (activation)

Summation

Total exicitation

Activation function

Activation threshold

McCulloch-Pitts transfer function

Implementation of AND, OR, NOT

- McCulloch-Pitts neuron

Including threshold into weights

McCulloch-Pitts model

- Neuron equations

(vector dot product)

(vector dot product)

x

x

w

w

max antisimilarity

max dissimilarity (orthogonality)

max similarity

Vector dot product interpretation

- Inputs are called input vector
- weights are called weight vector
- Neuron excites, when input vector is similar

enough to the weight vector - Weight vector is a template for some set of

input vectors

Neurons elements of the ANNs

- Dont be fooled...
- These are our neurons ...

Neurons elements of the ANNs

Single neuron (stereoscopic)

Neurony - elementy skladowe sieci neuronowych

- There is some analogy...

The real neuron

Synaptic connection organic structure

The real neuron

Synaptic connection the molecular level

McCulloch-Pitts model

- The conclusion
- If we tune the weights of the neuron properly, we

can make it implement the transfer function we

need (AND, OR, NOT) - The question
- What the weights of neurons are tuned in our

brains, what is the adaptation mechanism

Adaptacja neuronu

- Donald Hebb (1949, neurophysiologist)When an

axon of cell A is near enough to excite a cell B

and repeatedly or persistently takes part in

firing it, some growth process or metabolic

change takes place in one or both cells such that

As efficiency, as one of the cells firing B, is

increased.

Hebb rule

Hebb rule

- It is a local rule of adaptation
- The multiplication of input and output signifies

a correlation between them - The rule is unstable a weight can grow without

limits - (that doesnt happen in nature, where there

are limited resources) - numerous modifications of the Hebb rule has been

proposed, to make it stable

Hebb rule

- Hebb rule is very important and useful ...
- ... but for now we want to make the neuron to

learn the function we need

Rosenblatt Perceptron

- Frank Rosenblatt (1958) Perceptron hardware

(electromechanical) implementation of the ANN

(effectively 1 neuron).

Rosenblatt Perceptron

- One of the goals of the experiment was to train

the neuron, i.E. to make it go active whenever

specific pattern appears on retina - The neuron was to be trained with examples
- The experimenter (teacher) was to expose the

neuron to the different patterns and in each case

tell it, whether it should fire, or not - The learning algorithm should do best to make

neuron do what the teacher requires

Perceptron learning rule

- Kind of Hebbian rule modification (weight

correction depends on the error between actual

and desired output)

Supervised scheme

Supervised scheme

- One training example the pair ltinput value,

desired outputgt is called a training pair - The set of all the training pairs is called

training set

Unsupervised scheme

Example of supervised learning

- Linear Associator

Neural networks

- A set of processing elements implementing each

other - The neurons (PEs) are interconnected. The output

of each neuron can be connected to the input of

every neuron, including itself

Neural networks

- If there is a path of propagation (direct or

indirect) between the output of a neuron and its

own input, we have feedbacks - such structures

are called recurrent - If there is no feedback in a network, such

structure is called feedforward

What does recurrent mean?

- recurrent definition is a definition of a concept

is a definition using the very same concept (but

perhaps in lower complixity setup) - recurent function is a function calling itself
- classical recurrent definition factorial

function

Recurrent connection

- function calling itself

Recurrent connection

- At any given moment, the whole history of past

excitations influences neuron output - The concept of temporal memory emerges
- The past influences present to the degree

determined by the weight of the recurrent

connection - This weight is effectively a forgetting factor

Feedforward layered network

Our brain

- There are ca 1011 neurons in our brain
- Each of them is connected on averege to 1000

other neurons - There is only one connection per 10 billions of

other - If every neuron would be connected to each other,

our brain would have to be a few hundred meters

in diameter - There is a strong modularity

Our brain

A fragment af the neural network connecting

retina to the visual perception area of the brain

Our brain vs computers

- The memory size estimation ca. 1014 connections

gives an estimated size 100TB (each connection

has a continous real weight) - Neurons are quite slow, capable of activating no

more than 200 times per second, but there are a

lot of them, that gives an estimate of 1016

floating point operations per second.

Neural networks vs computer

- Many (1011) simple processing elements (neurons)
- Massively parallel, distributed processing
- The momory evenly distributed in the whole

structure, content addressable - Large fault tollerance

- A few complex processing elements
- Sequential, centralized processing
- Compact, addressed by an index memory
- Large fault vulnerability

How to train the whole network?

- For the Perceptron the output of the neuron

could be compared to the desired value - But what with the layered structure? How to reach

the hidden neurons? - The original idea comes from experiments of

Widrow and Hoff in 60s - The global error optimization using gradient

descent has been used

Supervised scheme once again

Error minimization

- The error function component can be quite

elaborately defined - But the goal is always to minimize the error
- One widely used technique of function

optimization (minimization/maximization) is

called gradient descent

Error function

- One cycle of training consists of the

presentation of many training pairs it is

called one epoch of learning - The error accumulated for the whole epoch is an

average

Why quadratic function?

Error function once again

- As subsequent input/output pairs are averaged

out, we can think of the error function mainly

as a function of weights w - The goal of learning to choose weights in such

way, that the error would be minimized

Error function derivative

Derivative gives us information on whether the

function increases or decreases when the argument

increases (and how fast)

The function is falling, then the sign of the

derivative is negative

We want to minimize the function value, thus we

have to increase the argument.

wi

The gradient rule

Error function gradient

- In multidimensional case we have to do with a

vector of error function partial derivatives with

respect to each dimension (gradient)

Gradient method

The metod of moving against the gradient is

commonly called hill-climbing

Gradient method

Steepest descent demo

- MATLAB demonstration

Other form of activation function

- So called sigmoidal function, e.g.

Other form of activation function

ß1

ß100

ß0.4

Backpropagation algorithm

Backpropagation algorithm

Chain rule

- Applies chain rule of differentiation

That makes possible to transfer the error

backward toward hidden units

Chain rule

Backward propagation through neuron

Backpropagation through neuron

Backpropagation through neuron

Backpropagation through neuron

Backpropagation through neuron

Backpropagation through neuron

Backpropagation through neuron

Backpropagation through neuron

Backpropagation through neuron

Backpropagation through neuron

- Conclusion if we know the error function

gradient with respect to the output of the

neuron, we can compute the gradient with respect

to each of its weights - In general, our goal is to propagate the error

function gradient from the output of the network

to the outputs of the hidden units

Backpropagation

- Additional problem in general, each hidden

neuron is connected to more than one neuron of

the next layer - There are many paths for the error gradient to be

transmitted backward from the next layer

Error backpropagation

Backpropagation through layer

- Applying the rule of derivation for function of

compound arguments

- we can propagate the error gradient through the

layer

Backpropagation through layer

Backpropagation through layer

Backpropagation through layer

Ogólniej

Backpropagation through layer

Forward propagation

The activations of the neurons are propagated

Forward propagation

a1

w11

z1

w12

a2

w13

a3

The activations of the neurons are propagated

Backpropagation

a2

The error function gradient is propagated

Backpropagation

w12

a2

w22

The error function gradient is propagated

Single algorithm cycle

Forward propagation

- One cycle of algorithm
- get inputs of the current layer
- compute the excitations of the considered layer,

transferring inputs through the layer of

weights (multiplying the inputs by the

corresponding weights and performing the

summation) - calculate the activations of the layers neurons

by transferring the neuron excitations through

the activation functions - Repeat that cycle, starting with the layer 1 on

to the output layer. The activations of neurons

of the output layer are the outputs of the network

Backpropagation

- One cycle of the algorithm
- get error function gradients with respect to the

outputs of the layer - compute the error gradients with respect to the

excitations of the layers neurons by

transferring the gradients backward through the

derivatives of the neuron activation functions - compute the error function gradients with respect

to the outputs of the prior layer by transferring

the so far computed gradients through the layer

of weights (multiplying the gradients by the

corresponding weights and performing the

summation)

Backpropagation

- Repeat that cycle starting from the last layer

the error function gradients can be computed

directly on toward the first layer. The

gradients computed through the process can be

used to calculate gradients with respect to the

weights

BP Algorithm

- It all ends up with an computationally effective

and elegant procedure to compute partial

derivative of the error function with respect to

every weight in a network. - It allows us to correct every weight of a network

in such a way co reduce the error - Repeating the process on and on gradually reduces

the error and constitutes the learning process

Example source code (MATLAB)

Learning rate

- Term ? is called learning rate
- The faster, the better, but too fast can cause

the learning process to become unstable

Learning rate

- In practice we have to manipulate the learning

rate during the course of learning process - The strategy of the constant learning rate is not

too good

Two types of problems

- Data grouping/classification
- Function approximation

Classification

Classification

- Alternative scheme

0 (1)

1 (1)

2 (1)

3 (1)

4 (1)

...

...

5 (1)

6 (1)

7 (90)

8 (1)

9 (1)

Brak decyzji

Classification typical applications

- Classification Pattern recognition
- medical diagnosis
- fault condition recognition
- handwriting recognition
- object identification
- decision support

Classification example

- Applet Character recognition

Classification

- Assumes that a class is a group of similar

objects - Similarity has to be defined
- Similar objects objects having similar

attributes - We have to describe the attributes

Classification

- E.g. some of the human attributes
- Height
- Age
- Class K Tall people under 30

Classification

- Object O1 belonging to the class K
- A person 180 cm high, 23 years old
- Object O2 that doesnt belong to the class K
- A person 165cm high, 35 years old

(180, 23) (165, 35)

Classification

The similarity of objects

The similarity

- Euklidean distance (Euclidean metric)

Other metrics

Manhattan metric

Classification

- The more attributes the more dimensions

Multidimensional metric

Multidimensional data

- OLIVE presentation

Classification

Atr 2

Atr 4

Atr 6

Atr 1

Atr 3

Atr 5

Atr 8, itd. ..

Atr 7

Classification

YKX AGE KHEIGHT

AGE gt KHEIGHT

- Wytyczenie granicy miedzy dwoma grupami

WIEK lt KHEIGHT

Classification

AGE KHEIGHTB AGE-KHEIGHT-B0 AGEK2HEIGHT

B20

AGE

35

23

HEIGHT

Classification

- In general, for the multidimensional case, so

called classification hiperplane is described by

- We are very close to the McCulloch-Pitts ...

McCulloch-Pitts

Neuron as a simple classifier

- Single McCullocha-Pittsa threshold unit performs

a linear dichotomy (separation of two classes in

the multidimensional space) - Tuning the weights and threshold changes the

orientation of the separating hyperplane

Neuron as a simple classifier

- If we tune the weights properly (train the neuron

properly), it will classify the processed objects - Processing an object means exposing the object

attributes on the neuron inputs

More classes

- More neurons a network
- Every neuron performs a bisection of the feature

space - A few neurons partitions the space to a few

distinct areas

Sigmoidal activation function

Classification example

- NeuroSolutions Principal Component

Complicated separation border

- Neurosolutions Support Vector Machine

Aproksymacja

X

Y

?

Example

- True phenomenon

Example

- There is only a limited number of observations

Example

- And the observations are corrupted

Typical situation

- We have a small amount of data
- Data is corrupted (we are not certain of how

reliable it is)

Example

- The experimenter sees only the data

Experimenter/system task

- To fill the gaps?
- We would call that an interpolation
- But what we truly think of is an approximation

looking for a model (trace), which is most

similar (approximate) to the unknown (!) true

phenomenon

Example

- We can apply e.g. a MATLAB polyfit

Polyfit

- Polynomial approximation

Example

- Polyfit with 2nd order polynomial

Example

- But how come we know, we should apply the 2nd

order polynomial?

Example

- And what if we apply 15th degree? It fits the

date much better (but it doesnt fit the original

well)

The variance factor

- The higher the degree the more freaky it gets
- 15th degree is quite flexible can be fit to

many things - However, the generalization is sacrificed the

model fits well the data, but most probably would

fail on other data that would come later - Thats closing too much to the modelling the

variance of the data

Example

- We could also insist on the 1st order

Example

- ... or even, the 0th order (the data are almost

completely ignored)...

The bias factor

- Lower polynomial degree means lower flexibility
- Arbitral model degree choice is what we called an

inductive bias - It is a kind of a priori knowledge, we introduce
- In case of 0th and 1st order the bias is too

strong

Polyfit

- A polynomial
- Training set
- Polyfit

Approximation

- Linear model
- A model employing polynomials (linear as well)

Aproksymacja

- Uogólniony model liniowy

Approximation

- hk() funcunctions can be various polynomial,

sinus, - Can be sigmoid as well

Approximation

- ANN can do a linear model...

Approximation

- But can do much more!

ANN transfer function

- This looks like nonlinear function, indeed ...

Approximation

- An Artificial Neural Network build on processing

elements with sigmoidal activation functions is

an universal approximator for the functions of

class C1 (continuous to the first derivative)

Hornik, 1983 - Every typical transfer function can be modelled,

with an arbitrary precision, provided there is an

appropriate number of neurons

Przyklad aproksymacji funkcji

- Applet Java function approximation

Where to go now?

- This set of slides
- http//pr.radom.net/pgolabek/Antwerp/NNIntro.ppt
- Be sure to check the comp-ai.neural-nets FAQ
- http//www.faqs.org/faqs/ai-faq/neural-nets/
- Books
- Simon Haykin Neural networks a comprehensive

direction - Christopher Bishop Neural networks for pattern

recognition - Neural and adaptive systems the

NeuroSolutions interactive book (www.nd.com)

Where to go now

- Software
- NeuroSolution www.nd.com
- MATLAB Neural Networks Toolbox
- SNNS - Stuttgart Neural Network Simulator
- and countless other