An Introduction to Artificial Neural Networks - PowerPoint PPT Presentation

1 / 153
About This Presentation
Title:

An Introduction to Artificial Neural Networks

Description:

It allows us to correct every weight of a network in such a way co reduce the error Repeating the process on and on ... www.nd.com MATLAB Neural Networks ... – PowerPoint PPT presentation

Number of Views:2000
Avg rating:3.0/5.0
Slides: 154
Provided by: PiotrG7
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Artificial Neural Networks


1
An Introduction to Artificial Neural Networks
  • Piotr Golabek, Ph.D.
  • Radom Technical University
  • Poland
  • pgolab_at_pr.radom.net

2
An overview of the lecture
  • What are ANNs? What are they for?
  • Neural networks as inductive machines inductive
    reasoning tradition
  • The evolution of the concept keywords,
    structures, algorithms

3
An overview of the lecture
  • Two general tasks classification and
    approximation
  • Above tasks in more familiar setting decision
    making, signal processing, control systems
  • live presentations

4
What are ANNs?
  • Dont ask me ...
  • ANN is a set of processing elements (PEs),
    influencing each other
  • (that definition suit almost everything...)

5
What are ANNs
  • ... but seriously...
  • neural following biological
    (neurophysiological) inspiration,
  • artificial dont forget these are not real
    neurons!
  • networks strongly interconnected (in fact
    massive parallel processing)
  • and the implicit meaning
  • ANNs are learning machines, i.E. adapt, just as
    biological neurons do

6
Machine learning
  • Important field of AI
  • A computer program is said to learn from
    experience E with respect to some class of tasks
    T and performance measure P, if its performance
    at tasks in T, as measured by P, improves with
    experience E
  • (Take a look at Machine Learning by Tom
    Mitchell)

7
What is ANN?
  • In case of ANNs, the Experience is input data
    (examples)
  • The ANN is a inductive learning machine, i.E.
    machine constructing internal generalized
    concepts based on evidence brought by data stream
  • ANN learns from examples a paradigm shift

8
What is ANN
  • Structurally, ANN is a complex, interconnected
    structure composed of simple processing elements,
    often mimicking biological neurons
  • Functionally, ANN is an inductive learning
    machine, it is able to undergo an adaptation
    process (learning) driven by examples

9
What are ANNs used for?
  • Recognition of images, OCR
  • Recognition of time signal signatures vibration
    diagnostic, sonar signal interpretation,
    detection intrusion patterns in various
    transaction systems
  • Trend prediction, esp. in financial markets (bond
    rating prediction)
  • Decision support, eg. in credit assessment,
    medical diagnosis
  • Industrial process control, eg. the melting
    parameters in metallurgical processes
  • Adaptive signal filtering to restore the
    information from corrupted source

10
Inductive process
  • Concepts rooted in epistemology (episteme
    knowledge)
  • Heraclitus The nature likes to hide
  • Observations vs the true nature of the phenomenon
  • The empiric (experimental) method of developing
    the model (hypothesis) of the true phenomenon
    the inductive process
  • Something like this goes on during ANN learning

11
ANN as inductive learning machine
  • The theory the way ANN behaves
  • Experimental data examples the ANN learns
    from
  • New examples cause the ANN to change its
    behaviour, in order to fit better to the evidence
    brought by examples

12
Inductive process
  • Inductive bias - the initial theory (a priori
    knowledge)
  • Variance the evidence brought by data
  • The strong bias prevents the data to affect the
    theory
  • The weak bias makes the theory vulnerable to the
    data corruption
  • The game is to properly set the bias-variance
    balance

13
ANN as inductive learning machines
  • We can shape the inductive bias of learning
    process e.g. by tuning the number of neurons
  • The more neurons, the more flexible the network
    (the more sensitive to data)

14
Inductive vs deductive reasoning
  • Reasoning premises ? conclusions
  • Deductive reasoning the conclusions are more
    specific than premises (we just reason the
    consequences)
  • Inductive reasoning the conclusions are more
    general than premises (we reason the general
    rules governing the phenomenon from the specific
    examples)

15
The main goal of inductive reasoning
  • The main goal To achive the good generalization
    to reason the rule general enough, that it fits
    to any futer data
  • This is also the main goal of machine learning
    to use the experience in order to build good
    enough performance (in every possible future
    situation)

16
McCulloch-Pitts model
Warren McCulloch
  • Walter Pitts

A Logical Calculus Immanent in Nervous
Activity, 1943
17
McCulloch-Pitts model
  • Logical calculus approach
  • elementary logical operations AND, OR, NOT
  • basic reasoning operator, implication
  • (given premises p, we draw conclusion q)

18
McCulloch-Pitts model
  • Logical operators are functions
  • Truth tables

x y x ? y
0 0 1
0 1 1
1 0 0
1 1 1
x y x AND y
0 0 0
0 1 0
1 0 0
1 1 1
x y x OR y
0 0 0
0 1 1
1 0 1
1 1 1
x NOT x
0 1
1 0
19
McCulloch-Pitts model
  • The working question whether a neuron can
    perform logical functions AND, OR, NOT
  • If the answer is yes, the chain of implications
    (reasoning) could be implemented in neural network

20
McCulloch-Pitts model
Inputs
Weights
Neuron output (activation)
Summation
Total exicitation
Activation function
Activation threshold
21
McCulloch-Pitts transfer function
22
Implementation of AND, OR, NOT
  • McCulloch-Pitts neuron

23
Including threshold into weights
24
McCulloch-Pitts model
  • Neuron equations

25
(vector dot product)
26
(vector dot product)
x
x
w
w
max antisimilarity
max dissimilarity (orthogonality)
max similarity
27
Vector dot product interpretation
  • Inputs are called input vector
  • weights are called weight vector
  • Neuron excites, when input vector is similar
    enough to the weight vector
  • Weight vector is a template for some set of
    input vectors

28
Neurons elements of the ANNs
  • Dont be fooled...
  • These are our neurons ...

29
Neurons elements of the ANNs
Single neuron (stereoscopic)
30
Neurony - elementy skladowe sieci neuronowych
  • There is some analogy...

31
The real neuron
Synaptic connection organic structure
32
The real neuron
Synaptic connection the molecular level
33
McCulloch-Pitts model
  • The conclusion
  • If we tune the weights of the neuron properly, we
    can make it implement the transfer function we
    need (AND, OR, NOT)
  • The question
  • What the weights of neurons are tuned in our
    brains, what is the adaptation mechanism

34
Adaptacja neuronu
  • Donald Hebb (1949, neurophysiologist)When an
    axon of cell A is near enough to excite a cell B
    and repeatedly or persistently takes part in
    firing it, some growth process or metabolic
    change takes place in one or both cells such that
    As efficiency, as one of the cells firing B, is
    increased.

35
Hebb rule
36
Hebb rule
  • It is a local rule of adaptation
  • The multiplication of input and output signifies
    a correlation between them
  • The rule is unstable a weight can grow without
    limits
  • (that doesnt happen in nature, where there
    are limited resources)
  • numerous modifications of the Hebb rule has been
    proposed, to make it stable

37
Hebb rule
  • Hebb rule is very important and useful ...
  • ... but for now we want to make the neuron to
    learn the function we need

38
Rosenblatt Perceptron
  • Frank Rosenblatt (1958) Perceptron hardware
    (electromechanical) implementation of the ANN
    (effectively 1 neuron).

39
Rosenblatt Perceptron
  • One of the goals of the experiment was to train
    the neuron, i.E. to make it go active whenever
    specific pattern appears on retina
  • The neuron was to be trained with examples
  • The experimenter (teacher) was to expose the
    neuron to the different patterns and in each case
    tell it, whether it should fire, or not
  • The learning algorithm should do best to make
    neuron do what the teacher requires

40
Perceptron learning rule
  • Kind of Hebbian rule modification (weight
    correction depends on the error between actual
    and desired output)

41
Supervised scheme
42
Supervised scheme
  • One training example the pair ltinput value,
    desired outputgt is called a training pair
  • The set of all the training pairs is called
    training set

43
Unsupervised scheme
44
Example of supervised learning
  • Linear Associator

45
Neural networks
  • A set of processing elements implementing each
    other
  • The neurons (PEs) are interconnected. The output
    of each neuron can be connected to the input of
    every neuron, including itself

46
Neural networks
  • If there is a path of propagation (direct or
    indirect) between the output of a neuron and its
    own input, we have feedbacks - such structures
    are called recurrent
  • If there is no feedback in a network, such
    structure is called feedforward

47
What does recurrent mean?
  • recurrent definition is a definition of a concept
    is a definition using the very same concept (but
    perhaps in lower complixity setup)
  • recurent function is a function calling itself
  • classical recurrent definition factorial
    function

48
Recurrent connection
  • function calling itself

49
Recurrent connection
  • At any given moment, the whole history of past
    excitations influences neuron output
  • The concept of temporal memory emerges
  • The past influences present to the degree
    determined by the weight of the recurrent
    connection
  • This weight is effectively a forgetting factor

50
Feedforward layered network
51
Our brain
  • There are ca 1011 neurons in our brain
  • Each of them is connected on averege to 1000
    other neurons
  • There is only one connection per 10 billions of
    other
  • If every neuron would be connected to each other,
    our brain would have to be a few hundred meters
    in diameter
  • There is a strong modularity

52
Our brain
A fragment af the neural network connecting
retina to the visual perception area of the brain
53
Our brain vs computers
  • The memory size estimation ca. 1014 connections
    gives an estimated size 100TB (each connection
    has a continous real weight)
  • Neurons are quite slow, capable of activating no
    more than 200 times per second, but there are a
    lot of them, that gives an estimate of 1016
    floating point operations per second.

54
Neural networks vs computer
  • Many (1011) simple processing elements (neurons)
  • Massively parallel, distributed processing
  • The momory evenly distributed in the whole
    structure, content addressable
  • Large fault tollerance
  • A few complex processing elements
  • Sequential, centralized processing
  • Compact, addressed by an index memory
  • Large fault vulnerability

55
How to train the whole network?
  • For the Perceptron the output of the neuron
    could be compared to the desired value
  • But what with the layered structure? How to reach
    the hidden neurons?
  • The original idea comes from experiments of
    Widrow and Hoff in 60s
  • The global error optimization using gradient
    descent has been used

56
Supervised scheme once again
57
Error minimization
  • The error function component can be quite
    elaborately defined
  • But the goal is always to minimize the error
  • One widely used technique of function
    optimization (minimization/maximization) is
    called gradient descent

58
Error function
  • One cycle of training consists of the
    presentation of many training pairs it is
    called one epoch of learning
  • The error accumulated for the whole epoch is an
    average

59
Why quadratic function?
60
Error function once again
  • As subsequent input/output pairs are averaged
    out, we can think of the error function mainly
    as a function of weights w
  • The goal of learning to choose weights in such
    way, that the error would be minimized

61
Error function derivative
Derivative gives us information on whether the
function increases or decreases when the argument
increases (and how fast)
The function is falling, then the sign of the
derivative is negative
We want to minimize the function value, thus we
have to increase the argument.
wi
62
The gradient rule
63
Error function gradient
  • In multidimensional case we have to do with a
    vector of error function partial derivatives with
    respect to each dimension (gradient)

64
Gradient method
The metod of moving against the gradient is
commonly called hill-climbing
65
Gradient method
66
Steepest descent demo
  • MATLAB demonstration

67
Other form of activation function
  • So called sigmoidal function, e.g.

68
Other form of activation function
ß1
ß100
ß0.4
69
Backpropagation algorithm
70
Backpropagation algorithm
71
Chain rule
  • Applies chain rule of differentiation

That makes possible to transfer the error
backward toward hidden units
72
Chain rule
Backward propagation through neuron
73
Backpropagation through neuron
74
Backpropagation through neuron
75
Backpropagation through neuron
76
Backpropagation through neuron
77
Backpropagation through neuron
78
Backpropagation through neuron
79
Backpropagation through neuron
80
Backpropagation through neuron
81
Backpropagation through neuron
  • Conclusion if we know the error function
    gradient with respect to the output of the
    neuron, we can compute the gradient with respect
    to each of its weights
  • In general, our goal is to propagate the error
    function gradient from the output of the network
    to the outputs of the hidden units

82
Backpropagation
  • Additional problem in general, each hidden
    neuron is connected to more than one neuron of
    the next layer
  • There are many paths for the error gradient to be
    transmitted backward from the next layer

83
Error backpropagation
84
Backpropagation through layer
  • Applying the rule of derivation for function of
    compound arguments
  • we can propagate the error gradient through the
    layer

85
Backpropagation through layer
86
Backpropagation through layer
87
Backpropagation through layer
Ogólniej
88
Backpropagation through layer
89
Forward propagation
The activations of the neurons are propagated
90
Forward propagation
a1
w11
z1
w12
a2
w13
a3
The activations of the neurons are propagated
91
Backpropagation
a2
The error function gradient is propagated
92
Backpropagation
w12
a2
w22
The error function gradient is propagated
93
Single algorithm cycle
94
Forward propagation
  • One cycle of algorithm
  • get inputs of the current layer
  • compute the excitations of the considered layer,
    transferring inputs through the layer of
    weights (multiplying the inputs by the
    corresponding weights and performing the
    summation)
  • calculate the activations of the layers neurons
    by transferring the neuron excitations through
    the activation functions
  • Repeat that cycle, starting with the layer 1 on
    to the output layer. The activations of neurons
    of the output layer are the outputs of the network

95
Backpropagation
  • One cycle of the algorithm
  • get error function gradients with respect to the
    outputs of the layer
  • compute the error gradients with respect to the
    excitations of the layers neurons by
    transferring the gradients backward through the
    derivatives of the neuron activation functions
  • compute the error function gradients with respect
    to the outputs of the prior layer by transferring
    the so far computed gradients through the layer
    of weights (multiplying the gradients by the
    corresponding weights and performing the
    summation)

96
Backpropagation
  • Repeat that cycle starting from the last layer
    the error function gradients can be computed
    directly on toward the first layer. The
    gradients computed through the process can be
    used to calculate gradients with respect to the
    weights

97
BP Algorithm
  • It all ends up with an computationally effective
    and elegant procedure to compute partial
    derivative of the error function with respect to
    every weight in a network.
  • It allows us to correct every weight of a network
    in such a way co reduce the error
  • Repeating the process on and on gradually reduces
    the error and constitutes the learning process

98
Example source code (MATLAB)
99
Learning rate
  • Term ? is called learning rate
  • The faster, the better, but too fast can cause
    the learning process to become unstable

100
Learning rate
  • In practice we have to manipulate the learning
    rate during the course of learning process
  • The strategy of the constant learning rate is not
    too good

101
Two types of problems
  • Data grouping/classification
  • Function approximation

102
Classification
103
Classification
  • Alternative scheme

0 (1)
1 (1)
2 (1)
3 (1)
4 (1)
...
...
5 (1)
6 (1)
7 (90)
8 (1)
9 (1)
Brak decyzji
104
Classification typical applications
  • Classification Pattern recognition
  • medical diagnosis
  • fault condition recognition
  • handwriting recognition
  • object identification
  • decision support

105
Classification example
  • Applet Character recognition

106
Classification
  • Assumes that a class is a group of similar
    objects
  • Similarity has to be defined
  • Similar objects objects having similar
    attributes
  • We have to describe the attributes

107
Classification
  • E.g. some of the human attributes
  • Height
  • Age
  • Class K Tall people under 30

108
Classification
  • Object O1 belonging to the class K
  • A person 180 cm high, 23 years old
  • Object O2 that doesnt belong to the class K
  • A person 165cm high, 35 years old

(180, 23) (165, 35)
109
Classification
110
The similarity of objects
111
The similarity
  • Euklidean distance (Euclidean metric)

112
Other metrics

Manhattan metric
113
Classification
  • The more attributes the more dimensions

114
Multidimensional metric
115
Multidimensional data
  • OLIVE presentation

116
Classification
Atr 2
Atr 4
Atr 6
Atr 1
Atr 3
Atr 5
Atr 8, itd. ..
Atr 7
117
Classification
YKX AGE KHEIGHT
AGE gt KHEIGHT
  • Wytyczenie granicy miedzy dwoma grupami

WIEK lt KHEIGHT
118
Classification
AGE KHEIGHTB AGE-KHEIGHT-B0 AGEK2HEIGHT
B20
AGE
35
23
HEIGHT
119
Classification
  • In general, for the multidimensional case, so
    called classification hiperplane is described by
  • We are very close to the McCulloch-Pitts ...

120
McCulloch-Pitts
121
Neuron as a simple classifier
  • Single McCullocha-Pittsa threshold unit performs
    a linear dichotomy (separation of two classes in
    the multidimensional space)
  • Tuning the weights and threshold changes the
    orientation of the separating hyperplane

122
Neuron as a simple classifier
  • If we tune the weights properly (train the neuron
    properly), it will classify the processed objects
  • Processing an object means exposing the object
    attributes on the neuron inputs

123
More classes
  • More neurons a network
  • Every neuron performs a bisection of the feature
    space
  • A few neurons partitions the space to a few
    distinct areas

124
Sigmoidal activation function
125
Classification example
  • NeuroSolutions Principal Component

126
Complicated separation border
  • Neurosolutions Support Vector Machine

127
Aproksymacja
X
Y
?
128
Example
  • True phenomenon

129
Example
  • There is only a limited number of observations

130
Example
  • And the observations are corrupted

131
Typical situation
  • We have a small amount of data
  • Data is corrupted (we are not certain of how
    reliable it is)

132
Example
  • The experimenter sees only the data

133
Experimenter/system task
  • To fill the gaps?
  • We would call that an interpolation
  • But what we truly think of is an approximation
    looking for a model (trace), which is most
    similar (approximate) to the unknown (!) true
    phenomenon

134
Example
  • We can apply e.g. a MATLAB polyfit

135
Polyfit
  • Polynomial approximation

136
Example
  • Polyfit with 2nd order polynomial

137
Example
  • But how come we know, we should apply the 2nd
    order polynomial?

138
Example
  • And what if we apply 15th degree? It fits the
    date much better (but it doesnt fit the original
    well)

139
The variance factor
  • The higher the degree the more freaky it gets
  • 15th degree is quite flexible can be fit to
    many things
  • However, the generalization is sacrificed the
    model fits well the data, but most probably would
    fail on other data that would come later
  • Thats closing too much to the modelling the
    variance of the data

140
Example
  • We could also insist on the 1st order

141
Example
  • ... or even, the 0th order (the data are almost
    completely ignored)...

142
The bias factor
  • Lower polynomial degree means lower flexibility
  • Arbitral model degree choice is what we called an
    inductive bias
  • It is a kind of a priori knowledge, we introduce
  • In case of 0th and 1st order the bias is too
    strong

143
Polyfit
  • A polynomial
  • Training set
  • Polyfit

144
Approximation
  • Linear model
  • A model employing polynomials (linear as well)

145
Aproksymacja
  • Uogólniony model liniowy

146
Approximation
  • hk() funcunctions can be various polynomial,
    sinus,
  • Can be sigmoid as well

147
Approximation
  • ANN can do a linear model...

148
Approximation
  • But can do much more!

149
ANN transfer function
  • This looks like nonlinear function, indeed ...

150
Approximation
  • An Artificial Neural Network build on processing
    elements with sigmoidal activation functions is
    an universal approximator for the functions of
    class C1 (continuous to the first derivative)
    Hornik, 1983
  • Every typical transfer function can be modelled,
    with an arbitrary precision, provided there is an
    appropriate number of neurons

151
Przyklad aproksymacji funkcji
  • Applet Java function approximation

152
Where to go now?
  • This set of slides
  • http//pr.radom.net/pgolabek/Antwerp/NNIntro.ppt
  • Be sure to check the comp-ai.neural-nets FAQ
  • http//www.faqs.org/faqs/ai-faq/neural-nets/
  • Books
  • Simon Haykin Neural networks a comprehensive
    direction
  • Christopher Bishop Neural networks for pattern
    recognition
  • Neural and adaptive systems the
    NeuroSolutions interactive book (www.nd.com)

153
Where to go now
  • Software
  • NeuroSolution www.nd.com
  • MATLAB Neural Networks Toolbox
  • SNNS - Stuttgart Neural Network Simulator
  • and countless other
Write a Comment
User Comments (0)
About PowerShow.com