Artificial Neural Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Artificial Neural Networks

Description:

High degree of parallel computation. Distributed representations ... (tp-yp) s'(Si wi xip) (-xip) for y=s(a) = 1/(1 e-a) s'(a)= e-a/(1 e-a)2=s(a) (1-s(a) ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 59
Provided by: raviCs
Category:

less

Transcript and Presenter's Notes

Title: Artificial Neural Networks


1
  • Artificial Neural Networks
  • Notes based on Nilsson and Mitchells
  • Machine learning

2
Outline
  • Perceptrons (LTU)
  • Gradient descent
  • Multi-layer networks
  • Backpropagation

3
Biological Neural Systems
  • Neuron switching time gt 10-3 secs
  • Number of neurons in the human brain 1010
  • Connections (synapses) per neuron 104105
  • Face recognition 0.1 secs
  • High degree of parallel computation
  • Distributed representations

4
Properties of Artificial Neural Nets (ANNs)
  • Many simple neuron-like threshold switching units
  • Many weighted interconnections among units
  • Highly parallel, distributed processing
  • Learning by adaptation of the connection weights

5
Appropriate Problem Domains for Neural Network
Learning
  • Input is high-dimensional discrete or real-valued
    (e.g. raw sensor input)
  • Output is discrete or real valued
  • Output is a vector of values
  • Form of target function is unknown
  • Humans do not need to interpret the results
    (black box model)

6
  • General Idea
  • A network of neurons. Each neuron is
    characterized by
  • number of input/output wires
  • weights on each wire
  • threshold value
  • These values are not explicitly programmed, but
    they evolve through a training process.
  • During training phase, labeled samples are
    presented. If the network classifies correctly,
    no weight changes. Otherwise, the weights are
    adjusted.
  • backpropagation algorithm used to adjust weights.

7
ALVINN (Carnegie Mellon Univ)
Automated driving at 70 mph on a public highway
Camera image
30 outputs for steering
30x32 weights into one out of four hidden unit
4 hidden units
30x32 pixels as inputs
8
  • Another Example
  • NETtalk Program that learns to pronounce
    English text. (Sejnowski and Rosenberg 1987).
  • A difficult task using conventional programming
    models.
  • Rule-based approaches are too complex since
    pronunciations are very irregular.
  • NETtalk takes as input a sentence and produces a
    sequence of phonemes and an associated stress for
    each letter.

9
NETtalk A phoneme is a basic unit of sound in a
language. Stress relative loudness of that
sound. Because the pronunciation of a single
letter depends upon its context and the letters
around it, NETtalk is given a seven character
window. Each position is encoded by one of 29
symbols, (26 letters and 3 punctuations.) Letter
in each position activates the corresponding
unit.
10
NETtalk The output units encode phonemes using
21 different features of human articulation. Rema
ining five units encode stress and syllable
boundaries. NETtalk also has a middle layer
(hidden layer) that has 80 hidden units and
nearly 18000 connections (edges). NETtalk is
trained by giving it a 7 character window so that
it learns the pronounce the middle character. It
learns by comparing the computed pronunciation to
the correct pronunciation.
11
Handwritten character recognition
  • This is another area in which neural networks
    have been successful.
  • In fact, all the successful programs have a
    neural network component.

12
Threshold Logic Unit (TLU)
inputs
weights
w1
output
activation
w2
?
y
q
. . .
a?i1n wi xi
wn
1 if a ? q y 0 if a
lt q

13
(No Transcript)
14
Activation Functions
threshold
linear
y
y
a
a
sigmoid
y
piece-wise linear
y
a
a
15
Decision Surface of a TLU
1
1
Decision line w1 x1 w2 x2 q
x2
w
1
0
0
0
x1
1
0
0
16
Scalar Products Projections
v
v
v
w
w
w
j
j
j
w v gt 0
w v 0
w v lt 0
v
w
j
w v wv cos j
17
Geometric Interpretation
The relation wxq defines the decision line
x2
Decision line
w
wxq
y1
xwq/w
xw
x1
x
y0
18
Geometric Interpretation
  • In n dimensions the relation w xq defines a
    n-1 dimensional hyper-plane, which is
    perpendicular to the weight vector w.
  • On one side of the hyper-plane (w x gt q) all
    patterns are classified by the TLU as 1, while
    those that get classified as 0 lie on the other
    side of the hyper-plane.
  • If patterns can be not separated by a hyper-plane
    then they cannot be correctly classified with a
    TLU.

19
Linear Separability
x2
w1? w2? q ?
w11 w21 q1.5
1
0
1
0
x1
x1
0
1
0
0
Logical XOR
Logical AND
x1 x2 a y
0 0 0 0
0 1 1 0
1 0 1 0
1 1 2 1
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
20
Threshold as Weight
qwn1
xn1-1
w1
wn1
w2
?
y
. . .
a ?i1n1 wi xi
wn
  • 1 if a ? 0
  • y
  • 0 if a lt0


21
Geometric Interpretation
The relation w x0 defines the decision line
x2
Decision line
w
wx0
y1
x1
y0
x
22
Training ANNs
  • Training set S of examples x, t
  • x is an input vector and
  • t the desired target vector
  • Example Logical And
  • S (0,0),0, (0,1),0, (1,0),0, (1,1),1
  • Iterative process
  • Present a training example x , compute network
    output y, compare output y with target t, adjust
    weights and thresholds
  • Learning rule
  • Specifies how to change the weights w and
    thresholds q of the network as a function of the
    inputs x, output y and target t.

23
Adjusting the Weight Vector
x
x
w w ax
jgt90
ax
w
w
Target t1 Output y0
Move w in the direction of x
x
x
w
-ax
jlt90
w
w w - ax
Target t0 Output y1
Move w away from the direction of x
24
Perceptron Learning Rule
  • w w a (t-y) x
  • Or in components
  • wi wi Dwi wi a (t-y) xi (i1..n1)
  • With wn1 q and xn1 1
  • The parameter a is called the learning rate. It
    determines the magnitude of weight updates Dwi .
  • If the output is correct (t y) the weights are
    not changed (Dwi 0).
  • If the output is incorrect (t ? y) the weights wi
    are changed such that the output of the TLU for
    the new weights wi is closer/further to the
    input xi.

25
Perceptron Training Algorithm
  • repeat
  • for each training vector pair (x,t)
  • evaluate the output y when x is the input
  • if y?t then
  • form a new weight vector w according
  • to ww a (t-y) x
  • else
  • do nothing
  • end if
  • end for
  • until yt for all training vector pairs

26
Perceptron Convergence Theorem
  • The algorithm converges to the correct
    classification
  • if the training data is linearly separable
  • and a is sufficiently small
  • If two classes of vectors X1 and X2 are linearly
    separable, the application of the perceptron
    training algorithm will eventually result in a
    weight vector w0, such that w0 defines a TLU
    whose decision hyper-plane separates X1 and X2
    (Rosenblatt 1962).
  • Solution w0 is not unique, since if w0 x 0
    defines a hyper-plane, so does w0 k w0.

27
Example
  • x1 x2 output
  • 1 1
  • 9.4 6.4 -1
  • 2.5 2.1 1
  • 8.0 7.7 -1
  • 0.5 2.2 1
  • 7.9 8.4 -1
  • 7.0 7.0 -1
  • 2.8 0.8 1
  • 1.2 3.0 1
  • 7.8 6.1 -1

Initial weights (0.75, -0.5, -0.6)
28
Multiple TLUs
  • Handwritten alphabetic character recognition
  • 26 classes A,B,C,Z
  • First TLU distinguishes between As and
    non-As, second TLU between Bs and non-Bs
    etc.

. . .
y1
y2
y26
wji connects xi with yj
. . .
wji wji a (tj-yj) xi
29
Linear Unit
inputs
weights
w1
output
activation
w2
?
y
. . .
y a ?i1n wi xi
a?i1n wi xi
wn
30
Gradient Descent Learning Rule
  • Consider linear unit without threshold and
    continuous output o (not just 1,1)
  • 0 w0 w1 x1 wn xn
  • Train the wis such that they minimize the
    squared error
  • e ?a?D (fa-da)2
  • where D is the set of training examples
  • Here fa is the actual output, da is the desired
    output.

31
  • Gradient Descent rule
  • We want to choose the weights wi so that e is
    minimized. Recall that
  • e ?a?D (fa da)2
  • Since our goal is to work this error function for
    one input at a time, let us consider a fixed
    input x in D, and define
  • e (fx dx)2
  • We will drop the subscript and just write this
    as
  • e (f d)2
  • Our goal is to find the weights that will
    minimize this expression.

32
?e/?W ?e/?w1, ?e/?wn1 Since s, the
threshold function, is given by s X . W, we
have ?e/?W ?e/?s ?s/?W. However, ?s/?W X.
Thus, ?e/?W ?e/?s X Recall from the previous
slide that e (f d)2 So, we have ?e/?s
2(f d) ?f /?s (note d is constant) This
gives the expression ?e/?W 2(f d) ?f /?s
X A problem arises when dealing with TLU, namely
f is not a continuous function of s.
33
  • For a fixed input x, suppose the desired output
    is d, and the actual output is f, then the above
    expression becomes
  • ?w - 2(d f) x
  • This is what is known as the Widrow-Hoff
    procedure, with 2 replaced by c
  • The key idea is to move the weight vector along
    the gradient.
  • When will this converge to the correct weights?
  • We are assuming that the data is linearly
    separable.
  • We are also assuming that the desired output
    from the linear threshold gate is available for
    the training set.
  • Under these conditions, perceptron convergence
    theorem shows that the above procedure will
    converge to the correct weights after a finite
    number of iterations.

34
Neuron with Sigmoid-Function
inputs
weights
w1
output
activation
w2
?
y
. . .
a?i1n wi xi
wn
ys(a) 1/(1e-a)
35
Sigmoid Unit
x0-1
w1
a?i0n wi xi
w0
y?(a)1/(1e-a)
w2
?
y
. . .
?(x) is the sigmoid function 1/(1e-x)
wn
d?(x)/dx ?(x) (1- ?(x))
Derive gradient descent rules to train
36
Sigmoid function
f
s

37
Gradient Descent Rule for Sigmoid Output Function
s
sigmoid
Epw1,,wn (tp-yp)2
  • ?Ep/?wi ?/?wi (tp-yp)2
  • ?/?wi(tp- s(Si wi xip))2
  • (tp-yp) s(Si wi xip) (-xip)
  • for ys(a) 1/(1e-a)
  • s(a) e-a/(1e-a)2s(a) (1-s(a))

a
s
a
wi wi ?wi wi a y(1-y)(tp-yp) xip
38
Presentation of Training Examples
  • Presenting all training examples once to the ANN
    is called an epoch.
  • In incremental stochastic gradient descent
    training examples can be presented in
  • Fixed order (1,2,3,M)
  • Randomly permutated order (5,2,7,,3)
  • Completely random (4,1,7,1,5,4,) (repetitions
    allowed arbitrarily)

39
Capabilities of Threshold Neurons
  • The threshold neuron can realize any linearly
    separable function Rn ? 0, 1.
  • Although we only looked at two-dimensional
    input, our findings apply to any dimensionality
    n.
  • For example, for n 3, our neuron can realize
    any function that divides the three-dimensional
    input space along a two-dimension plane.

40
Capabilities of Threshold Neurons
  • What do we do if we need a more complex
    function?
  • We can combine multiple artificial neurons to
    form networks with increased capabilities.
  • For example, we can build a two-layer network
    with any number of neurons in the first layer
    giving input to a single neuron in the second
    layer.
  • The neuron in the second layer could, for
    example, implement an AND function.

41
Capabilities of Threshold Neurons
  • What kind of function can such a network realize?

42
Capabilities of Threshold Neurons
  • Assume that the dotted lines in the diagram
    represent the input-dividing lines implemented by
    the neurons in the first layer
  • Then, for example, the second-layer neuron could
    output 1 if the input is within a polygon, and 0
    otherwise.

43
Capabilities of Threshold Neurons
  • However, we still may want to implement
    functions that are more complex than that.
  • An obvious idea is to extend our network even
    further.
  • Let us build a network that has three layers,
    with arbitrary numbers of neurons in the first
    and second layers and one neuron in the third
    layer.
  • The first and second layers are completely
    connected, that is, each neuron in the first
    layer sends its output to every neuron in the
    second layer.

44
Capabilities of Threshold Neurons
  • What type of function can a three-layer network
    realize?

45
Capabilities of Threshold Neurons
  • Assume that the polygons in the diagram indicate
    the input regions for which each of the
    second-layer neurons yields output 1
  • Then, for example, the third-layer neuron could
    output 1 if the input is within any of the
    polygons, and 0 otherwise.

46
Capabilities of Threshold Neurons
  • The more neurons there are in the first layer,
    the more vertices can the polygons have.
  • With a sufficient number of first-layer neurons,
    the polygons can approximate any given shape.
  • The more neurons there are in the second layer,
    the more of these polygons can be combined to
    form the output function of the network.
  • With a sufficient number of neurons and
    appropriate weight vectors wi, a three-layer
    network of threshold neurons can realize any
    function Rn ? 0, 1.

47
Terminology
  • Usually, we draw neural networks in such a way
    that the input enters at the bottom and the
    output is generated at the top.
  • Arrows indicate the direction of data flow.
  • The first layer, termed input layer, just
    contains the input vector and does not perform
    any computations.
  • The second layer, termed hidden layer, receives
    input from the input layer and sends its output
    to the output layer.
  • After applying their activation function, the
    neurons in the output layer contain the output
    vector.

48
Terminology
  • Example Network function f R3 ? 0, 12
  • output vector
  • output layer
  • hidden layer
  • input layer
  • input vector

49
Multi-Layer Networks
output layer
hidden layer
input layer
50
Training-Rule for Weights to the Output Layer
Epwij ½ ?j (tjp-yjp)2
yj
?Ep/?wij ?/?wij ½ Sj (tjp-yjp)2
- yjp(1-ypj)(tpj-ypj) xip
wji
xi
?wij a yjp(1-yjp) (tpj-yjp) xip a
djp xip with djp yjp(1-yjp) (tpj-yjp)
51
Training-Rule for Weights to the Hidden Layer
Credit assignment problem No target values t
for hidden layer units.
yj
dj
wjk
xk
Error for hidden units?
dk
dk Sj wjk dj yj (1-yj)
wki
?wki a xkp(1-xkp) dkp xip
xi
52
Training-Rule for Weights to the Hidden Layer
yj
Epwki ½ ?j (tjp-yjp)2
dj
wjk
?Ep/?wki ?/?wki ½ Sj (tjp-yjp)2 ?/?wki ½Sj
(tjp-s(Skwjk xkp))2 ?/?wki ½Sj (tjp-s(Skwjk
s(Siwki xip)))2 -?j (tjp-yjp) sj(a) wjk sk(a)
xip -?j dj wjk sk(a) xip -?j dj wjk xk
(1-xk) xip
xk
dk
wki
xi
?wki a dk xip with dk ?j dj wjk xk(1-xk)
53
Backpropagation
Backward step propagate errors from output to
hidden layer
yj
dj
wjk
xk
dk
wki
Forward step Propagate activation from input
to output layer
xi
54
Backpropagation Algorithm
  • Initialize weights wij with a small random value
  • repeat
  • for each training pair (x1,xn)p,(t1,...,tm)p
    Do
  • Present (x1,,xn)p to the network and compute the
    outputs yj (forward step)
  • Compute the errors dj in the output layer and
    propagate them to the hidden layer (backward
    step)
  • Update the weights in both layers according to
  • ?wki a dk xi
  • end for loop
  • until overall error E becomes acceptably low

55
Backpropagation Algorithm
  • Initialize each wi to some small random value
  • Until the termination condition is met, Do
  • For each training example lt(x1,xn),tgt Do
  • Input the instance (x1,,xn) to the network and
    compute the network outputs yk
  • For each output unit k
  • ?kyk(1-yk)(tk-yk)
  • For each hidden unit h
  • ?hyh(1-yh) ?k wh,k ?k
  • For each network weight wi,j Do
  • wi,jwi,j?wi,j where
  • ?wi,j ? ?j xi,j

56
Backpropagation
  • Gradient descent over entire network weight
    vector
  • Easily generalized to arbitrary directed graphs
  • Will find a local, not necessarily global error
    minimum
  • -in practice often works well (can be invoked
    multiple times with different initial weights)
  • Often include weight momentum term
  • ?wi,j(n) ? ?j xi,j ? ?wi,j (n-1)
  • Minimizes error training examples
  • Will it generalize well to unseen instances
    (over-fitting)?
  • Training can be slow typical 1000-10000
    iterations
  • (Using network after training is fast)

57
Backpropagation
  • Easily generalized to arbitrary directed graphs
    without clear layers.
  • BP finds a local, not necessarily global error
    minimum
  • - in practice often works well (can be invoked
    multiple times with different initial weights)
  • Minimizes error over training examples
  • How does it generalize to unseen instances ?
  • Training can be slow typical 1000-10000
    iterations
  • (use more efficient optimization methods than
    gradient descent)
  • Using network after training is fast

58
Convergence of Backprop
  • Gradient descent to some local minimum perhaps
    not global minimum
  • Add momentum term ?wki(n)
  • ?wki(n) a dk(n) xi (n) l Dwki(n-1)
  • with l ? 0,1
  • Stochastic gradient descent
  • Train multiple nets with different initial
    weights
  • Nature of convergence
  • Initialize weights near zero
  • Therefore, initial networks near-linear
  • Increasingly non-linear functions possible as
    training progresses

59
Expressive Capabilities of ANN
  • Boolean functions
  • Every boolean function can be represented by
    network with single hidden layer
  • But might require exponential (in number of
    inputs) hidden units
  • Continuous functions
  • Every bounded continuous function can be
    approximated with arbitrarily small error, by
    network with one hidden layer Cybenko 1989,
    Hornik 1989
  • Any function can be approximated to arbitrary
    accuracy by a network with two hidden layers
    Cybenko 1988
Write a Comment
User Comments (0)
About PowerShow.com