EE645 Neural Networks and Learning Theory - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

EE645 Neural Networks and Learning Theory

Description:

Goal: study computational capabilities of neural network and learning systems. ... y=g(s); activation or squashing function. g( ) x. w. y. s. w. 0 (Computational node) ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 72
Provided by: wayneas
Learn more at: http://ee.hawaii.edu
Category:

less

Transcript and Presenter's Notes

Title: EE645 Neural Networks and Learning Theory


1
EE645Neural Networks and Learning Theory
Spring 2003
Prof. Anthony Kuh Dept. of Elec. Eng. University
of Hawaii Phone (808)-956-7527, Fax
(808)-956-3427 Email kuh_at_spectra.eng.hawaii.edu
2
I. Introduction to neural networks
  • Goal study computational capabilities of neural
    network and learning systems.
  • Multidisciplinary field
  • Algorithms, Analysis, Applications

3
A. Motivation
  • Why study neural networks and machine learning?
  • Biological inspiration (natural computation)
  • Nonparametric models adaptive learning systems,
    learning from examples, analysis of learning
    models
  • Implementation
  • Applications
  • Cognitive (Human vs. Computer Intelligence)
  • Humans superior to computers in pattern
    recognition, associative recall, learning complex
    tasks.
  • Computers superior to humans in arithmetic
    computations, simple repeatable tasks.
  • Biological (study human brain)
  • 1010 to 1011 neurons in cerebral cortex with on
    average of 103 interconnections / neuron.

4
A neuron

Schematic of one neuron
5
Neural Network
  • Connection of many neurons together forms a
    neural network.
  • Neural network properties
  • Highly parallel (distributed computing)
  • Robust and fault tolerant
  • Flexible (short and long term learning)
  • Handles variety of information
  • (often random, fuzzy, and inconsistent)
  • Small, compact, dissipates very little power

6
B. Single Neuron
(Computational node)
g( )
w
?
x
y
s
w
0
  • sw T x w0 synaptic strength (linearly
    weighted sum of inputs).
  • yg(s) activation or squashing function

7
Activation functions
  • Linear units g(s) s.
  • Linear threshold units g(s) sgn (s).
  • Sigmoidal units g(s) tanh (Bs), B gt0.
  • Neural networks generally have nonlinear
    activation functions.

Most popular models linear threshold units
and sigmoidal units.
Other types of computational units receptive
units (radial basis functions).
8
C. Neural Network Architectures
Systems composed of interconnected neurons
output
inputs
Neural network represented by directed graph
edges represent weights and nodes represent
computational units.
9
Definitions
  • Feedforward neural network has no loops in
    directed graph.
  • Neural networks are often arranged in layers.
  • Single layer feedforward neural network has one
    layer of computational nodes.
  • Multilayer feedforward neural network has two or
    more layers of computational nodes.
  • Computational nodes that are not output nodes are
    called hidden units.

10
D. Learning and Information Storage
  • Neural networks have computational capabilities.
  • Where is information stored in a neural network?
  • What are parameters of neural network?
  • How does a neural network work? (two phases)
  • Training or learning phase (equivalent to write
    phase in conventional computer memory) weights
    are adjusted to meet certain desired criterion.
  • Recall or test phase (equivalent to read phase in
    conventional computer memory) weights are fixed
    as neural network realizes some task.

11
Learning and Information (continued)
  • 3) What can neural network models learn?
  • Boolean functions
  • Pattern recognition problems
  • Function approximation
  • Dynamical systems
  • 4) What type of learning algorithms are there?
  • Supervised learning (learning with a teacher)
  • Unsupervised learning (no teacher)
  • Reinforcement learning (learning with a critic)

12
Learning and Information (continued)
  • 5) How do neural networks learn?
  • Iterative algorithm weights of neural network
    are adjusted on-line as training data is
    received.
  • w(k1) L(w(k),x(k),d(k)) for supervised
    learning where
  • d(k) is desired output.
  • Need cost criterion common cost criterion
  • Mean Squared Error for one output J(w) ?
    (y(k) d(k)) 2
  • Goal is to find minimum J(w) over all possible w.
    Iterative techniques often use gradient descent
    approaches.

13
Learning and Information (continued)
  • 6)Learning and Generalization
  • Learning algorithm takes training examples as
    inputs and produces concept, pattern or function
    to be learned.
  • How good is learning algorithm? Generalization
    ability measures how well learning algorithm
    performs.
  • Sufficient number of training examples. (LLN,
    typical sequences)
  • Occams razor simplest explanation is the
    best.







Regression problem
14
Learning and Information (continued)
  • Generalization error
  • ?g ?emp ?model
  • Empirical error average error from training data
    (desired output vs. actual output)
  • Model error due to dimensionality of class of
    functions or patterns
  • Desire class to be large enough so that empirical
    error is small and small enough so that model
    error is small.

15
II. Linear threshold units
A. Preliminaries
sgn( )
w
?
x
y
s
w
0
1, if sgt0 -1, if slt0
sgn(s)
16
Linearly separable
Consider a set of points with two labels and
o.
Set of points is linearly separable if a linear
threshold function can partition the points
from the o points.
o


o
o
Set of linearly separable points

17
Not linearly separable
A set of labeled points that cannot be
partitioned by a linear threshold function is not
linearly separable.
o


o
Set of points that are not linearly separable
18
B. Perceptron Learning Algorithm
  • An iterative learning algorithm that can find
    linear threshold function to partition two set of
    points.
  • w(0) arbitrary
  • Pick point (x(k),d(k)).
  • If w(k) T x(k)d(k) gt 0 go to 5)
  • w(k1) w(k ) x(k)d(k)
  • kk1, check if cycled through data, if not go
    to 2
  • Otherwise stop.

19
PLA comments
  • Perceptron convergence theorem (requires margins)
  • Sketch of proof
  • Updating threshold weights
  • Algorithm is based on cost function
  • J(w) - (sum of synaptic strengths of
    misclassified points)
  • w(k1) w(k) - ?(k)J(w(k)) (gradient descent)

20
Perceptron Convergence Theorem
  • Assumptions w solutions and w1, no
    threshold and w(0)0. Let maxx(k)? and min
    y(k)x(k)Tw?.
  • ltw(k),wgtltw(k-1) x(k-1)y(k-1),wgt ?
    ltw(k-1),wgt ? ? k ?.
  • w(k)2 ? w(k-1)2 x(k-1)2 ?
    w(k-1)2 ? 2 ? k? 2 .
  • Implies that k ? ( ?/ ? ) 2 (max number of
    updates).

21
III. Linear Units
A. Preliminaries
w
?
x
sy
22
Model Assumptions and Parameters
  • Training examples (x(k),d(k)) drawn randomly
  • Parameters
  • Inputs x(k)
  • Outputs y(k)
  • Desired outputs d(k)
  • Weights w(k)
  • Error e(k) d(k)-y(k)
  • Error criterion (MSE)
  • min J(w) E .5(e(k)) 2

23
Wiener solution
  • Define P E(x(k)d(k)) and RE(x(k)x(k)T).
  • J(w) .5 E(d(k)-y(k))2
  • .5E(d(k)2)- E(x(k)d(k)) Tw wT
    E(x(k)x(k) T)w
  • .5Ed(k) 2 PTw .5wTRw
  • Note J(w) is a quadratic function of w. To
    minimize J(w) find gradient, ?J(w) and set to 0.
  • ?J(w) -P Rw 0
  • RwP (Wiener solution)
  • If R is nonsingular, then w R-1 P.
  • Resulting MSE .5Ed(k)2-PTR-1P

24
Iterative algorithms
  • Steepest descent algorithm (move in direction of
    negative gradient)
  • w(k1) w(k) -? ?J(w(k)) w(k) ? (P-Rw(k))
  • Least mean square algorithm
  • (approximate gradient from training example)
  • ?J(w(k)) -e(k)x(k)
  • w(k1) w(k) ?e(k)x(k)


25
Steepest Descent Convergence
  • w(k1) w(k) ? (P-Rw(k)) Let w be solution.
  • Center weight vector vw-w
  • v(k1) v(k) - ? (Rw(k)) Assume R is
    nonsingular.
  • Decorrelate weight vector u Q-1v where RQ? Q-1
    is the transformation that diagonalizes R.
  • u(k1) (I - ? ? ), u(k) (I - ? ? )k u(0).
  • Conditions for convergence 0lt ? lt 2/?max .

26
LMS Algorithm Properties
  • Steepest Descent and LMS algorithm convergence
    depends on step size ? and eigenvalues of R.
  • LMS algorithm is simple to implement.
  • LMS algorithm convergence is relatively slow.
  • Tradeoff between convergence speed and excess
    MSE.
  • LMS algorithm can track training data that is
    time varying.

27
Adaptive MMSE Methods
  • Training data
  • Linear MMSE LMS, RLS algorithms
  • Nonlinear Decision feedback detectors
  • Blind algorithms
  • Second order statistics
  • Minimum Output Energy Methods
  • Reduced order approximations PCA, multistage
    Wiener Filter
  • Higher order statistics
  • Cumulants, Information based criteria

28
Designing a learning system
  • Given a set of training data, design a system
    that can realize the desired task.

Signal Processing
Feature Extraction
Neural Network
Outputs
Inputs
29
IV. Multilayer Networks
  • A. Capabilities
  • Depend directly on total number of weights and
    threshold values.
  • A one hidden layer network with sufficient number
    of hidden units can arbitrarily approximate any
    boolean function, pattern recognition problems,
    and well behaved function approximation problems.
  • Sigmoidal units more powerful than linear
    threshold units.

30
B. Error backpropagation
  • Error backpropagation algorithm methodical way
    of implementing LMS algorithm for multilayer
    neural networks.
  • Two passes forward pass (computational pass),
    backward pass (weight correction pass).
  • Analog computations based on MSE criterion.
  • Hidden units usually sigmoidal units.
  • Initialization weights take on small random
    values.
  • Algorithm may not converge to global minimum.
  • Algorithm converges slower than for linear
    networks.
  • Representation is distributed.

31
BP Algorithm Comments
  • ?s are error terms computed from output layer
    back to first layer in dual network.
  • Training is usually done online.
  • Examples presented in random or sequential order.
  • Update rule is local as weight changes only
    involve connections to weight.
  • Computational complexity depends on number of
    computational units.
  • Initial weights randomized to avoid converging to
    local minima.

32
BP Algorithm Comment continued
  • Threshold weights updated in similar manner to
    other weights (input 1).
  • Momentum term added to speed up convergence.
  • Step size set to small value.
  • Sigmoidal activation derivatives simple to
    compute.

33
BP Architecture
Output of computational values calculated
Forward network
Output of error terms calculated
Sensitivity network
34
Modifications to BP Algorithm
  • Batch procedure
  • Variable step size
  • Better approximation of gradient method (momentum
    term, conjugate gradient)
  • Newton methods (Hessian)
  • Alternate cost functions
  • Regularization
  • Network construction algorithms
  • Incorporating time

35
When to stop training
  • First major features captured. As training
    continues minor features captured.
  • Look at training error.
  • Crossvalidation (training, validation, and test
    sets)

testing error
training error
Learning typically slow and may find flat
learning areas with little improvement in energy
function.
36
C. Radial Basis Functions
  • Use locally receptive units (potential functions)
  • Transform input space to hidden unit space via
    potential functions.
  • Output unit is linear.

?
output
?
inputs
Linear unit
?
Potential units ?(x) exp (-.5x-c 2 /? 2
37
Transformation of input space
?
X
X
O
X
O
O
X
O
Input space
Feature space
? X Z
38
Training Radial basis functions
  • Use gradient descent on unknown parameters
    centers, widths, and output weights
  • Separate tasks for quicker training (first layer
    centers, widths), (second layer weights)
  • First layer
  • Fix widths, centers determined from lattice
    structure
  • Fix widths, clustering algorithm for centers
  • Resource allocation network
  • Second layer use LMS to learn weights

39
Comparisons between RBFs and BP Algorithm
  • RBF single hidden layer and BP algorithm can have
    many hidden layers.
  • RBF (potential functions) locally receptive units
    versus BP algorithm (sigmoidal units) distributed
    representations.
  • RBF typically many more hidden units.
  • RBF training typically quicker training.

40
V. Alternate Detection Method
  • Consider detection methods based on optimum
    margin classifiers or Support Vector Machines
    (SVM)
  • SVM are based on concepts from statistical
    learning theory.
  • SVM are easily extended to nonlinear decision
    regions via kernel functions.
  • SVM solutions involve solving quadratic
    programming problems.

41
Optimal Marginal Classifiers
X
Given a set of points that are linearly separable
X
X
X
Which hyperplane should you choose to separate
points?
O
O
O
Choose hyperplane that maximizes distance between
two sets of points.
42
Finding Optimal Hyperplane
margins
  • Draw convex hull around each set of points.
  • Find shortest line segment connecting two convex
    hulls.
  • Find midpoint of line segment.
  • Optimal hyperplane intersects line segment at
    midpoing perpendicular to line segment.

X
X
X
w
X
O
O
O
Optimal hyperplane
43
Alternative Characterization of Optimal
Margin Classifiers
Maximizing margins equivalent to minimizing
magnitude of weight vector.
X
2m
X
X
T
W (u-v) 2
w
T
X
W (u-v)/ W 2/ W 2m
u
T
O
W u b 1
O
v
O
T
W v b -1
44
Solution in 1 Dimension
O O O O O X O X X O X X X
Points on wrong side of hyperplane
If C is large SV include
If C is small SV include all points (scaled MMSE
solution)
Note that weight vector depends most heavily on
outer support vectors.
45
Comments on 1 Dimensional Solution
  • Simple algorithm can be implemented to solve 1D
    problem.
  • Solution in multiple dimensions is finding
    weight and then projecting down to 1D.
  • Min. probability of error threshold depends on
    likelihood ratio.
  • MMSE solution depends on all points where as SVM
    depends on SV (points that are under margin
    (closer to min. probability of error).
  • Min. probability of error, MMSE solution, and
    SVM in general give different detectors.

46
Kernel Methods
In many classification and detection problems a
linear classifier is not sufficient. However,
working in higher dimensions can lead to curse
of dimensionality.
Solution Use kernel methods where computations
done in dual observation space.
?
X
X
O
X
O
O
X
O
Input space
Feature space
? X Z
47
Solving QP problem
  • SVM require solving large QP problems. However,
    many ?s are zero (not support vectors). Breakup
    QP into subproblem.
  • Chunking (Vapnik 1979) numerical solution.
  • Ossuna algorithm (1997) numerical solution.
  • Platt algorithm (1998) Sequential Minimization
    Optimization (SMO) analytical solution.

48
SMO Algorithm
  • Sequential Minimization Optimization breaks up QP
    program into small subproblems that are solved
    analytically.
  • SMO solves dual QP SVM problem by examining
    points that violate KKT conditions.
  • Algorithm converges and consists of
  • Search for 2 points that violate KKT conditions.
  • Solve QP program for 2 points.
  • Calculate threshold value b.
  • Continue until all points satisfy KKT conditions.
  • On numerous benchmarks time to convergence of
    SMO varied from O (l) to O (l 2.2 ) .
    Convergence time depends on difficulty of
    classification problem and kernel functions used.

49
SVM Summary
  • SVM are based on optimum margin classifiers and
    are solved using quadratic programming methods.
  • SVM are easily extended to problems that are not
    linearly separable.
  • SVM can create nonlinear separating surfaces via
    kernel functions.
  • SVM can be efficiently programmed via the SMO
    algorithm.
  • SVM can be extended to solve regression problems.

50
VI.Unsupervised Learning
  • Motivation
  • Given a set of training examples with no teacher
    or
  • critic, why do we learn?
  • Feature extraction
  • Data compression
  • Signal detection and recovery
  • Self organization
  • Information can be found about data from inputs.

51
B. Principal Component Analysis
  • Introduction
  • Consider a zero mean random vector x ? R n with
    autocorrelation matrix R E(xxT).
  • R has eigenvectors q(1), ,q(n) and associated
    eigenvalues ?(1)? ? ?(n).
  • Let Q q(1) q(n) and ? be a diagonal
    matrix containing eigenvalues along diagonal.
  • Then R Q ? QT can be decomposed into
    eigenvector and eigenvalue decomposition.

52
First Principal Component
  • Find max xTRx subject to x1.
  • Maximum obtained when x q(1) as this
    corresponds to xTRx ?(1).
  • q(1) is first principal component of x and also
    yields direction of maximum variance.
  • y(1) q(1)T x is projection of x onto first
    principal component.

x
q(1)
y(1)
53
Other Principal Components
  • ith principal component denoted by q(i) and
    projection denoted by y(i) q(i)T x with
    E(y(i)) 0 and E(y(i)2) ?(i).
  • Note that y QTx and we can obtain data vector
    x from y by noting that xQy.
  • We can approximate x by taking first m principal
    components (PC) to get z z q(1)x(1)
    q(m)x(m). Error given by e x-z. e is orthogonal
    to q(i) when 1? i ? m.

54
Diagram of PCA
x
x
x
x
x
x
Second PC
x
x
First PC
x
x
x
x
x
x
x
x
x
x
x
x
First PC gives more information than second PC.
55
Learning algorithms for PCA
  • Hebbian learning rule when presynaptic and
    postsynaptic signal are postive, then weigh
    associated with synapse increase in strength.

w
x
y
?w ? x y
56
Ojas rule
  • Use normalize Hebbian rule applied to linear
    neuron.

w
?
x
sy
Need normalized Hebbian rule otherwise
weight vector will grow unbounded.
57
Ojas rule continued
  • wi (k1) wi(k) ? xi (k) y(k) (apply Hebbian
    rule)
  • w(k1) w(k1) / w(k1) (renormalize weight)
  • Unfortunately above rule is difficult to
    implement so modification approximates above rule
    giving
  • wi (k1) wi(k) ? y(k)(xi (k)- y(k) wi(k))
  • Similar to Hebbian rule with modified input.
  • Can show that w(k) ? q(1) with probability one
    given that x(k) is zero mean second order and
    drawn from a fixed distribution.

58
Learning other PCs
  • Adaptive learning rules (subtract larger PCs
    out)
  • Generalized Hebbian Algorithm
  • APEX
  • Batch Algorithm (singular value decomposition)
  • Approximate correlation matrix R with time
    averages.

59
Applications of PCA
  • Matched Filter problem x(k) s(k) ?v(k).
  • Multiuser communications CDMA
  • Image coding (data compression)

GHA
quantizer
PCA
60
Kernel Methods
In many classification and detection problems a
linear classifier is not sufficient. However,
working in higher dimensions can lead to curse
of dimensionality.
Solution Use kernel methods where computations
done in dual observation space.
?
X
X
O
X
O
O
X
O
Input space
Feature space
? X Z
61
C. Independent Component Analysis
  • PCA decorrelates inputs. However in many
    instances we may want to make outputs independent.

U
Y
A
W
X
A
Inputs U assumed independent and user sees
X. Goal is to find W so that Y is independent.
62
ICA Solution
  • Y DPU where D is a diagonal matrix and P is a
    permutation matrix.
  • Algorithm is unsupervised. What are assumptions
    where learning is possible? All components of U
    except possibly one are nongaussian.
  • Establish criterion to learn from (use higher
    order statistics) information based criteria,
    kurtosis function.
  • Kullback Leibler Divergence
  • D(f,g) ? f(x) log (f(x)/g(x)) dx

63
ICA Information Criterion
  • Kullback Leibler Divergence nonnegative.
  • Set f to joint density of Y and g to products of
    marginals of Y then
  • D(f,g) -H(Y) ?H(Yi)
  • which is minized when components of Y are
    independent.
  • When outputs are independent they can be a
    permutation and scaled version of U.

64
Learning Algorithms
  • Can learn weights by approximating divergence
    cost function using contrast functions.
  • Iterative gradient estimate algorithms can be
    used.
  • Faster convergence can be achieved with fixed
    point algorithms that approximate Newtons
    methods.
  • Algorithms have been shown to converge.

65
Applications of ICA
  • Array antenna processing
  • Blind source separation speech separation,
    biomedical signals, financial data

66
D. Competitive Learning
  • Motivation Neurons compete with one another with
    only one winner emerging.
  • Brain is a topologically ordered computational
    map.
  • Array of neurons self organize.
  • Generalized competitive learning algorithm.
  • Initialize weights
  • Randomly choose inputs
  • Pick winner.
  • Update weights associated with winner.
  • Go to 2).

67
Competitive Learning Algorithm
  • K means algorithm (no topological ordering)
  • Online algorithm
  • Update centers
  • Reclassify points
  • Converges to local minima
  • Kohonen Self Organization Feature Map
    (topological ordering)
  • Neurons arranged on lattice
  • Weight that are updated depend on winner, step
    size, and neighborhood.
  • Decrease step size and neighborhood size to get
    topological ordering.

68
KSOFM 2 dimensional lattice
69
Neural Network Applications
  • Backgammon (Feedforward network)
  • 459-24-24-1 network to rate moves
  • Hand crafted examples, noise helped in training
  • 59 winning percentage against SUN gammontools
  • Later versions used reinforcement learning
  • Handwritten zip code (Feedforward network)
  • 16-768-192-30-10 network to distinguish numbers
  • Preprocessed data, 2 hidden layers act as
    feature detectors
  • 7291 training examples, 2000 test examples
  • Training data .14, test data 5, test/reject
    data 1,12

70
Neural Network Applications
  • Speech recognition
  • KSOFM map followed by feedforward neural network
  • 40 120 frames mapped onto 12 by 12 Kohonen map
  • Each frame composed of 600 to 1800 analog vector
  • Output of Kohonen map fed to feedforward network
  • Reduced search using KSOFM map
  • TI 20 word data base 98-99 correct on speaker
    dependent classsification

71
Other topics
  • Reinforcement learning
  • Associative networks
  • Neural dynamics and control
  • Computational learning theory
  • Bayesian learning
  • Neuroscience
  • Cognitive science
  • Hardware implementation
Write a Comment
User Comments (0)
About PowerShow.com