Artificial Neural Networks - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Artificial Neural Networks

Description:

Biological learning system (brain) complex network of ... arrow: negated gradient at one point. steepest descent along the surface. Derivation of the rule ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 56
Provided by: timokn
Category:

less

Transcript and Presenter's Notes

Title: Artificial Neural Networks


1
Artificial Neural Networks
  • Threshold units
  • Gradient descent
  • Multilayer networks
  • Backpropagation
  • Hidden layer representations
  • Example
  • Advanced topics

2
Biological motivation
  • Biological learning system (brain)
  • complex network of neurons
  • ANN
  • network of simple units
  • real-valued inputs outputs

3
Brain must be parallel
  • Properties
  • 1010, each connected to 104
  • switching time 0.001 s.
  • scene recognition 0.1 s.
  • 100 inference steps?
  • Two schools
  • modeling biological systems
  • just building learning systems

4
Prototypical ANN
  • Units interconnected in layers
  • directed, acyclic graph (DAG)
  • Network structure is fixed
  • learning weight adjustment
  • backpropagation algorithm

5
Appropriate problems
  • Instances vectors of attributes
  • discrete or real values
  • Target function
  • discrete, real, vector
  • Noisy data
  • Long training times acceptable
  • Fast evaluation
  • No need to be readable

6
Perceptrons
  • Structure function
  • inputs, weights, threshold
  • hypotheses in weight vector space
  • Representational power
  • defines a hyperplane decision surface
  • linearly separable problems
  • most boolean functions
  • m of n -problems

7
Representational power
  • Nonseparable problem XOR
  • Networks?
  • AND, OR, NOT with a single unit
  • any boolean function
  • actually 2 layers suffice (CNF form)

8
Perceptron training rule
  • Learn weights of a single unit
  • Problem
  • given examples labeled 1/-1
  • learn weight vector
  • Algorithms
  • perceptron rule, delta rule,
  • guaranteed to converge
  • different acceptable hypotheses assumptions

9
Training procedure
  • Begin with random weights
  • REPEAT
  • FOR each ltx,c(x)gt IN D
  • IF h(x) ltgt c(x) THEN
  • adjust weights
  • UNTIL no errors made
  • Notation (traditional)
  • t c(x), o h(x)

10
Training rule
  • Adjustment ( rule)
  • wi wi ?wi
  • ?wi ?(t-o)xi
  • ? gt 0 learning rate parameter
  • Why works?
  • t o no change
  • t 1, o -1 --gt weight increases
  • Fact converges (small ?, lin.sep.)

11
Delta rule
  • Works reasonably with non-separable data
  • Minimizes error
  • Gradient descent method
  • basis of Backpropagation method
  • basis for methods working in multidimensional
    continuous spaces

12
Delta rule
  • Consider linear unit (no threshold)
  • output dot product of w x
  • w0 threshold, x0 1
  • Minimize squared error
  • E(w) 1/2 sum(t - o)2
  • function of w (o depends on it)
  • training set assumed constant
  • later h minimizing E is the most probable one

13
Hypothesis space
  • Example case two weights
  • error surface E(w)
  • parabolic (by definition)
  • single global minimum
  • arrow negated gradient at one point
  • steepest descent along the surface

14
Derivation of the rule
  • Compute the gradient of E(w)
  • vector ?E(w) of partial derivatives
  • specifies the direction of steepest increase in E
  • training rule ?w -??E(w)
  • componentwise
  • wi wi ?wi, ?wi -??E/?wi
  • wi is changed in proportion to ?E/?wi

15
Practical algorithm
  • Efficient computation of ?E/?wi
  • Reduces to sum((t - o)(-xi))
  • Converges to minimum error
  • too large ? -gt may oscillate
  • common modification gradually decrease learning
    rate

16
Problems of gradient descent
  • Gradient descent
  • continuous hypothesis space
  • error can be differentiated wrt hypothesis
    parameters
  • Difficulties
  • converging can be slooooow
  • many local minima -gt no guarantee we find the
    global one

17
Stochastic approximation
  • Variation of the batch method
  • alleviates previous difficulties
  • incremental / stochastic descent
  • approximate gradient for each example (not the
    whole D)
  • ?wi -?(t-o)xi
  • distinct error function for each example, we try
    to minimize all

18
Incremental
  • Fact
  • stochastic gradient approximates the true
    gradient arbitrarily closely (?)
  • Differences
  • batch method requires more computation per update
    but can use a larger learning rate
  • stochastic is better in avoiding local minima

19
Remarks
  • Perceptron rule
  • thresholded output
  • converges after a finite of iterations
  • provided data is spearable
  • Delta rule
  • unthresholded
  • asymptotic convergence
  • regardless of training data

20
Multilayer Networks
  • Complex decision surfaces
  • nonlinear (new type of unit)
  • example speech signal recognition
  • Learning
  • backpropagation algorithm
  • based on gradient descent

21
Differentiable TU
  • Linear units?
  • only linear functions
  • Perceptrons?
  • discontinuous (step function)
  • unsuitable for gradient methods
  • Sigmoid unit
  • much like a perceptron
  • smoother, differentiable

22
Sigmoid unit
  • Output ?(net)
  • net w?x (net input)
  • ?(y) 1/(1e-y)
  • Nice derivative
  • ?(y) ?(y)(1 - ?(y))
  • Other possibilities
  • 1/(1e-ky) (k gt 0)
  • tanh

23
Backpropagation alg.
  • Single unit
  • gradient ?E/?wi
  • - sum(t - o)o(1 - o)xi
  • Multiple output units
  • re-define E sum of individual errors
  • Search space
  • space of all weights of the network
  • try to minimize E

24
Backpropagation
  • Stochastic variant
  • Notation
  • x_ji input from node i to node j
  • w_ji associated weight
  • ?k error term for unit k
  • analogous to (t-o) in delta rule
  • ?k - ?E/? net_k

25
Weight update rule
  • Much like delta rule
  • (t-o) replaced by ? (derived later)
  • Intuition
  • output units (t-o) times derivative
  • hidden unit h? (all we know are t)
  • sum ? of output units connected to h
  • weight errors with w_kh
  • how much h is responsible for ?_k

26
Termination
  • Fixed of iterations
  • Error below some constant
  • training set or separate validation set
  • Too few iterations
  • large error on all data
  • Too many iterations
  • overfit (and time)

27
Adding momentum
  • Variation of basic algorithm
  • update on nth iteration depends partially on the
    one made at (n-1)
  • add ? ?w_ji(n-1)
  • 0 ? ? ? 1 momentum
  • analogy ball rolling down
  • jumps over small local minima, continues movement
    on plateaus, accelerates if gradient stays same

28
Arbitrary acyclic networks
  • Backpropagation works
  • non-layered networks of
  • arbitrary depth
  • only change in computation of ?
  • Generalized error term
  • ?_k o_k(1-o_k) sum(w_nk ?_n)
  • n ranges over Downstream(k)
  • nodes immediately below k

29
Derivations of rules
  • Task compute
  • ?w_ji - ? ?E/?w_ji
  • E(w) 1/2 sum (t_k - o_k)2
  • k ranges over output level
  • Derivation
  • o_j, t_j, net_j, downstream(j)
  • w_ji influences only through net_j
  • chain rule applies
  • ?E/?w_ji (?E/?net_j) x_ij

30
Derivations...
  • Remaining task ?E/?net_j
  • Output level
  • net_j influences only through o_j
  • chain rule
  • ?E/?o_j - (t_j - o_j)
  • ?o_j/?net_j o_j(1-o_j)
  • we have ?w_ji for output nodes (same as in
    algorithm)

31
Hidden Units
  • w_ji influences o E indirectly
  • rule must consider all ways
  • ?E/?net_j
  • net_j affects E only through downstream(j)
  • sum (?E/?net_k ?net_k/?net_j)
  • ?E/?net_k ?_k
  • ?net_k/?net_j w_kj o_j(1-o_j)

32
Remarks on BP alg.
  • Convergence local minima
  • Representational power
  • Search Bias
  • Hidden layer representations
  • Generalization, overfit stopping

33
Convergence
  • Error surfaces are very complex
  • no guarantee to reach global min.
  • Still highly effective in practice
  • many dimensions actually helps
  • weight evolution
  • initially small --gt smooth almost linear function
  • later more complex, but we have also done a lot
    of search already

34
Common heuristics
  • Momentum
  • may also jump over global minimum
  • Stochastic descent (as used)
  • every example has different surface
  • Multiple networks
  • different initial state
  • select best or make a vote

35
Repr. power
  • Boolean functions (2 layers)
  • Continuous functions
  • bounded f can be approximated with arbitrarily
    small error with 2 layers
  • Arbitrary functions
  • any f approx with arb. small error with 3 layers
  • Number of nodes, reachability

36
Hyp. space bias
  • n-dimensional continuous space
  • h any weight assignment
  • E is differentiable wrt w_ij --gt descent methods
  • Search bias gradient descent
  • symbolic systems gen-to-spec
  • decision trees simple-to-complex
  • smooth interpolation

37
Hidden representations
  • Interesting property
  • hidden units discover intermediate concepts
  • they have to?
  • properties most relevant in learning the target
    function (?)
  • Example
  • network invents binary numbers

38
Hidden features
  • Key to ANN learning
  • no restriction on predefined features
  • Example case
  • squared error
  • evolution of hidden layer
  • evolution of individual weights
  • significant changes at same time

39
Overfit
  • Obvious termination small E
  • poor choice (overfit as in DT)
  • generalization accuracy performance on
    validation set
  • Technique 1 weight decay
  • decrease each w_ji at each iteration
  • as if E had penalty for weight sum
  • bias against too complex surfaces

40
Using validation sets
  • Successful method
  • monitor performance on validation set while
    learning with training set
  • remember best network so far
  • stop when error increases (careful)
  • Small data sets
  • k-fold cross validation, of iterations
  • compute average , apply BP

41
Example application
  • Face recognition
  • data programs available from www
  • 20 ppl, 32 images of each
  • angles, expressions, glasses,
  • background, clothing, position,
  • 120 x 128 resolution, 256 shades
  • Target function?
  • here direction person is facing to

42
Design choices
  • Input encoding
  • apply traditional machine vision techniques to
    extract features?
  • fixed of features preferable
  • image --gt 30x32 (comp. demands)
  • intensity --gt 0..1
  • mean, random,

43
Design choices
  • Output encoding (4 values)
  • single unit with 4 values?
  • 4 units, highest wins (also known as 1-of-n
    encoding)
  • more degrees of freedom (4 times as many weights)
  • difference of 1st 2nd measure of confidence
  • target values t 0.1 0.9 (0 1 bad)

44
Design choices
  • Network graph structure
  • standard layered structure
  • one (or two) hidden layers (no more)
  • of hidden units?
  • 3 units 90 accuracy, 5 minutes
  • 30 units 92, one hour
  • empirically
  • certain min. is required
  • after that, accuracy wont increase much

45
Design choices
  • Other parameters
  • learning rate (0.3)
  • momentum (0.3)
  • too high values no convergence
  • full gradient descent used
  • output units random, others 0 initially
  • gives visualization for learned weights
  • training validation sets, check after each 50
    iterations

46
Hidden representations
  • Weights (4) of output units (4)
  • threshold 3 hidden units
  • brightness indicates value
  • Weights of hidden units
  • each receives 30x32 inputs
  • values seem to be sensitive to features in
    regions where face body appear

47
Advanced topics
  • Alternative
  • error functions
  • minimization procedures
  • Recurrent networks
  • Modifying structure

48
New error functions
  • New E --gt new weight update rule
  • Penalty for weight magnitude
  • reduce risk of overfitting
  • add ? sum(w_ij2) to E(w)
  • adjust multiply w_ij with (1-2??)
  • Include slope (derivative) in E
  • not always available
  • example invariance to translations

49
New error functions
  • Minimize cross entropy
  • learning a probabilistic function
  • fact best estimate minimizes c.e.
  • training rule in chapter 6
  • Tying weights together
  • require some weights are equal --gt representation
    bias
  • speech recogn independence of time
  • update each, assign mean value

50
Minimization procedures
  • Weight update makes 2 decisions
  • direction amount of update
  • Line search
  • decide direction
  • search point minimizing E
  • Conjugate gradient
  • sequence of line searches
  • No significant impact

51
Recurrent networks
  • Apply to time series data
  • output at time t as input at time t1
  • Example
  • stock market average y(t1)
  • based on economic indicators x(t)
  • depends also on earlier values of x?
  • recurrence allows arbitrary time window

52
Recurrent networks
  • Training
  • unfold network a few times
  • train as usual
  • weight in recurrent network mean value of
    weights in copies
  • Experiences
  • more difficult to train
  • do not generalize as reliably

53
Modifying structure
  • Number of units
  • accuracy vs. training efficiency
  • Cascade-Correlation
  • start with no hidden units
  • try to repair error of the network with a new
    hidden unit (etc)
  • weights maximize correlation of new unit value
    network error
  • weight in new units are fixed

54
Modifying structure
  • Opposite idea prune network
  • is weight w_ij inessential?
  • almost zero
  • effect of small variation (?E/?w_ij)
  • optimal brain damage
  • subsequent training faster
  • Experiences
  • mixed success
  • improvements in training time

55
Summary
  • Practical learning method
  • continuous functions, noise
  • Backpropagation algorithm
  • hypothesis space, gradient descent
  • Invention of new features
  • Overfit
  • Alternative learning methods
Write a Comment
User Comments (0)
About PowerShow.com