Best Practices for Convolutional NNs Applied to Visual Document Analysis according to P'A'Simard, D' - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Best Practices for Convolutional NNs Applied to Visual Document Analysis according to P'A'Simard, D'

Description:

Best Practices for Convolutional NNs. Applied to Visual Document Analysis ... handwriting recognition. segmented handwritten digits. data: ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 19
Provided by: mzw
Category:

less

Transcript and Presenter's Notes

Title: Best Practices for Convolutional NNs Applied to Visual Document Analysis according to P'A'Simard, D'


1
Best Practices for Convolutional NNsApplied to
Visual Document Analysis(according to
P.A.Simard, D. Steinkraus, and J.C. Platt)
2
outline
  • the task
  • training set expansion
  • network architecture
  • learning

3
the task
  • handwriting recognition
  • segmented handwritten digits
  • data
  • benchmark set of English digit images (MNIST)
  • size-normalized to 28 x 28 pixels
  • 60,000 training patterns, 10,000 test patterns
  • goal image vector ? 0, 1, , 9

4
the task
  • example from test set

5
training set expansion
  • Etest Etrain ? 1/P (P size of training set)
  • idea apply transformations to generate
    additional data
  • learning algorithm will learn transformation
    invariance (wrt. original, non-transformed, input)

6
training set expansion
  • examples of transformations
  • translation
  • rotation
  • skewing
  • method for every pixel in original image,
    computenew location, e.g.
  • x(x,y)1, y(x,y)0
  • x(x,y)ax, y(x,y)ay ( interpolation if a not
    int)
  • elastic deformations

7
training set expansion
A (0,0)
3 (1,0)
7 (2,0)
(0.75, 0.5)
?
5 (1,-1)
9 (2,-1)
(0,0)
xnew(x,y) 1.75 ynew(x,y) - 0.5 gray level
(gl) evaluate gl at (xnew,ynew) with bilinear
interpolation over x 3 0.75 (7 - 3)
6 5 0.75 (9 - 5) 8 over y 8 0.5 (6 -
8) 7
8
training set expansion
  • elastic deformations
  • x(x,y) rand(-1, 1), y(x,y) rand(-1, 1)
  • smooth with Gaussian function of given SD (in
    pix)
  • if chosen SD large, resulting values small
  • if SD small, random field
  • intermediate SD elastic deformation
  • factor for intensity

9
training set expansion
  • examples of distortions

10
network architecture
  • account for topological properties of input
    (shape of curves, edges, etc.)
  • gradually extract more complex features
  • simple features extracted at higher resolutions,
    more coarser features at coarser resolutions over
    smaller regions
  • conversion from one to the other with operation
    of convolution
  • coarser resolutions generated by sub-sampling

11
network architecture
12
network architecture
  • set of layers each with one or more planes
  • each unit on plane receives input from small area
    on planes in previous layer ? local receptive
    fields
  • shared weights at all points on a plane ? reduce
    number of parameters
  • multiple planes in each layer ? detect multiple
    features
  • once feature detected, spatial subsampling ?
    local averaging of weights
  • (partial) invariance to translation, rotation,
    scale, and deformation

13
network architecture
S2
C2
S1 (factor of 2)
C1
Kernel size 5x5
100 hidden units
50 features
5 features e.g. edge, ink, intersection
14
gradient-based learning
  • backpropagation
  • output Yp F(Xp, W)
  • loss function Ep D(Dp, F(Xp, W)
  • Etrain(W) average of Ep over training
    (X1,D1), (Xp,Dp)
  • Ep (Dp - F(Xp, W))2 / 2
  • Etrain(W) 1/P sum(Ep)
  • simplest setting find W such that min
    Etrain(W)

15
gradient-based learning
  • if E differentiable wrt. W,
  • gradient-based optimization can be used to
    compute min
  • module output Xn Fn (Xn-1, Wn)
  • Wn trainable parameters Wn ? W
  • Xn-1 modules input (previous
    modules output)
  • X0 input pattern Xp

16
gradient-based learning
  • if known, then and can be computed


(Wn, Xn-1)
(Wn, Xn-1)
JF(wn,xn) wrt. W evaluated at (Wn, Xn-1)
compute gradient
JF(wn,xn) wrt. X at (Wn, Xn-1) propagate
backward
JF martix containing partial derivatives
of all outputs wrt. all inputs
17
gradient-based learning
  • simplest minimization gradient descent
  • W iteratively adjusted as follows
  • traditional backprop special case of gradient
    learning with
  • Yn Wn Xn-1
  • Xn F(Yn)

18
application
  • zip-code scanning (generalized version over
    time-domain)
  • fax reading
  • similar techniques used in other digital image
    recognition(e.g. face recognition, X-ray, MRI,
    etc.)
  • later version (2003) dynamically changing layer
    parameters
Write a Comment
User Comments (0)
About PowerShow.com