Title: Best Practices for Convolutional NNs Applied to Visual Document Analysis according to P'A'Simard, D'
1Best Practices for Convolutional NNsApplied to
Visual Document Analysis(according to
P.A.Simard, D. Steinkraus, and J.C. Platt)
2outline
- the task
- training set expansion
- network architecture
- learning
3the task
- handwriting recognition
- segmented handwritten digits
- data
- benchmark set of English digit images (MNIST)
- size-normalized to 28 x 28 pixels
- 60,000 training patterns, 10,000 test patterns
- goal image vector ? 0, 1, , 9
4the task
5training set expansion
- Etest Etrain ? 1/P (P size of training set)
- idea apply transformations to generate
additional data - learning algorithm will learn transformation
invariance (wrt. original, non-transformed, input)
6training set expansion
- examples of transformations
- translation
- rotation
- skewing
- method for every pixel in original image,
computenew location, e.g. - x(x,y)1, y(x,y)0
- x(x,y)ax, y(x,y)ay ( interpolation if a not
int) - elastic deformations
7training set expansion
A (0,0)
3 (1,0)
7 (2,0)
(0.75, 0.5)
?
5 (1,-1)
9 (2,-1)
(0,0)
xnew(x,y) 1.75 ynew(x,y) - 0.5 gray level
(gl) evaluate gl at (xnew,ynew) with bilinear
interpolation over x 3 0.75 (7 - 3)
6 5 0.75 (9 - 5) 8 over y 8 0.5 (6 -
8) 7
8training set expansion
- elastic deformations
- x(x,y) rand(-1, 1), y(x,y) rand(-1, 1)
- smooth with Gaussian function of given SD (in
pix) - if chosen SD large, resulting values small
- if SD small, random field
- intermediate SD elastic deformation
- factor for intensity
9training set expansion
10network architecture
- account for topological properties of input
(shape of curves, edges, etc.) - gradually extract more complex features
- simple features extracted at higher resolutions,
more coarser features at coarser resolutions over
smaller regions - conversion from one to the other with operation
of convolution - coarser resolutions generated by sub-sampling
11network architecture
12network architecture
- set of layers each with one or more planes
- each unit on plane receives input from small area
on planes in previous layer ? local receptive
fields - shared weights at all points on a plane ? reduce
number of parameters - multiple planes in each layer ? detect multiple
features - once feature detected, spatial subsampling ?
local averaging of weights - (partial) invariance to translation, rotation,
scale, and deformation
13network architecture
S2
C2
S1 (factor of 2)
C1
Kernel size 5x5
100 hidden units
50 features
5 features e.g. edge, ink, intersection
14gradient-based learning
- backpropagation
- output Yp F(Xp, W)
- loss function Ep D(Dp, F(Xp, W)
- Etrain(W) average of Ep over training
(X1,D1), (Xp,Dp) - Ep (Dp - F(Xp, W))2 / 2
- Etrain(W) 1/P sum(Ep)
- simplest setting find W such that min
Etrain(W)
15gradient-based learning
- if E differentiable wrt. W,
- gradient-based optimization can be used to
compute min -
- module output Xn Fn (Xn-1, Wn)
- Wn trainable parameters Wn ? W
- Xn-1 modules input (previous
modules output) - X0 input pattern Xp
16gradient-based learning
- if known, then and can be computed
(Wn, Xn-1)
(Wn, Xn-1)
JF(wn,xn) wrt. W evaluated at (Wn, Xn-1)
compute gradient
JF(wn,xn) wrt. X at (Wn, Xn-1) propagate
backward
JF martix containing partial derivatives
of all outputs wrt. all inputs
17gradient-based learning
- simplest minimization gradient descent
- W iteratively adjusted as follows
- traditional backprop special case of gradient
learning with - Yn Wn Xn-1
- Xn F(Yn)
18application
- zip-code scanning (generalized version over
time-domain) - fax reading
- similar techniques used in other digital image
recognition(e.g. face recognition, X-ray, MRI,
etc.) - later version (2003) dynamically changing layer
parameters