Title: Tutorial on Neural Network Models for Speech and Image Processing
1Tutorial onNeural Network Models for Speech and
Image Processing
- B. Yegnanarayana
- Speech Vision Laboratory
- Dept. of Computer Science Engineering
- IIT Madras, Chennai-600036
- yegna_at_cs.iitm.ernet.in
WCCI 2002, Honululu, Hawaii, USA May 12, 2002
2Need for New Models of Computing for Speech
Image Tasks
- Speech Image processing tasks
- Issues in dealing with these tasks by human
beings - Issues in dealing with the tasks by machine
- Need for new models of computing in dealing with
natural signals - Need for effective (relevant) computing
- Role of Artificial Neural Networks (ANN)
lt Prev
Next gt
3Organization of the Tutorial
- Part I Feature extraction and classification
problems with speech and image data - Part II Basics of ANN
- Part III ANN models for feature extraction and
classification - Part IV Applications in speech and image
processing
lt Prev
Next gt
4PART I Feature Extraction and Classification
Problems in Speech and Image
5Feature Extraction and Classification Problems in
Speech and Image
- Distinction between natural and synthetic
signals (unknown model vs known model generating
the signal) - Nature of speech and image data (non-repetitive
data, but repetitive features) - Need for feature extraction and classification
- Methods for feature extraction and models for
classification - Need for nonlinear approaches (methods and models)
lt Prev
Next gt
6Speech vs Audio
- Audio (audible) signals (noise, music, speech and
other signals) - Categories of audio signals
- Audio signal vs non-signal (noise)
- Signal from speech production mechanism vs other
audio signals - Non-speech vs speech signals (like with natural
language)
lt Prev
Next gt
7Speech Production Mechanism
lt Back
8Different types of sounds
lt Back
9Categorization of sound units
lt Back
10Nature of Speech Signal
- Digital speech Sequence of samples or numbers
- Waveform for word MASK (Figure)
- Characteristics of speech signal
- Excitation source characteristics
- Vocal tract system characteristics
lt Prev
Nextgt
11Waveform for the word mask
ltBack
12Source-System Model of Speech Production
Pitch period
Vocal tract parameters
Voice/ unvoiced switch
Impulse train generator
Time-varying digital filter
X
s(n)
u(n)
Random noise generator
G
lt Prev
Next gt
13Features from Speech Signal (demo)
- Different components of speech (speech, source
and system) - Different speech sound units (Alphabet in Indian
Languages) - Different emotions
- Different speakers
lt Prev
Next gt
14Speech Signal Processing Methods
- To extract source-system features and
suprasegmental features - Production-based features
- DSP-based features
- Perception-based features
lt Prev
Next gt
15Models for Matching and Classification
- Dynamic Time Warping (DTW)
- Hidden Markov Models (HMM)
- Gaussian Mixture Models (GMM)
lt Prev
Next gt
16Applications of Speech Processing
- Speech recognition
- Speaker recognition/verification
- Speech enhancement
- Speech compression
- Audio indexing and retrieval
lt Prev
Next gt
17Limitations of Feature Extraction Methods and
Classification Models
- Fixed frame analysis
- Variability in the implicit pattern
- Not pattern-based analysis
- Temporal nature of the patterns
lt Prev
Next gt
18Need for New Approaches
- To deal with ambiguity and variability in the
data for feature extraction - To combine evidence from multiple sources
(classifiers and knowledge sources)
lt Prev
Next gt
19Images
- Digital Image - Matrix of numbers
- Types of Images
- line sketches, binary, gray level and color
- Still images, video, multimedia
lt Prev
Next gt
20Image Analysis
- Feature extraction
- Image segmentation Gray level, color, texture
- Image classification
lt Prev
Next gt
21Processing of Texture-like Images
2-D Gabor Filter
A typical Gaussian filter with ?30
A typical Gabor filter with ?30, ?3.14 and
?45?
lt Prev
Next gt
22Limitations
- Feature extraction
- Matching
- Classification methods/models
lt Prev
Next gt
23Need for New Approaches
- Feature extraction PCA and nonlinear PCA
- Matching Stereo images
- Smoothing Using the knowledge of image and not
noise - Edge extraction and classification Integration
of global and local information or combining
evidence
lt Prev
Next gt
24PART IIBasics of ANN
25Artificial Neural Networks
- Problem solving Pattern recognition tasks by
human and machine - Pattern vs data
- Pattern processing vs data processing
- Architectural mismatch
- Need for new models of computing
lt Prev
Next gt
26Biological Neural Networks
- Structure and function Neurons,
interconnections, dynamics for learning and
recall - Features Robustness, fault tolerance,
flexibility, ability to deal with variety of data
situations, collective computation - Comparison with computers Speed, processing,
size and complexity, fault tolerance, control
mechanism - Parallel and Distributed Processing (PDP) models
lt Prev
Next gt
27Basics of ANN
- ANN terminology Processing unit (fig),
interconnection, operation and update (input,
weights, activation value, output function,
output value) - Models of neurons MP neuron, perceptron and
adaline - Topology (fig)
- Basic learning laws (fig)
Next gt
lt Prev
28Model of a Neuron
ltback
29Topology
ltback
30Basic Learning Laws
ltback
31Activation and Synaptic Dynamic Models
- General activation dynamics model
Passive decay term
Excitatory term
Inhibitory term
Correlation term
Passive decay term
- Stability and convergence
Nextgt
ltPrev
32Functional Units and Pattern Recognition Tasks
- Feedforward ANN
- Pattern association
- Pattern classification
- Pattern mapping/classification
- Feedback ANN
- Autoassociation
- Pattern storage (LTM)
- Pattern environment storage (LTM)
- Feedforward and Feedback (Competitive Learning)
ANN - Pattern storage (STM)
- Pattern clustering
- Feature map
lt Prev
Next gt
33Two Layer Feedforward Neural Network (FFNN)
lt Prev
Next gt
34PR Tasks by FFNN
- Pattern association
- Architecture Two layers, linear processing,
single set of weights - Learning, Hebb's (orthogonal) rule, Delta
(linearly independent) rule - Recall Direct
- Limitation Linear independence, number of
patterns restricted to input dimensionality - To overcome Nonlinear processing units, leads to
a pattern classification problem - Pattern classification
- Architecture Two layers, nonlinear processing
units, geometrical interpretation - Learning Perceptron learning
- Recall Direct
- Limitation Linearly separable functions, cannot
handle hard problems - To overcome More layers, leads to a hard
learning problem - Pattern mapping/classification
- Architecture Multilayer (hidden), nonlinear
processing units, geometrical interpretation - Learning Generalized delta rule
(backpropagation) - Recall Direct
- Limitation Slow learning, does not guarantee
convergence - To overcome More complex architecture
lt Prev
Next gt
35Perceptron Network
- Perceptron classification problem
- Perceptron learning law
- Perceptron convergence theorem
- Perceptron representation problem
- Multilayer perceptron
lt Prev
Next gt
36Geometric Interpretation of Perceptron Learning
lt Prev
Next gt
37Generalized Delta Rule (Backpropagation Learning)
lt Prev
Next gt
38Issues in Backpropagation Learning
- Description and features of error backpropagation
- Performance of backpropagation learning
- Refinements of backpropagation learning
- Interpretation of results of learning
- Generalization
- Tasks with backpropagation network
- Limitations of backpropagation learning
- Extensions to backpropagation
lt Prev
Next gt
39PR Tasks by FBNN
- Autoassociation
- Architecture Single layer with feedback, linear
processing units - Learning Hebb (orthogonal inputs), Delta
(linearly independent inputs) - Recall Activation dynamics until stable states
are reached - Limitation No accretive behavior
- To overcome Nonlinear processing units, leads to
a pattern storage problem - Pattern Storage
- Architecture Feedback neural network, nonlinear
processing units, states, Hopfield energy
analysis - Learning Not important
- Recall Activation dynamics until stable states
are reached - Limitation Hard problems, limited number of
patterns, false minima - To overcome Stochastic update, hidden units
- Pattern Environment Storage
- Architecture Boltzmann machine, nonlinear
processing units, hidden units, stochastic update - Learning Boltzmann learning law, simulated
annealing - Recall Activation dynamics, simulated annealing
- Limitation Slow learning
- To Overcome Different architecture
lt Prev
Next gt
40Hopfield Model
- Model
- Pattern storage condition
where
- Capacity of Hopfield model Number of patterns
for a given probability of error
Continuous Hopfield model
lt Prev
Next gt
41State Transition Diagram
lt Prev
Next gt
42Computation of Weights for Pattern Storage
Patterns to be stored (111) and (010). Results
in set of inequalities to be satisfied.
lt Prev
Next gt
43Pattern Storage Tasks
- Hard problems Conflicting requirements on a set
of inequalities - Hidden units Problem of false minima
- Stochastic update
Stochastic equilibrium Boltzmann-Gibbs Law
lt Prev
Next gt
44Simulated Annealing
Next gt
lt Prev
45Boltzmann Machine
- Pattern environment storage
- Architecture Visible units, hidden units,
stochastic update, simulated annealing - Boltzmann Learning Law
lt Prev
Next gt
46Discussion on Boltzmann Learning
- Expression for Boltzmann learning
- Significance of pij and p-ij
- Learning and unlearning
- Local property
- Choice of ? and initial weights
- Implementation of Boltzmann learning
- Algorithm for learning a pattern environment
- Algorithm for recall of a pattern
- Implementation of simulated annealing
- Annealing schedule
- Pattern recognition tasks by Boltzmann machine
- Pattern completion
- Pattern association
- Recall from noisy or partial input
- Interpretation of Boltzmann learning
- Markov property of simulated annealing
- Clamped-free energy and full energy
- Variations of Boltzmann learning
- Deterministic Boltzmann machine
lt Prev
Next gt
47Competitive Learning Neural Network (CLNN)
Output layer with on-center and
off-surround connections
Input layer
lt Prev
Next gt
48PR Tasks by CLNN
- Pattern storage (STM)
- Architecture Two layers (input and competitive),
linear processing units - Learning No learning in FF stage, fixed weights
in FB layer - Recall Not relevant
- Limitation STM, no application, theoretical
interest - To overcome Nonlinear output function in FB
stage, learning in FF stage - Pattern clustering (grouping)
- Architecture Two layers (input and competitive),
nonlinear processing units in the competitive
layer - Learning Only in FF stage, Competitive learning
- Recall Direct in FF stage, activation dynamics
until stable state is reached in FB layer - Limitation Fixed (rigid) grouping of patterns
- To overcome Train neighbourhood units in
competition layer - Feature map
- Architecture Self-organization network, two
layers, nonlinear processing units, excitatory
neighbourhood units - Learning Weights leading to the neighbourhood
units in the competitive layer - Recall Apply input, determine winner
- Limitation Only visual features, not
quantitative - To overcome More complex architecture
lt Prev
Next gt
49Learning Algorithms for PCA networks
Next gt
lt Prev
50Self Organization Network
Output layer
Input layer
(b) Neighborhood regions at different times in
the output layer
(a) Network structure
lt Prev
Next gt
51Illustration of SOM
lt Prev
Next gt
52PART IIIANN Models for Feature Extraction and
Classification
Next gt
53Neural Network Architecture and Models for
Feature Extraction
- Multilayer Feedforward Neural Network (MLFFNN)
- Autoassociative Neural Networks (AANN)
- Constraint Satisfaction Models (CSM)
- Self Organization MAP (SOM)
- Time Delay Neural Networks (TDNN)
- Hidden Markov Models (HMM)
lt Prev
Next gt
54Multilayer FFNN
- Nonlinear feature extraction followed by linearly
separable classification problem
lt Prev
Next gt
55Multilayer FFNN
- Complex decision hypersurfaces for classification
- Asymptotic approximation of a posterior class
probabilities
lt Prev
Next gt
56Radial Basis Function
- Radial Basis Function NN Clustering followed by
classification
Basis function
?j(a)
Class labels
Input vector a
j
c1
cN
lt Prev
Next gt
57Autoassociation Neural Network (AANN)
- Architecture
- Nonlinear PCA
- Feature extraction
- Distribution capturing ability
Next gt
lt Prev
58Autoassociation Neural Network (AANN)
Input Layer
Output Layer
Dimension Compression Hidden Layer
ltBack
59Distribution Capturing Ability of AANN
- Distribution of feature vector (fig)
- Illustration of distribution in 2D case (fig)
- Comparison with Gaussian Mixture Model (fig)
Next gt
lt Prev
60Distribution of feature vector
ltBack
61(a) Illustration of distribution in 2D case (b,c)
Comparison with Gaussian Mixture Model
ltBack
62Feature Extraction by AANN
- Input and output to AANN Sequence of signal
samples - (captures dominant 2nd order statistical
features) - Input and output to AANN Sequence of Residual
samples - (captures higher order statistical features in
the sample sequence)
Next gt
lt Prev
63Constraint Satisfaction Model
- Purpose To satisfy the given (weak) constraints
as much as possible - Structure Feedback network with units
(hypotheses), connections (constraints /
knowledge) - Goodness of fit function Depends on the output
of unit and connection weights - Relaxation Strategies Deterministic and
Stochastic
lt Prev
Next gt
64Application of CS Models
- Combining evidence
- Combining classifiers outputs
- Solving optimization problems
lt Prev
Next gt
65Self Organization Map (illustrations)
- Organization of 2D input to 1D feature mapping
- Organization of 16 Dimensional LPC vector to
obtain phoneme map - Organization of large document files
lt Prev
Next gt
66Time Delay Neural Networks for Temporal Pattern
Recognition
lt Prev
Next gt
67Stochastic Models for Temporal Pattern Recognition
- Maximum likelihood formulation Determine the
class w, given the observation symbol sequence y,
using criterion - Markov Models
- Hidden Markov Models
lt Prev
Next gt
68PART IVApplications in Speech Image Processing
69Applications in Speech and Image Processing
- Edge extraction in texture-like images
- Texture segmentation/classification by CS model
- Road detection from satellite images
- Speech recognition by CS model
- Speaker recognition by AANN model
lt Prev
Next gt
70Problem of Edge Extraction in Texture-like Images
- Nature of texture-like images
- Problem of edge extraction
- Preprocessing (1-D) to derive partial evidence
- Combining evidence using CS model
lt Prev
Next gt
71Problem of Edge Extraction
- Texture Edges are the locations where there is an
abrupt change in texture properties
Image with 4 natural texture regions
Edgemap showing micro edges
Edgemap showing macro edges
lt Prev
Next gt
721-D processing using Gabor Filter and Difference
Operator
- 1-D Gabor smoothing filter Magnitude and Phase
1-D Gabor Filter Gaussian modulated by a complex
sinusoidal
Even Component
Odd Component
lt Prev
Next gt
731-D processing using Gabor filter and Difference
operator (contd.)
- Differential operator for edge evidence
- First derivative of 1-D Gaussian function
- Need for a set of Gabor filters
lt Prev
Next gt
74Texture Edge Extraction using 1-D Gabor Magnitude
and Phase
- Apply 1-D Gabor filter along each of the parallel
lines of an image in one direction ( say,
horizontal ) - Apply all Gabor filters of the filter bank in a
similar way - For each of the Gabor filtered output, partial
edge information is extracted by applying the 1-D
differential operator in the orthogonal direction
( say, vertical ) - The entire process is repeated in the orthogonal
(vertical and horizontal) directions to obtain
the partial edge evidence in the other direction - The partial edge evidence is combined using a
Constraint Satisfaction Neural Network Model
lt Prev
Next gt
75Texture Edge Extraction using a set of 1-D Gabor
Filters
Input Image
Bank of 1-D Gabor Filters
Filtered Image
Post-processing using 1-D Differential operator
and Thresholding
Edge evidence
Combining the Edge evidence using Constraint
Satisfaction Neural Network Mode
Edge map
lt Prev
Next gt
76Combining Evidence using CSNN model
Structure of 3-D CSNN Model
J
I
K
ve
Connections among the nodes across the layers of
for each pixel
3D lattice of size IxJxK
-ve
Connections from a set of neighboring nodes to
each node in the same layer.
lt Prev
Next gt
77Combining the Edge Evidence using Constraint
Satisfaction Neural Network (CSNN) Model
- Neural network model contains nodes arranged in a
3-D lattice structure - Each node corresponds to a pixel in the
post-processed Gabor filter output - Post processed output of a single 1-D Gabor
filter is an input to one 2-D layer of nodes - Different layers of nodes, each corresponding to
a particular filter output, are stacked one upon
the other to form the 3-D structure - Each node represents a hypothesis
- Connection between two nodes represents a
constraint - Each node is connected to other nodes with
inhibitory and excitatory connections
lt Prev
Next gt
78Combining Evidence using CSNN model (contd.)
Let, represents the weight of the connection
from node (i,j,k) to node (i1,j1,k) within each
layer k, and the weight represents the
constraint between the nodes in two different
layers (k and k1) in the same column. These are
given as
- The node is connected to other nodes in the
same column with excitatory connections
lt Prev
Next gt
79Combining Evidence using CSNN model (contd.)
- Using the notation as the
output of the node (i,j,k), and the set
as the state if the network - The state of the neural network model is
initialized using - In the deterministic relaxation method, the state
of the network is updated iteratively by changing
the output of each node at one time - The state of each node is obtained using
- Ui,j,k (n) ? Wi,j,k,i1,j1,k ?i1,j1,k ?
Wi,j,k,i,j,k1 ?i,j,k1 Ii,j,k - Where Ui,j,k(n) is the net input to node(i,j,k)
at nth iteration, and Ii,j,k is the external
input given to the node (i,j,k) - The state of the network is updated using
- where ? is the threshold
lt Prev
Next gt
80Comparison of Edge Extraction using Gabor
Magnitude and Gabor Phase
1-D Gabor Magnitude
1-D Gabor Phase
2-D Gabor Filter
Texture Image
1-D Gabor Magnitude
1-D Gabor Phase
2-D Gabor Filter
Texture Image
Next gt
lt Prev
81Texture Segmentation and Classification
- Image analysis (revisited)
- Problem of texture segmentation and
classification - Preprocessing using 2D Gabor filter to derive
feature vector - Combining the partial evidence using CS model
lt Prev
Next gt
82CS Model for Texture Classification
- Supervised and unsupervised problem
- Modeling of image constraint
- Formulation of a posterior probability CS model
- Hopfield neural network model and its energy
function - Deterministic and Stochastic relaxation
strategies
lt Prev
Next gt
83CS Model for Texture Classification- Modeling of
Image Constraints
- Feature formation process Defined by the
conditional probability of the feature
vector gs of each pixels given the model
parameter of each class k.
- Partition process Defines the probability of
the label of a pixel given the label of the
pixels in its pth order neighborhood.
- Label competition process Describes the
conditional probability of assigning a new label
to an already labeled pixel
lt Prev
Next gt
84CS Model for Texture Classification- Modeling of
Image Constraints (contd.)
- Formulation of Posteriori Probability
-
where
and
- Total energy of the system
lt Prev
Next gt
85CS Model for Texture Classification
(ijK)
K
(ijK)
(ijk)
k
(ijk)
ve
(ij1)
J
-ve
I
(ij1)
Connections from a set of neighboring nodes to
each node in the same layer.
Connections among the nodes across the layers of
for each pixel
E
state
lt Prev
Next gt
86Hopfield Neural Network and its Energy Function
o1
oj
oN
B1
Bj
BN
K
J
I
lt Prev
Next gt
87Results of Texture Classification - Natural
Textures
Natural Textures
Initial Classification
Final Classification
lt Back
88Results of Texture Classification - Remotely
Sensing Data
Band-2 IRS image containing 4 texture classes
Initial Classification
Final Classification
lt Back
89Results of Texture Classification -
Multispectral Data
SIR-C/X-SAR image of Lost City of Ubar
Classification using multispectral and textural
information
Classification using multispectral information
lt Back
90Speech Recognition using CS Model
- Problem of recognition of SCV unit (Table)
- Issues in classification of SCVs(Table)
- Representation of isolated utterance of SCV unit
- 60ms before and 140 ms after vowel onset point
- 240 dimensional feature vector consisting of
weighted cepstral coefficients - Block diagram of the recognition system for SCV
unit (Fig) - CS network for classification of SCV unit(Fig)
lt Prev
Next gt
91Problem of Recognition of SCV Units
ltBack
92Issues in Classification of SCVs
- Importance of SCVs
- High frequency of occurrence About 45
- Main Issues in Classification of SCVs
- Large number of SCV classes
- Similarity among several SCVs classes
- Model of Classification of SCVs
- Should have good discriminatory capablity
- ( Artificial neural networks )
- - Should be able to handle large number of
classes - ( Neural networks based on a modular approach )
ltBack
93Block Diagram of Recognition System for SCV Units
ltBack
94CS Network for Classification of SCV Units
POA Feedback Subnetwork
External evidence of bias for the node is
computed using the output of the MLFFNN5
External evidence of bias for the node is
computed using the output of the MLFFNN1
External evidence of bias for the node is
computed using the output of the MLFFNN9
Vowel Feedback Subnetwork
MOA Feedback Subnetwork
ltBack
95Classification Performance of CSM and other SCV
Recognition Systems on Test Data of 80 SCV Classes
lt Prev
Next gt
96Speaker Verification using AANN Models and Vocal
Tract System Features
- One AANN for each speaker
- Verification by identification
- AANN structure 19L 38N 4N 38N 19 L
- Feature 19 weighted LPCC from 16th order LPC
for each frame of 27.5 ms and frame shift 13.75ms - Training Pattern mode, 100 epochs, 1 min of data
- Testing Model giving highest confidence for 10
sec of test data
lt Prev
Next gt
97Speaker Recognition using Source Features
- One model for each speaker
- Structure of AANN 40L 48N 12N 48N 40L
- Feature About 10 sec of data, 60 epochs
- Testing Select model giving highest confidence
for 2 sec of test data
lt Prev
Next gt
98Other Applications
- Speech enhancement
- Speech compression
- Image compression
- Character recognition
- Stereo image matching
lt Prev
Next gt
99Summary and Conclusions
- Speech and image processing Natural tasks
- Significance of pattern processing
- Limitation of conventional computer architecture
- Need for new models or architectures for pattern
processing tasks - Basics of ANN
- Architecture of ANN for feature extraction and
classification - Potential of ANN for speech and image processing
lt Prev
100References
1. B.Yegnanarayana, Artificial Neural
Networks, Prentice-Hall of India, New Delhi,
1999 2. L. R. Rabiner and B. H. Juang,
Fundamentals of Speech Recognition,
Prentice-Hall, New Jersey, 1993 3. Alan C.
Bovik, Handbook of Image and Video Processing,
Academic Press, 2001 4. Xuedong Hwang, Alex Acero
and Hsiao-Wuen Hon, Spoken Language Processing,
Prentice-Hall, New Jersey, 2001 5. P. P. Raghu,
Artificial Neural Network Models for Texture
Analysis, PhD Thesis, CSE Dept., IIT Madras,
1995 6. C. Chandra Sekar, Neural Network Models
for Recognition of Stop Consonant Vowel (SCV)
Segments in Continuous Speech, PhD Thesis, CSE
Dept., IIT Madras, 1996 7. P. Kiran Kumar,
Texture Edge Extraction using One Dimensional
Processing, MS Thesis, CSE Dept., 2001 8. S. P.
Kishore, Speaker Verification using
Autoassociative Neural Netwrok Models, MS
Thesis, CSE Dept., IIT Madras, 2000 9. B.
Yegnanarayana, K. Sharath Reddy and S. P.
Kishore, Source and System Features for Speaker
Recognition using AANN Models, ICASSP, May
2001 10. S. P. Kishore, Suryakanth V.
Ganagashetty and B. Yegnanarayana, Online Text
Independent Speaker Verification System using
Autoassociative Neural Network Models,
INNS-IEEE Int. Conf. Neural Networks, July
2001. 11. K. Sharat Reddy, Source and System
Features for Speaker Recognition, MS Thesis, CSE
Dept., IIT Madras, September 2001. 12. B.
Yegnanarayana and S. P. Kishore, Autoassociative
Neural Networks An alternative to GMM for
Pattern Recognition, to appear in Nerual
Networks 2002.