Title: A Frame Classifier that Provides Endpoints for Owner Speech Recorded in MultiChannel Meetings
1- A Frame Classifier that Provides Endpoints for
Owner Speech Recorded in Multi-Channel Meetings - Ziad Al Bawab
- PhD. Candidate
- Robust Speech Recognition Laboratory
- Department of Electrical and Computer Engineering
- Carnegie Mellon University
- July 14, 2005
2Outline
- 1. Introduction
- Overview of Speech Endpoint Detection
- Challenges Faced in Multi-Channel Meetings
- Proposed Approach
- Motivation Behind Our Approach
- 2. Classifier Description
- 3. Segmentation Mechanism
- 4. Experimental Results
- 5. Summary and Conclusions
-
3Overview of Speech Endpoint Detection in
Multi-Channel Meetings
- Multi-channels Meetings
- Head-mounted Microphones
- Meeting Recorder (MR) application running on each
users laptop, recording and recognizing speech
in real-time - No information is shared across channels
- Owner speech is the speech of the participant
wearing the headset
4Challenges Faced in Multi-Channel Meetings
- Noise
- Microphone (e.g. clicks, glitches, etc.)
- Human (e.g. breath, cough, laughter, etc.)
- Background (e.g. door, paper, phone, etc.)
- Crosstalk (i.e. speech of other participants)
5Overview of Speech Endpoint Detection
6Overview of Speech Endpoint Detection
7Overview of Speech Endpoint Detection
- Our objective is to accurately detect the speech
endpoints of each channels Owner
8Why Speech Endpointing?
- Reduce Automatic Speech Recognition (ASR) Errors
by accurately identifying the temporal boundaries
of speech used in recognition - Save system resources by not processing unwanted
speech (we are only interested in Owner speech) - Provide marking information on where speech
occurs for offline human transcription efforts
Speech Recognition System
9How to?
- Traditional approach use frame energy
- pros - computationally efficient for real-time
implementation - - performs well in quiet environments
- cons - not robust to noise or crosstalk
- Meeting speech endpointing requirements
- Accurate detection of Owner speech endpoints
- Distinguishing Owner speech from noise and
crosstalk - Consistence in features distribution across
different channels (e.g. energy normalization) - Low computational cost for real-time
implementation
10Our Solution
- 1. Use a frame-by-frame speech classifier
- Classify speech into four classes
- Mel-Frequency Cepstral Coefficients (MFCCs) for
features - Histogram-based energy normalization
- Gaussian Mixture Models (GMMs) statistical
classification - 2. Use the classification results to identify the
temporal boundaries (i.e. Segment endpoints)
11Our Solution to Speech Endpoint Detection
12Our Solution to Speech Endpoint Detection
13Outline
- 1. Introduction
- 2. Classifier Description
- Features Used
- c0 Normalization
- MAP Classification
- 3. Segmentation Mechanism
- 4. Experimental Results
- 5. Summary and Conclusions
-
14 Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
15Features Used for Classification
- Mel-Frequency Cepstral Coefficients (MFCCs)
- Features most commonly used for speech
recognition - Capture the spectral characteristics of the
signal
Filter bank figure is excerpted from Davis and
Mermelstein 4
16MFCCs (Continued)
- 25 msecs Window Length
- 10 msecs Frame Shift
F1
F2
F3
17Features Used for Classification
- Mel-Frequency Cepstral Coefficients (MFCCs)
- Features most commonly used for speech
recognition - Capture the spectral characteristics of the
signal
Filter bank figure is excerpted from Davis and
Mermelstein 4
18MFCCs (Continued)
- c0 is the sum of the log of the energy at the
output of each filter - c1 reflects the difference between lower and
higher frequency content of the Mel-filtered log
spectrum - Higher-order coefficients represent more rapid
variations in the log spectrum with respect to
frequency
19Spectral Characteristics
Frequency
Time
20 Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
21Energy Normalization
- Different channels have different microphone
gains - Distance between the mouth and the microphone
varies with different speakers - For robust classification, c0 statistics should
be consistent across channels to match (as
closely as possible) the models used in
classification
Channel 2 c0 Histogram
Channel 1 c0 Histogram
22Histogram Normalization
- Based on NISTs Signal to Noise Ratio (SNR)
estimation routine 5
23Histogram Normalization (Continued)
24Histogram Normalization (Continued)
Chan 1, Not-Normalized (upper), Normalized (lower)
Chan 2, Not-Normalized (upper), Normalized (lower)
25 Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
26GMM Classification
- Maximum a posteriori probability (MAP)
- M Gaussian Densities in Mixture
- 4 Classes O,S,N,SIL
- Diagonal Covariance Matrices
27 Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
28Segmentation Mechanism
- Frame classifications are first smoothed using a
5-frame voting window - Endpoints are specified using a state machine
which also controls ASR
29Simplified State Machine
30Outline
- 1. Introduction
- 2. Classifier Description
- 3. Segmentation Mechanism
- 4. Experimental Results
- Training and Testing Meeting Data
- Classifier Performance and Frame Error Rate (FER)
- Segmentation Performance
- Segment Error Rate (SER)
- Word Error Rate (WER)
- 5. Summary and Conclusions
-
31Evaluations
- Utterances were segmented (i.e. obtained) using
the energy-based endpointer - Training the models was done using CMUs SPHINX
(EM algorithm) - User1 data had better representation of frames
from the four classes - Testing used GMM classification as explained
before - 40 frames are Owner, 60 frames are Other (S, N,
SIL) - 441 utterances containing 141067 frames belonged
to User1 Channel
32Classifier Performance
- We measure the frame error rate (FER) in
classifying Owner speech versus Other - Theoretically the probability of error is
33Theoretical Evaluation
- Using c0 only
- M 1 Gaussian
- No classification smoothing applied
- No normalization (User1 dataset)
-
34Including More Features
- Using all test data (FER 25.19 with c0 only)
- Applying classification smoothing
- Varying the number of MFCCs (113)
- M 1 Gaussian Density
- Including c1 and c0 results in 15.4 absolute
and 61 relative improvement in FER - Including normalization results in 3.6 absolute
and 37 relative improvement - FER 4.43 with 13 MFCCs and normalization
35More Gaussian Densities per Mixture
- FER 4.22 with four Gaussians, 13 MFCCs, and
c0 normalization, 4.74 relative improvement
with respect to one Gaussian case - With more densities per mixture, the models
become overfitted to the training data
36Frame Classifications Example
- O S N SIL
- O 24757 45 357 150
- S 226 7219 507 2953
- N 723 1010 10305 2430
- SIL 1538 4695 4980 79172
- O OTHER
- O 24757 552
- OTHER 2487 113271
- Prob.of Detection 0.9782
- False Rejection 0.0218
- Correct Rejection 0.9785
- False Acceptance 0.0215
- Frame Error Rate 0.0215
- User1 Test Data
- M 4 Gaussians
- Normalization Applied
- 13 MFCCs
37Segmentation Techniques
- Human-Based Segmentation
- Energy-Based Segmentation
- Classifier-Based Segmentation
38Human-Based Segmentation
- Sync time"4.44"/gt
- /h/
- ltSync time"6.381"/gt
- ltSync time"20.896"/gt
- for
- ltSync time"22.202"/gt
- ltSync time"25.075"/gt
- /oh/ really /uh/ maybe the i i don't know yeah
/uh/ - ltSync time"33.739"/gt
39Energy-Endptr Output
- gtgt cont_fileseg 11025 0.3 chan3.raw
- Calibrating ... done
- Utt 00000000, st 0.00s, et 1.74s, seg
1.74s (samp 19184) - Utt 00000001, st 2.62s, et 11.88s, seg
9.26s (samp 102080) - Utt 00000002, st 16.24s, et 20.43s, seg
4.20s (samp 46288) - gtgt
40Classifier-Endptr Output
- gtgt classify_endptr means variances
mixture_weights filelist mfc - chan3 (2190 Frames, 20.862 secs)
- --------------------------------------------------
---- - Utt1, Leader 0.086, Trailer 1.824
- Utt2, Leader 2.743, Trailer 6.434
- Utt3, Leader 6.800, Trailer 7.224
- Utt4, Leader 7.743, Trailer 9.938
- Utt5, Leader 12.276, Trailer 13.148
- Utt6, Leader 13.971, Trailer 14.405
- Utt7, Leader 16.295, Trailer 20.862
- --------------------------------------------------
---- - gtgt
41Segmentation Performance
Deletion
Insertion
Correct Detection
- Reference is the hand segmentation
- Segment is defined by consecutive O frames
- Segment deletion number of O frames in
hypothesis lt 90 number of O frames in reference - Segment insertion number of O frames in
reference lt 10 number of O frames in hypothesis - Segment Error Rate (SER)
42Segment Error Rate
- 27 relative improvement for the Classifier
Segmentation and 30 for the Normalized
Classifier Segmentation over the Energy
Segmentation
43Word Error Rate
- Speech Recognition Experiments Using CMUs Sphinx
- Acoustic Model Composed of 3 states HMMs with 16
Gaussian Mixtures per state - 434 utterances from four speakers
- Recognize the speech in between the boundaries
specified by the three segmentation techniques -
44Word Error Rate
- 37.5 relative degradation for the Energy
Segmentation and14 for the Classifier
Segmentation over the Hand Segmentation
45Summary and Conclusion
- We presented a frame classifier approach for
speech endpoint detection in multi-channel
meetings - Classifies speech into four classes
- Normalizes the energy feature for each channel
separately - Works real-time (less than 0.1 Real-Time)
- FER as low as 4.22
- Described a segmentation mechanism that generates
accurate Owner speech endpoints - Uses the frame-by-frame classification results
- SER and WER outperformed the energy-based
segmentation performance - Our endpointer outperformed and has already
replaced the energy-based endpointer in the MR
46References
- 1 T. Pfau, D. P. W. Ellis, and A. Stolcke,
Multispeaker speech activity detection for the
ICSI Meeting Recorder, in Proceedings IEEE
Automatic Speech Recognition and Understanding
Workshop, Madonna di Campiglio, Italy, Dec. 2001. - 2 S.N. Wrigley, G.J. Brown, V. Wan, and S.
Renals, Speech and Crosstalk Detection in
Multi-Channel Audio, to appear in IEEE
Transactions on Speech and Audio Processing
(2004). - 3 S.E. Bou-Ghazale, K. Assaleh, A robust
endpoint detection of speech for noisy
environments with application to automatic speech
recognition, in Proceedings IEEE International
Conference on Acoustics, Speech, and Signal
Processing, Volume 4, Pages 3808 - 3811, May
2002. - 4 S. B. Davis, P. Mermelstein, "Comparison of
parametric representations for monosyllabic word
recognition in continuously spoken sentences", in
IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 28, no. 4, pp 357-366,
August 1980. - 5 NIST Speech Quality Assurance (SPQA) Package
Version 2.3, available from http//www.nist.gov/s
peech/tools/index.htm.
47Questions?