A Frame Classifier that Provides Endpoints for Owner Speech Recorded in MultiChannel Meetings - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

A Frame Classifier that Provides Endpoints for Owner Speech Recorded in MultiChannel Meetings

Description:

Meeting Recorder (MR) application running on each user's laptop, ... Consistence in features distribution across different channels (e.g. energy normalization) ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 48

Provided by: zia2

Category:

more less

Transcript and Presenter's Notes

Title: A Frame Classifier that Provides Endpoints for Owner Speech Recorded in MultiChannel Meetings

1

A Frame Classifier that Provides Endpoints for
Owner Speech Recorded in Multi-Channel Meetings
Ziad Al Bawab
PhD. Candidate
Robust Speech Recognition Laboratory
Department of Electrical and Computer Engineering
Carnegie Mellon University
July 14, 2005

2
Outline

1. Introduction
Overview of Speech Endpoint Detection
Challenges Faced in Multi-Channel Meetings
Proposed Approach
Motivation Behind Our Approach
2. Classifier Description
3. Segmentation Mechanism
4. Experimental Results
5. Summary and Conclusions

3
Overview of Speech Endpoint Detection in
Multi-Channel Meetings

Multi-channels Meetings
Head-mounted Microphones
Meeting Recorder (MR) application running on each
users laptop, recording and recognizing speech
in real-time
No information is shared across channels

Owner speech is the speech of the participant
wearing the headset

4
Challenges Faced in Multi-Channel Meetings

Noise
Microphone (e.g. clicks, glitches, etc.)
Human (e.g. breath, cough, laughter, etc.)
Background (e.g. door, paper, phone, etc.)
Crosstalk (i.e. speech of other participants)

5
Overview of Speech Endpoint Detection
6
Overview of Speech Endpoint Detection
7
Overview of Speech Endpoint Detection

Our objective is to accurately detect the speech
endpoints of each channels Owner

8
Why Speech Endpointing?

Reduce Automatic Speech Recognition (ASR) Errors
by accurately identifying the temporal boundaries
of speech used in recognition
Save system resources by not processing unwanted
speech (we are only interested in Owner speech)
Provide marking information on where speech
occurs for offline human transcription efforts

Speech Recognition System
9
How to?

Traditional approach use frame energy
pros - computationally efficient for real-time
implementation
- performs well in quiet environments
cons - not robust to noise or crosstalk
Meeting speech endpointing requirements
Accurate detection of Owner speech endpoints
Distinguishing Owner speech from noise and
crosstalk
Consistence in features distribution across
different channels (e.g. energy normalization)
Low computational cost for real-time
implementation

10
Our Solution

1. Use a frame-by-frame speech classifier
Classify speech into four classes
Mel-Frequency Cepstral Coefficients (MFCCs) for
features
Histogram-based energy normalization
Gaussian Mixture Models (GMMs) statistical
classification
2. Use the classification results to identify the
temporal boundaries (i.e. Segment endpoints)

11
Our Solution to Speech Endpoint Detection
12
Our Solution to Speech Endpoint Detection
13
Outline

1. Introduction
2. Classifier Description
Features Used
c0 Normalization
MAP Classification
3. Segmentation Mechanism
4. Experimental Results
5. Summary and Conclusions

14
Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
15
Features Used for Classification

Mel-Frequency Cepstral Coefficients (MFCCs)
Features most commonly used for speech
recognition
Capture the spectral characteristics of the
signal

Filter bank figure is excerpted from Davis and
Mermelstein 4
16
MFCCs (Continued)

Framing

25 msecs Window Length
10 msecs Frame Shift

F1
F2
F3
17
Features Used for Classification

Mel-Frequency Cepstral Coefficients (MFCCs)
Features most commonly used for speech
recognition
Capture the spectral characteristics of the
signal

Filter bank figure is excerpted from Davis and
Mermelstein 4
18
MFCCs (Continued)

c0 is the sum of the log of the energy at the
output of each filter
c1 reflects the difference between lower and
higher frequency content of the Mel-filtered log
spectrum
Higher-order coefficients represent more rapid
variations in the log spectrum with respect to
frequency

19
Spectral Characteristics
Frequency
Time
20
Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
21
Energy Normalization

Different channels have different microphone
gains
Distance between the mouth and the microphone
varies with different speakers
For robust classification, c0 statistics should
be consistent across channels to match (as
closely as possible) the models used in
classification

Channel 2 c0 Histogram
Channel 1 c0 Histogram
22
Histogram Normalization

Based on NISTs Signal to Noise Ratio (SNR)
estimation routine 5

23
Histogram Normalization (Continued)
24
Histogram Normalization (Continued)
Chan 1, Not-Normalized (upper), Normalized (lower)
Chan 2, Not-Normalized (upper), Normalized (lower)
25
Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
26
GMM Classification

Maximum a posteriori probability (MAP)

M Gaussian Densities in Mixture
4 Classes O,S,N,SIL
Diagonal Covariance Matrices

27
Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
28
Segmentation Mechanism

Frame classifications are first smoothed using a
5-frame voting window
Endpoints are specified using a state machine
which also controls ASR

29
Simplified State Machine
30
Outline

1. Introduction
2. Classifier Description
3. Segmentation Mechanism
4. Experimental Results
Training and Testing Meeting Data
Classifier Performance and Frame Error Rate (FER)
Segmentation Performance
Segment Error Rate (SER)
Word Error Rate (WER)
5. Summary and Conclusions

31
Evaluations

Utterances were segmented (i.e. obtained) using
the energy-based endpointer
Training the models was done using CMUs SPHINX
(EM algorithm)
User1 data had better representation of frames
from the four classes
Testing used GMM classification as explained
before
40 frames are Owner, 60 frames are Other (S, N,
SIL)
441 utterances containing 141067 frames belonged
to User1 Channel

32
Classifier Performance

We measure the frame error rate (FER) in
classifying Owner speech versus Other
Theoretically the probability of error is

33
Theoretical Evaluation

Using c0 only
M 1 Gaussian
No classification smoothing applied
No normalization (User1 dataset)

34
Including More Features

Using all test data (FER 25.19 with c0 only)
Applying classification smoothing
Varying the number of MFCCs (113)
M 1 Gaussian Density
Including c1 and c0 results in 15.4 absolute
and 61 relative improvement in FER
Including normalization results in 3.6 absolute
and 37 relative improvement
FER 4.43 with 13 MFCCs and normalization

35
More Gaussian Densities per Mixture

FER 4.22 with four Gaussians, 13 MFCCs, and
c0 normalization, 4.74 relative improvement
with respect to one Gaussian case
With more densities per mixture, the models
become overfitted to the training data

36
Frame Classifications Example

O S N SIL
O 24757 45 357 150
S 226 7219 507 2953
N 723 1010 10305 2430
SIL 1538 4695 4980 79172
O OTHER
O 24757 552
OTHER 2487 113271
Prob.of Detection 0.9782
False Rejection 0.0218
Correct Rejection 0.9785
False Acceptance 0.0215
Frame Error Rate 0.0215

User1 Test Data
M 4 Gaussians
Normalization Applied
13 MFCCs

37
Segmentation Techniques

Human-Based Segmentation
Energy-Based Segmentation
Classifier-Based Segmentation

38
Human-Based Segmentation

Sync time"4.44"/gt
/h/
ltSync time"6.381"/gt
ltSync time"20.896"/gt
for
ltSync time"22.202"/gt
ltSync time"25.075"/gt
/oh/ really /uh/ maybe the i i don't know yeah
/uh/
ltSync time"33.739"/gt

39
Energy-Endptr Output

gtgt cont_fileseg 11025 0.3 chan3.raw
Calibrating ... done
Utt 00000000, st 0.00s, et 1.74s, seg
1.74s (samp 19184)
Utt 00000001, st 2.62s, et 11.88s, seg
9.26s (samp 102080)
Utt 00000002, st 16.24s, et 20.43s, seg
4.20s (samp 46288)
gtgt

40
Classifier-Endptr Output

gtgt classify_endptr means variances
mixture_weights filelist mfc
chan3 (2190 Frames, 20.862 secs)
--------------------------------------------------
----
Utt1, Leader 0.086, Trailer 1.824
Utt2, Leader 2.743, Trailer 6.434
Utt3, Leader 6.800, Trailer 7.224
Utt4, Leader 7.743, Trailer 9.938
Utt5, Leader 12.276, Trailer 13.148
Utt6, Leader 13.971, Trailer 14.405
Utt7, Leader 16.295, Trailer 20.862
--------------------------------------------------
----
gtgt

41
Segmentation Performance
Deletion
Insertion
Correct Detection

Reference is the hand segmentation
Segment is defined by consecutive O frames
Segment deletion number of O frames in
hypothesis lt 90 number of O frames in reference
Segment insertion number of O frames in
reference lt 10 number of O frames in hypothesis
Segment Error Rate (SER)

42
Segment Error Rate

27 relative improvement for the Classifier
Segmentation and 30 for the Normalized
Classifier Segmentation over the Energy
Segmentation

43
Word Error Rate

Speech Recognition Experiments Using CMUs Sphinx
Acoustic Model Composed of 3 states HMMs with 16
Gaussian Mixtures per state
434 utterances from four speakers
Recognize the speech in between the boundaries
specified by the three segmentation techniques

44
Word Error Rate

37.5 relative degradation for the Energy
Segmentation and14 for the Classifier
Segmentation over the Hand Segmentation

45
Summary and Conclusion

We presented a frame classifier approach for
speech endpoint detection in multi-channel
meetings
Classifies speech into four classes
Normalizes the energy feature for each channel
separately
Works real-time (less than 0.1 Real-Time)
FER as low as 4.22
Described a segmentation mechanism that generates
accurate Owner speech endpoints
Uses the frame-by-frame classification results
SER and WER outperformed the energy-based
segmentation performance
Our endpointer outperformed and has already
replaced the energy-based endpointer in the MR

46
References

1 T. Pfau, D. P. W. Ellis, and A. Stolcke,
Multispeaker speech activity detection for the
ICSI Meeting Recorder, in Proceedings IEEE
Automatic Speech Recognition and Understanding
Workshop, Madonna di Campiglio, Italy, Dec. 2001.
2 S.N. Wrigley, G.J. Brown, V. Wan, and S.
Renals, Speech and Crosstalk Detection in
Multi-Channel Audio, to appear in IEEE
Transactions on Speech and Audio Processing
(2004).
3 S.E. Bou-Ghazale, K. Assaleh, A robust
endpoint detection of speech for noisy
environments with application to automatic speech
recognition, in Proceedings IEEE International
Conference on Acoustics, Speech, and Signal
Processing, Volume 4, Pages 3808 - 3811, May
2002.
4 S. B. Davis, P. Mermelstein, "Comparison of
parametric representations for monosyllabic word
recognition in continuously spoken sentences", in
IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 28, no. 4, pp 357-366,
August 1980.
5 NIST Speech Quality Assurance (SPQA) Package
Version 2.3, available from http//www.nist.gov/s
peech/tools/index.htm.