A Frame Classifier that Provides Endpoints for Owner Speech Recorded in MultiChannel Meetings - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

A Frame Classifier that Provides Endpoints for Owner Speech Recorded in MultiChannel Meetings

Description:

Meeting Recorder (MR) application running on each user's laptop, ... Consistence in features distribution across different channels (e.g. energy normalization) ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 48
Provided by: zia2
Category:

less

Transcript and Presenter's Notes

Title: A Frame Classifier that Provides Endpoints for Owner Speech Recorded in MultiChannel Meetings


1
  • A Frame Classifier that Provides Endpoints for
    Owner Speech Recorded in Multi-Channel Meetings
  • Ziad Al Bawab
  • PhD. Candidate
  • Robust Speech Recognition Laboratory
  • Department of Electrical and Computer Engineering
  • Carnegie Mellon University
  • July 14, 2005

2
Outline
  • 1. Introduction
  • Overview of Speech Endpoint Detection
  • Challenges Faced in Multi-Channel Meetings
  • Proposed Approach
  • Motivation Behind Our Approach
  • 2. Classifier Description
  • 3. Segmentation Mechanism
  • 4. Experimental Results
  • 5. Summary and Conclusions

3
Overview of Speech Endpoint Detection in
Multi-Channel Meetings
  • Multi-channels Meetings
  • Head-mounted Microphones
  • Meeting Recorder (MR) application running on each
    users laptop, recording and recognizing speech
    in real-time
  • No information is shared across channels
  • Owner speech is the speech of the participant
    wearing the headset

4
Challenges Faced in Multi-Channel Meetings
  • Noise
  • Microphone (e.g. clicks, glitches, etc.)
  • Human (e.g. breath, cough, laughter, etc.)
  • Background (e.g. door, paper, phone, etc.)
  • Crosstalk (i.e. speech of other participants)

5
Overview of Speech Endpoint Detection
6
Overview of Speech Endpoint Detection
7
Overview of Speech Endpoint Detection
  • Our objective is to accurately detect the speech
    endpoints of each channels Owner

8
Why Speech Endpointing?
  • Reduce Automatic Speech Recognition (ASR) Errors
    by accurately identifying the temporal boundaries
    of speech used in recognition
  • Save system resources by not processing unwanted
    speech (we are only interested in Owner speech)
  • Provide marking information on where speech
    occurs for offline human transcription efforts

Speech Recognition System
9
How to?
  • Traditional approach use frame energy
  • pros - computationally efficient for real-time
    implementation
  • - performs well in quiet environments
  • cons - not robust to noise or crosstalk
  • Meeting speech endpointing requirements
  • Accurate detection of Owner speech endpoints
  • Distinguishing Owner speech from noise and
    crosstalk
  • Consistence in features distribution across
    different channels (e.g. energy normalization)
  • Low computational cost for real-time
    implementation

10
Our Solution
  • 1. Use a frame-by-frame speech classifier
  • Classify speech into four classes
  • Mel-Frequency Cepstral Coefficients (MFCCs) for
    features
  • Histogram-based energy normalization
  • Gaussian Mixture Models (GMMs) statistical
    classification
  • 2. Use the classification results to identify the
    temporal boundaries (i.e. Segment endpoints)

11
Our Solution to Speech Endpoint Detection
12
Our Solution to Speech Endpoint Detection
13
Outline
  • 1. Introduction
  • 2. Classifier Description
  • Features Used
  • c0 Normalization
  • MAP Classification
  • 3. Segmentation Mechanism
  • 4. Experimental Results
  • 5. Summary and Conclusions

14
Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
15
Features Used for Classification
  • Mel-Frequency Cepstral Coefficients (MFCCs)
  • Features most commonly used for speech
    recognition
  • Capture the spectral characteristics of the
    signal

Filter bank figure is excerpted from Davis and
Mermelstein 4
16
MFCCs (Continued)
  • Framing
  • 25 msecs Window Length
  • 10 msecs Frame Shift

F1
F2
F3
17
Features Used for Classification
  • Mel-Frequency Cepstral Coefficients (MFCCs)
  • Features most commonly used for speech
    recognition
  • Capture the spectral characteristics of the
    signal

Filter bank figure is excerpted from Davis and
Mermelstein 4
18
MFCCs (Continued)
  • c0 is the sum of the log of the energy at the
    output of each filter
  • c1 reflects the difference between lower and
    higher frequency content of the Mel-filtered log
    spectrum
  • Higher-order coefficients represent more rapid
    variations in the log spectrum with respect to
    frequency

19
Spectral Characteristics
Frequency
Time
20
Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
21
Energy Normalization
  • Different channels have different microphone
    gains
  • Distance between the mouth and the microphone
    varies with different speakers
  • For robust classification, c0 statistics should
    be consistent across channels to match (as
    closely as possible) the models used in
    classification

Channel 2 c0 Histogram
Channel 1 c0 Histogram
22
Histogram Normalization
  • Based on NISTs Signal to Noise Ratio (SNR)
    estimation routine 5

23
Histogram Normalization (Continued)
24
Histogram Normalization (Continued)
Chan 1, Not-Normalized (upper), Normalized (lower)
Chan 2, Not-Normalized (upper), Normalized (lower)
25
Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
26
GMM Classification
  • Maximum a posteriori probability (MAP)
  • M Gaussian Densities in Mixture
  • 4 Classes O,S,N,SIL
  • Diagonal Covariance Matrices

27
Design
Features Extraction (MFCCs)
Normalization (Histogram-based)
Classification (GMMs)
speech
Segmentation (State Machine)
endpoints
O, S, N, SIL
28
Segmentation Mechanism
  • Frame classifications are first smoothed using a
    5-frame voting window
  • Endpoints are specified using a state machine
    which also controls ASR

29
Simplified State Machine
30
Outline
  • 1. Introduction
  • 2. Classifier Description
  • 3. Segmentation Mechanism
  • 4. Experimental Results
  • Training and Testing Meeting Data
  • Classifier Performance and Frame Error Rate (FER)
  • Segmentation Performance
  • Segment Error Rate (SER)
  • Word Error Rate (WER)
  • 5. Summary and Conclusions

31
Evaluations
  • Utterances were segmented (i.e. obtained) using
    the energy-based endpointer
  • Training the models was done using CMUs SPHINX
    (EM algorithm)
  • User1 data had better representation of frames
    from the four classes
  • Testing used GMM classification as explained
    before
  • 40 frames are Owner, 60 frames are Other (S, N,
    SIL)
  • 441 utterances containing 141067 frames belonged
    to User1 Channel

32
Classifier Performance
  • We measure the frame error rate (FER) in
    classifying Owner speech versus Other
  • Theoretically the probability of error is

33
Theoretical Evaluation
  • Using c0 only
  • M 1 Gaussian
  • No classification smoothing applied
  • No normalization (User1 dataset)

34
Including More Features
  • Using all test data (FER 25.19 with c0 only)
  • Applying classification smoothing
  • Varying the number of MFCCs (113)
  • M 1 Gaussian Density
  • Including c1 and c0 results in 15.4 absolute
    and 61 relative improvement in FER
  • Including normalization results in 3.6 absolute
    and 37 relative improvement
  • FER 4.43 with 13 MFCCs and normalization

35
More Gaussian Densities per Mixture
  • FER 4.22 with four Gaussians, 13 MFCCs, and
    c0 normalization, 4.74 relative improvement
    with respect to one Gaussian case
  • With more densities per mixture, the models
    become overfitted to the training data

36
Frame Classifications Example
  • O S N SIL
  • O 24757 45 357 150
  • S 226 7219 507 2953
  • N 723 1010 10305 2430
  • SIL 1538 4695 4980 79172
  • O OTHER
  • O 24757 552
  • OTHER 2487 113271
  • Prob.of Detection 0.9782
  • False Rejection 0.0218
  • Correct Rejection 0.9785
  • False Acceptance 0.0215
  • Frame Error Rate 0.0215
  • User1 Test Data
  • M 4 Gaussians
  • Normalization Applied
  • 13 MFCCs

37
Segmentation Techniques
  • Human-Based Segmentation
  • Energy-Based Segmentation
  • Classifier-Based Segmentation

38
Human-Based Segmentation
  • Sync time"4.44"/gt
  • /h/
  • ltSync time"6.381"/gt
  • ltSync time"20.896"/gt
  • for
  • ltSync time"22.202"/gt
  • ltSync time"25.075"/gt
  • /oh/ really /uh/ maybe the i i don't know yeah
    /uh/
  • ltSync time"33.739"/gt

39
Energy-Endptr Output
  • gtgt cont_fileseg 11025 0.3 chan3.raw
  • Calibrating ... done
  • Utt 00000000, st 0.00s, et 1.74s, seg
    1.74s (samp 19184)
  • Utt 00000001, st 2.62s, et 11.88s, seg
    9.26s (samp 102080)
  • Utt 00000002, st 16.24s, et 20.43s, seg
    4.20s (samp 46288)
  • gtgt

40
Classifier-Endptr Output
  • gtgt classify_endptr means variances
    mixture_weights filelist mfc
  • chan3 (2190 Frames, 20.862 secs)
  • --------------------------------------------------
    ----
  • Utt1, Leader 0.086, Trailer 1.824
  • Utt2, Leader 2.743, Trailer 6.434
  • Utt3, Leader 6.800, Trailer 7.224
  • Utt4, Leader 7.743, Trailer 9.938
  • Utt5, Leader 12.276, Trailer 13.148
  • Utt6, Leader 13.971, Trailer 14.405
  • Utt7, Leader 16.295, Trailer 20.862
  • --------------------------------------------------
    ----
  • gtgt

41
Segmentation Performance
Deletion
Insertion
Correct Detection
  • Reference is the hand segmentation
  • Segment is defined by consecutive O frames
  • Segment deletion number of O frames in
    hypothesis lt 90 number of O frames in reference
  • Segment insertion number of O frames in
    reference lt 10 number of O frames in hypothesis
  • Segment Error Rate (SER)

42
Segment Error Rate
  • 27 relative improvement for the Classifier
    Segmentation and 30 for the Normalized
    Classifier Segmentation over the Energy
    Segmentation

43
Word Error Rate
  • Speech Recognition Experiments Using CMUs Sphinx
  • Acoustic Model Composed of 3 states HMMs with 16
    Gaussian Mixtures per state
  • 434 utterances from four speakers
  • Recognize the speech in between the boundaries
    specified by the three segmentation techniques

44
Word Error Rate
  • 37.5 relative degradation for the Energy
    Segmentation and14 for the Classifier
    Segmentation over the Hand Segmentation

45
Summary and Conclusion
  • We presented a frame classifier approach for
    speech endpoint detection in multi-channel
    meetings
  • Classifies speech into four classes
  • Normalizes the energy feature for each channel
    separately
  • Works real-time (less than 0.1 Real-Time)
  • FER as low as 4.22
  • Described a segmentation mechanism that generates
    accurate Owner speech endpoints
  • Uses the frame-by-frame classification results
  • SER and WER outperformed the energy-based
    segmentation performance
  • Our endpointer outperformed and has already
    replaced the energy-based endpointer in the MR

46
References
  • 1 T. Pfau, D. P. W. Ellis, and A. Stolcke,
    Multispeaker speech activity detection for the
    ICSI Meeting Recorder, in Proceedings IEEE
    Automatic Speech Recognition and Understanding
    Workshop, Madonna di Campiglio, Italy, Dec. 2001.
  • 2 S.N. Wrigley, G.J. Brown, V. Wan, and S.
    Renals, Speech and Crosstalk Detection in
    Multi-Channel Audio, to appear in IEEE
    Transactions on Speech and Audio Processing
    (2004).
  • 3 S.E. Bou-Ghazale, K. Assaleh, A robust
    endpoint detection of speech for noisy
    environments with application to automatic speech
    recognition, in Proceedings IEEE International
    Conference on Acoustics, Speech, and Signal
    Processing, Volume 4, Pages 3808 - 3811, May
    2002.
  • 4 S. B. Davis, P. Mermelstein, "Comparison of
    parametric representations for monosyllabic word
    recognition in continuously spoken sentences", in
    IEEE Transactions on Acoustics, Speech, and
    Signal Processing, vol. 28, no. 4, pp 357-366,
    August 1980.
  • 5 NIST Speech Quality Assurance (SPQA) Package
    Version 2.3, available from http//www.nist.gov/s
    peech/tools/index.htm.

47
Questions?
Write a Comment
User Comments (0)
About PowerShow.com