Computer Vision, Speech Communication - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Computer Vision, Speech Communication

Description:

4 Noises (artificial): subway, babble, car, exhibition. 5 SNRs : 5, 10, 15, 20dB , clean ... Application to Aurora 3. Fusion with other features. HIWIRE Meeting, ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 62
Provided by: cvspC
Category:

less

Transcript and Presenter's Notes

Title: Computer Vision, Speech Communication


1
HIWIRE
Computer Vision, Speech Communication and Signal
Processing Research Group
2
HIWIRE Involved CVSP Members
  • Group Leader Prof. Petros Maragos
  • Ph.D. Students / Graduate Research Assistants
  • D. Dimitriadis (speech recognition,
    modulations)
  • V. Pitsikalis (speech recognition,
    fractals/chaos, NLP)
  • A. Katsamanis (speech modulations, statistical
    processing, recognition)
  • G. Papandreou (vision PDEs, active contours,
    level sets, AV-ASR)
  • G. Evangelopoulos (vision/speech texture,
    modulations, fractals)
  • S. Leykimiatis (speech statistical processing,
    microphone arrays)

3
ICCS-NTUA in HIWIRE 1st Year
  • Evaluation
  • Databases Completed
  • Baseline Completed
  • WP1
  • Noise Robust Features Results 1st Year
  • Audio-Visual ASR Baseline Visual Features
  • Multi-microphone array Exploratory Phase
  • VAD Prelim. Results
  • WP2
  • Speaker Normalization Baseline
  • Non-native Speech Database Completed

4
ICCS-NTUA in HIWIRE 1st Year
  • Evaluation
  • Databases Completed
  • Baseline Completed
  • WP1
  • Noise Robust Features Results 1st Year
  • Modulation Features Results 1st Year
  • Fractal Features Results 1st Year
  • Audio-Visual ASR Baseline Visual Features
  • Multi-microphone array Exploratory Phase
  • VAD Prelim. Results
  • WP2
  • Speaker Normalization Baseline
  • Non-native Speech Database Completed

5
WP1 Noise Robustness
  • Platform HTK
  • Baseline Evaluation
  • Aurora 2, Aurora 3, TIMITNOISE
  • Modulation Features
  • AM-FM Modulations
  • Teager Energy Cepstrum
  • Fractal Features
  • Dynamical Denoising
  • Correlation Dimension
  • Multiscale Fractal Dimension
  • Hybrid-Merged Features

up to 62 (Aurora 3)
up to 36 (Aurora 2)
up to 61 (Aurora 2)
6
ICCS-NTUA in HIWIRE 1st Year
  • Evaluation
  • Databases Completed
  • Baseline Completed
  • WP1
  • Noise Robust Features Results 1st Year
  • Speech Modulation Features Results 1st Year
  • Fractal Features Results 1st Year
  • Audio-Visual ASR Baseline Visual
    Features
  • Multi-microphone array Exploratory Phase
  • VAD Prelim. Results
  • WP2
  • Speaker Normalization Baseline
  • Non-native Speech Database Completed

7
Speech Modulation Features
  • Filterbank Design
  • Short-Term AM-FM Modulation Features
  • Short-Term Mean Inst. Amplitude
    IA-Mean
  • Short-Term Mean Inst. Frequency IF-Mean
  • Frequency Modulation Percentages FMP
  • Short-Term Energy Modulation Features
  • Average Teager Energy, Cepstrum Coef. TECC

8
Modulation Acoustic Features
Nonlinear Processing
Demodulation
Robust Feature Transformation/ Selection
Regularization Multiband Filtering
Speech
Statistical Processing
AM-FM Modulation Features Mean Inst. Ampl.
IA-Mean Mean Inst. Freq. IF-Mean Freq.
Mod. Percent. FMP
V.A.D.
Energy Features Teager Energy Cepstrum Coeff.
TECC
9
TIMIT-based Speech Databases
  • TIMIT Database
  • Training Set 3696 sentences , 35
    phonemes/utterances
  • Testing Set 1344 utterances, 46680 phonemes
  • Sampling Frequency 16 kHz
  • Feature Vectors
  • MFCCC0AM-FM1st2nd Time Derivatives
  • Stream Weights (1) for MFCC and (2) for
    ??-FM
  • 3-state left-right HMMs, 16 mixtures
  • All-pair, Unweighted grammar
  • Performance Criterion Phone Accuracy Rates ()
  • Back-end System HTK v3.2.0

10
Results TIMITNoise
Up to 106
11
Aurora 3 - Spanish
  • Connected-Digits, Sampling Frequency 8 kHz
  • Training Set
  • WM (Well-Matched) 3392 utterances (quiet 532,
    low 1668 and high noise 1192
  • MM (Medium-Mismatch) 1607 utterances (quiet 396
    and low noise 1211)
  • HM (High-Mismatch) 1696 utterances (quiet 266,
    low 834 and high noise 596)
  • Testing Set
  • WM 1522 utterances (quiet 260, low 754 and high
    noise 508), 8056 digits
  • MM 850 utterances (quiet 0, low 0 and high
    noise 850), 4543 digits
  • HM 631 utterances (quiet 0, low 377 and high
    noise 254), 3325 digits
  • 2 Back-end ASR Systems (??? and BLasr)
  • Feature Vectors MFCCAM-FM (or Auditory?M-FM),
    TECC
  • All-Pair, Unweighted Grammar (or Word-Pair
    Grammar)
  • Performance Criterion Word (digit) Accuracy Rates

12
Results Aurora 3 (HTK)
Up to 62
13
Databases Aurora 2
  • Task Speaker Independent Recognition of Digit
    Sequences
  • TI - Digits at 8kHz
  • Training (8440 Utterances per scenario, 55M/55F)
  • Clean (8kHz, G712)
  • Multi-Condition (8kHz, G712)
  • 4 Noises (artificial) subway, babble, car,
    exhibition
  • 5 SNRs 5, 10, 15, 20dB , clean
  • Testing, artificially added noise
  • 7 SNRs -5, 0, 5, 10, 15, 20dB , clean
  • A noises as in multi-cond train., G712 (28028
    Utters)
  • B restaurant, street, airport, train station,
    G712 (28028 Utters)
  • C subway, street (MIRS) (14014 Utters)

14
Results Aurora 2
Up to 12
15
Work To Be Done on Modulation Features
16
ICCS-NTUA in HIWIRE 1st Year
  • Evaluation
  • Databases Completed
  • Baseline Completed
  • WP1
  • Noise Robust Features Results 1st Year
  • Speech Modulation Features Results 1st Year
  • Fractal Features Results 1st Year
  • Audio-Visual ASR Baseline Visual Features
  • Multi-microphone array Exploratory Phase
  • VAD Prelim. Results
  • WP2
  • Speaker Normalization Baseline
  • Non-native Speech Database Completed

17
Fractal Features

N-d Cleaned
FDCD
speech signal
N-d Signal
Local SVD
Embedding
Filtered Dynamics - Correlation Dimension (8)
MFD
Geometrical Filtering
Multiscale Fractal Dimension (6)
Noisy Embedding
Filtered Embedding
18
Databases Aurora 2
  • Task Speaker Independent Recognition of Digit
    Sequences
  • TI - Digits at 8kHz
  • Training (8440 Utterances per scenario, 55M/55F)
  • Clean (8kHz, G712)
  • Multi-Condition (8kHz, G712)
  • 4 Noises (artificial) subway, babble, car,
    exhibition
  • 5 SNRs 5, 10, 15, 20dB , clean
  • Testing, artificially added noise
  • 7 SNRs -5, 0, 5, 10, 15, 20dB , clean
  • A noises as in multi-cond train., G712 (28028
    Utters)
  • B restaurant, street, airport, train station,
    G712 (28028 Utters)
  • C subway, street (MIRS) (14014 Utters)

19
Results Aurora 2
Up to 40
20
Results Aurora 2
Up to 27
21
Results Aurora 2
Up to 61
22
Future Directions on Fractal Features
  • Refine Fractal Feature Extraction.
  • Application to Aurora 3.
  • Fusion with other features.

23
ICCS-NTUA in HIWIRE 1st Year
  • Evaluation
  • Databases Completed
  • Baseline Completed
  • WP1
  • Noise Robust Features Results 1st Year
  • Audio-Visual ASR Baseline Visual
    Features
  • Multi-microphone array Exploratory Phase
  • VAD Prelim. Results
  • WP2
  • Speaker Normalization Baseline
  • Non-native Speech Database Completed

24
Visual Front-End
  • Aim
  • Extract low-dimensional visual speech feature
    vector from video
  • Visual front-end modules
  • Speaker's face detection
  • ROI tracking
  • Facial Model Fitting
  • Visual feature extraction
  • Challenges
  • Very high dimensional signal - which features are
    proper?
  • Robustness
  • Computational Efficiency

25
Face Modeling
  • A well studied problem in Computer Vision
  • Active Appearance Models, Morphable Models,
    Active Blobs
  • Both Shape Appearance can enhance lipreading
  • The shape and appearance of human faces live in
    low dimensional manifolds



26
Image Fitting Example
step 2
step 6
step 10
step 14
step 18
27
Example Face Interpretation Using AAM
shape track superimposed on original video
reconstructed face This is what the visual-only
speech recognizer sees!
original video
  • Generative models like AAM allow us to evaluate
    the output of the visual front-end

28
Evaluation on the CUAVE Database
29
Audio-Visual ASR Database
  • Subset of CUAVE database used
  • 36 speakers (30 training, 6 testing)
  • 5 sequences of 10 connected digits per speaker
  • Training set 1500 digits (30x5x10)
  • Test set 300 digits (6x5x10)
  • CUAVE database also contains more complex data
    sets speaker moving around, speaker shows
    profile, continuous digits, two speakers (to be
    used in future evaluations)
  • CUAVE was kindly provided by the Clemson
    University

30
Recognition Results (Word Accuracy)
  • Data
  • Training 500 digits (29 speakers)
  • Testing 100 digits (4 speakers)

Audio Visual Audiovisual
Classification 99 46 85
Recognition 98 26 78
31
Future Work
  • Visual Front-end
  • Better trained AAM
  • Temporal tracking
  • Feature fusion
  • Experimentation with alternative DBN
    architectures
  • Automatic stream weight determination
  • Integration with non-linear acoustic features
  • Experiments on other audio-visual databases
  • Systematic evaluation of visual features

32
ICCS-NTUA in HIWIRE 1st Year
  • Evaluation
  • Databases Completed
  • Baseline Completed
  • WP1
  • Noise Robust Features Results 1st Year
  • Modulation Features Results 1st Year
  • Fractal Features Results 1st Year
  • Audio-Visual ASR Baseline Visual Features
  • Multi-microphone array Exploratory Phase
  • VAD Prelim. Results
  • WP2
  • Speaker Normalization Baseline
  • Non-native Speech Database Completed

33
User Robustness, Speaker Adaptation
  • VTLN Baseline
  • Platform HTK
  • Database AURORA 4
  • Fs 8 kHz
  • Scenarios Training, Testing
  • Comparison with MLLR
  • Collection of non-Native Speech Data Completed
  • 10 Speakers
  • 100 Utterances/Speaker

34
Vocal Tract Length Normalization
  • Implementation HTK
  • Warping Factor Estimation
  • Maximum Likelihood (ML) criterion

Figures from Hain99, Lee96
35
VTLN
  • Training
  • AURORA 4 Baseline Setup
  • Clean (SIC), Multi-Condition (SIM), Noisy (SIN)
  • Testing
  • Estimate warping factor using adaptation
    utterances (Supervised VTLN)
  • Per speaker warping factor (1, 2, 10, 20
    Utterances)
  • 2-pass Decoding
  • 1st pass
  • Get a hypothetical transcription
  • Alignment and ML to estimate per utterance
    warping factor
  • 2nd pass
  • Decode properly normalized utterance

36
Databases Aurora 4
  • Task 5000 Word, Continuous Speech Recognition
  • WSJ0 (16 / 8 kHz) Artificially Added Noise
  • 2 microphones Sennheiser, Other
  • Filtering G712, P341
  • Noises Car, Babble, Restaurant, Street, Airport,
    Train Station
  • Training (7138 Utterances per scenario)
  • Clean Sennheiser mic.
  • Multi-Condition Sennheiser Other mic.,
  • 75 w. artificially added noise _at_ SNR 10 20
    dB
  • Noisy Sennheiser, artificially added noise
  • SNR 10 20 dB
  • Testing (330 Utterances 166 Utterances each.
    Speaker 8)
  • SNR 5-15 dB
  • 1-7 Sennheiser microphone
  • 8-14 Other microphone

37
VTLN Results, Clean Training
38
VTLN Results, Multi-Condition Training
39
VTLN Results, Noisy Training
40
Future Directions for Speaker Normalization
  • Estimate warping transforms at signal level
  • Exploit instantaneous amplitude or frequency
    signals to estimate the warping parameters,
    Normalize the signal
  • Effective integration with model-based adaptation
    techniques (collaboration with TSI)

41
ICCS-NTUA in HIWIRE 1st Year
  • Evaluation
  • Databases Completed
  • Baseline Completed
  • WP1
  • Noise Robust Features Results 1st Year
  • Audio-Visual ASR Baseline Visual Features
  • Multi-microphone array Exploratory Phase
  • VAD Prelim. Results
  • WP2
  • Speaker Normalization Baseline
  • Non-native Speech Database Completed

42
WP1 Appendix Slides
  • Aurora 3

43
ASR Results ?
44
Experimental Results IIa (HTK)
45
Aurora 3 Configs
  • HM
  • States 14, Mixs 12
  • MM
  • States 16, Mixs 6
  • WM
  • States 16, Mixs 16

46
WP1 Appendix Slides
  • Aurora 2

47
Baseline Aurora 2
  • Database Structure
  • 2 Training Scenarios, 3 Test Sets, 442
    Conditions,
  • 7 SNRs per Condition Total of 2x70 Tests
  • Presentation of Selected Results
  • Average over SNR.
  • Average over Condition.
  • Training Scenarios Clean- v.s Multi- Train.
  • Noise Level Low v.s. High SNR.
  • Condition Worst v.s. Easy Conditions.
  • Features MFCCDA v.s. MFCCDACMS
  • Set up states 18 10-22, mixs 3-32,
    MFCCDACMS

48
Average Baseline Results Aurora 2
Average over all SNRs and all Conditions
Plain MFCCDA, CMS MFCCDACMS. Mixture
Clean train (Both Plain,CMS) 3, Multi train
Plain 22, CMS 32. Best Select for each
condition/noise the mixs with the best
result.
Average HTK results as reported with the
database.
49
Results Aurora 2
Up to 12
50
Results Aurora 2
Up to 40
51
Results Aurora 2
Up to 27
52
Results Aurora 2
Up to 61
53
Aurora 2 Distributed, Multicondition Training
Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full
  A A A A A B B B B B C C C  
  Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average
Clean 98,68 98,52 98,39 98,49 98,52 98,68 98,52 98,39 98,49 98,52 98,50 98,58 98,54 98,52
20 dB 97,61 97,73 98,03 97,41 97,70 96,87 97,58 97,44 97,01 97,23 97,30 96,55 96,93 97,35
15 dB 96,47 97,04 97,61 96,67 96,95 95,30 96,31 96,12 95,53 95,82 96,35 95,53 95,94 96,29
10 dB 94,44 95,28 95,74 94,11 94,89 91,96 94,35 93,29 92,87 93,12 93,34 92,50 92,92 93,79
5 dB 88,36 87,55 87,80 87,60 87,83 83,54 85,61 86,25 83,52 84,73 82,41 82,53 82,47 85,52
0 dB 66,90 62,15 53,44 64,36 61,71 59,29 61,34 65,11 56,12 60,47 46,82 54,44 50,63 59,00
-5dB 26,13 27,18 20,58 24,34 24,56 25,51 27,60 29,41 21,07 25,90 18,91 24,24 21,58 24,50
Average 88,76 87,95 86,52 88,03 87,82 85,39 87,04 87,64 85,01 86,27 83,24 84,31 83,78 86,39
54
Aurora 2 Distributed, Clean Training
Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full
  A A A A A B B B B B C C C  
  Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average
Clean 98,93 99,00 98,96 99,20 99,02 98,93 99,00 98,96 99,20 99,02 99,14 98,97 99,06 99,03
20 dB 97,05 90,15 97,41 96,39 95,25 89,99 95,74 90,64 94,72 92,77 93,46 95,13 94,30 94,07
15 dB 93,49 73,76 90,04 92,04 87,33 76,24 88,45 77,01 83,65 81,34 86,77 88,91 87,84 85,04
10 dB 78,72 49,43 67,01 75,66 67,71 54,77 67,11 53,86 60,29 59,01 73,90 74,43 74,17 65,52
5 dB 52,16 26,81 34,09 44,83 39,47 31,01 38,45 30,33 27,92 31,93 51,27 49,21 50,24 38,61
0 dB 26,01 9,28 14,46 18,05 16,95 10,96 17,84 14,41 11,57 13,70 25,42 22,91 24,17 17,09
-5dB 11,18 1,57 9,39 9,60 7,94 3,47 10,46 8,23 8,45 7,65 11,82 11,15 11,49 8,53
Average 69,49 49,89 60,60 65,39 61,34 52,59 61,52 53,25 55,63 55,75 66,16 66,12 66,14 60,06
55
WP1 Appendix Slides
  • Audio Visual Details

56
Introduction Motivations for AV-ASR
  • Audio-only ASR does not work reliably in many
    scenarios
  • Noisy background (e.g. car's cabin, cockpit)
  • Interference between talkers
  • Need to enhance the auditory signal when it is
    not reliable
  • Human speech perception is multimodal
  • Different modalities are weighed according to
    their reliability
  • Hearing impaired people can lipread
  • McGurk Effect (McGurk MacDonald, 1976)
  • Machines should also be able to exploit
    multimodal information

57
Audio-Visual Feature Fusion
  • Audio-visual feature integration is highly
    non-trivial
  • Audio visual speech asychrony (100 ms)
  • Relative reliability of streams can vary wildly
  • Many approaches to feature fusion in the
    literature
  • Early integration
  • Intermediate integration
  • Late integration
  • Highly active research area (mainly machine
    learning)
  • The class of Dynamic Bayesian Networks (DBNs)
    seems particularly suited for the problem
  • Stream interaction explicitly modeled
  • Model parameter inference is more difficult than
    in HMM

58
Visual Front-End AAM Parameters
  • First frame of the 36 videos manually annotated
  • 68 points on the whole face as shape landmarks
  • Color appearance sampled at 10000 pixels
  • Eigenvectors retained explain 70 variance
  • 5 eigenshapes 10 eigenfaces
  • Initial condition at each new frame the converged
    solution at the previous frame
  • Inverse-compositional gradient descent algorithm
  • Coarse-to-fine refinement (Gaussian pyramid - 3
    scales)

59
AV-ASR Experiment Setup
  • Features
  • Audio 39 features (MFCC_D_A)
  • Visual (upsampled from 30 Hz to 100 Hz)
  • 5 shape features (Sh)
  • 10 appearance features (App)
  • Audio-Visual 3945 feats (MFCC_D_ASHAPP_D_A)
  • Two-stream HMM
  • 8 state, left-to-right HMM whole-digit models
    with no state skipping
  • Single Gaussian observation probability densities
  • Separate audio video feature streams with equal
    weights (1,1)

60
WP1 Appendix Slides
  • Aurora 4

61
Aurora 4, Multi-Condition Training
62
Aurora 4, Noisy Training
63
Aurora 4, Noisy Training
Write a Comment
User Comments (0)
About PowerShow.com