74.419 Artificial Intelligence 2004 Speech - PowerPoint PPT Presentation

About This Presentation
Title:

74.419 Artificial Intelligence 2004 Speech

Description:

74.419 Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 31
Provided by: jdurston
Category:

less

Transcript and Presenter's Notes

Title: 74.419 Artificial Intelligence 2004 Speech


1
74.419 Artificial Intelligence 2004 Speech
Natural Language Processing
  • Natural Language Processing
  • written text as input
  • sentences (well-formed)
  • Speech Recognition
  • acoustic signal as input
  • conversion into written words
  • Spoken Language Understanding
  • analysis of spoken language (transcribed speech)

2
(No Transcript)
3
Speech Natural Language Processing
  • Areas in Natural Language Processing
  • Morphology
  • Grammar Parsing (syntactic analysis)
  • Semantics
  • Pragamatics
  • Discourse / Dialogue
  • Spoken Language Understanding
  • Areas in Speech Recognition
  • Signal Processing
  • Phonetics
  • Word Recognition

4
(No Transcript)
5
Speech Production Reception
  • Sound and Hearing
  • change in air pressure ? sound wave
  • reception through inner ear membrane / microphone
  • break-up into frequency components receptors in
    cochlea / mathematical frequency analysis (e.g.
    Fast-Fourier Transform FFT) ? Frequency Spectrum
  • perception/recognition of phonemes and
    subsequently words (e.g. Neural Networks,
    Hidden-Markov Models)

6
(No Transcript)
7
Speech Recognition Phases
  • Speech Recognition
  • acoustic signal as input
  • signal analysis - spectrogram
  • feature extraction
  • phoneme recognition
  • word recognition
  • conversion into written words

8
Speech Signal
  • Speech Signal
  • composed of different (sinus) waves with
    different frequencies and amplitudes
  • Frequency - waves/second ? like pitch
  • Amplitude - height of wave ? like loudness
  • noise (not sinus wave)
  • Speech Signal
  • composite signal comprising different frequency
    components

9
Waveform (fig. 7.20)
Amplitude/ Pressure
Time
"She just had a baby."
10
Waveform for Vowel ae (fig. 7.21)
Amplitude/ Pressure
Time
Time
11
Speech Signal Analysis
  • Analog-Digital Conversion of Acoustic Signal
  • Sampling in Time Frames (windows)
  • frequency 0-crossings per time frame
  • ? e.g. 2 crossings/second is 1 Hz (1 wave)
  • ? e.g. 10kHz needs sampling rate 20kHz
  • measure amplitudes of signal in time frame
  • ? digitized wave form
  • separate different frequency components
  • ? FFT (Fast Fourier Transform)
  • ? spectrogram
  • other frequency based representations
  • ? LPC (linear predictive coding),
  • ? Cepstrum

12
Waveform and Spectrogram (figs. 7.20, 7.23)
13
Waveform and LPC Spectrum for Vowel ae (figs.
7.21, 7.22)
Amplitude/ Pressure
Time
Energy
Formants
Frequency
14
Speech Signal Characteristics
  • From Signal Representation derive, e.g.
  • formants - dark stripes in spectrum
  • strong frequency components characterize
    particular vowels gender of speaker
  • pitch fundamental frequency
  • baseline for higher frequency harmonics like
    formants gender characteristic
  • change in frequency distribution
  • characteristic for e.g. plosives (form of
    articulation)

15
(No Transcript)
16
(No Transcript)
17
Video of glottis and speech signal in lingWAVES
(from http//www.lingcom.de)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Phoneme Recognition
  • Recognition Process based on
  • features extracted from spectral analysis
  • phonological rules
  • statistical properties of language/ pronunciation
  • Recognition Methods
  • Hidden Markov Models
  • Neural Networks
  • Pattern Classification in general

22
Pronunciation Networks / Word Models as
Probabilistic FAs (fig 5.12)
23
Pronunciation Network for 'about' (fig 5.13)
24
Word Recognition with Probabilistic FA / Markov
Chain (fig 5.14)
25
Viterbi-Algorithm
  • The Viterbi Algorithm finds an optimal sequence
    of states in continuous Speech Recognition, given
    an observation sequence of phones and a
    probabilistic (weighted) FA. The algorithm
    returns the path through the automaton which has
    maximum probability and accepts the observed
    sequence.

26
Viterbi-Algorithm (fig 5.19)
function VITERBI(observations of len
T,state-graph) returns best-path num-states
NUM-OF-STATES(state-graph) Create a path
probability matrix viterbinum-states2,T2 viter
bi0,0 1.0 for each time step t from 0 to T
do for each state s from 0 to num-states do for
each transition s0 from s specified by
state-graph new-score viterbis, t as,s0
bs0 (ot ) if ((viterbis0,t1 0) jj (new-score
gt viterbis0, t1)) then viterbis0, t1
new-score back-pointers0, t1 s Backtrace from
highest probability state in the final column of
viterbi and return path
27
Speech Recognition
Acoustic / sound wave
Filtering, Sampling Spectral Analysis
FFT Frequency Spectrum
Features (Phonemes Context)
Signal Processing / Analysis
Phoneme Recognition HMM, Neural
Networks Phonemes
Grammar or Statistics Phoneme Sequences /
Words
Grammar or Statistics for likely word
sequences Word Sequence / Sentence
28
Speech Recognizer Architecture (fig. 7.2)
29
Speech Processing - Important Types and
Characteristics
single word vs. continuous speech unlimited vs.
large vs. small vocabulary speaker-dependent vs.
speaker-independent training Speech Recognition
vs. Speaker Identification
30
Additional References
  • Hong, X. A. Acero H. Hon Spoken Language
    Processing. A Guide to Theory, Algorithms, and
    System Development. Prentice-Hall, NJ, 2001
  • Figures taken from
  • Jurafsky, D. J. H. Martin, Speech and Language
    Processing, Prentice-Hall, 2000, Chapters 5 and
    7.
  • lingWAVES (from http//www.lingcom.de
Write a Comment
User Comments (0)
About PowerShow.com