Automatic%20Speech%20Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic%20Speech%20Recognition

Description:

Automatic Speech Recognition – PowerPoint PPT presentation

Number of Views:563
Avg rating:3.0/5.0
Slides: 27
Provided by: juliah171
Category:

less

Transcript and Presenter's Notes

Title: Automatic%20Speech%20Recognition


1
  • Automatic Speech Recognition

2
  • Opportunity to participate in a new user study
    for Newsblaster and get 25-30 for 2.5-3 hours
    of time respectively.
  • http//www1.cs.columbia.edu/delson/study.html
  • More opportunities will be coming.

3
What is speech recognition?
  • Transcribing words?
  • Understanding meaning?
  • Today
  • Overview ASR issues
  • Building an ASR system
  • Using an ASR system
  • Future research

4
Its hard to ... recognize speech/wreck a nice
beach
  • Speaker variability within and across
  • Recording environment varies wrt noise
  • Transcription task must handle all of this and
    produce a transcript of what was said, from
    limited, noisy information in the speech signal
  • Success low word error rate (WER)
  • WER (SID)/N 100
  • Thesis test vs. This is a test. 75 WER
  • Understanding task must do more from words to
    meaning

5
  • Measure concept accuracy (CA) of string in terms
    of accuracy of recognition of domain concepts
    mentioned in string and their values
  • I want to go from Boston to Baltimore on
    September 29
  • Domain concepts Values
  • source city Boston
  • target city Baltimore
  • travel date September 29
  • Score recognized string Go from Boston to
    Washington on December 29 (1/3 33 CA)
  • Go to Boston from Baltimore on September 29

6
Again, the Noisy Channel Model
  • Input to channel spoken sentence s
  • Output from channel an observation O
  • Decoding task find s P(sO)
  • Using Bayes Rule
  • And since P(O) doesnt change for any
    hypothetical s
  • s P(Os) P(s)
  • P(Os) is the observation likelihood, or Acoustic
    Model, and P(s) is the prior, or Language Model

7
What do we need to build use an ASR system?
  • Corpora for training and testing of components
  • Feature extraction component
  • Pronunciation Model
  • Acoustic Model
  • Language Model
  • Algorithms to search hypothesis space efficiently

8
Training and Test Corpora
  • Collect corpora appropriate for recognition task
    at hand
  • Small speech phonetic transcription to
    associate sounds with symbols (Acoustic Model)
  • Large (gt 60 hrs) speech orthographic
    transcription to associate words with sounds
    (Acoustic Model)
  • Very large text corpus to identify unigram and
    bigram probabilities (Language Model)

9
Representing the Signal
  • What parameters (features) of the speech input
  • Can be extracted automatically
  • Will preserve phonetic identity and distinguish
    it from other phones
  • Will be independent of speaker variability and
    channel conditions
  • Will not take up too much space
  • Speech representations (for ae in had)
  • Waveform change in sound pressure over time
  • LPC Spectrum component frequencies of a waveform
  • Spectrogram overall view of how frequencies
    change from phone to phone

10
  • Speech captured by microphone and sampled
    (digitized) -- may not capture all vital
    information
  • Signal divided into frames
  • Power spectrum computed to represent energy in
    different bands of the signal
  • LPC spectrum, Cepstra, PLP
  • Each frames spectral features represented by
    small set of numbers
  • Frames clustered into phone-like groups (phones
    in context) -- Gaussian or other models

11
  • Why this works?
  • Different phonemes have different spectral
    characteristics
  • Why it doesnt work?
  • Phonemes can have different properties in
    different acoustic contexts, spoken by different
    people
  • Nice white rice

12
Pronunciation Model
  • Models likelihood of word given network of
    candidate phone hypotheses (weighted phone
    lattice)
  • Allophones butter vs. but
  • Multiple pronunciations for each word
  • Lexicon may be weighted automaton or simple
    dictionary
  • Words come from all corpora pronunciations from
    pronouncing dictionary or TTS system

13
Acoustic Models
  • Model likelihood of phones or subphones given
    spectral features and prior context
  • Use pronunciation models
  • Usually represented as HMM
  • Set of states representing phones or other
    subword units
  • Transition probabilities on states how likely is
    it to see one phone after seeing another?
  • Observation/output likelihoods how likely is
    spectral feature vector to be observed from phone
    state i, given phone state i-1?

14
  • Initial estimates for
  • Transition probabilities between phone states
  • Observation probabilities associating phone
    states with acoustic examples
  • Re-estimate both probabilities by feeding the HMM
    the transcribed speech training corpus (forced
    alignment)
  • I.e., we tell the HMM the right answers --
    which words to associate with which sequences of
    sounds
  • Iteratively retrain the transition and
    observation probabilities by running the training
    data through the model and scoring output until
    no improvement

15
Language Model
  • Models likelihood of word given prior word and of
    entire sentence
  • Ngram models
  • Build the LM by calculating bigram or trigram
    probabilities from text training corpus
  • Smoothing issues very important for real systems
  • Grammars
  • Finite state grammar or Context Free Grammar
    (CFG) or semantic grammar
  • Out of Vocabulary (OOV) problem

16
  • Entropy H(X) the amount of information in a LM,
    grammar
  • How many bits will it take on average to encode a
    choice or a piece of information?
  • More likely things will take fewer bits to encode
  • Perplexity 2H a measure of the weighted mean
    number of choice points in e.g. a language model

17
Search/Decoding
  • Find the best hypothesis P(Os) P(s) given
  • Lattice of subword units (Acoustic Model)
  • Segmentation of all paths into possible words
    (Pronunciation Model)
  • Probabilities of word sequences (Language Model)
  • Produces a huge search space How to reduce?
  • Lattice minimization and determinization
  • Forward algorithm sum of all paths leading to a
    state
  • Viterbi algorithm max of all paths leading to a
    state

18
  • Forward-backward (Baum-Welch, Expectation-Maximiza
    tion) algorithm computes probability of
    sequence at any state in search space
  • Beam search prune the lattice

19
Varieties of Speech Recognition
20
Challenges for Transcription
  • Robustness to channel characteristics and noise
  • Portability to new applications
  • Adapatation to speakers, to environments
  • LMs simple ngrams need help
  • Confidence measures
  • OOV words
  • New speaking styles/genres
  • New applications

21
Challenges for Understanding
  • Recognizing communicative problems
  • ASR errors
  • User corrections
  • Disfluencies and self-repairs
  • Possibilities
  • Recognizing speaker emotion
  • Identifying speech acts okay
  • Locating topic boundaries for topic tracking,
    audio browsing, speech data mining

22
An Unsuccessful Dialogue
  • S Hi, this is ATT Amtrak schedule system. This
    is TOOT. How may I help you?
  • U I'd like to find a train from Ch- to Chicago
    from Baltimore.
  • S Sorry, I can't understand you. Please repeat
    your utterance.
  • U I'd like to find a train from Baltimore to
    Chicago.
  • S I heard you say leave today around 5 thirty
    in the morning . Which city do you want to go
    to?
  • U I would like to leave Saturday at eight in the
    morning.

23
S You can say the arrival city name, such as
New York City." U Baltimore to Chicago. S You
can say the arrival city name, such as New York
City. U Baltimore to Chicago. (I hate it.) U
Train from Baltimore ...S ...I heard you s...U
to Chicago.... Hello? S You can say the name of
your departure city, such as "New York City." U
Baltimore.
24
U Bal-ti-more.... S You can say... U
Baltimore. S Do you want to continue and find a
new set of trains? U Yes. S You can say "yes"
or "I am done here" to find a new set of
trains. U Yes. S Thank you for using ATT
Amtrak train time table system. See you next
time. U I nev-
25
Summary
  • ASR technology relies upon a large number of
    phenomena and techniques weve already seen to
    convert sound into words
  • Phonetic/phonological, morphological, and lexical
    events
  • FSAs, Ngrams, Dynamic programming algorithms
  • Better modeling of linguistic phenomena will be
    needed to improve performance on transcription
    and especially on understanding
  • For next class well start talking about larger
    structures in language above the word (Ch 8)

26
Disfluencies and Self-Repairs
  • Disfluencies abound in spontaneous speech
  • every 4.6s in radio call-in (Blackmer Mitton
    91)
  • hesitation Ch- change strategy.
  • filled pause Um Baltimore.
  • self-repair Ba- uh Chicago.
  • Hard to recognize
  • Ch- change strategy. --gt to D C D C today ten
    fifteen.
  • Um Baltimore. --gt From Baltimore ten.
  • Ba- uh Chicago. --gt For Boston Chicago.
Write a Comment
User Comments (0)
About PowerShow.com