Modeling Prosody for Automatic Corpus Labeling and Utterance Classification for Translation

About This Presentation

Title:

Modeling Prosody for Automatic Corpus Labeling and Utterance Classification for Translation

Description:

Construct FSM from syntactic-prosodic language model and CHMM acoustic models ... results from prosody labeling to build prosodic 'language models' for each class ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 13

Provided by: wwwsc

Category:

more less

Transcript and Presenter's Notes

Title: Modeling Prosody for Automatic Corpus Labeling and Utterance Classification for Translation

1
Modeling Prosody for Automatic Corpus Labeling
and Utterance Classification for Translation

Shankar Ananthakrishnan
July 29, 2004

2
Labeling Previous work

Automatic annotation of stress patterns
Stress patterns based on ToBI labeling
Previous approach
Static feature vectors
Syllable level feature extraction
Only features related to pitch employed
Limitations
Temporal misalignment between event and evidence
No history or language model
Syntactic information not incorporated

3
Labeling Current approach

Binary decision at the syllable level
Simple stressed/unstressed classification
Avoids complicated ToBI stress categorization
Time series modeling framework
Integrate multiple streams of information (ex.
intensity, phone duration, pitch)
Streams evolve at different speeds,
asynchronity
Modeling framework Coupled HMMs
Incorporate syntactic knowledge statistically
Part-of-Speech bigrams predict probability of
stress

4
Acoustic Model

Coupled Hidden Markov Model
Captures relationship between asynchronous
information streams
Used in audio-visual ASR to jointly model
acoustic/video data

Stream 1
A1
A2
A3
A4
B1
B2
B3
B4
Stream 2
5
Language Model

Syntax is a powerful predictor of prosody
Can be used to determine a baseline prosody
regardless of the specific utterance Arnfield
1994
Part-of-Speech (P-o-S) is a very accurate
indicator of presence/absence of stress
Words in noun phrases and content words much more
likely to be stressed
P-o-S based language model for prosody
recognizer
Use tagger to label corpus with P-o-S
Annotated data can be used to build language
models of the type p(NP_s PP_u)

6
Prosody Recognizer

Acoustic models at the syllable level
Stress occurs at the syllable level
Word/higher level models will have an undesirable
averaging effect on stress labeling
Only two categories for acoustic model stressed
and unstressed two CHMMs
Recognition (decoding)
Construct FSM from syntactic-prosodic language
model and CHMM acoustic models using lexical
stress information from standard dictionaries
Decode using Viterbi algorithm to obtain most
likely stress pattern sequence

7
Machine Translation

Utterance classification
Map each utterance to its canonical form
Pick corresponding utterance in target language
Simple approach (Transonics)
Build class-specific language models p(w c)
Evaluate test utterance against each LM and pick
the class whose model provides the highest score
c arg maxc p(w c)
Limitations of this approach
Insufficient data to train class-specific LMs
Ignores information contained in acoustics

8
Prosody for Classification

Joint model incorporating acoustics and language
information
Assume all classes are equally likely for the
following formulation
MAP classifier
c arg maxc p(c p, w) arg maxc p(p
w, c) p(w c)

class LM
acoustic-prosodic model
9
A Naïve Approach

Simplify the acoustic-prosodic model
c arg maxc p(c p, w) arg maxc p(p, w
c) arg maxc p(p c) p(w c)
Assumption given the target class, prosody is
only weakly dependent on the word sequence
Can use results from prosody labeling to build
prosodic language models for each class
Class-based word LMs can be used as before

10
Joint Modeling

Sparsity of training data
Insufficient data for modeling p(p w, c) if we
model acoustics at the word or higher level
Even p(w c) suffers from sparsity problems
(only 4-5 sentences per class on average)
Modeling at phrase level
Captures high-level information useful for
classification
CHMMs at the phrase level instead of syllable
level
Issue what units to use for acoustic models so
as to capture phrase-level prosody and overcome
problems posed by sparsity?

11
Semantic Annotation

Attach semantic tags to phrases
Automatic phrase parser segments text into phrase
constituents (Charniak)
Statistical semantic tagger can be built from
hand-annotated corpora (Gildea Jurafsky)
UC Berkeley ICSI FrameNet corpus
Eliminating sparsity
Few semantic roles for our translation problem
(medical domain)
Class-based language and acoustic models can be
built from tagged semantic roles instead of raw
word sequences
LMs will be much more accurate
Acoustic models may perform better, since prosody
depends more on the semantic role than on the
actual word sequence