Title: Automatic Cue-Based Dialogue Act Tagging
1Automatic Cue-Based Dialogue Act Tagging
- Discourse Dialogue
- CMSC 35900-1
- November 3, 2006
2Roadmap
- Task Corpus
- Dialogue Act Tagset
- Automatic Tagging Models
- Features
- Integrating Features
- Evaluation
- Comparison Summary
3Task Corpus
- Goal
- Identify dialogue acts in conversational speech
- Spoken corpus Switchboard
- Telephone conversations between strangers
- Not task oriented topics suggested
- 1000s of conversations
- recorded, transcribed, segmented
4Dialogue Act Tagset
- Cover general conversational dialogue acts
- No particular task/domain constraints
- Original set 50 tags
- Augmented with flags for task, conv mgmt
- 220 tags in labeling some rare
- Final set 42 tags, mutually exclusive
- Agreement K0.80 (high)
- 1,155 conv labeled split into train/test
5Common Tags
- Statement Opinion declarative /- op
- Question Yes/NoDeclarative form, force
- Backchannel Continuers like uh-huh, yeah
- Turn Exit/Adandon break off, /- pass
- Answer Yes/No, follow questions
- Agreement Accept/Reject/Maybe
6Probabilistic Dialogue Models
- HMM dialogue models
- Argmax U P(U)P(EU) E evidence,UDAs
- Assume decomposable by utterance
- Evidence from true words, ASR words, prosody
- Structured as offline decoding process on
dialogue - States DAs, ObsUtts, P(Obs)P(EiUi),
transP(U) - P(U)
- Conditioning on speaker tags improves model
- Bigram model adequate, useful
7DA Classification -Words
- Words
- Combines notion of discourse markers and
collocations e.g. uh-huhBackchannel - Contrast true words, ASR 1-best, ASR n-best
- Results
- Best 71- true words, 65 ASR 1-best
8DA Classification - Prosody
- Features
- Duration, pause, pitch, energy, rate, gender
- Pitch accent, tone
- Results
- Decision trees 5 common classes
- 45.4 - baseline16.6
- In HMM with DT likelihoods as P(EiUi)
- 49.7 (vs. 35 baseline)
9DA Classification - All
- Combine word and prosodic information
- Consider case with ASR words and acoustics
- P(Ai,Wi,FiUi) P(Ai,WiUi)P(FiUi)
- Reweight for different accuracies
- Slightly better than raw ASR
10Integrated Classification
- Focused analysis
- Prosodically disambiguated classes
- Statement/Question-Y/N and Agreement/Backchannel
- Prosodic decision trees for agreement vs
backchannel - Disambiguated by duration and loudness
- Substantial improvement for prosodywords
- True words S/Q 85.9-gt 87.6 A/B 81.0-gt84.7
- ASR words S/Q 75.4-gt79.8 A/B 78.2-gt81.7
- More useful when recognition is iffy
11Observations
- DA classification can work on open domain
- Exploits word model, DA context, prosody
- Best results for prosodywords
- Words are quite effective alone even ASR
- Questions
- Whole utterance models? more fine-grained
- Longer structure, long term features
12Automatic Metadata Annotation
- What is structural metadata?
- Why annotate?
13What is Structural Metadata?
- Issue Speech is messy
- Sentence/Utterance boundaries not marked
- Basic units for dialogue act, etc
- Speech has disfluencies
- Result Automatic transcripts hard to read
- Structural metadata annotation
- Mark utterance boundaries
- Identify fillers, repairs
14Metadata Details
- Sentence-like units (SU)
- Provide basic units for other processing
- Not necessarily grammatical sentences
- Distinguish full and incomplete SUs
- Conversational fillers
- Discourse markers, disfluencies um, uh, anyway
- Edit disfluencies
- Repetitions, repairs, restarts
- Mark material that should be excluded from fluent
- Interruption point (IP) where corrective starts
15Annotation Architecture
- 2 step process
- For each word, mark IP, SU, ISU, none bound
- For region boundwords identify CF/ED
- Post-process to remove insertions
- Boundary detection decision trees
- Prosodic features duration, pitch, amp, silence
- Lexical features POS tags, word/POS tag
patterns, adjacent filler words
16Boundary Detection - LM
- Language model based boundaries
- Hidden event language model
- Trigram model with boundary tags
- Combine with decision tree
- Use LM value as feature in DT
- Linear interpolation of DT LM probabilities
- Jointly model with HMM
17Edit and Filler Detection
- Transformation-based learning
- Baseline predictor, rule templates, objective fn
- Classify with baseline
- Use rule templates to generate rules to fix
errors - Add best rule to baseline
- Training Supervised
- Features Word, POS, word use, repetition,loc
- Tag Filled pause, edit, marker, edit term
18Evaluation
- SU Best combine all feature types
- None great
- CF/ED Best features lexical match, IP
- Overall SU detection relatively good
- Better on reference than ASR
- Most FP errors due to ASR errors
- DM errors not due to ASR
- Remainder of tasks problematic
19SU Detection
Features SU-R SU-P ISU-R ISU-P IP-R IP-P
Prosody only 46.5 74.6 0 8.8 47.2
POS, Pattern,LM 77.3 79.6 30 53.3 64.4 77.4
Pros,POS, Pattern,LM 81.5 80.4 36.5 69.7 66.1 78.7
Allfrag 81.1 81.6 20.1 60.7 80.7 80.4