Context and Prosody in the Interpretation of Cue Phrases in Dialogue - PowerPoint PPT Presentation

About This Presentation
Title:

Context and Prosody in the Interpretation of Cue Phrases in Dialogue

Description:

Collection paradigm. Annotations. Perception Study of Okays ... Collection ... of affirmative cue words (alright, mm-hm, okay, right, uh-huh, yeah, yes, ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 45
Provided by: juliahir
Category:

less

Transcript and Presenter's Notes

Title: Context and Prosody in the Interpretation of Cue Phrases in Dialogue


1
Context and Prosody in the Interpretation of Cue
Phrases in Dialogue
Spoken Dialog with Humans and Machines
  • Julia Hirschberg
  • Columbia University and KTH
  • 11/22/07

2
In collaboration with
  • Agustín Gravano, Stefan Benus, Héctor Chávez,
    Shira Mitchell, and Lauren Wilcox
  • With thanks to Gregory Ward and Elisa Sneed
    German

3
Managing Conversation
  • How do speakers indicate conversational structure
    in human/human dialogue?
  • How do they communicate varying levels of
    attention, agreement, acknowledgment?
  • What role does lexical choice play in these
    communicative acts? Phonetic realization?
    Prosodic variation? Prior context?
  • Can human/human behavior be modeled in Spoken
    Dialogue Systems?

4
Cue Phrases/Discourse Markers/Cue Words/
Discourse Particles/Clue Words
  • Linguistic expressions that can be employed
  • to convey information about the discourse
    structure, or
  • to make a semantic (literal?) contribution.
  • Examples
  • now, well, so, alright, and, okay, first, on the
    other hand, by the way, for example,

5
Some Examples
  • thats pretty much okay
  • Speaker 1 between the yellow mermaid and
    the whaleSpeaker 2 okaySpeaker 1 and it is
  • okay we gonna be placing the blue moon

6
A Problem for Spoken Dialogue systems
  • How do speakers produce and hearers interpret
    such potentially ambiguous terms?
  • How important is acoustic/prosodic information?
  • Phonetic variation?
  • Discourse context?

7
Research Goals
  • Learn which features best characterize the
    different functions of single affirmative cue
    words.
  • Determine how these can be identified
    automatically.
  • Important in Spoken Dialogue Systems
  • Understand user input.
  • Produce output appropriately.

8
Overview
  • Previous research
  • The Columbia Games Corpus
  • Collection paradigm
  • Annotations
  • Perception Study of Okays
  • Experimental design
  • Analysis and results
  • Machine Learning Experiments on Okay
  • Future work Entrainment and Cue Phrases

9
Previous Work
  • General studies
  • Schriffin 82, 87 Reichman 85 Grosz Sidner
    86
  • Cues to cue phrase disambiguation
  • Hirschberg Litman 87, 93 Hockey 93 Litman
    94
  • Cues to Dialogue Act identification
  • Jurafsky et al 98 Rosset Lamel 04
  • Contextual cues to the production of backchannels
  • Ward Tsukahara 00 Sanjanhar Ward 06

10
The Columbia Games CorpusCollection
  • 12 spontaneous task-oriented dyadic conversations
    in Standard American English (9h 8m speech)
  • 2 subjects playing a series of computer games, no
    eye contact (45m 39s mean session time)
  • 2 sessions per subject, w/different partners
  • Several types of games, designed to vary the way
    discourse entities became old, or given in the
    discourse to study variation in intonational
    realization of information status

11
Cards Game 1
?
Player 1 (Describer)
?
Player 2 (Searcher)
  • Short monologues
  • Vary frequency and order of occurrence of objects
    on the cards.

12
Cards Game 2
?
Player 1 (Describer)
?
Player 2 (Searcher)
  • Dialogue
  • Vary frequency and order of occurrence of objects
    on the cards across speakers.

13
Objects Game
  • Follower must place the target object where it
    appears on the Describers screen solely via the
    description provided (4h 19m)

Describer
Follower
14
The Columbia Games CorpusRecording and Logging
  • Recorded on separate channels in soundproof
    booth, digitized and downsampled to 16k
  • All user and system behaviors logged

15
The Columbia Games CorpusAnnotation
  • Orthographic transcription and alignment (73k
    words).
  • Laughs, coughs, breaths, smacks,
    throat-clearings.
  • Self-repairs.
  • Intonation, using ToBI conventions.
  • Function (10 categories) of affirmative cue words
    (alright, mm-hm, okay, right, uh-huh, yeah, yes,
    ).
  • Question form and function.
  • Turn-taking behaviors.

16
The Columbia Games CorpusToBI Labeling
  • Tones
  • Pitch accents L, H, LH, H!H,
  • Phrase accents L-, H-, !H-
  • Boundary tones L, H
  • Break Indices
  • Degrees of junction0 no word boundary ... 4
    full intonational phrase boundary
  • Miscellaneous
  • Disfluencies, non-speech sounds,

17
The Columbia Games CorpusToBI Example
18
Perception StudySelection of Materials
Speaker 1 yeah um there's like there's some
space there's Speaker 2 okay I think I got it
Speaker 1 but it's gonna be below the
onion Speaker 2 okay
Speaker 1 okay alright I'll try it
okay Speaker 2 okay the owl is blinking
19
Perception StudyExperiment Design
  • 54 instances of okay (18 for each function).
  • 2 tokens for each okay
  • Isolated condition Only the word okay.
  • Contextualized condition 2 full speaker turns
  • The turn containing the target okay and
  • The previous turn by the other speaker.

20
Perception StudyExperiment Design
  • 1/3 each 3 labelers agreed, 2, none
  • Two conditions
  • Part 1 54 isolated tokens
  • Part 2 54 contextualized tokens
  • Subjects asked to classify each token of okay
    as
  • Acknowledgment / Agreement, or
  • Backchannel, or
  • Cue beginning discourse segment.

21
Perception StudyDefinitions Given to the Subjects
  • Acknowledge/Agreement
  • The function of okay that indicates I believe
    what you said and/or I agree with what you
    say.
  • Backchannel
  • The function of okay in response to another
    speaker's utterance that indicates only Im
    still here or I hear you and please continue.
  • Cue beginning discourse segment
  • The function of okay that marks a new segment of
    a discourse or a new topic. This use of okay
    could be replaced by now.

22
Perception StudySubjects and Procedure
  • Subjects
  • 20 paid subjects (10 female, 10 male).
  • Ages between 20 and 60.
  • Native speakers of English.
  • No hearing problems.
  • GUI on a laboratory workstation with headphones.

23
Results Inter-Subject Agreement
  • Kappa measure of agreement with respect to chance
    (Fleiss 71)

Isolated Condition Contextualized Condition
Overall .120 .294
Ack / Agree vs. Other .089 .227
Backchannel vs. Other .118 .164
Cue beginning vs. Other .157 .497
24
ResultsCues to Interpretation
  • Phonetic transcription of okay
  • Isolated Condition
  • Strong correlation for realization of initial
    vowel
  • ? Backchannel
  • ? Ack/Agree, Cue Beginning
  • Contextualized Condition
  • No strong correlations found for phonetic
    variants.

25
Results Cues to Interpretation
Isolated Condition Contextualized Condition
Ack / Agree Shorter /k/ Shorter latency between turns Shorter pause before okay
Backchannel Higher final pitch slope Longer 2nd syllable Lower intensity Higher final pitch slope More words by S2 before okay Fewer words by S1 after okay
Cue beginning Lower final pitch slope Lower overall pitch slope Lower final pitch slope Longer latency between turns More words by S1 after okay
S1 Utterer of the target okay. S2 The
other speaker.
26
Results Cues to Interpretation
  • Phrase-final intonation (ToBI)
  • (Both isolated and contextualized conditions.)
  • H-H ? Backchannel
  • H-L
  • L-H ? Ack/Agree, Backchannel
  • L-L ? Ack/Agree, Cue beginning

27
Perception Study Conclusions
  • Agreement
  • Availability of context improves inter-subject
    agreement.
  • Cue beginnings easier to disambiguate than the
    other two functions.
  • Cues to interpretation
  • Contextual features override word features
  • Exception Final pitch slope of okay in both
    conditions.

28
Machine Learning Experiments Okay
  • Can we identify the different functions of okay
    in our larger corpus reliably?
  • What features perform best?
  • How do these compare to those that predict human
    judgments?

29
Method
  • ML Algorithm
  • JRip Wekas implementation of the propositional
    rule learner Ripper (Cohen 95).
  • We also tried J4.8, Wekas implementation of the
    decision tree learner C4.5 (Quinlan 93, 96),
    with similar results.
  • 10-fold cross validation in all experiments.

30
Units of Analysis
  • IPU (Inter-pausal unit)
  • Maximal sequence of words delimited by pause gt
    50ms.
  • Conversational Turn
  • Maximal sequence of IPUs by the same speaker,
    with no contribution from the other speaker.

31
Experimental features
  • Text-based features (from transcriptions)
  • Word ident, POS tags (auto) position of word in
    IPU / turn
  • IPU, turn length in words prev turn same spkr?
  • Timing features (from time alignment)
  • Word / IPU / turn duration amount of spkr
    overlap
  • Time to word beg/end in IPU, turn
  • Acoustic features
  • min, mean, max, stdev x pitch, intensity
  • Slope of pitch, stylized pitch, and intensity,
    over the whole word, and over its last 100, 200,
    300ms.
  • Acoustic features from last IPU of prior
    speakers turn.

32
Results Classification of individual words
  • Classification of each individual word into its
    most common functions.
  • alright ? Ack/Agree, Cue Begin, Other
  • mm-hm ? Ack/Agree, Backchannel
  • okay ? Ack/Agree, Backchannel, Cue Begin,
    AckCueBegin, AckCueEnd, Other
  • right ? Ack/Agree, Check, Literal Modifier
  • yeah ? Ack/Agree, Backchannel

33
Majority Labeled Functions of Okay (n2434)
  • 1137 Ackn / Agreemt
  • 548 Cue begin discourse segment
  • 232 Pivot ending (A/ACue end)
  • 121 Backchannel
  • 68 Pivot beginning (A/ACue beg)
  • 33 Check with the interlocutor
  • 29 Literal modifier
  • 15 Stall / Filler
  • 10 Cue end discourse seg
  • 6 Back from task

34
Results Classification of okay
Feature Set Error Rate F-Measure F-Measure F-Measure F-Measure F-Measure
Feature Set Error Rate Ack /Agree Back-channel Cue Begin Ack/Agree Cue Begin Ack/Agree Cue End
Majority Label 1137 121 548 68 232
Text-based 31.7 .76 .16 .77 .09 .33
Acoustic 40.2 .69 .24 .64 .03 .25
Text-based Timing 25.6 .79 .31 .82 .18 .67
Full set 25.5 .80 .46 .83 .21 .66
Baseline (1) 48.3 .68 .00 .00 .00 .00
Human labelers (2) 14.0 .89 .78 .94 .56 .73
(1) Majority class baseline ACK/AGREE. (2)
Calculated wrt each labelers agreement with the
majority labels.
35
Conclusions ML Experiments
  • Context and timing features
  • Like perception in context results timing
  • Pause after okay, not before
  • of succeeding words
  • Acoustic features impoverished
  • No phonetic features
  • No pitch slope
  • But ToBI labels (where available) didnt help

36
Future Work
  • Experiments with full ToBI labeling
  • Other features
  • Lexical, Acoustic-Prosodic, and Discourse
    Entrainment and Dis-Entrainment
  • Positive correlations for affirmative cue words
  • Affirmative cue word entrainment and game scores
  • Affirmative cue word entrainment and overlaps and
    interruptions in turn-taking

37
Tack!
38
Other Work
  • Benus et al, 2007
  • The prosody of backchannels in American
    English, ICPhS 2007, Saarbrücken, Germany,
    August 2007.
  • Gravano et al, 2007
  • Classification of discourse functions of
    affirmative words in spoken dialogue,
    Interspeech 2007, Antwerp, Belgium, August 2007.

39
Importance for Spoken Dialogue Systems
  • Convey ambiguous terms with the intended meaning
  • Interpret the users input correctly

40
Experiment Design
  • Goal Study the relation between the down-stepped
    contour and
  • Information status
  • Syntactic position
  • Discourse position
  • Spontaneous speech
  • Both monologue and dialogue

41
Experiment Design
  • Three computer games.
  • Two players, each on a different computer.
  • They collaborate to perform a common task.
  • Totally unrestricted speech.

42
Objects Game
?
Player 1 (Describer)
?
Player 2 (Searcher)
  • Dialogue
  • Vary target and surrounding objects (subject and
    object position).

43
Games Session
  • Repeat 3 times
  • Cards Game 1
  • Cards Game 2
  • Short break (optional)
  • Repeat 3 times
  • Objects Game
  • Each subject participated in 2 sessions.
  • 12 sessions

44
Subjects
  • Postings
  • Columbias webpage for temporary job adds.
  • Craigs list
  • http//www.craigslist.org
  • Category Gigs ? Event gigs
  • Problem
  • People are unreliable
  • 50 did not show up, or cancelled with short
    notice.

45
Subjects
  • Possible solutions
  • Give precise instructions to e-mail ALL required
    info
  • Name, native speaker?, hearing impairments?, etc.
  • Ask for a phone number.
  • Call them and explain why it is so important for
    us that they show up (or cancel with adecuate
    notice).
  • Increase the pay after each session.
  • Example 5, 10, 15 instead of 10, 10, 10.

46
Recording
  • Sound-proof booth
  • 2 subjects 1 or 2 confederates.
  • Head-mounted mics.
  • Digital Audio Tape (DAT) one channel per
    speaker.
  • Wav files
  • One mono file per speaker.
  • Sample rate 48000
  • Downsampled to 16000 (but kept original files!)
  • 20 hours of speech ? 2.8 GB (16k)

47
Logs
  • Log everything the subjects do to a text file.
  • Example
  • 170355234 BEGIN_EXECUTION
  • 170404868 NEXT_TURN
  • 170431837 RESULTS 97 points awarded.
  • 170438426 NEXT_TURN
  • 170503873 RESULTS 92 points awarded.
  • ...
  • Later, this may be used (e.g.) to divide each
    session into smaller tasks or conversations.
Write a Comment
User Comments (0)
About PowerShow.com