Machine Learning of Disfluencies - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Machine Learning of Disfluencies

Description:

Repetition, deletion, insertion, subtitution, incompletely uttered ... entity in context (6) Laughter (7) Unintelligible material (8) Coordinating conjunction ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 26

Provided by: piroska4

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning of Disfluencies

1
Machine Learning of Disfluencies

Piroska Lendvai, Antal van den Bosch, Emiel
Krahmer
Dept. of Computational Linguistics
Tilburg University

2
Overview

Machine Learning and Disfluencies Introduction
Case studies
ML-based detection of fragmented words
ML-based disfluency chunking
Machine Learning and Disfluencies Perspectives

3
Introduction

Disfluency in speech
Causes problems in NLP
Repetition, deletion, insertion, subtitution,
incompletely uttered word, filled pause,
abandoned sentence
Have (some) hierarchy
phono-/morpho-/syntactic level
repair delete/insert/substitute reformulate
has editing term btw 2 constituents hesitate
repeatefilled pause
(some) Have structure
het veilig gebruik van interne--IP sorry van
electronic commerce
Corpora having Ds annotated CGN, MapTask,
Switchboard, ATIS, TRAINS, Verbmobil

4
ML and disfluencies

Methods CART trees, decision trees, unsupervised
clustering, statistical classification,
rule-based correction of syntactic parses, rule
induction, memory-based learning
Tasks detect Ds and correct/erase, identify cues
to facilitate repairing, classify repeated items
as un/planned, define/model
Features used lexical context, fine-grained
prosody, syntax, word overlap in context,
presence of other Ds, occurrence position, total
nr words in utterance, gender, eye-contact,
semantic information density

5
ML-based fragment detection

Explorative study on CGN to identify incompletely
uttered words
Use fragment gives structural info (the
Interruption Point) of the D
het veilig gebruik van interne--IP sorry van
electronic commerce
Fragment presence often used in D detection, but
it is not simple to detect it short and not in
lexicon, but these criteria account for lt50
0.9 of CGN data is a fragment (3k words)

6
Data

Spoken Dutch Corpus, 203 transcribed discourses
Various genres, 1-7 speakers
45k sentences, 341k lexical tokens
3k tagged fragmented words (0.9)
Instance generation
define informative properties features
automatically extract feature values
assign binary class Non-/Fragment

7
Cues in learning incomplete words

Based on corpus and the literature readily
available, word-based features
Lexical window neighbouring 2 words left/right
focus word (string)
Overlap in letters/matching words (binary)
Sentence-based general (numeric)
Context-type (binary)
Vector of 22 features

8
22 features
9
Feature vector
10
Learning the Fragment class

Memory-based classifier implemented in TiMBL
Nearest neighbour approach
compute similarities between neighbouring
instance(s) and instance to be classified
algorithm parameters to be defined number of
NNs, similarity metric, feature weights,
extrapolate class from NNs to query instance
Learning training testing phase. 10-fold CV.
Performance evaluation
accuracy
precision, recall, F -score on In-Disfl class

11
Parameter Optimization

Learning algorithms parameters can take many
values
Unknown which setting(s) preform well on this
data
Can be set manually, but huge search space
Set automatically construct large number of
different learners by varying algorithm
parameters, estimate performance on training data
Iterative deepening classifier wrapping
progressive sampling
Good settings are tested on more (training) data
Best setting applied to held-out test data

12
Performance on Fragment Detection

Learner Accuracy Precision Recall Fß1
1-letter 97.4 54.3 43.9 48.5
baseline
MBL 99.6?0.1 81.3?4.5 65.3?4.4 72.4?4.2
Opt. MBL 99.6?0.1 83.9?3.5 67.7?4.6 74.9?3.9

13
Results of default vs optimized learners
14
Error analysis

Frequent false negatives fragmented item that
resembles true word
False positives short true lexical items
Named entities, foreign words, acronyms cause
confusion
Annotation errors

15
Memory-based disfluency chunking

Goal
Create preprocessing module that filters out Ds
Method
Syntactically annotated Spoken Dutch Corpus
Disfluency everything that was not annotated as
part of syntactic tree
Machine learning techniques memory-based
classification, feature extraction, data
attenuation, algorithm parameter optimization
Task learn where disfluent chunks start end
het veilig gebruikOut-D van interne--
sorryIn-D van electronic commerceOut-D

16
CGN Material

orthographically transcribed, morpho-syntactically
tagged, complete syntactic tree built
31k words not under tree ? regard as disfluent
27k disfluent chunks
ik uh ik heb met de nodige scepsis uh deze gang
van zaken zon zon jaar aangekeken

17
Representing the material for ML

Extract simple properties from corpus
Each word, embedded in context
Overlap (if any) between focus word and context
context words
I uh I have fo-- followed ...
Worked well for fragment detection (Lendvai 2003)
Mark class In-Disfluency, Out-Disfluency ?
instance

18
Feature vector of some instances
19
Task baselines

Majority-class always predict Out-Disfl
90 correct, 0 recall on target class (In-Disfl)
FilledPause-baseline predict that all and only
(easily detectable) FPs are In-Disfl
no 100 precision as 1 in 4 FPs is in a larger
chunk
Accuracy Prec Rec Fß1
Majority class 90 - 0 -
Filled Pause 93 76 28 41

20
Attenuation

Infrequent values in vectors problematic for ML
e.g. rare words
Word form of those may be informative e.g. CAPS
named entity, -dt suffix 2nd person singular
verb form
Mask actual word but retain formal properties
attenuate. Threshold.
zeven _ ik uh ik heb met de
nodige scepsis uh deze gang
van zaken
MORPH-NUM _ ik uh ik heb met de
MORPH-ge MORPH-is uh deze MORPH-ng van
zaken
Compresses data

21
Optimizing MBL

Learning algorithms parameters can take many
values
Unknown which setting(s) preform well on this
data
Can be done by hand
Construct large number of different learners by
varying algorithm parameters, automatically
estimate performance on training data
Iterative deepening classifier wrapping
progressive sampling
Good settings are tested on more (training) data
Best setting applied to held-out test data

22
MBL performance in disfluency chunking

23
Disfluency chunking
24
Evaluation

Given annotated material, memory-based machine
learning performs well in chunking disfluencies
using simple features (80 is also typical
parsing score)
Parameter optimization attenuation technique
beneficial for task large improvement over
baseline learners
Most reliable features in classification
(estimated by classsifier) focus word, word
overlap info on Focus-Right1, Focus-Right2

25
Future Work Perspectives