Machine Learning of Disfluencies - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Machine Learning of Disfluencies

Description:

Repetition, deletion, insertion, subtitution, incompletely uttered ... entity in context (6) Laughter (7) Unintelligible material (8) Coordinating conjunction ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 26
Provided by: piroska4
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning of Disfluencies


1
Machine Learning of Disfluencies
  • Piroska Lendvai, Antal van den Bosch, Emiel
    Krahmer
  • Dept. of Computational Linguistics
  • Tilburg University

2
Overview
  • Machine Learning and Disfluencies Introduction
  • Case studies
  • ML-based detection of fragmented words
  • ML-based disfluency chunking
  • Machine Learning and Disfluencies Perspectives

3
Introduction
  • Disfluency in speech
  • Causes problems in NLP
  • Repetition, deletion, insertion, subtitution,
    incompletely uttered word, filled pause,
    abandoned sentence
  • Have (some) hierarchy
  • phono-/morpho-/syntactic level
  • repair delete/insert/substitute reformulate
    has editing term btw 2 constituents hesitate
    repeatefilled pause
  • (some) Have structure
  • het veilig gebruik van interne--IP sorry van
    electronic commerce
  • Corpora having Ds annotated CGN, MapTask,
    Switchboard, ATIS, TRAINS, Verbmobil

4
ML and disfluencies
  • Methods CART trees, decision trees, unsupervised
    clustering, statistical classification,
    rule-based correction of syntactic parses, rule
    induction, memory-based learning
  • Tasks detect Ds and correct/erase, identify cues
    to facilitate repairing, classify repeated items
    as un/planned, define/model
  • Features used lexical context, fine-grained
    prosody, syntax, word overlap in context,
    presence of other Ds, occurrence position, total
    nr words in utterance, gender, eye-contact,
    semantic information density

5
ML-based fragment detection
  • Explorative study on CGN to identify incompletely
    uttered words
  • Use fragment gives structural info (the
    Interruption Point) of the D
  • het veilig gebruik van interne--IP sorry van
    electronic commerce
  • Fragment presence often used in D detection, but
    it is not simple to detect it short and not in
    lexicon, but these criteria account for lt50
  • 0.9 of CGN data is a fragment (3k words)

6
Data
  • Spoken Dutch Corpus, 203 transcribed discourses
  • Various genres, 1-7 speakers
  • 45k sentences, 341k lexical tokens
  • 3k tagged fragmented words (0.9)
  • Instance generation
  • define informative properties features
  • automatically extract feature values
  • assign binary class Non-/Fragment

7
Cues in learning incomplete words
  • Based on corpus and the literature readily
    available, word-based features
  • Lexical window neighbouring 2 words left/right
    focus word (string)
  • Overlap in letters/matching words (binary)
  • Sentence-based general (numeric)
  • Context-type (binary)
  • Vector of 22 features

8
22 features
9
Feature vector
10
Learning the Fragment class
  • Memory-based classifier implemented in TiMBL
  • Nearest neighbour approach
  • compute similarities between neighbouring
    instance(s) and instance to be classified
  • algorithm parameters to be defined number of
    NNs, similarity metric, feature weights,
  • extrapolate class from NNs to query instance
  • Learning training testing phase. 10-fold CV.
  • Performance evaluation
  • accuracy
  • precision, recall, F -score on In-Disfl class

11
Parameter Optimization
  • Learning algorithms parameters can take many
    values
  • Unknown which setting(s) preform well on this
    data
  • Can be set manually, but huge search space
  • Set automatically construct large number of
    different learners by varying algorithm
    parameters, estimate performance on training data
  • Iterative deepening classifier wrapping
    progressive sampling
  • Good settings are tested on more (training) data
  • Best setting applied to held-out test data

12
Performance on Fragment Detection
  • Learner Accuracy Precision Recall Fß1
  • 1-letter 97.4 54.3 43.9 48.5
  • baseline
  • MBL 99.6?0.1 81.3?4.5 65.3?4.4 72.4?4.2
  • Opt. MBL 99.6?0.1 83.9?3.5 67.7?4.6 74.9?3.9

13
Results of default vs optimized learners
14
Error analysis
  • Frequent false negatives fragmented item that
    resembles true word
  • False positives short true lexical items
  • Named entities, foreign words, acronyms cause
    confusion
  • Annotation errors

15
Memory-based disfluency chunking
  • Goal
  • Create preprocessing module that filters out Ds
  • Method
  • Syntactically annotated Spoken Dutch Corpus
  • Disfluency everything that was not annotated as
    part of syntactic tree
  • Machine learning techniques memory-based
    classification, feature extraction, data
    attenuation, algorithm parameter optimization
  • Task learn where disfluent chunks start end
  • het veilig gebruikOut-D van interne--
    sorryIn-D van electronic commerceOut-D

16
CGN Material
  • orthographically transcribed, morpho-syntactically
    tagged, complete syntactic tree built
  • 31k words not under tree ? regard as disfluent
  • 27k disfluent chunks
  • ik uh ik heb met de nodige scepsis uh deze gang
    van zaken zon zon jaar aangekeken

17
Representing the material for ML
  • Extract simple properties from corpus
  • Each word, embedded in context
  • Overlap (if any) between focus word and context
    context words
  • I uh I have fo-- followed ...
  • Worked well for fragment detection (Lendvai 2003)
  • Mark class In-Disfluency, Out-Disfluency ?
    instance

18
Feature vector of some instances
19
Task baselines
  • Majority-class always predict Out-Disfl
  • 90 correct, 0 recall on target class (In-Disfl)
  • FilledPause-baseline predict that all and only
    (easily detectable) FPs are In-Disfl
  • no 100 precision as 1 in 4 FPs is in a larger
    chunk
  • Accuracy Prec Rec Fß1
  • Majority class 90 - 0 -
  • Filled Pause 93 76 28 41

20
Attenuation
  • Infrequent values in vectors problematic for ML
    e.g. rare words
  • Word form of those may be informative e.g. CAPS
    named entity, -dt suffix 2nd person singular
    verb form
  • Mask actual word but retain formal properties
    attenuate. Threshold.
  • zeven _ ik uh ik heb met de
    nodige scepsis uh deze gang
    van zaken
  • MORPH-NUM _ ik uh ik heb met de
    MORPH-ge MORPH-is uh deze MORPH-ng van
    zaken
  • Compresses data

21
Optimizing MBL
  • Learning algorithms parameters can take many
    values
  • Unknown which setting(s) preform well on this
    data
  • Can be done by hand
  • Construct large number of different learners by
    varying algorithm parameters, automatically
    estimate performance on training data
  • Iterative deepening classifier wrapping
    progressive sampling
  • Good settings are tested on more (training) data
  • Best setting applied to held-out test data

22
MBL performance in disfluency chunking

23
Disfluency chunking
24
Evaluation
  • Given annotated material, memory-based machine
    learning performs well in chunking disfluencies
    using simple features (80 is also typical
    parsing score)
  • Parameter optimization attenuation technique
    beneficial for task large improvement over
    baseline learners
  • Most reliable features in classification
    (estimated by classsifier) focus word, word
    overlap info on Focus-Right1, Focus-Right2

25
Future Work Perspectives
  • Use ASR lexical output instead of transcriptions
    -)
  • Use upcoming prosodic annotation
  • Automatically tag ASR output morpo-syntactically,
    use as feature
  • Combine module with shallow parser
Write a Comment
User Comments (0)
About PowerShow.com