Context and Learning in Multilingual Tone and Pitch Accent Recognition presentation

About This Presentation

Transcript and Presenter's Notes

Title: Context and Learning in Multilingual Tone and Pitch Accent Recognition

1
Context and Learning in Multilingual Tone and
Pitch Accent Recognition

Gina-Anne Levow
University of Chicago
May 18, 2007

2
Roadmap

Challenges for Tone and Pitch Accent
Contextual effects
Training demands
Modeling Context for Tone and Pitch Accent
Data collections processing
Integrating context
Context in Recognition
Asides More tones and features
Reducing Training Demands
Data collections structure
Semi-supervised learning
Unsupervised clustering
Conclusion

3
Challenges Context

Tone and Pitch Accent Recognition
Key component of language understanding
Lexical tone carries word meaning
Pitch accent carries semantic, pragmatic,
discourse meaning
Non-canonical form (Shen 90, Shih 00, Xu 01)
Tonal coarticulation modifies surface realization
In extreme cases, fall becomes rise
Tone is relative
To speaker range
High for male may be low for female
To phrase range, other tones
E.g. downstep

4
Challenges Training Demands

Tone and pitch accent recognition
Exploit data intensive machine learning
SVMs (Thubthong 01,Levow 05, SLX05)
Boosted and Bagged Decision trees (X. Sun, 02)
HMMs (Wang Seneff 00, Zhou et al 04,
Hasegawa-Johnson et al, 04,
Can achieve good results with huge sample sets
SLX05 10K lab syllabic samples -gt gt 90
accuracy
Training data expensive to acquire
Time pitch accent 10s of times real-time
Money requires skilled labelers
Limits investigation across domains, styles, etc
Human language acquisition doesnt use labels

5
Strategy Overall

Common model across languages
Common machine learning classifiers
Acoustic-prosodic model
No word label, POS, lexical stress info
No explicit tone label sequence model
English, Mandarin Chinese, isiZulu
(also Cantonese)

6
Strategy Context

Exploit contextual information
Features from adjacent syllables
Height, shape direct, relative
Compensate for phrase contour
Analyze impact of
Context position, context encoding, context type
gt 12.5 reduction in error over no context

7
Data Collections I

English (Ostendorf et al, 95)
Boston University Radio News Corpus, f2b
Manually ToBI annotated, aligned, syllabified
Pitch accent aligned to syllables
Unaccented, High, Downstepped High, Low
(Sun 02, Ross Ostendorf 95)

8
Data Collections II

Mandarin
TDT2 Voice of America Mandarin Broadcast News
Automatically force aligned to anchor scripts
Automatically segmented, pinyin pronunciation
lexicon
Manually constructed pinyin-ARPABET mapping
CU Sonic language porting
High, Mid-rising, Low, High falling, Neutral

9
Data Collections III

isiZulu (Govender et al., 2005)
Sentence text collected from Web
Selected based on grapheme bigram variation
Read by male native speaker
Manually aligned, syllabified
Tone labels assigned by 2nd native speaker
Based only on utterance text
Tone labels High, low

10
Local Feature Extraction

Uniform representation for tone, pitch accent
Motivated by Pitch Target Approximation Model
Tone/pitch accent target exponentially approached
Linear target height, slope (Xu et al, 99)
Base features
Pitch, Intensity max, mean, min, range
(Praat, speaker normalized)
Pitch at 5 points across voiced region
Duration
Initial, final in phrase
Slope
Linear fit to last half of pitch contour

11
Context Features

Local context
Extended features
Pitch max, mean, adjacent points of preceding,
following syllables
Difference features
Difference between
Pitch max, mean, mid, slope
Intensity max, mean
Of preceding, following and current syllable
Phrasal context
Compute collection average phrase slope
Compute scalar pitch values, adjusted for slope

12
Classification Experiments

Classifier Support Vector Machine
Linear kernel
Multiclass formulation
SVMlight (Joachims), LibSVM (Cheng Lin 01)
41 training / test splits
Experiments Effects of
Context position preceding, following, none,
both
Context encoding Extended/Difference
Context type local, phrasal

13
Results Local Context
Context Mandarin Tone English Pitch Accent isiZulu Tone
Full 74.5 81.3 75.9
Extend PrePost 74 80.7 73.8
Extend Pre 74 79.9 73.6
Extend Post 70.5 76.7 72.3
Diffs PrePost 75.5 80.7 75.8
Diffs Pre 76.5 79.5 75.5
Diffs Post 69 77.3 72.8
Both Pre 76.5 79.7 75.5
Both Post 71.5 77.6 72.5
No context 68.5 75.9 72.2
14
Results Local Context
Context Mandarin Tone English Pitch Accent isiZulu Tone
Full 74.5 81.3 75.9
Extend PrePost 74 80.7 73.8
Extend Pre 74 79.9 73.6
Extend Post 70.5 76.7 72.3
Diffs PrePost 75.5 80.7 75.8
Diffs Pre 76.5 79.5 75.5
Diffs Post 69 77.3 72.8
Both Pre 76.5 79.7 75.5
Both Post 71.5 77.6 72.5
No context 68.5 75.9 72.2
15
Results Local Context
Context Mandarin Tone English Pitch Accent isiZulu Tone
Full 74.5 81.3 75.9
Extend PrePost 74 80.7 73.8
Extend Pre 74 79.9 73.6
Extend Post 70.5 76.7 72.3
Diffs PrePost 75.5 80.7 75.8
Diffs Pre 76.5 79.5 75.5
Diffs Post 69 77.3 72.8
Both Pre 76.5 79.7 75.5
Both Post 71.5 77.6 72.5
No context 68.5 75.9 72.2
16
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
17
Discussion Local Context

Any context information improves over none
Preceding context information consistently
improves over none or following context
information
English/isiZulu Generally more context features
are better
Mandarin Following context can degrade
Little difference in encoding (Extend vs Diffs)
Consistent with phonetic analysis (Xu) that
carryover coarticulation is greater than
anticipatory

18
Results Discussion Phrasal Context
Phrase Context Mandarin Tone English Pitch Accent
Phrase 75.5 81.3
No Phrase 72 79.9

Phrase contour compensation enhances recognition
Simple strategy
Use of non-linear slope compensate may improve

19
Context Summary

Employ common acoustic representation
Tone (Mandarin), pitch accent (English)
Cantonese 64 68 with RBF kernel
SVM classifiers - linear kernel 76, 81
Local context effects
Up to gt 20 relative reduction in error
Preceding context greatest contribution
Carryover vs anticipatory
Phrasal context effects
Compensation for phrasal contour improves
recognition

20
Context Summary

Employ common acoustic representation
Tone (Mandarin,isiZulu), pitch accent (English)
SVM classifiers - linear kernel 76,76, 81
Local context effects
Up to gt 20 relative reduction in error
Preceding context greatest contribution
Carryover vs anticipatory
Phrasal context effects
Compensation for phrasal contour improves
recognition

21
Aside More Tones

Cantonese
CUSENT corpus of read broadcast news text
Same feature extraction representation
6 tones
High level, high rise, mid level, low fall, low
rise, low level
SVM classification
Linear kernel 64, Gaussian kernel 68
3,6 50 - mutually indistinguishable (50
pairwise)
Human levels no context 50 context 68
Augment with syllable phone sequence
86 accuracy 90 of syllable w/tone 3 or 6 one
dominates

22
Aside Voice Quality Energy

w/ Dinoj Surendran
Assess local voice quality and energy features
for tone
Not typically associated with tones
Mandarin/isiZulu
Considered
VQ NAQ, AQ, etc Spectral balance Spectral
Tilt Band energy
Useful Band energy significantly improves
Mandarin neutral tone
Supports identification of unstressed syllables
Spectral balance predicts stress in Dutch
isiZulu Using band energy outperforms pitch
In conjunction with pitch -gt 78

23
Roadmap

Challenges for Tone and Pitch Accent
Contextual effects
Training demands
Modeling Context for Tone and Pitch Accent
Data collections processing
Integrating context
Context in Recognition
Reducing Training Demands
Data collections structure
Semi-supervised learning
Unsupervised clustering
Conclusion

24
Strategy Training

Challenge
Can we use the underlying acoustic structure of
the language through unlabeled examples to
reduce the need for expensive labeled training
data?
Exploit semi-supervised and unsupervised learning
Semi-supervised Laplacian SVM
K-means and asymmetric k-lines clustering
Substantially outperform baselines
Can approach supervised levels

25
Data Collections Processing

English (as before)
Boston University Radio News Corpus, f2b
Binary Unaccented vs accented
4-way Unaccented, High, Downstepped High, Low
Mandarin
Lab speech data (Xu, 1999)
5 syllable utterances vary tone, focus position
In-focus, pre-focus, post-focus
TDT2 Voice of America Mandarin Broadcast News
4-way High, Mid-rising, Low, High falling
isiZulu (as before)
Read web sentences
2-way High vs low

26
Semi-supervised Learning

Approach
Employ small amount of labeled data
Exploit information from additional presumably
more available unlabeled data
Few prior examples several weakly supervised
(Wong et al, 05)
Classifier
Laplacian SVM (Sindhwani,BelkinNiyogi 05)
Semi-supervised variant of SVM
Exploits unlabeled examples
RBF kernel, typically 6 nearest neighbors,
transductive

27
Experiments

Pitch accent recognition
Binary classification Unaccented/Accented
1000 instances, proportionally sampled
Labeled training 200 unacc, 100 acc
80 accuracy (cf. 84 w/15x labeled SVM)
Mandarin tone recognition
4-way classification n(n-1)/2 binary classifiers
400 instances balanced 160 labeled
Clean lab speech- in-focus-94
cf. 99 w/SVM, 1000s train 85 w/SVM 160
training samples
Broadcast news 70
Cf. lt 50 w/SVM 160 training samples

28
Unsupervised Learning

Question
Can we identify the tone structure of a language
from the acoustic space without training?
Analogous to language acquisition
Significant recent research in unsupervised
clustering
Established approaches k-means
Spectral clustering (Shi Malik 97, Fischer
Poland 2004) asymmetric k-lines
Little research for tone
Self-organizing maps (Gauthier et al,2005)
Tones identified in lab speech using f0
velocities
Cluster-based bootstrapping (Narayanan et al,
2006)
Prominence clustering (Tambourini 05)

29
Clustering

Pitch accent clustering
4 way distinction 1000 samples, proportional
2-16 clusters constructed
Assign most frequent class label to each cluster
Classifier
Asymmetric k-lines
context-dependent kernel radii, non-spherical
gt 78 accuracy
2 clusters asymmetric k-lines best
Context effects
Vector w/preceding context vs vector with no
context comparable

30
Contrasting Clustering

Contrasts
Clustering
3 Spectral approaches
Perform spectral decomposition of affinity matrix
Asymmetric k-lines (Fischer Poland 2004)
Symmetric k-lines (Fischer Poland 2004)
Laplacian Eigenmaps (Belkin, Niyogi, Sindhwani
2004)
Binary weights, k-lines clustering
K-means Standard Euclidean distance
of clusters 2-16
Best results gt 78
2 clusters asymmetric k-lines gt 2 clusters
kmeans
Larger clusters all similar

31
Contrasting Learners
32
Tone Clustering I

Mandarin four tones
400 samples balanced
2-phase clustering 2-5 clusters each
Asymmetric k-lines, k-means clustering
Clean read speech
In-focus syllables 87 (cf. 99 supervised)
In-focus and pre-focus 77 (cf. 93 supervised)
Broadcast news 57 (cf. 74 supervised)
K-means requires more clusters to reach k-lines
level

33
Tone Structure
First phase of clustering splits high/rising from
low/falling by slope Second phase by pitch height
34
Tone Clustering II

isiZulu High/Low tones
3225 samples no labels
Proportional 62 low, 38 high
K-means clustering 2 clusters
Read speech, web-based sentences
70 accuracy (vs 76 fully-supervised)

35
Conclusions

Common prosodic framework for tone and pitch
accent recognition
Contextual modeling enhances recognition
Local context and broad phrase contour
Carryover coarticulation has larger effect for
Mandarin
Exploiting unlabeled examples for recognition
Semi- and Un-supervised approaches
Best cases approach supervised levels with less
training
Exploits acoustic structure of tone and accent
space

36
Current and Future Work

Interactions of tone and intonation
Recognition of topic and turn boundaries
Effects of topic and turn cues on tone realn
Child-directed speech tone learning
Support for Computer-assisted tone learning
Structured sequence models for tone
Sub-syllable segmentation modeling
Feature assessment
Band energy and intensity in tone recognition

37
Thanks

Dinoj Surendran, Siwei Wang, Yi Xu
Natasha Govender and Etienne Barnard
V. Sindhwani, M. Belkin, P. Niyogi I. Fischer
J. Poland T. Joachims C-C. Cheng C. Lin
This work supported by NSF Grant 0414919
http//people.cs.uchicago.edu/levow/tai

Write a Comment

User Comments (0)

About PowerShow.com

Context and Learning in Multilingual Tone and Pitch Accent Recognition PowerPoint PPT Presentation