Title: Context and Learning in Multilingual Tone and Pitch Accent Recognition
1Context and Learning in Multilingual Tone and
Pitch Accent Recognition
- Gina-Anne Levow
- University of Chicago
- May 18, 2007
2Roadmap
- Challenges for Tone and Pitch Accent
- Contextual effects
- Training demands
- Modeling Context for Tone and Pitch Accent
- Data collections processing
- Integrating context
- Context in Recognition
- Asides More tones and features
- Reducing Training Demands
- Data collections structure
- Semi-supervised learning
- Unsupervised clustering
- Conclusion
3Challenges Context
- Tone and Pitch Accent Recognition
- Key component of language understanding
- Lexical tone carries word meaning
- Pitch accent carries semantic, pragmatic,
discourse meaning -
- Non-canonical form (Shen 90, Shih 00, Xu 01)
- Tonal coarticulation modifies surface realization
- In extreme cases, fall becomes rise
- Tone is relative
- To speaker range
- High for male may be low for female
- To phrase range, other tones
- E.g. downstep
4Challenges Training Demands
- Tone and pitch accent recognition
- Exploit data intensive machine learning
- SVMs (Thubthong 01,Levow 05, SLX05)
- Boosted and Bagged Decision trees (X. Sun, 02)
- HMMs (Wang Seneff 00, Zhou et al 04,
Hasegawa-Johnson et al, 04, - Can achieve good results with huge sample sets
- SLX05 10K lab syllabic samples -gt gt 90
accuracy - Training data expensive to acquire
- Time pitch accent 10s of times real-time
- Money requires skilled labelers
- Limits investigation across domains, styles, etc
- Human language acquisition doesnt use labels
5Strategy Overall
- Common model across languages
- Common machine learning classifiers
- Acoustic-prosodic model
- No word label, POS, lexical stress info
- No explicit tone label sequence model
- English, Mandarin Chinese, isiZulu
- (also Cantonese)
6Strategy Context
- Exploit contextual information
- Features from adjacent syllables
- Height, shape direct, relative
- Compensate for phrase contour
- Analyze impact of
- Context position, context encoding, context type
- gt 12.5 reduction in error over no context
7Data Collections I
- English (Ostendorf et al, 95)
- Boston University Radio News Corpus, f2b
- Manually ToBI annotated, aligned, syllabified
- Pitch accent aligned to syllables
- Unaccented, High, Downstepped High, Low
- (Sun 02, Ross Ostendorf 95)
8Data Collections II
- Mandarin
- TDT2 Voice of America Mandarin Broadcast News
- Automatically force aligned to anchor scripts
- Automatically segmented, pinyin pronunciation
lexicon - Manually constructed pinyin-ARPABET mapping
- CU Sonic language porting
- High, Mid-rising, Low, High falling, Neutral
9Data Collections III
- isiZulu (Govender et al., 2005)
- Sentence text collected from Web
- Selected based on grapheme bigram variation
- Read by male native speaker
- Manually aligned, syllabified
- Tone labels assigned by 2nd native speaker
- Based only on utterance text
- Tone labels High, low
10Local Feature Extraction
- Uniform representation for tone, pitch accent
- Motivated by Pitch Target Approximation Model
- Tone/pitch accent target exponentially approached
- Linear target height, slope (Xu et al, 99)
- Base features
- Pitch, Intensity max, mean, min, range
- (Praat, speaker normalized)
- Pitch at 5 points across voiced region
- Duration
- Initial, final in phrase
- Slope
- Linear fit to last half of pitch contour
11Context Features
- Local context
- Extended features
- Pitch max, mean, adjacent points of preceding,
following syllables - Difference features
- Difference between
- Pitch max, mean, mid, slope
- Intensity max, mean
- Of preceding, following and current syllable
- Phrasal context
- Compute collection average phrase slope
- Compute scalar pitch values, adjusted for slope
12Classification Experiments
- Classifier Support Vector Machine
- Linear kernel
- Multiclass formulation
- SVMlight (Joachims), LibSVM (Cheng Lin 01)
- 41 training / test splits
- Experiments Effects of
- Context position preceding, following, none,
both - Context encoding Extended/Difference
- Context type local, phrasal
13Results Local Context
Context Mandarin Tone English Pitch Accent isiZulu Tone
Full 74.5 81.3 75.9
Extend PrePost 74 80.7 73.8
Extend Pre 74 79.9 73.6
Extend Post 70.5 76.7 72.3
Diffs PrePost 75.5 80.7 75.8
Diffs Pre 76.5 79.5 75.5
Diffs Post 69 77.3 72.8
Both Pre 76.5 79.7 75.5
Both Post 71.5 77.6 72.5
No context 68.5 75.9 72.2
14Results Local Context
Context Mandarin Tone English Pitch Accent isiZulu Tone
Full 74.5 81.3 75.9
Extend PrePost 74 80.7 73.8
Extend Pre 74 79.9 73.6
Extend Post 70.5 76.7 72.3
Diffs PrePost 75.5 80.7 75.8
Diffs Pre 76.5 79.5 75.5
Diffs Post 69 77.3 72.8
Both Pre 76.5 79.7 75.5
Both Post 71.5 77.6 72.5
No context 68.5 75.9 72.2
15Results Local Context
Context Mandarin Tone English Pitch Accent isiZulu Tone
Full 74.5 81.3 75.9
Extend PrePost 74 80.7 73.8
Extend Pre 74 79.9 73.6
Extend Post 70.5 76.7 72.3
Diffs PrePost 75.5 80.7 75.8
Diffs Pre 76.5 79.5 75.5
Diffs Post 69 77.3 72.8
Both Pre 76.5 79.7 75.5
Both Post 71.5 77.6 72.5
No context 68.5 75.9 72.2
16Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
17Discussion Local Context
- Any context information improves over none
- Preceding context information consistently
improves over none or following context
information - English/isiZulu Generally more context features
are better - Mandarin Following context can degrade
- Little difference in encoding (Extend vs Diffs)
-
- Consistent with phonetic analysis (Xu) that
carryover coarticulation is greater than
anticipatory
18Results Discussion Phrasal Context
Phrase Context Mandarin Tone English Pitch Accent
Phrase 75.5 81.3
No Phrase 72 79.9
- Phrase contour compensation enhances recognition
- Simple strategy
- Use of non-linear slope compensate may improve
19Context Summary
- Employ common acoustic representation
- Tone (Mandarin), pitch accent (English)
- Cantonese 64 68 with RBF kernel
- SVM classifiers - linear kernel 76, 81
- Local context effects
- Up to gt 20 relative reduction in error
- Preceding context greatest contribution
- Carryover vs anticipatory
- Phrasal context effects
- Compensation for phrasal contour improves
recognition
20Context Summary
- Employ common acoustic representation
- Tone (Mandarin,isiZulu), pitch accent (English)
- SVM classifiers - linear kernel 76,76, 81
- Local context effects
- Up to gt 20 relative reduction in error
- Preceding context greatest contribution
- Carryover vs anticipatory
- Phrasal context effects
- Compensation for phrasal contour improves
recognition
21Aside More Tones
- Cantonese
- CUSENT corpus of read broadcast news text
- Same feature extraction representation
- 6 tones
- High level, high rise, mid level, low fall, low
rise, low level - SVM classification
- Linear kernel 64, Gaussian kernel 68
- 3,6 50 - mutually indistinguishable (50
pairwise) - Human levels no context 50 context 68
- Augment with syllable phone sequence
- 86 accuracy 90 of syllable w/tone 3 or 6 one
dominates
22Aside Voice Quality Energy
- w/ Dinoj Surendran
- Assess local voice quality and energy features
for tone - Not typically associated with tones
Mandarin/isiZulu - Considered
- VQ NAQ, AQ, etc Spectral balance Spectral
Tilt Band energy - Useful Band energy significantly improves
- Mandarin neutral tone
- Supports identification of unstressed syllables
- Spectral balance predicts stress in Dutch
- isiZulu Using band energy outperforms pitch
- In conjunction with pitch -gt 78
23Roadmap
- Challenges for Tone and Pitch Accent
- Contextual effects
- Training demands
- Modeling Context for Tone and Pitch Accent
- Data collections processing
- Integrating context
- Context in Recognition
- Reducing Training Demands
- Data collections structure
- Semi-supervised learning
- Unsupervised clustering
- Conclusion
24Strategy Training
- Challenge
- Can we use the underlying acoustic structure of
the language through unlabeled examples to
reduce the need for expensive labeled training
data? - Exploit semi-supervised and unsupervised learning
- Semi-supervised Laplacian SVM
- K-means and asymmetric k-lines clustering
- Substantially outperform baselines
- Can approach supervised levels
25Data Collections Processing
- English (as before)
- Boston University Radio News Corpus, f2b
- Binary Unaccented vs accented
- 4-way Unaccented, High, Downstepped High, Low
- Mandarin
- Lab speech data (Xu, 1999)
- 5 syllable utterances vary tone, focus position
- In-focus, pre-focus, post-focus
- TDT2 Voice of America Mandarin Broadcast News
- 4-way High, Mid-rising, Low, High falling
- isiZulu (as before)
- Read web sentences
- 2-way High vs low
26Semi-supervised Learning
- Approach
- Employ small amount of labeled data
- Exploit information from additional presumably
more available unlabeled data - Few prior examples several weakly supervised
(Wong et al, 05) - Classifier
- Laplacian SVM (Sindhwani,BelkinNiyogi 05)
- Semi-supervised variant of SVM
- Exploits unlabeled examples
- RBF kernel, typically 6 nearest neighbors,
transductive
27Experiments
- Pitch accent recognition
- Binary classification Unaccented/Accented
- 1000 instances, proportionally sampled
- Labeled training 200 unacc, 100 acc
- 80 accuracy (cf. 84 w/15x labeled SVM)
- Mandarin tone recognition
- 4-way classification n(n-1)/2 binary classifiers
- 400 instances balanced 160 labeled
- Clean lab speech- in-focus-94
- cf. 99 w/SVM, 1000s train 85 w/SVM 160
training samples - Broadcast news 70
- Cf. lt 50 w/SVM 160 training samples
28Unsupervised Learning
- Question
- Can we identify the tone structure of a language
from the acoustic space without training? - Analogous to language acquisition
- Significant recent research in unsupervised
clustering - Established approaches k-means
- Spectral clustering (Shi Malik 97, Fischer
Poland 2004) asymmetric k-lines - Little research for tone
- Self-organizing maps (Gauthier et al,2005)
- Tones identified in lab speech using f0
velocities - Cluster-based bootstrapping (Narayanan et al,
2006) - Prominence clustering (Tambourini 05)
29Clustering
- Pitch accent clustering
- 4 way distinction 1000 samples, proportional
- 2-16 clusters constructed
- Assign most frequent class label to each cluster
- Classifier
- Asymmetric k-lines
- context-dependent kernel radii, non-spherical
- gt 78 accuracy
- 2 clusters asymmetric k-lines best
- Context effects
- Vector w/preceding context vs vector with no
context comparable
30Contrasting Clustering
- Contrasts
- Clustering
- 3 Spectral approaches
- Perform spectral decomposition of affinity matrix
- Asymmetric k-lines (Fischer Poland 2004)
- Symmetric k-lines (Fischer Poland 2004)
- Laplacian Eigenmaps (Belkin, Niyogi, Sindhwani
2004) - Binary weights, k-lines clustering
- K-means Standard Euclidean distance
- of clusters 2-16
- Best results gt 78
- 2 clusters asymmetric k-lines gt 2 clusters
kmeans - Larger clusters all similar
31Contrasting Learners
32Tone Clustering I
- Mandarin four tones
- 400 samples balanced
- 2-phase clustering 2-5 clusters each
- Asymmetric k-lines, k-means clustering
- Clean read speech
- In-focus syllables 87 (cf. 99 supervised)
- In-focus and pre-focus 77 (cf. 93 supervised)
- Broadcast news 57 (cf. 74 supervised)
- K-means requires more clusters to reach k-lines
level
33Tone Structure
First phase of clustering splits high/rising from
low/falling by slope Second phase by pitch height
34Tone Clustering II
- isiZulu High/Low tones
- 3225 samples no labels
- Proportional 62 low, 38 high
- K-means clustering 2 clusters
- Read speech, web-based sentences
- 70 accuracy (vs 76 fully-supervised)
35Conclusions
- Common prosodic framework for tone and pitch
accent recognition - Contextual modeling enhances recognition
- Local context and broad phrase contour
- Carryover coarticulation has larger effect for
Mandarin - Exploiting unlabeled examples for recognition
- Semi- and Un-supervised approaches
- Best cases approach supervised levels with less
training - Exploits acoustic structure of tone and accent
space
36Current and Future Work
- Interactions of tone and intonation
- Recognition of topic and turn boundaries
- Effects of topic and turn cues on tone realn
- Child-directed speech tone learning
- Support for Computer-assisted tone learning
- Structured sequence models for tone
- Sub-syllable segmentation modeling
- Feature assessment
- Band energy and intensity in tone recognition
37Thanks
- Dinoj Surendran, Siwei Wang, Yi Xu
- Natasha Govender and Etienne Barnard
- V. Sindhwani, M. Belkin, P. Niyogi I. Fischer
J. Poland T. Joachims C-C. Cheng C. Lin - This work supported by NSF Grant 0414919
- http//people.cs.uchicago.edu/levow/tai