Title: MandarinEnglish Information MEI: Investigating Translingual Speech Retrieval Johns Hopkins Universit
1Mandarin-English Information (MEI)Investigating
Translingual Speech RetrievalJohns Hopkins
University Center of Language and Speech
ProcessingSummer Workshop 2000
- The MEI Team
- August 23, 2000
2MEI Team
Helen Meng Chinese University of Hong
Kong Erika Grams Advanced Analytic
Tools Sanjeev Khudanpur Johns Hopkins
University Gina-Anne Levow University of
Maryland Douglas Oard University of
Maryland Patrick Schone US Department of
Defense Hsin-Min Wang Academia Sinica, Taiwan
Berlin Chen National Taiwan University Wai-Kit
Lo Chinese University of Hong Kong Karen Tang
Princeton University Jianqiang Wang University
of Maryland
3Outline
- Motivation
- Background
- The Multi-scale Paradigm
- multi-scale query processing
- multi-scale document indexing
- multi-scale retrieval
- The Perfect Retrieval Myth
- Experiments and Findings
- Conclusions and Future Work
4Motivation
- Monolingual speech retrieval applications are
emerging, e.g. - http//speechbot.research.compaq.com
Internet-accessible Radio and Television Stations
source www.real.com, Feb 2000
5Motivation (cont) Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
6MEI The Big Picture
7Concept Demo
Karen, Erika
8Two Prevailing Problems in CL-SDR
- Translation problem
- out-of-vocabulary (OOV) in translation
- too many translations
- Recognition problem
- OOV in recognition
- acoustic confusions
- Solution subword units may help
- transliteration, e.g.
- Northern Ireland /bei3 ai4 er3 lan2/ (in query)
- recognition of subword units, e.g.
- Iraq -- a rock (in document)
9Background for Mandarin Speech Recognition
- 400 syllables
- full phonological coverage in Mandarin Chinese
- 6,800 characters
- full textual coverage in written Chinese
(GB-coded) - each character pronounced as a syllable
- Unknown number of Chinese words
- one to several characters per word
- character combinations create different meanings
- ambiguity in word tokenization
10OOV and Acoustic Confusions in Mandarin SDR
Query Iraq...
11Subwords for Retrieval
- Character n-grams
- robust to word-level mismatches due to different
tokenization - Syllable n-grams
- robust to word/character-level mismatches due to
homophones - Partial matches possible
Pros
Con
- Subwords contain reduced lexical knowledge c.f.
words
12The MEI Investigation
- Use of a multi-scale representation for
crosslingual spoken document retrieval (CL-SDR) - Words and subwords
Research Challenges
- Multi-scale query translation
- Multi-scale audio indexing
- Multi-scale retrieval
13 President Bill Clinton and Chinese President
Jiang Zemin engaged in a spirited, televised
debate Saturday over human rights and
the Tiananmen Square crackdown, and announced a
string of agreements on arms control, energy and
environmental matters. There were no announced
breakthroughs on American human rights concerns,
including Tibet, but both leaders accentuated
the positive
Query by Example
English Newswire Exemplars
Mandarin Audio Stories
??????????????????????????????????????? ??????????
????????????????????????????? ????,???????????????
????????????,??????? ????????????????????????????
??????
14Evaluation Collection
Development Collection TDT-2
Evaluation Collection TDT-3
Oct 98
Dec 98
Jun 98
Jan 98
17 topics, variable number of exemplars
56 topics, variable number of exemplars
English text topic exemplars Associated
Press New York Times
2265 manually segmented stories
3371 manually segmented stories
Mandarin audio broadcast news Voice of America
Mar 98
Exhaustive relevance assessment based on event
overlap
15Abstract Task Model
American English Text Exemplar
Mandarin Chinese Broadcast News
Cross-Language Speech Retrieval
Ranked List of News Stories
16Evaluation of Ranked Lists
Relevance Judgments
VOA 0427.22
Relevant
VOA 0521.14
Not
VOA 0604.39
Not
VOA 0419.12
Relevant
VOA 0513.17
Relevant
VOA 0527.13
Not
17Recall-Precision Graph
18Variation Across Exemplars
19Average Across Exemplars
20Variation Across Topics
1.0
0.8
0.6
Mean Uninterpolated
Average Precision
0.4
0.2
0.0
Topic
21Comparing Two Systems
Topic
22Significance Testing
- Statistical significance
- Null hypothesis mean average precision across
topics is drawn from same distribution - Paired 2-tailed t-test, significant if p
- For System A vs. System B, p0.94
- Meaningful differences
- Rule of thumb 5-10 relative
- For System A vs. System B, relative difference is
23Translingual and Multi-ScaleQuery Processing
24 Bilingual Term List
Relevance Judgments
English Exemplar
LDC
000100010000010100
President Bill Clinton and
LDC
Named Entity Tagging
Term Translation
Term Selection
Query Construction
BBN
Ranked List
Mandarin IR System
U Mass
Evaluation
Mandarin Audio
Document Construction
Speech Recognition
Cornell
Mean Uninterpolated Average Precision
LDC
Dragon
Story Boundaries
LDC
25Multi-Scale Query Translation
- Words and Phrases
- (Gina, Sanjeev)
- Subwords
- (Helen, Wai-Kit, Berlin, Karen)
26Bilingual Term List
- Combination of
- LDC English-Chinese bilingual term list
- Chinese-English Translation Assistance File
(CETA) inverted
199,444 395,216 81,127 105,750
Total English Terms Total Translation
Pairs Phrasal Terms Phrasal Translation Pairs
Term human right(s) human rights
translations 7 30 1
27Query Term Selection
- Tagged named entities (BBN Identifinder)
- Person partners of Goldman, Sachs, Co.
- Organization UN Security Council
- Dictionary-based phrases
- translatable multi-word units, e.g
- Wall Street, best interests, guiding
principles, human rights - automatic tagging greedy, left-to-right, max
match - Chi-squared filtering
- Compared to English background model
28Query Term Translation
- Named entities
- if absent from dictionary, translate individual
terms - e.g. Security Council versus First Bank of
Siam - Numeric Expressions
- special processing for digits
- e.g. 1230 pm, June 15, 1969
- Remaining terms
- Consult bilingual term list, lemmatize if
necessary - e.g. televised translates as television
29Query Construction
- Unbalanced queries
- Use all plausible translations for each term
- Balanced queries
- Pseudo-term weight average of translations
weights - Structured queries
- Recompute pseudo-term weight from translations
term frequency and document frequency
30Strategies in Query Translation
- Phrase based translation is significantly better
- Named entities and numeral translations are
(barely) helpful - Balanced translation matches Structured queries
- also extends easily to subword units
31Untranslatable Terms
suharto 97 ( of occurrences) netanyahu 88 starr
62 arafat 50 bjp 45 vajpayee 44 estrada 44 . h
su 19 zemin 7
32Subword Transliteration
Kosovo (/ke1-suo3-wo4/, /ke1-suo3-fo2/,
/ke1-suo3-fu1, /ke1-suo3-fu2/)
Mandarin Audio Document
English Query Exemplar
..Kosovo...
../ke-suo-fo/.
Sound alike -- match in phonetic space?
33Subword Transliteration Procedure (1)
34Subword Transliteration Procedure (2)
35Subword Transliteration
Chinese phone lattice generation Syllable bigram
language model N-best syllable sequence hyp
N1 (one-best hypothesis) /ji li si te fu/
(hyp) /ke li si tuo fu/ (ref)
36Cross Lingual Phonetic Matching
- Documents are indexed with syllable bigrams (in
addition to words and character bigrams if
necessary) - Query terms are translated as words where
possible, phonetically where necessary
37Multi-Scale Query Construction
38Multi-Scale Query ConstructionObjectives
Bag of English query terms (selected)
Multi-scale query representation in Chinese
Query Construction
- Multi-scale representation integrates
- translated phrases, named entities, numeric
expressions, translated terms - transliterated syllables
- words, characters and syllable n-grams
39Multi-Scale Query ConstructionProcedures
words syl bigrams
char syl bigrams
Syllable bigrams and Transliterations yi-se
se-lie shou-xiang ben-jie jie-ming ne-tan tan-ya
ya-hu
syl bigrams
40Multi-Scale Audio Document Indexing
- Hsin-min, Helen, Berlin, and Wai-kit
41Previous Chinese Example
42Audio Document IndexingObjectives
- Augment words with subword-based indexing
- Dragon word recognition outputs are provided
- Character-based indexing
- Characters derived from Dragons recognized words
- Syllable-based indexing
- Syllables derived by pronunciation lookup using
Dragons recognized words - Address Dragons ASR errors
- Augment with alternative (word/char/syl)
hypotheses e.g. syllable lattice Chen Wang,
ICASSP-2000
43Syllable Lattice Development
- Dragons recognition accuracies
- Evaluated against anchor scripts
- 82.0(word) 87.9(char) 92.1(syl)
- Syllable substitution errors (5.2)
- MEIs syllable recognition accuracy
- Trained on Hub4 Mandarin (VOA, 11 hours, 1997)
- 70.2 (syl) ? !!!
44Strategy
- Improve MEIs syllable recognizer
- Design a structure for document indexing which
incorporates - Dragons word / character / syllable hypotheses
- MEIs syllable hypotheses
- (hopefully complementary to Dragons syllables)
45MEI Syllable RecognizerImprove Acoustic Models
VOA Audio for Doc i
Forced Alignment
Speaker Adaptation
Dragon Outputs for Doc i
Speaker-Adapted Acoustic Models
Baseline Acoustic Models
Syllable Recognition
MEI Syllables for Doc i
- Forced alignment with Dragons output for each
document - Blind speaker adaptation with Dragons syllables
- MEI syllable accuracy 70.2(original)? 87.7
?!!!
46MEI Syllable RecognizerIncorporate Language
Model
47Audio Document Indexing withMultiple Syllable
Recognition Outputs
The revised syllable lattice
48Multi-scale Audio Document Indexing
49Fusion of Words and Subwordsin Multi-Scale
Retrieval
50Loose Coupling
- Merging ranked lists from separate runs
- For each query and document pair, the score is
recalculated as - wk are the weights for different retrieval runs
- K denotes a retrieval run at some scale (word,
characters, syllables, combinations) - Sk (Qi, Dj) is a rank-based score between query i
and document j in retrieval run k
51Loose Coupling
Word
Word
Word
Char2
Char2
Syl2
52Tight Coupling
- Unified indexing of words and subword ngrams
- For query and documents
- Combine terms at different scales to form a
multi-scale query/document representation, e.g. - Multi-scale retrieval produces a single ranked
list
yi-se se-lie shou-xiang ben-jie jie-ming ne-tan
tan-ya ya-hu
53Loose vs Tight Coupling
- Tight coupling combines document scores before
ranking - may need weight optimization
- Loose coupling combines lists post-hoc
- outperforms individual lists
54The Perfect Retrieval Myth
- Erika, Helen, Hsin-Min, Jian Qiang, Berlin
55The Perfect Retrieval Myth
- 100 Average Precision ALL relevant docs and
ZERO non-relevant docs retrieved
Is corrupted by ...
56Bounds on Word-Based Systems
- Using Mandarin VOA documents as exemplars
- matched condition
- Using Xinhua text documents as exemplar
- source mismatch
- Using manual translations of NYT documents as
exemplars
57Bounds on Subword-Based Systems
- Character bigrams for indexing
- marginally outperforms word-based systems
- Syllable bigrams
- are quite competitive, though somewhat behind
- Mean average precision 0.6 is a good CL-SDR
target
58TDT-2 Results
59Retrieval Performance on TDT2
60TDT-3 Results
61Retrieval Performance on TDT3
62Summary and Conclusions
- Novel multi-scale paradigm for CL-SDR
- ameliorates the translation and recognition OOV
problems - Multi-scale query and document processing
- cross-lingual subword transliteration procedure
(CLPM) - query and document construction embeds words /
characters / syllables - balanced and structured queries
- Multi-scale retrieval
- tight and loose coupling strategies to fuse words
and subwords for retrieval
63Summary and Conclusions (2)
- Extensive experiments on TDT-2, TDT-3
- character bigrams typically outperform words or
syllable bigrams in retrieval - fusion of word and subword units shows potential
in multi-scale retrieval - syllable lattice needs further investigation
64Future Work
- Word-subword fusion techniques merit further
investigation - Multi-scale query expansion for retrieval
performance improvement (Wai-Kit) - Incorporation of acoustic scores in syllable
lattice representation for documents
65END